Skip to main content
Article
data-extractionstructured-datapythonopenaipydanticinstructorschema-based-extractionnlp

Extract Structured Data from Unstructured Text

Use the Data Extraction skill to convert raw text into a structured, validated JSON object. Define a target data schema, provide the text, and an AI model will parse the information and populate your schema automatically.

beginner15 min4 steps
The play
  1. Set Up Your Environment
    To perform AI-powered Data Extraction, you need libraries to interact with a large language model (LLM) and to define data schemas. We'll use `openai` for the LLM and `instructor` to apply the schema. Install them via pip.
  2. Define Your Target Data Schema
    The core of schema-based extraction is defining the structure you want to pull from the text. Create a Pydantic model that represents your desired data. This schema guides the AI, ensuring the output is typed and validated.
  3. Prepare the AI Client and Input Text
    Patch the OpenAI client using `instructor` to enable schema-based responses. Then, define the unstructured text you want to parse. Make sure your OpenAI API key is set as an environment variable (`OPENAI_API_KEY`).
  4. Execute the Data Extraction
    Call the chat completions API using the `response_model` parameter to pass your `Invoice` schema. `instructor` handles the complex prompting and validation to ensure the LLM's output perfectly conforms to your model.
Starter code
import os
import pydantic
import openai
import instructor

# 1. Ensure your OPENAI_API_KEY is set in your environment
# For example: export OPENAI_API_KEY='sk-...'
if not os.getenv("OPENAI_API_KEY"):
    raise ValueError("OPENAI_API_KEY environment variable not set!")

# 2. Define the schema for the Data Extraction skill
# This tells the AI exactly what to look for and in what format.
class Invoice(pydantic.BaseModel):
    invoice_id: str
    amount: float
    due_date: str = pydantic.Field(description="The due date in YYYY-MM-DD format")

# 3. Patch the OpenAI client to add the extraction capability
client = instructor.patch(openai.OpenAI())

# 4. Define the unstructured text to process
text_block = "Hi, please find attached invoice INV-123 for $99.99. The payment is due on Feb 25, 2024."

# 5. Execute the extraction by providing the response_model
try:
    invoice_details = client.chat.completions.create(
        model="gpt-4o",
        response_model=Invoice,
        messages=[
            {"role": "user", "content": f"Extract the invoice details from the following text: {text_block}"}
        ]
    )

    print("--- Extracted Data ---")
    print(invoice_details.model_dump_json(indent=2))

except Exception as e:
    print(f"An error occurred: {e}")

# Expected output:
# -- Extracted Data ---
# {
#   "invoice_id": "INV-123",
#   "amount": 99.99,
#   "due_date": "2024-02-25"
# }
Extract Structured Data from Unstructured Text — Action Pack