Article
data-extractionstructured-datapythonopenaipydanticinstructorschema-based-extractionnlp
Extract Structured Data from Unstructured Text
Use the Data Extraction skill to convert raw text into a structured, validated JSON object. Define a target data schema, provide the text, and an AI model will parse the information and populate your schema automatically.
beginner15 min4 steps
The play
- Set Up Your EnvironmentTo perform AI-powered Data Extraction, you need libraries to interact with a large language model (LLM) and to define data schemas. We'll use `openai` for the LLM and `instructor` to apply the schema. Install them via pip.
- Define Your Target Data SchemaThe core of schema-based extraction is defining the structure you want to pull from the text. Create a Pydantic model that represents your desired data. This schema guides the AI, ensuring the output is typed and validated.
- Prepare the AI Client and Input TextPatch the OpenAI client using `instructor` to enable schema-based responses. Then, define the unstructured text you want to parse. Make sure your OpenAI API key is set as an environment variable (`OPENAI_API_KEY`).
- Execute the Data ExtractionCall the chat completions API using the `response_model` parameter to pass your `Invoice` schema. `instructor` handles the complex prompting and validation to ensure the LLM's output perfectly conforms to your model.
Starter code
import os
import pydantic
import openai
import instructor
# 1. Ensure your OPENAI_API_KEY is set in your environment
# For example: export OPENAI_API_KEY='sk-...'
if not os.getenv("OPENAI_API_KEY"):
raise ValueError("OPENAI_API_KEY environment variable not set!")
# 2. Define the schema for the Data Extraction skill
# This tells the AI exactly what to look for and in what format.
class Invoice(pydantic.BaseModel):
invoice_id: str
amount: float
due_date: str = pydantic.Field(description="The due date in YYYY-MM-DD format")
# 3. Patch the OpenAI client to add the extraction capability
client = instructor.patch(openai.OpenAI())
# 4. Define the unstructured text to process
text_block = "Hi, please find attached invoice INV-123 for $99.99. The payment is due on Feb 25, 2024."
# 5. Execute the extraction by providing the response_model
try:
invoice_details = client.chat.completions.create(
model="gpt-4o",
response_model=Invoice,
messages=[
{"role": "user", "content": f"Extract the invoice details from the following text: {text_block}"}
]
)
print("--- Extracted Data ---")
print(invoice_details.model_dump_json(indent=2))
except Exception as e:
print(f"An error occurred: {e}")
# Expected output:
# -- Extracted Data ---
# {
# "invoice_id": "INV-123",
# "amount": 99.99,
# "due_date": "2024-02-25"
# }