Article
pdf-extractiondata-processingpythonunstructuredocrdocument-airagautomation
Build a PDF Extraction Pipeline in Python
Use the unstructured library to build a PDF Extraction Pipeline. Extract text, tables, and metadata from native and scanned PDFs using OCR. This pipeline automatically detects layouts and chunks content for downstream AI applications.
intermediate30 min5 steps
The play
- Set Up Your EnvironmentThe PDF Extraction Pipeline relies on the `unstructured` Python library and Tesseract for OCR. Install `unstructured` with PDF and Tesseract extras. You must also install the Tesseract engine on your system (e.g., `brew install tesseract` on macOS or via the official installer on Windows).
- Extract All Elements from a PDFUse the `partition_pdf` function to process a document. It intelligently chooses between text extraction and OCR. The function returns a list of 'Element' objects, such as `Title`, `NarrativeText`, and `Table`.
- Force OCR on Scanned DocumentsFor image-only PDFs or to ensure OCR is used, set the `strategy` parameter to `"ocr_only"`. The PDF Extraction Pipeline will use Tesseract to read the text from the document images.
- Isolate and Extract TablesThe pipeline automatically detects tables and extracts them as `Table` elements. You can filter the results to process only this structured data. The `.text` attribute contains the content, while `.metadata.text_as_html` provides an HTML representation.
- Output as Structured JSONFor easy integration with other systems, convert the extracted elements into a structured JSON format. This is ideal for feeding data into databases, APIs, or other scripts.
Starter code
import os
from unstructured.partition.pdf import partition_pdf
# --- Setup ---
# 1. Create a file named 'sample.pdf' in the same directory.
# 2. Make sure you have installed Tesseract: https://tesseract-ocr.github.io/tessdoc/Installation.html
# 3. Run this script: python run_pipeline.py
FILENAME = "sample.pdf"
if not os.path.exists(FILENAME):
print(f"Error: Please create a PDF file named '{FILENAME}' to run this script.")
else:
print(f"Processing {FILENAME} with the PDF Extraction Pipeline...")
# Use the hi_res strategy for better layout analysis and table detection
elements = partition_pdf(
filename=FILENAME,
strategy="hi_res",
infer_table_structure=True
)
print("\n--- Extracted Elements ---\n")
# Print the type and first 100 chars of each element found
for i, element in enumerate(elements):
element_type = type(element).__name__
element_text_preview = element.text[:100].replace('\n', ' ')
print(f"{i+1}. [{element_type}]: {element_text_preview}...")
print(f"\nTotal elements extracted: {len(elements)}")