Skip to main content
Article
pdf-extractiondata-processingpythonunstructuredocrdocument-airagautomation

Build a PDF Extraction Pipeline in Python

Use the unstructured library to build a PDF Extraction Pipeline. Extract text, tables, and metadata from native and scanned PDFs using OCR. This pipeline automatically detects layouts and chunks content for downstream AI applications.

intermediate30 min5 steps
The play
  1. Set Up Your Environment
    The PDF Extraction Pipeline relies on the `unstructured` Python library and Tesseract for OCR. Install `unstructured` with PDF and Tesseract extras. You must also install the Tesseract engine on your system (e.g., `brew install tesseract` on macOS or via the official installer on Windows).
  2. Extract All Elements from a PDF
    Use the `partition_pdf` function to process a document. It intelligently chooses between text extraction and OCR. The function returns a list of 'Element' objects, such as `Title`, `NarrativeText`, and `Table`.
  3. Force OCR on Scanned Documents
    For image-only PDFs or to ensure OCR is used, set the `strategy` parameter to `"ocr_only"`. The PDF Extraction Pipeline will use Tesseract to read the text from the document images.
  4. Isolate and Extract Tables
    The pipeline automatically detects tables and extracts them as `Table` elements. You can filter the results to process only this structured data. The `.text` attribute contains the content, while `.metadata.text_as_html` provides an HTML representation.
  5. Output as Structured JSON
    For easy integration with other systems, convert the extracted elements into a structured JSON format. This is ideal for feeding data into databases, APIs, or other scripts.
Starter code
import os
from unstructured.partition.pdf import partition_pdf

# --- Setup ---
# 1. Create a file named 'sample.pdf' in the same directory.
# 2. Make sure you have installed Tesseract: https://tesseract-ocr.github.io/tessdoc/Installation.html
# 3. Run this script: python run_pipeline.py

FILENAME = "sample.pdf"

if not os.path.exists(FILENAME):
    print(f"Error: Please create a PDF file named '{FILENAME}' to run this script.")
else:
    print(f"Processing {FILENAME} with the PDF Extraction Pipeline...")
    # Use the hi_res strategy for better layout analysis and table detection
    elements = partition_pdf(
        filename=FILENAME, 
        strategy="hi_res",
        infer_table_structure=True
    )

    print("\n--- Extracted Elements ---\n")
    # Print the type and first 100 chars of each element found
    for i, element in enumerate(elements):
        element_type = type(element).__name__
        element_text_preview = element.text[:100].replace('\n', ' ')
        print(f"{i+1}. [{element_type}]: {element_text_preview}...")

    print(f"\nTotal elements extracted: {len(elements)}")
Build a PDF Extraction Pipeline in Python — Action Pack