Build a PDF Extraction Pipeline in Python

Use the unstructured library to build a PDF Extraction Pipeline. Extract text, tables, and metadata from native and scanned PDFs using OCR. This pipeline automatically detects layouts and chunks content for downstream AI applications.

intermediate30 min5 steps

The play

Set Up Your Environment
The PDF Extraction Pipeline relies on the `unstructured` Python library and Tesseract for OCR. Install `unstructured` with PDF and Tesseract extras. You must also install the Tesseract engine on your system (e.g., `brew install tesseract` on macOS or via the official installer on Windows).
Extract All Elements from a PDF
Use the `partition_pdf` function to process a document. It intelligently chooses between text extraction and OCR. The function returns a list of 'Element' objects, such as `Title`, `NarrativeText`, and `Table`.
Force OCR on Scanned Documents
For image-only PDFs or to ensure OCR is used, set the `strategy` parameter to `"ocr_only"`. The PDF Extraction Pipeline will use Tesseract to read the text from the document images.
Isolate and Extract Tables
The pipeline automatically detects tables and extracts them as `Table` elements. You can filter the results to process only this structured data. The `.text` attribute contains the content, while `.metadata.text_as_html` provides an HTML representation.
Output as Structured JSON
For easy integration with other systems, convert the extracted elements into a structured JSON format. This is ideal for feeding data into databases, APIs, or other scripts.

Starter code

import os
from unstructured.partition.pdf import partition_pdf

# --- Setup ---
# 1. Create a file named 'sample.pdf' in the same directory.
# 2. Make sure you have installed Tesseract: https://tesseract-ocr.github.io/tessdoc/Installation.html
# 3. Run this script: python run_pipeline.py

FILENAME = "sample.pdf"

if not os.path.exists(FILENAME):
    print(f"Error: Please create a PDF file named '{FILENAME}' to run this script.")
else:
    print(f"Processing {FILENAME} with the PDF Extraction Pipeline...")
    # Use the hi_res strategy for better layout analysis and table detection
    elements = partition_pdf(
        filename=FILENAME, 
        strategy="hi_res",
        infer_table_structure=True
    )

    print("\n--- Extracted Elements ---\n")
    # Print the type and first 100 chars of each element found
    for i, element in enumerate(elements):
        element_type = type(element).__name__
        element_text_preview = element.text[:100].replace('\n', ' ')
        print(f"{i+1}. [{element_type}]: {element_text_preview}...")

    print(f"\nTotal elements extracted: {len(elements)}")