Article

ocrpythontesseractpytesseractdocument-parsingtext-extractionimage-processingcomputer-vision

Build a Python OCR Pipeline with Tesseract

Learn to build an end-to-end OCR Pipeline. This guide covers image preprocessing, text extraction with Tesseract, and post-processing with regex to pull structured data like invoice numbers and amounts from an image.

intermediate30 min4 steps

The play

Install Tesseract and Python Libraries
An OCR Pipeline requires an engine. We'll use Tesseract, a powerful open-source OCR engine. You must install it on your system first, then install the Python wrappers `pytesseract` for control and `Pillow` for image handling.
Load and Preprocess the Image
Raw images are often not optimal for OCR. Preprocessing steps like converting to grayscale or black-and-white (binarization) significantly improve accuracy by removing color noise and increasing contrast. We'll use the Pillow library for this.
Extract Raw Text with Pytesseract
With a clean image, you can now run the core OCR process. The `pytesseract.image_to_string` function sends the image to the Tesseract engine and returns the recognized text. You can pass configuration options, like `--psm 6`, which tells Tesseract to assume a single uniform block of text.
Post-Process to Get Structured Data
The raw text is unstructured. The final step of an OCR Pipeline is to parse this text to extract specific, structured information. Regular expressions (regex) are a powerful tool for finding patterns like invoice numbers, dates, or amounts.

Starter code

import pytesseract
from PIL import Image, ImageDraw, ImageFont
import re
import json

# --- OCR Pipeline Starter ---
# This script demonstrates a full OCR Pipeline: image creation, preprocessing, text extraction, and structured data parsing.
#
# PREREQUISITES:
# 1. Install Tesseract OCR on your system:
#    - macOS: brew install tesseract
#    - Debian/Ubuntu: sudo apt-get install tesseract-ocr
#    - Windows: Download installer from https://github.com/UB-Mannheim/tesseract/wiki
# 2. Install Python libraries: pip install pytesseract Pillow
#
# NOTE: If Tesseract is not in your system's PATH, you may need to specify its location:
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'

def create_sample_invoice_image():
    """Creates a sample image in memory to make the script self-contained."""
    img = Image.new('RGB', (450, 200), color='white')
    d = ImageDraw.Draw(img)
    try:
        font = ImageFont.truetype("arial.ttf", 16)
    except IOError:
        font = ImageFont.load_default()
    
    d.text((10, 10), "INVOICE #INV-00123", fill='black', font=font)
    d.text((10, 40), "Date: 2023-10-26", fill='black', font=font)
    d.text((10, 70), "Description: 1x AI Consulting", fill='black', font=font)
    d.text((10, 100), "Total Amount: $42.99", fill='black', font=font)
    d.text((10, 130), "Contact: support@example.com", fill='black', font=font)
    return img

# 1. GET IMAGE: Create our sample document
source_image = create_sample_invoice_image()

# 2. PREPROCESS: Convert to grayscale for better OCR performance
processed_image = source_image.convert('L')

# 3. EXTRACT: Run Tesseract OCR engine on the preprocessed image
print("Running OCR...")
custom_config = r'--oem 3 --psm 6'
raw_text = pytesseract.image_to_string(processed_image, config=custom_config)
print("\n--- Raw OCR Output ---")
print(raw_text)

# 4. POST-PROCESS: Parse raw text into structured data using regex
structured_data = {}

patterns = {
    'invoice_number': r'INVOICE #(\S+)',
    'total_amount': r'Total Amount: \$(\d+\.\d+)',
    'contact_email': r'Contact: (\S+@\S+)'
}

for key, pattern in patterns.items():
    match = re.search(pattern, raw_text)
    if match:
        structured_data[key] = match.group(1)

print("\n--- Structured Data Extracted ---")
print(json.dumps(structured_data, indent=2))