Article
ocrpythontesseractpytesseractdocument-parsingtext-extractionimage-processingcomputer-vision
Build a Python OCR Pipeline with Tesseract
Learn to build an end-to-end OCR Pipeline. This guide covers image preprocessing, text extraction with Tesseract, and post-processing with regex to pull structured data like invoice numbers and amounts from an image.
intermediate30 min4 steps
The play
- Install Tesseract and Python LibrariesAn OCR Pipeline requires an engine. We'll use Tesseract, a powerful open-source OCR engine. You must install it on your system first, then install the Python wrappers `pytesseract` for control and `Pillow` for image handling.
- Load and Preprocess the ImageRaw images are often not optimal for OCR. Preprocessing steps like converting to grayscale or black-and-white (binarization) significantly improve accuracy by removing color noise and increasing contrast. We'll use the Pillow library for this.
- Extract Raw Text with PytesseractWith a clean image, you can now run the core OCR process. The `pytesseract.image_to_string` function sends the image to the Tesseract engine and returns the recognized text. You can pass configuration options, like `--psm 6`, which tells Tesseract to assume a single uniform block of text.
- Post-Process to Get Structured DataThe raw text is unstructured. The final step of an OCR Pipeline is to parse this text to extract specific, structured information. Regular expressions (regex) are a powerful tool for finding patterns like invoice numbers, dates, or amounts.
Starter code
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import re
import json
# --- OCR Pipeline Starter ---
# This script demonstrates a full OCR Pipeline: image creation, preprocessing, text extraction, and structured data parsing.
#
# PREREQUISITES:
# 1. Install Tesseract OCR on your system:
# - macOS: brew install tesseract
# - Debian/Ubuntu: sudo apt-get install tesseract-ocr
# - Windows: Download installer from https://github.com/UB-Mannheim/tesseract/wiki
# 2. Install Python libraries: pip install pytesseract Pillow
#
# NOTE: If Tesseract is not in your system's PATH, you may need to specify its location:
# pytesseract.pytesseract.tesseract_cmd = r'C:\Program Files\Tesseract-OCR\tesseract.exe'
def create_sample_invoice_image():
"""Creates a sample image in memory to make the script self-contained."""
img = Image.new('RGB', (450, 200), color='white')
d = ImageDraw.Draw(img)
try:
font = ImageFont.truetype("arial.ttf", 16)
except IOError:
font = ImageFont.load_default()
d.text((10, 10), "INVOICE #INV-00123", fill='black', font=font)
d.text((10, 40), "Date: 2023-10-26", fill='black', font=font)
d.text((10, 70), "Description: 1x AI Consulting", fill='black', font=font)
d.text((10, 100), "Total Amount: $42.99", fill='black', font=font)
d.text((10, 130), "Contact: support@example.com", fill='black', font=font)
return img
# 1. GET IMAGE: Create our sample document
source_image = create_sample_invoice_image()
# 2. PREPROCESS: Convert to grayscale for better OCR performance
processed_image = source_image.convert('L')
# 3. EXTRACT: Run Tesseract OCR engine on the preprocessed image
print("Running OCR...")
custom_config = r'--oem 3 --psm 6'
raw_text = pytesseract.image_to_string(processed_image, config=custom_config)
print("\n--- Raw OCR Output ---")
print(raw_text)
# 4. POST-PROCESS: Parse raw text into structured data using regex
structured_data = {}
patterns = {
'invoice_number': r'INVOICE #(\S+)',
'total_amount': r'Total Amount: \$(\d+\.\d+)',
'contact_email': r'Contact: (\S+@\S+)'
}
for key, pattern in patterns.items():
match = re.search(pattern, raw_text)
if match:
structured_data[key] = match.group(1)
print("\n--- Structured Data Extracted ---")
print(json.dumps(structured_data, indent=2))