Paper·arxiv.org
machine-learningevaluationresearchdata-pipelines
The Character Error Vector: Decomposable errors for page-level OCR evaluation
Implement the Character Error Vector (CEV) for robust, page-level OCR evaluation. Unlike traditional CER, CEV provides decomposable errors, allowing granular analysis even with parsing inaccuracies. This leads to more targeted OCR model improvements and reliable document processing.
intermediate30 min5 steps
The play
- Understand CER's LimitationsRecognize why traditional Character Error Rate (CER) is inadequate for page-level OCR evaluation, especially when document parsing errors are present, as it becomes undefined or misleading.
- Adopt CEV for Granular InsightIntegrate the Character Error Vector (CEV) as your primary metric for assessing OCR quality at the page level, moving beyond simple aggregate error rates to gain deeper insights into error types and locations.
- Deconstruct Error VectorsAnalyze the decomposable errors provided by CEV. This includes understanding character-level substitutions, insertions, and deletions, as well as errors related to page parsing and structure, which CER cannot capture.
- Pinpoint OCR WeaknessesUse the detailed breakdown from CEV to identify specific areas where your OCR model or document processing pipeline underperforms. This allows for precise identification of character recognition failures versus structural parsing issues.
- Iterate for Targeted ImprovementApply insights derived from CEV to refine model training, adjust parsing logic, or improve pre/post-processing steps. Continuously monitor with CEV to ensure robust and accurate OCR system performance in real-world applications.
Starter code
def calculate_cer(reference_text: str, ocr_output_text: str) -> float:
"""
Calculates the Character Error Rate (CER) between a reference text and OCR output.
This is a basic Levenshtein distance-based CER.
"""
if not reference_text and not ocr_output_text:
return 0.0
if not reference_text:
return float(len(ocr_output_text)) # All insertions
if not ocr_output_text:
return float(len(reference_text)) # All deletions
# Levenshtein distance calculation
len_ref = len(reference_text)
len_ocr = len(ocr_output_text)
dp = [[0] * (len_ocr + 1) for _ in range(len_ref + 1)]
for i in range(len_ref + 1):
dp[i][0] = i
for j in range(len_ocr + 1):
dp[0][j] = j
for i in range(1, len_ref + 1):
for j in range(1, len_ocr + 1):
cost = 0 if reference_text[i-1] == ocr_output_text[j-1] else 1
dp[i][j] = min(dp[i-1][j] + 1, # Deletion
dp[i][j-1] + 1, # Insertion
dp[i-1][j-1] + cost) # Substitution
distance = dp[len_ref][len_ocr]
return distance / len_ref
# Example Usage:
ref_text = "This is a test document."
ocr_text = "Thiz is atest document."
cer_value = calculate_cer(ref_text, ocr_text)
print(f"Traditional CER: {cer_value:.4f}")
# Note: The Character Error Vector (CEV) provides a more advanced, decomposable
# error analysis, addressing limitations of traditional CER, especially for
# page-level OCR with parsing errors. Its implementation would involve a more
# complex algorithm to categorize and localize different error types beyond this basic CER.Source