The Character Error Vector: Decomposable errors for page-level OCR evaluation

Implement the Character Error Vector (CEV) for robust, page-level OCR evaluation. Unlike traditional CER, CEV provides decomposable errors, allowing granular analysis even with parsing inaccuracies. This leads to more targeted OCR model improvements and reliable document processing.

intermediate30 min5 steps

The play

Understand CER's Limitations
Recognize why traditional Character Error Rate (CER) is inadequate for page-level OCR evaluation, especially when document parsing errors are present, as it becomes undefined or misleading.
Adopt CEV for Granular Insight
Integrate the Character Error Vector (CEV) as your primary metric for assessing OCR quality at the page level, moving beyond simple aggregate error rates to gain deeper insights into error types and locations.
Deconstruct Error Vectors
Analyze the decomposable errors provided by CEV. This includes understanding character-level substitutions, insertions, and deletions, as well as errors related to page parsing and structure, which CER cannot capture.
Pinpoint OCR Weaknesses
Use the detailed breakdown from CEV to identify specific areas where your OCR model or document processing pipeline underperforms. This allows for precise identification of character recognition failures versus structural parsing issues.
Iterate for Targeted Improvement
Apply insights derived from CEV to refine model training, adjust parsing logic, or improve pre/post-processing steps. Continuously monitor with CEV to ensure robust and accurate OCR system performance in real-world applications.

Starter code

def calculate_cer(reference_text: str, ocr_output_text: str) -> float:
    """
    Calculates the Character Error Rate (CER) between a reference text and OCR output.
    This is a basic Levenshtein distance-based CER.
    """
    if not reference_text and not ocr_output_text:
        return 0.0
    if not reference_text:
        return float(len(ocr_output_text)) # All insertions
    if not ocr_output_text:
        return float(len(reference_text)) # All deletions

    # Levenshtein distance calculation
    len_ref = len(reference_text)
    len_ocr = len(ocr_output_text)

    dp = [[0] * (len_ocr + 1) for _ in range(len_ref + 1)]

    for i in range(len_ref + 1):
        dp[i][0] = i
    for j in range(len_ocr + 1):
        dp[0][j] = j

    for i in range(1, len_ref + 1):
        for j in range(1, len_ocr + 1):
            cost = 0 if reference_text[i-1] == ocr_output_text[j-1] else 1
            dp[i][j] = min(dp[i-1][j] + 1,      # Deletion
                           dp[i][j-1] + 1,      # Insertion
                           dp[i-1][j-1] + cost) # Substitution

    distance = dp[len_ref][len_ocr]
    return distance / len_ref

# Example Usage:
ref_text = "This is a test document."
ocr_text = "Thiz is atest document."
cer_value = calculate_cer(ref_text, ocr_text)
print(f"Traditional CER: {cer_value:.4f}")

# Note: The Character Error Vector (CEV) provides a more advanced, decomposable
# error analysis, addressing limitations of traditional CER, especially for
# page-level OCR with parsing errors. Its implementation would involve a more
# complex algorithm to categorize and localize different error types beyond this basic CER.

Source

Paperarxiv.org