Clean Raw Text for LLMs with a Python Script

Use this Data Cleaning Script to preprocess text for fine-tuning or RAG. It removes HTML, normalizes whitespace, deduplicates entries, and filters low-quality content, directly improving your dataset's quality and subsequent model performance.

beginner15 min4 steps

The play

Install and Run Initial Cleanup
Get the script and its main dependency, BeautifulSoup. Run a basic cleanup to remove HTML tags and standardize inconsistent whitespace. This is the essential first pass for any raw text scraped from the web.
Deduplicate Your Dataset
Remove identical lines from your dataset to prevent the model from overfitting on repeated data. Use the --deduplicate flag for this. For more advanced near-duplicate detection, consider integrating a library like 'text-dedup'.
Filter by Quality Heuristics
Filter out documents that are too short or too long to be useful. This helps remove boilerplate, error messages, or overly verbose documents that can harm training quality. Adjust the min and max length to fit your specific data.
Fix Common Encoding Errors
Install the 'ftfy' library to fix mojibake and other text encoding issues (e.g., 'â€™' becomes '’'). Clean encoding prevents tokenization errors and ensures the LLM sees the correct characters.

Starter code

#!/bin/bash

# This script sets up a minimal environment, creates the Data Cleaning Script,
# prepares a sample raw data file, runs the cleaning process, and shows the result.

# 1. Install dependencies
if ! python -c "import bs4" &> /dev/null; then
    echo "Installing BeautifulSoup..."
    pip install beautifulsoup4
fi

# 2. Create the Data Cleaning Script python file
cat << 'EOF' > data_cleaner.py
import argparse
import re
import sys
from bs4 import BeautifulSoup

def clean_text(text, fix_encoding=False):
    # Fix encoding issues with ftfy if available
    if fix_encoding:
        try:
            import ftfy
            text = ftfy.fix_text(text)
        except ImportError:
            print("Warning: 'ftfy' not installed. Skipping encoding fix. Run 'pip install ftfy'", file=sys.stderr)

    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()

    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()

    return text

def main():
    parser = argparse.ArgumentParser(description="Cleans and normalizes text data for LLM consumption.")
    parser.add_argument("input_file", help="Path to the input text file (one document per line).")
    parser.add_argument("output_file", help="Path to save the cleaned text file.")
    parser.add_argument("--min-length", type=int, default=10, help="Minimum character length for a line to be kept.")
    parser.add_argument("--max-length", type=int, default=100000, help="Maximum character length for a line to be kept.")
    parser.add_argument("--deduplicate", action="store_true", help="Remove duplicate lines.")
    parser.add_argument("--fix-encoding", action="store_true", help="Fix common text encoding issues using ftfy.")
    args = parser.parse_args()

    seen_lines = set()
    with open(args.input_file, 'r', encoding='utf-8') as infile, open(args.output_file, 'w', encoding='utf-8') as outfile:
        for line in infile:
            cleaned = clean_text(line, args.fix_encoding)

            if not (args.min_length <= len(cleaned) <= args.max_length):
                continue

            if args.deduplicate:
                if cleaned in seen_lines:
                    continue
                seen_lines.add(cleaned)

            outfile.write(cleaned + '\n')

if __name__ == "__main__":
    main()

EOF

# 3. Create a sample raw data file
cat << 'EOF' > raw_data.txt
<p>This is the first sentence.   It has extra spaces.</p>
<div>This is the second sentence.</div> It also has HTML.
This is a very short line.
This is the first sentence.   It has extra spaces.
This line has a weird character: â€™
EOF

# 4. Run the script
echo "--- Running Data Cleaning Script ---"
python data_cleaner.py raw_data.txt cleaned_data.txt --min-length 20 --deduplicate

# 5. Show the cleaned output
echo "\n--- Cleaned Data (cleaned_data.txt) ---"
cat cleaned_data.txt