Article
data-cleaningpythonscriptnlpdata-preprocessingllmragfine-tuning
Clean Raw Text for LLMs with a Python Script
Use this Data Cleaning Script to preprocess text for fine-tuning or RAG. It removes HTML, normalizes whitespace, deduplicates entries, and filters low-quality content, directly improving your dataset's quality and subsequent model performance.
beginner15 min4 steps
The play
- Install and Run Initial CleanupGet the script and its main dependency, BeautifulSoup. Run a basic cleanup to remove HTML tags and standardize inconsistent whitespace. This is the essential first pass for any raw text scraped from the web.
- Deduplicate Your DatasetRemove identical lines from your dataset to prevent the model from overfitting on repeated data. Use the --deduplicate flag for this. For more advanced near-duplicate detection, consider integrating a library like 'text-dedup'.
- Filter by Quality HeuristicsFilter out documents that are too short or too long to be useful. This helps remove boilerplate, error messages, or overly verbose documents that can harm training quality. Adjust the min and max length to fit your specific data.
- Fix Common Encoding ErrorsInstall the 'ftfy' library to fix mojibake and other text encoding issues (e.g., '’' becomes '’'). Clean encoding prevents tokenization errors and ensures the LLM sees the correct characters.
Starter code
#!/bin/bash
# This script sets up a minimal environment, creates the Data Cleaning Script,
# prepares a sample raw data file, runs the cleaning process, and shows the result.
# 1. Install dependencies
if ! python -c "import bs4" &> /dev/null; then
echo "Installing BeautifulSoup..."
pip install beautifulsoup4
fi
# 2. Create the Data Cleaning Script python file
cat << 'EOF' > data_cleaner.py
import argparse
import re
import sys
from bs4 import BeautifulSoup
def clean_text(text, fix_encoding=False):
# Fix encoding issues with ftfy if available
if fix_encoding:
try:
import ftfy
text = ftfy.fix_text(text)
except ImportError:
print("Warning: 'ftfy' not installed. Skipping encoding fix. Run 'pip install ftfy'", file=sys.stderr)
# Remove HTML tags
text = BeautifulSoup(text, "html.parser").get_text()
# Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
return text
def main():
parser = argparse.ArgumentParser(description="Cleans and normalizes text data for LLM consumption.")
parser.add_argument("input_file", help="Path to the input text file (one document per line).")
parser.add_argument("output_file", help="Path to save the cleaned text file.")
parser.add_argument("--min-length", type=int, default=10, help="Minimum character length for a line to be kept.")
parser.add_argument("--max-length", type=int, default=100000, help="Maximum character length for a line to be kept.")
parser.add_argument("--deduplicate", action="store_true", help="Remove duplicate lines.")
parser.add_argument("--fix-encoding", action="store_true", help="Fix common text encoding issues using ftfy.")
args = parser.parse_args()
seen_lines = set()
with open(args.input_file, 'r', encoding='utf-8') as infile, open(args.output_file, 'w', encoding='utf-8') as outfile:
for line in infile:
cleaned = clean_text(line, args.fix_encoding)
if not (args.min_length <= len(cleaned) <= args.max_length):
continue
if args.deduplicate:
if cleaned in seen_lines:
continue
seen_lines.add(cleaned)
outfile.write(cleaned + '\n')
if __name__ == "__main__":
main()
EOF
# 3. Create a sample raw data file
cat << 'EOF' > raw_data.txt
<p>This is the first sentence. It has extra spaces.</p>
<div>This is the second sentence.</div> It also has HTML.
This is a very short line.
This is the first sentence. It has extra spaces.
This line has a weird character: ’
EOF
# 4. Run the script
echo "--- Running Data Cleaning Script ---"
python data_cleaner.py raw_data.txt cleaned_data.txt --min-length 20 --deduplicate
# 5. Show the cleaned output
echo "\n--- Cleaned Data (cleaned_data.txt) ---"
cat cleaned_data.txt