Paper·arxiv.org
llmresearchmachine-learningfine-tuningdata-pipelines
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts
Improve LLM factual recall and reduce hallucinations by implementing training data pruning. This technique optimizes data distribution, prioritizing quality and relevance over volume, leading to more reliable and trustworthy AI models for knowledge-intensive tasks.
intermediate1 hour6 steps
The play
- Acknowledge LLM Factual GapsRecognize that Large Language Models frequently struggle with factual memorization, leading to hallucinations and suboptimal performance on knowledge-intensive tasks.
- Prioritize Data QualityShift your focus from simply scaling up training data volume to meticulously selecting and curating high-quality, relevant data for LLM training.
- Implement Data Pruning StrategiesApply techniques to identify and remove redundant, noisy, or less informative data points from your training datasets that might hinder factual memorization.
- Optimize Data DistributionAnalyze and adjust the distribution of your training data to ensure critical factual knowledge is adequately represented and retained by the model.
- Evaluate Factual RecallDevelop and utilize specific evaluation metrics to measure improvements in factual memorization and reductions in hallucinations after implementing data pruning.
- Integrate into Data PipelinesIncorporate data pruning and quality assessment as standard steps within your LLM training data pipelines for continuous optimization and improved model reliability.
Starter code
{
"data_pruning_config": {
"strategy_name": "factual_memorization_optimization",
"filters": [
{
"type": "semantic_redundancy",
"threshold": 0.95,
"embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
"scope": "document_level"
},
{
"type": "source_credibility",
"whitelist": ["arxiv.org", "ncbi.nlm.nih.gov", "wikipedia.org"],
"blacklist": ["unverified_blogs", "social_media_posts"]
},
{
"type": "information_density",
"min_factual_statements_per_100_tokens": 5,
"nlp_model_for_fact_extraction": "spacy_fact_extractor"
}
],
"post_pruning_analysis": {
"sample_size": 0.05,
"metrics": ["factual_recall_rate", "hallucination_score"]
}
}
}Source