Skip to main content
Paper·arxiv.org
llmresearchmachine-learningfine-tuningdata-pipelines

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

Improve LLM factual recall and reduce hallucinations by implementing training data pruning. This technique optimizes data distribution, prioritizing quality and relevance over volume, leading to more reliable and trustworthy AI models for knowledge-intensive tasks.

intermediate1 hour6 steps
The play
  1. Acknowledge LLM Factual Gaps
    Recognize that Large Language Models frequently struggle with factual memorization, leading to hallucinations and suboptimal performance on knowledge-intensive tasks.
  2. Prioritize Data Quality
    Shift your focus from simply scaling up training data volume to meticulously selecting and curating high-quality, relevant data for LLM training.
  3. Implement Data Pruning Strategies
    Apply techniques to identify and remove redundant, noisy, or less informative data points from your training datasets that might hinder factual memorization.
  4. Optimize Data Distribution
    Analyze and adjust the distribution of your training data to ensure critical factual knowledge is adequately represented and retained by the model.
  5. Evaluate Factual Recall
    Develop and utilize specific evaluation metrics to measure improvements in factual memorization and reductions in hallucinations after implementing data pruning.
  6. Integrate into Data Pipelines
    Incorporate data pruning and quality assessment as standard steps within your LLM training data pipelines for continuous optimization and improved model reliability.
Starter code
{
  "data_pruning_config": {
    "strategy_name": "factual_memorization_optimization",
    "filters": [
      {
        "type": "semantic_redundancy",
        "threshold": 0.95,
        "embedding_model": "sentence-transformers/all-MiniLM-L6-v2",
        "scope": "document_level"
      },
      {
        "type": "source_credibility",
        "whitelist": ["arxiv.org", "ncbi.nlm.nih.gov", "wikipedia.org"],
        "blacklist": ["unverified_blogs", "social_media_posts"]
      },
      {
        "type": "information_density",
        "min_factual_statements_per_100_tokens": 5,
        "nlp_model_for_fact_extraction": "spacy_fact_extractor"
      }
    ],
    "post_pruning_analysis": {
      "sample_size": 0.05,
      "metrics": ["factual_recall_rate", "hallucination_score"]
    }
  }
}
Source
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts — Action Pack