Fine-Tune a Model with Hugging Face Transformers Training Script

Use the Hugging Face Transformers `Trainer` API to fine-tune a pre-trained model for text classification. This pack guides you through preparing a dataset, configuring training arguments, and launching the fine-tuning process.

intermediate30 min5 steps

The play

Install Dependencies
Install the `transformers`, `datasets`, and `accelerate` libraries. `transformers` provides the models and Trainer API, `datasets` helps manage and preprocess data, and `accelerate` enables seamless distributed training.
Load and Tokenize Data
Load a pre-trained model's tokenizer and a dataset. Here, we use `distilbert-base-uncased` and the `imdb` dataset. Create a function to tokenize the text and apply it to the entire dataset using `.map()`.
Prepare Model and Datasets
Load the pre-trained model for sequence classification. Then, create small training and evaluation subsets to speed up the example. This is a common practice for quick testing and iteration.
Configure Training Arguments
Instantiate `TrainingArguments` to define the training configuration. This includes the output directory, evaluation strategy, learning rate, and other hyperparameters. This object is the central configuration for the Hugging Face Transformers Training Script.
Initialize and Run Trainer
Create a `Trainer` instance, passing the model, arguments, datasets, and tokenizer. The `Trainer` class abstracts away the entire training loop. Call the `.train()` method to start the fine-tuning process.

Starter code

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# 1. Load tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 2. Load and preprocess dataset
def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

dataset = load_dataset("imdb")
tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Use smaller subsets for a quick example run
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))

# 3. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=1, # Set to 1 for a quick test
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    logging_dir='./logs',
    logging_steps=10,
)

# 4. Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=small_train_dataset,
    eval_dataset=small_eval_dataset,
)

# 5. Start training
print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning complete.")

# Optional: Evaluate the model
results = trainer.evaluate()
print("Evaluation results:", results)