Article
hugging-facetransformersfine-tuningnlptext-classificationpythonpytorchscript
Fine-Tune a Model with Hugging Face Transformers Training Script
Use the Hugging Face Transformers `Trainer` API to fine-tune a pre-trained model for text classification. This pack guides you through preparing a dataset, configuring training arguments, and launching the fine-tuning process.
intermediate30 min5 steps
The play
- Install DependenciesInstall the `transformers`, `datasets`, and `accelerate` libraries. `transformers` provides the models and Trainer API, `datasets` helps manage and preprocess data, and `accelerate` enables seamless distributed training.
- Load and Tokenize DataLoad a pre-trained model's tokenizer and a dataset. Here, we use `distilbert-base-uncased` and the `imdb` dataset. Create a function to tokenize the text and apply it to the entire dataset using `.map()`.
- Prepare Model and DatasetsLoad the pre-trained model for sequence classification. Then, create small training and evaluation subsets to speed up the example. This is a common practice for quick testing and iteration.
- Configure Training ArgumentsInstantiate `TrainingArguments` to define the training configuration. This includes the output directory, evaluation strategy, learning rate, and other hyperparameters. This object is the central configuration for the Hugging Face Transformers Training Script.
- Initialize and Run TrainerCreate a `Trainer` instance, passing the model, arguments, datasets, and tokenizer. The `Trainer` class abstracts away the entire training loop. Call the `.train()` method to start the fine-tuning process.
Starter code
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
# 1. Load tokenizer and model
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# 2. Load and preprocess dataset
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
dataset = load_dataset("imdb")
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Use smaller subsets for a quick example run
small_train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(100))
small_eval_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(100))
# 3. Define Training Arguments
training_args = TrainingArguments(
output_dir="./test_trainer",
evaluation_strategy="epoch",
num_train_epochs=1, # Set to 1 for a quick test
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
logging_dir='./logs',
logging_steps=10,
)
# 4. Initialize Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=small_train_dataset,
eval_dataset=small_eval_dataset,
)
# 5. Start training
print("Starting fine-tuning...")
trainer.train()
print("Fine-tuning complete.")
# Optional: Evaluate the model
results = trainer.evaluate()
print("Evaluation results:", results)