Article

transfer-learningfine-tuninghugging-facetransformersnlppytorchdata-efficient-learningtext-classification

Perform Transfer Learning with Hugging Face Transformers

Leverage a pre-trained model for a new task. This Transfer Learning guide uses Hugging Face to fine-tune a text classification model on a custom dataset, saving significant training time and compute resources.

intermediate30 min5 steps

The play

Install Dependencies
Set up your environment by installing the necessary libraries. `transformers` provides the models and training infrastructure, `datasets` for data loading, and `torch` as the backend framework.
Load Pre-trained Model & Tokenizer
Select and load a pre-trained model from the Hugging Face Hub. For Transfer Learning, we start with a model that already understands language, like 'distilbert-base-uncased', and its corresponding tokenizer.
Load and Prepare Your Dataset
Load your target dataset. We'll use the 'imdb' dataset for sentiment analysis. The tokenizer then converts the text into a format the model can understand (input IDs, attention mask).
Define Training Arguments
Configure the fine-tuning process using `TrainingArguments`. This class controls hyperparameters like learning rate, number of epochs, and batch size, which are crucial for successful Transfer Learning.
Fine-Tune the Model
Instantiate the `Trainer` with your model, arguments, and dataset. Calling `train()` starts the fine-tuning process, where the model's weights are adjusted to specialize in your specific task.

Starter code

import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer

# 1. Load Model and Tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)

# 2. Load and Prepare Dataset
# We use a very small subset for a quick example run
dataset = load_dataset("imdb", split='train[:200]')

def tokenize_function(examples):
    return tokenizer(examples["text"], padding="max_length", truncation=True)

tokenized_datasets = dataset.map(tokenize_function, batched=True)

# Split into training and evaluation sets
train_test_split = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]

# 3. Define Training Arguments
training_args = TrainingArguments(
    output_dir="./test_trainer",
    evaluation_strategy="epoch",
    num_train_epochs=1,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    learning_rate=5e-5,
    logging_dir='./logs',
    logging_steps=10,
)

# 4. Initialize and Run Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
)

print("Starting Transfer Learning (fine-tuning)...")
trainer.train()
print("Fine-tuning complete.")

# 5. Test the fine-tuned model
text = "This movie was not the best, but it had some good moments."
inputs = tokenizer(text, return_tensors="pt")

with torch.no_grad():
    logits = model(**inputs).logits

predicted_class_id = logits.argmax().item()
print(f"Input text: '{text}'")
print(f"Predicted class: {'Positive' if predicted_class_id == 1 else 'Negative'}")