Article
transfer-learningfine-tuninghugging-facetransformersnlppytorchdata-efficient-learningtext-classification
Perform Transfer Learning with Hugging Face Transformers
Leverage a pre-trained model for a new task. This Transfer Learning guide uses Hugging Face to fine-tune a text classification model on a custom dataset, saving significant training time and compute resources.
intermediate30 min5 steps
The play
- Install DependenciesSet up your environment by installing the necessary libraries. `transformers` provides the models and training infrastructure, `datasets` for data loading, and `torch` as the backend framework.
- Load Pre-trained Model & TokenizerSelect and load a pre-trained model from the Hugging Face Hub. For Transfer Learning, we start with a model that already understands language, like 'distilbert-base-uncased', and its corresponding tokenizer.
- Load and Prepare Your DatasetLoad your target dataset. We'll use the 'imdb' dataset for sentiment analysis. The tokenizer then converts the text into a format the model can understand (input IDs, attention mask).
- Define Training ArgumentsConfigure the fine-tuning process using `TrainingArguments`. This class controls hyperparameters like learning rate, number of epochs, and batch size, which are crucial for successful Transfer Learning.
- Fine-Tune the ModelInstantiate the `Trainer` with your model, arguments, and dataset. Calling `train()` starts the fine-tuning process, where the model's weights are adjusted to specialize in your specific task.
Starter code
import torch
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
# 1. Load Model and Tokenizer
model_name = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
# 2. Load and Prepare Dataset
# We use a very small subset for a quick example run
dataset = load_dataset("imdb", split='train[:200]')
def tokenize_function(examples):
return tokenizer(examples["text"], padding="max_length", truncation=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
# Split into training and evaluation sets
train_test_split = tokenized_datasets.train_test_split(test_size=0.2)
train_dataset = train_test_split["train"]
eval_dataset = train_test_split["test"]
# 3. Define Training Arguments
training_args = TrainingArguments(
output_dir="./test_trainer",
evaluation_strategy="epoch",
num_train_epochs=1,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
learning_rate=5e-5,
logging_dir='./logs',
logging_steps=10,
)
# 4. Initialize and Run Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset,
)
print("Starting Transfer Learning (fine-tuning)...")
trainer.train()
print("Fine-tuning complete.")
# 5. Test the fine-tuned model
text = "This movie was not the best, but it had some good moments."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
print(f"Input text: '{text}'")
print(f"Predicted class: {'Positive' if predicted_class_id == 1 else 'Negative'}")