Article

active-learningdata-labelingpythonscikit-learnmachine-learningdata-efficiencyuncertainty-samplinglow-resource

Implement Active Learning to Reduce Data Labeling Costs

Use Active Learning to intelligently select the most informative data for human labeling. This reduces annotation costs and improves model performance faster than random sampling by focusing on examples the model is most uncertain about.

intermediate30 min5 steps

The play

Train an Initial Model
Start with a small, randomly selected labeled dataset and a large pool of unlabeled data. Train a baseline model on this initial labeled set. This model's performance is your starting point for the Active Learning loop.
Calculate Prediction Uncertainty
Use your initial model to generate predictions on the entire unlabeled data pool. For each prediction, calculate an uncertainty score. A simple and effective method for classifiers is 'Least Confidence Sampling', where uncertainty = 1 - max(prediction_probability).
Query the Most Uncertain Samples
Apply your query strategy. Rank all unlabeled samples by their uncertainty score in descending order. Select the top 'k' samples—these are the ones the model would benefit most from having labeled. This is the core of the Active Learning skill.
Label and Retrain
Send the 'k' queried samples to your human annotators for labeling. Once labeled, add them to your training set and remove them from the unlabeled pool. Retrain your model on this newly expanded dataset to improve its performance.
Iterate the Loop
Repeat the cycle: predict on the remaining unlabeled data, query the most uncertain samples, get them labeled, and retrain. Continue this process until your model's performance meets your target, you run out of budget, or performance plateaus.

Starter code

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# --- 1. Setup: Create a simulated dataset ---
def create_dataset():
    X, y = make_classification(
        n_samples=1000,
        n_features=20,
        n_informative=5,
        n_redundant=5,
        n_classes=3,
        random_state=42
    )
    # Split into an initial labeled set, an unlabeled pool, and a test set
    X_train_full, X_test, y_train_full, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    X_labeled, X_unlabeled, y_labeled, y_unlabeled = train_test_split(X_train_full, y_train_full, test_size=0.95, random_state=42)
    return X_labeled, y_labeled, X_unlabeled, y_unlabeled, X_test, y_test

X_labeled, y_labeled, X_unlabeled, y_unlabeled, X_test, y_test = create_dataset()

print(f"Initial labeled set size: {len(X_labeled)}")
print(f"Unlabeled pool size: {len(X_unlabeled)}")
print(f"Test set size: {len(X_test)}")

# --- 2. Initial Training --- 
model = LogisticRegression()
model.fit(X_labeled, y_labeled)
initial_preds = model.predict(X_test)
initial_accuracy = accuracy_score(y_test, initial_preds)
print(f"\nInitial Model Accuracy: {initial_accuracy:.4f}")

# --- 3. Active Learning Loop (1 iteration) ---
print("\n--- Starting Active Learning Cycle ---")

# a. Predict on unlabeled data
probabilities = model.predict_proba(X_unlabeled)

# b. Calculate uncertainty (Least Confidence Sampling)
uncertainty = 1 - np.max(probabilities, axis=1)

# c. Query the most uncertain samples
QUERY_SIZE = 20 # Number of samples to label in this cycle
query_indices = np.argsort(uncertainty)[-QUERY_SIZE:]

# d. "Simulate" labeling by getting the true labels and moving data
X_queried, y_queried = X_unlabeled[query_indices], y_unlabeled[query_indices]

# Add queried samples to the labeled set
X_labeled_new = np.concatenate([X_labeled, X_queried])
y_labeled_new = np.concatenate([y_labeled, y_queried])

# Remove queried samples from the unlabeled pool
X_unlabeled_new = np.delete(X_unlabeled, query_indices, axis=0)
y_unlabeled_new = np.delete(y_unlabeled, query_indices, axis=0)

print(f"Queried {len(X_queried)} samples for labeling.")
print(f"New labeled set size: {len(X_labeled_new)}")
print(f"New unlabeled pool size: {len(X_unlabeled_new)}")

# e. Retrain the model
model.fit(X_labeled_new, y_labeled_new)
retrained_preds = model.predict(X_test)
retrained_accuracy = accuracy_score(y_test, retrained_preds)

print(f"\nRetrained Model Accuracy: {retrained_accuracy:.4f}")
print(f"Accuracy Improvement: {retrained_accuracy - initial_accuracy:.4f}")