Build a Fraud Detection Pipeline with Imbalanced-Learn & XGBoost

Implement a core Fraud Detection Pipeline for imbalanced data. This guide uses SMOTE-Tomek resampling from imbalanced-learn and an XGBoost classifier to effectively model rare events like fraud, providing a robust starting point for real-world applications.

intermediate30 min5 steps

The play

Install Dependencies
The Fraud Detection Pipeline relies on specific libraries for handling data imbalance and modeling. Install imbalanced-learn, scikit-learn, and xgboost to get started.
Simulate Imbalanced Data
Fraud data is naturally imbalanced. We'll use scikit-learn's `make_classification` to create a synthetic dataset with a 99:1 ratio of non-fraud to fraud cases, which is a realistic starting point.
Apply SMOTE-Tomek Resampling
To handle the severe class imbalance, we use SMOTE-Tomek. SMOTE (Synthetic Minority Over-sampling TEchnique) creates new synthetic fraud examples, while Tomek links remove ambiguous pairs. This combination cleans and balances the training data.
Train XGBoost Classifier
With a balanced dataset, we can now train a powerful classifier. We use XGBoost, a gradient boosting algorithm known for its high performance in competitions and fraud detection tasks. We train it on the resampled data.
Evaluate Model Performance
Finally, evaluate the model. For imbalanced problems, overall accuracy is misleading. Instead, we use a classification report, focusing on the precision, recall, and F1-score for the minority class (label '1') to understand the model's true effectiveness at catching fraud.

Starter code

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.combine import SMOTETomek
from xgboost import XGBClassifier

# 1. Simulate imbalanced data (e.g., 1% fraud cases)
print("--- Generating imbalanced dataset ---")
X, y = make_classification(
    n_samples=10000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.99, 0.01],
    flip_y=0,
    random_state=42
)
print(f"Original dataset shape: {X.shape}")
print(f"Original class distribution: {np.bincount(y)}\n")

# Split original data to create a hold-out test set for final evaluation
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# 2. Apply SMOTE-Tomek resampling to the training set
print("--- Applying SMOTE-Tomek resampling ---")
smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train_orig, y_train_orig)
print(f"Resampled dataset shape: {X_resampled.shape}")
print(f"Resampled class distribution: {np.bincount(y_resampled)}\n")

# 3. Train an XGBoost classifier on the resampled data
print("--- Training XGBoost model ---")
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_resampled, y_resampled)
print("Model training complete.\n")

# 4. Evaluate the model on the original, unseen test data
print("--- Evaluating model on original test set ---")
y_pred = model.predict(X_test_orig)

print("Classification Report:")
# target_names=['non-fraud (0)', 'fraud (1)']
print(classification_report(y_test_orig, y_pred, target_names=['non-fraud (0)', 'fraud (1)']))