Article
fraud-detectionimbalanced-learningsmotexgboostscikit-learnmachine-learningpython
Build a Fraud Detection Pipeline with Imbalanced-Learn & XGBoost
Implement a core Fraud Detection Pipeline for imbalanced data. This guide uses SMOTE-Tomek resampling from imbalanced-learn and an XGBoost classifier to effectively model rare events like fraud, providing a robust starting point for real-world applications.
intermediate30 min5 steps
The play
- Install DependenciesThe Fraud Detection Pipeline relies on specific libraries for handling data imbalance and modeling. Install imbalanced-learn, scikit-learn, and xgboost to get started.
- Simulate Imbalanced DataFraud data is naturally imbalanced. We'll use scikit-learn's `make_classification` to create a synthetic dataset with a 99:1 ratio of non-fraud to fraud cases, which is a realistic starting point.
- Apply SMOTE-Tomek ResamplingTo handle the severe class imbalance, we use SMOTE-Tomek. SMOTE (Synthetic Minority Over-sampling TEchnique) creates new synthetic fraud examples, while Tomek links remove ambiguous pairs. This combination cleans and balances the training data.
- Train XGBoost ClassifierWith a balanced dataset, we can now train a powerful classifier. We use XGBoost, a gradient boosting algorithm known for its high performance in competitions and fraud detection tasks. We train it on the resampled data.
- Evaluate Model PerformanceFinally, evaluate the model. For imbalanced problems, overall accuracy is misleading. Instead, we use a classification report, focusing on the precision, recall, and F1-score for the minority class (label '1') to understand the model's true effectiveness at catching fraud.
Starter code
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from imblearn.combine import SMOTETomek
from xgboost import XGBClassifier
# 1. Simulate imbalanced data (e.g., 1% fraud cases)
print("--- Generating imbalanced dataset ---")
X, y = make_classification(
n_samples=10000,
n_features=20,
n_informative=10,
n_redundant=5,
n_classes=2,
weights=[0.99, 0.01],
flip_y=0,
random_state=42
)
print(f"Original dataset shape: {X.shape}")
print(f"Original class distribution: {np.bincount(y)}\n")
# Split original data to create a hold-out test set for final evaluation
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
# 2. Apply SMOTE-Tomek resampling to the training set
print("--- Applying SMOTE-Tomek resampling ---")
smt = SMOTETomek(random_state=42)
X_resampled, y_resampled = smt.fit_resample(X_train_orig, y_train_orig)
print(f"Resampled dataset shape: {X_resampled.shape}")
print(f"Resampled class distribution: {np.bincount(y_resampled)}\n")
# 3. Train an XGBoost classifier on the resampled data
print("--- Training XGBoost model ---")
model = XGBClassifier(use_label_encoder=False, eval_metric='logloss', random_state=42)
model.fit(X_resampled, y_resampled)
print("Model training complete.\n")
# 4. Evaluate the model on the original, unseen test data
print("--- Evaluating model on original test set ---")
y_pred = model.predict(X_test_orig)
print("Classification Report:")
# target_names=['non-fraud (0)', 'fraud (1)']
print(classification_report(y_test_orig, y_pred, target_names=['non-fraud (0)', 'fraud (1)']))