Article

anomaly-detectionoutlier-detectionpythonpyodmachine-learningdata-scienceisolation-forestmonitoring

Detect Outliers in Your Data with PyOD

Use the PyOD Python library to quickly identify anomalies in your datasets. This guide walks you through training an Isolation Forest model, a powerful algorithm for outlier detection, on sample data.

beginner15 min4 steps

The play

Install PyOD
Get started by installing the PyOD library and its dependencies using pip. PyOD is a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data.
Generate Sample Data
For a reproducible example, use PyOD's built-in data generation utility. This creates a dataset with a known number of outliers, allowing you to verify your model's performance.
Train an Isolation Forest Model
Initialize and train an Isolation Forest (IForest) model. IForest is an efficient algorithm that works by isolating anomalies instead of profiling normal data points. The 'contamination' parameter estimates the proportion of outliers in the data.
Predict and Evaluate
Use the trained model to predict outlier labels on your data. The model assigns a label of '1' to anomalies and '0' to inliers. You can also get a raw anomaly score for each data point.

Starter code

import numpy as np
import matplotlib.pyplot as plt
from pyod.models.iforest import IForest
from pyod.utils.data import generate_data

# 1. Generate sample data
# contamination is the expected proportion of outliers in the data
contamination = 0.1
X_train, y_train = generate_data(n_train=200, n_test=100, n_features=2, contamination=contamination, random_state=42)

# 2. Train the Anomaly Detection model (Isolation Forest)
print("Training Isolation Forest model...")
clf = IForest(contamination=contamination, random_state=42)
clf.fit(X_train)

# 3. Get predictions
# The 'labels_' attribute contains the binary classification (0: inlier, 1: outlier)
y_pred = clf.labels_

# The 'decision_scores_' attribute contains the raw anomaly score for each sample
scores = clf.decision_scores_

# 4. Report and visualize results
num_outliers = np.sum(y_pred)
print(f"\nNumber of outliers detected: {num_outliers}")
print(f"Actual number of outliers: {int(contamination * len(X_train))}")

# Separate inliers and outliers for plotting
inliers = X_train[y_pred == 0]
outliers = X_train[y_pred == 1]

# Create a scatter plot
plt.figure(figsize=(10, 6))
plt.scatter(inliers[:, 0], inliers[:, 1], c='blue', label='Inliers')
plt.scatter(outliers[:, 0], outliers[:, 1], c='red', marker='x', label='Outliers')

# Add circles around the ground truth outliers for verification
ground_truth_outliers = X_train[y_train == 1]
plt.scatter(ground_truth_outliers[:, 0], ground_truth_outliers[:, 1], 
            facecolors='none', edgecolors='lime', s=150, linewidths=2, label='Ground Truth Outliers')

plt.title('Anomaly Detection with Isolation Forest')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.legend()
plt.grid(True)
print("\nDisplaying plot... Close the plot window to exit.")
plt.show()