Article

shapxaiexplainable-aifeature-importancemodel-interpretabilityscikit-learnpythondata-science

Explain ML Models with SHAP Feature Importance

Use the SHAP (SHapley Additive exPlanations) library to understand your model's predictions. This pack shows how to calculate and visualize which features have the most impact on model output for both global behavior and individual predictions.

intermediate30 min6 steps

The play

Install Dependencies
Install the necessary libraries. You'll need `shap` for the core logic, `scikit-learn` for a model and dataset, and `matplotlib` for plotting.
Train a Model to Explain
To explain a model, you first need a model. We'll train a simple RandomForestRegressor on the California Housing dataset. This provides the prediction function that SHAP will analyze.
Create a SHAP Explainer
Instantiate a SHAP Explainer object. For tree-based models like RandomForest or XGBoost, `shap.TreeExplainer` is highly optimized. The explainer takes the trained model as input.
Calculate SHAP Values
Use the explainer to calculate SHAP values for a set of data (e.g., your test set). This computes the contribution of each feature to each individual prediction.
Visualize Global Importance
Generate a summary plot to see which features are most important overall. The beeswarm plot shows the distribution of SHAP values for each feature, providing more detail than a simple bar chart.
Explain an Individual Prediction
Use a force plot to understand a single prediction. This shows which features pushed the model's output higher (red) or lower (blue) than the baseline for a specific data point.

Starter code

import shap
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt

def run_shap_analysis():
    """Loads data, trains a model, and generates SHAP feature importance plots."""
    # 1. Load data and train a model
    print("Loading California Housing dataset and training RandomForestRegressor...")
    X, y = fetch_california_housing(return_X_y=True, as_frame=True)
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
    
    model = RandomForestRegressor(n_estimators=100, random_state=42)
    model.fit(X_train, y_train)
    print("Model training complete.")

    # 2. Create a SHAP explainer and calculate values
    print("Calculating SHAP values...")
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X_test)

    # 3. Generate and show the global summary plot (beeswarm)
    print("Generating global feature importance plot (beeswarm)...")
    plt.figure(1)
    shap.summary_plot(shap_values, X_test, show=False)
    plt.title('Global Feature Importance (Beeswarm)')
    plt.tight_layout()
    
    # 4. Generate and show the summary plot (bar chart)
    print("Generating global feature importance plot (bar)...")
    plt.figure(2)
    shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
    plt.title('Global Feature Importance (Bar)')
    plt.tight_layout()

    # 5. Show plots
    print("Displaying plots. Close plot windows to exit.")
    plt.show()

if __name__ == "__main__":
    # To run this script: pip install shap scikit-learn matplotlib pandas
    run_shap_analysis()