Article
shapxaiexplainable-aifeature-importancemodel-interpretabilityscikit-learnpythondata-science
Explain ML Models with SHAP Feature Importance
Use the SHAP (SHapley Additive exPlanations) library to understand your model's predictions. This pack shows how to calculate and visualize which features have the most impact on model output for both global behavior and individual predictions.
intermediate30 min6 steps
The play
- Install DependenciesInstall the necessary libraries. You'll need `shap` for the core logic, `scikit-learn` for a model and dataset, and `matplotlib` for plotting.
- Train a Model to ExplainTo explain a model, you first need a model. We'll train a simple RandomForestRegressor on the California Housing dataset. This provides the prediction function that SHAP will analyze.
- Create a SHAP ExplainerInstantiate a SHAP Explainer object. For tree-based models like RandomForest or XGBoost, `shap.TreeExplainer` is highly optimized. The explainer takes the trained model as input.
- Calculate SHAP ValuesUse the explainer to calculate SHAP values for a set of data (e.g., your test set). This computes the contribution of each feature to each individual prediction.
- Visualize Global ImportanceGenerate a summary plot to see which features are most important overall. The beeswarm plot shows the distribution of SHAP values for each feature, providing more detail than a simple bar chart.
- Explain an Individual PredictionUse a force plot to understand a single prediction. This shows which features pushed the model's output higher (red) or lower (blue) than the baseline for a specific data point.
Starter code
import shap
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
def run_shap_analysis():
"""Loads data, trains a model, and generates SHAP feature importance plots."""
# 1. Load data and train a model
print("Loading California Housing dataset and training RandomForestRegressor...")
X, y = fetch_california_housing(return_X_y=True, as_frame=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = RandomForestRegressor(n_estimators=100, random_state=42)
model.fit(X_train, y_train)
print("Model training complete.")
# 2. Create a SHAP explainer and calculate values
print("Calculating SHAP values...")
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)
# 3. Generate and show the global summary plot (beeswarm)
print("Generating global feature importance plot (beeswarm)...")
plt.figure(1)
shap.summary_plot(shap_values, X_test, show=False)
plt.title('Global Feature Importance (Beeswarm)')
plt.tight_layout()
# 4. Generate and show the summary plot (bar chart)
print("Generating global feature importance plot (bar)...")
plt.figure(2)
shap.summary_plot(shap_values, X_test, plot_type="bar", show=False)
plt.title('Global Feature Importance (Bar)')
plt.tight_layout()
# 5. Show plots
print("Displaying plots. Close plot windows to exit.")
plt.show()
if __name__ == "__main__":
# To run this script: pip install shap scikit-learn matplotlib pandas
run_shap_analysis()