Article

pythonrecommendation-systemcollaborative-filteringsvdsurprise-librarymachine-learningpersonalization

Build a Movie Recommender with Collaborative Filtering

Learn to predict user ratings using Collaborative Filtering. This guide uses Python's Surprise library to build a simple movie recommender system based on user-item interaction data, a core technique for personalization.

beginner15 min5 steps

The play

Install the Surprise Library
First, set up your environment by installing `scikit-surprise`, the Python library for building and analyzing recommender systems. It provides various ready-to-use prediction algorithms like SVD and k-NN for Collaborative Filtering.
Load a Dataset
Load a standard dataset to work with. The Surprise library includes built-in datasets like MovieLens. We'll load the `ml-100k` dataset, which contains 100,000 ratings from 1000 users on 1700 movies.
Select a Collaborative Filtering Algorithm
Choose a model-based Collaborative Filtering algorithm. We'll use Singular Value Decomposition (SVD), a matrix factorization technique that uncovers latent factors in the user-item interaction matrix to predict missing ratings.
Train and Evaluate the Model
Use cross-validation to train the model and assess its accuracy. This splits the data into folds, training on some and testing on others, giving a robust estimate of performance. We'll measure the Root Mean Squared Error (RMSE).
Generate Predictions & Recommendations
Once the model is trained on the full dataset, you can predict a rating for any user-item pair. More practically, you can generate a list of top-N recommendations for a specific user.

Starter code

from collections import defaultdict
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate

def get_top_n_recommendations(predictions, n=10):
    """Return the top-N recommendation for each user from a set of predictions."""
    # First map the predictions to each user.
    top_n = defaultdict(list)
    for uid, iid, true_r, est, _ in predictions:
        top_n[uid].append((iid, est))

    # Then sort the predictions for each user and retrieve the k highest ones.
    for uid, user_ratings in top_n.items():
        user_ratings.sort(key=lambda x: x[1], reverse=True)
        top_n[uid] = user_ratings[:n]

    return top_n

# 1. Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')

# 2. Use the SVD algorithm for Collaborative Filtering
algo = SVD()

# 3. Run 5-fold cross-validation and print results
print("Running 5-fold cross-validation...")
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)

# 4. Train the algorithm on the whole dataset and make predictions
trainset = data.build_full_trainset()
algo.fit(trainset)

# 5. Predict a rating for a specific user and item
uid = str(196)  # raw user id
iid = str(302)  # raw item id
pred = algo.predict(uid, iid, r_ui=4, verbose=True)

# 6. Generate top-10 recommendations for each user
# First, predict ratings for all pairs (user, item) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)

top_n = get_top_n_recommendations(predictions, n=10)

# Print the recommendations for a specific user
user_id_to_show = '196'
print(f"\nTop 10 recommendations for user {user_id_to_show}:")
for iid, rating in top_n[user_id_to_show]:
    print(f"  Item ID: {iid}, Predicted Rating: {rating:.2f}")