Article
pythonrecommendation-systemcollaborative-filteringsvdsurprise-librarymachine-learningpersonalization
Build a Movie Recommender with Collaborative Filtering
Learn to predict user ratings using Collaborative Filtering. This guide uses Python's Surprise library to build a simple movie recommender system based on user-item interaction data, a core technique for personalization.
beginner15 min5 steps
The play
- Install the Surprise LibraryFirst, set up your environment by installing `scikit-surprise`, the Python library for building and analyzing recommender systems. It provides various ready-to-use prediction algorithms like SVD and k-NN for Collaborative Filtering.
- Load a DatasetLoad a standard dataset to work with. The Surprise library includes built-in datasets like MovieLens. We'll load the `ml-100k` dataset, which contains 100,000 ratings from 1000 users on 1700 movies.
- Select a Collaborative Filtering AlgorithmChoose a model-based Collaborative Filtering algorithm. We'll use Singular Value Decomposition (SVD), a matrix factorization technique that uncovers latent factors in the user-item interaction matrix to predict missing ratings.
- Train and Evaluate the ModelUse cross-validation to train the model and assess its accuracy. This splits the data into folds, training on some and testing on others, giving a robust estimate of performance. We'll measure the Root Mean Squared Error (RMSE).
- Generate Predictions & RecommendationsOnce the model is trained on the full dataset, you can predict a rating for any user-item pair. More practically, you can generate a list of top-N recommendations for a specific user.
Starter code
from collections import defaultdict
from surprise import Dataset, SVD
from surprise.model_selection import cross_validate
def get_top_n_recommendations(predictions, n=10):
"""Return the top-N recommendation for each user from a set of predictions."""
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# 1. Load the movielens-100k dataset
data = Dataset.load_builtin('ml-100k')
# 2. Use the SVD algorithm for Collaborative Filtering
algo = SVD()
# 3. Run 5-fold cross-validation and print results
print("Running 5-fold cross-validation...")
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True)
# 4. Train the algorithm on the whole dataset and make predictions
trainset = data.build_full_trainset()
algo.fit(trainset)
# 5. Predict a rating for a specific user and item
uid = str(196) # raw user id
iid = str(302) # raw item id
pred = algo.predict(uid, iid, r_ui=4, verbose=True)
# 6. Generate top-10 recommendations for each user
# First, predict ratings for all pairs (user, item) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n_recommendations(predictions, n=10)
# Print the recommendations for a specific user
user_id_to_show = '196'
print(f"\nTop 10 recommendations for user {user_id_to_show}:")
for iid, rating in top_n[user_id_to_show]:
print(f" Item ID: {iid}, Predicted Rating: {rating:.2f}")