Article
content-based-recommendationrecommendation-systempythonscikit-learntf-idfcosine-similaritynlp
Build a Content-Based Recommender with TF-IDF
Learn to build a Content-Based Recommendation system. This pack shows how to use TF-IDF to convert item descriptions into feature vectors and then use cosine similarity to find and recommend similar items based on their content.
beginner15 min4 steps
The play
- Prepare Item DataFirst, gather your data. For a Content-Based Recommendation system, you need items and their descriptive content (e.g., product descriptions, movie genres, article text). We will use a simple list of movies and their genre descriptions.
- Vectorize Content with TF-IDFConvert the text descriptions into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency). This creates a numerical 'fingerprint' for each item by scoring words based on their importance, which is essential for calculating similarity.
- Compute Cosine SimilarityCalculate the cosine similarity between all item vectors in the TF-IDF matrix. This produces a new matrix where each cell (i, j) contains a score representing how similar item i is to item j. This similarity matrix is the core of our recommender.
- Generate RecommendationsFinally, build a function that takes an item's title, finds its similarity scores against all other items, and returns the top N most similar items. This function uses the pre-computed similarity matrix to deliver real-time Content-Based Recommendation results.
Starter code
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# 1. Prepare Data
data = [
{'title': 'The Matrix', 'description': 'sci-fi action cyberpunk'},
{'title': 'Blade Runner', 'description': 'sci-fi noir cyberpunk'},
{'title': 'John Wick', 'description': 'action thriller neo-noir'},
{'title': 'Casablanca', 'description': 'romance drama war'},
{'title': 'When Harry Met Sally', 'description': 'romance comedy'},
{'title': 'Speed', 'description': 'action thriller'}
]
df = pd.DataFrame(data)
# 2. Vectorize Content
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['description'])
# 3. Compute Similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
# 4. Create Recommendation Function
# Create a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()
def get_content_recommendations(title, cosine_sim=cosine_sim, df=df):
# Get the index of the movie that matches the title
if title not in indices:
return f"Movie '{title}' not found."
idx = indices[title]
# Get the pairwise similarity scores of all movies with that movie
sim_scores = list(enumerate(cosine_sim[idx]))
# Sort the movies based on the similarity scores
sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)
# Get the scores of the 3 most similar movies (excluding itself)
sim_scores = sim_scores[1:4]
# Get the movie indices
movie_indices = [i[0] for i in sim_scores]
# Return the top 3 most similar movies
return df['title'].iloc[movie_indices]
# --- Example Usage ---
liked_movie = 'The Matrix'
recommendations = get_content_recommendations(liked_movie)
print(f"Because you liked '{liked_movie}', you might also like:")
print(recommendations)