Article

content-based-recommendationrecommendation-systempythonscikit-learntf-idfcosine-similaritynlp

Build a Content-Based Recommender with TF-IDF

Learn to build a Content-Based Recommendation system. This pack shows how to use TF-IDF to convert item descriptions into feature vectors and then use cosine similarity to find and recommend similar items based on their content.

beginner15 min4 steps

The play

Prepare Item Data
First, gather your data. For a Content-Based Recommendation system, you need items and their descriptive content (e.g., product descriptions, movie genres, article text). We will use a simple list of movies and their genre descriptions.
Vectorize Content with TF-IDF
Convert the text descriptions into numerical vectors using TF-IDF (Term Frequency-Inverse Document Frequency). This creates a numerical 'fingerprint' for each item by scoring words based on their importance, which is essential for calculating similarity.
Compute Cosine Similarity
Calculate the cosine similarity between all item vectors in the TF-IDF matrix. This produces a new matrix where each cell (i, j) contains a score representing how similar item i is to item j. This similarity matrix is the core of our recommender.
Generate Recommendations
Finally, build a function that takes an item's title, finds its similarity scores against all other items, and returns the top N most similar items. This function uses the pre-computed similarity matrix to deliver real-time Content-Based Recommendation results.

Starter code

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# 1. Prepare Data
data = [
    {'title': 'The Matrix', 'description': 'sci-fi action cyberpunk'},
    {'title': 'Blade Runner', 'description': 'sci-fi noir cyberpunk'},
    {'title': 'John Wick', 'description': 'action thriller neo-noir'},
    {'title': 'Casablanca', 'description': 'romance drama war'},
    {'title': 'When Harry Met Sally', 'description': 'romance comedy'},
    {'title': 'Speed', 'description': 'action thriller'}
]
df = pd.DataFrame(data)

# 2. Vectorize Content
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix = tfidf.fit_transform(df['description'])

# 3. Compute Similarity
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)

# 4. Create Recommendation Function
# Create a reverse map of indices and movie titles
indices = pd.Series(df.index, index=df['title']).drop_duplicates()

def get_content_recommendations(title, cosine_sim=cosine_sim, df=df):
    # Get the index of the movie that matches the title
    if title not in indices:
        return f"Movie '{title}' not found."
    idx = indices[title]

    # Get the pairwise similarity scores of all movies with that movie
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the movies based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 3 most similar movies (excluding itself)
    sim_scores = sim_scores[1:4]

    # Get the movie indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 3 most similar movies
    return df['title'].iloc[movie_indices]

# --- Example Usage ---
liked_movie = 'The Matrix'
recommendations = get_content_recommendations(liked_movie)

print(f"Because you liked '{liked_movie}', you might also like:")
print(recommendations)