Exploring Movie Recommendation Systems using Python

Srikar V
10 min readMay 15, 2024

--

Hey everyone! After a bit of a hiatus, I’m back with another blog post. For those who might be new here, I’m Srikar V, a Computer Science student hailing from Bangalore, India. Today, I’m excited to delve into the fascinating world of recommendation systems by taking you through my journey of building a simple movie recommender system using data from the MovieLens website, courtesy of GroupLens Research.

Recommendation Systems

Let’s kick things off with a quick definition of recommendation systems. These algorithms are designed to sift through vast datasets, uncovering patterns and insights to deliver personalized recommendations to users. There are primarily four types:

  1. Popularity-Based Recommender: This one’s straightforward. It suggests items based on their overall popularity among users, disregarding individual preferences.
  2. Content-Based Recommender: Here, recommendations are tailored to match a user’s past likes by analyzing item attributes like genre, keywords, or metadata.
  3. Collaborative Filtering: This method leverages user-item interaction data to identify patterns and make personalized suggestions. It can be user-based, focusing on similar users’ preferences, or item-based, focusing on similar items that a user has interacted with.

In my project, I’ve dabbled in all these recommendation systems, each offering its unique approach to the task.

Approaches Implemented

Let’s take a closer look at how I’ve applied these methods to our movie dataset:

  1. Popularity-Based Recommender: This approach recommends movies based on their overall popularity, considering factors like average ratings and the number of user reviews.
  2. Content-Based Recommender: By analyzing movie genres, this approach suggests similar movies to a given one, catering to specific genre preferences.
  3. Item-Based Collaborative Filtering: This complex method recommends movies based on the preferences of users who’ve rated a given movie highly. It involves identifying similar users, filtering their liked movies, and making recommendations based on those preferences.

Code Implementation

  • I’ve implemented these recommendation systems in Python, using pandas and scikit-learn libraries for data manipulation and analysis.
  • To enhance user experience, I’ve integrated IpyWidgets to create interactive widgets for each recommender type.
  • Cosine similarity serves as the metric for computing similarities between movies or users.

About Dataset

  • GroupLens Research has collected and made available rating data sets from the MovieLens website (http://movielens.org). The data sets were collected over various periods, depending on the size of the set.
  • The data consists of 105339 ratings applied over 10329 movies. The average rating is 3.5 and the minimum and maximum ratings are 0.5 and 5 respectively. There are 668 user who have given their ratings for 149532 movies.
  • Dataset Link Kaggle: https://www.kaggle.com/datasets/ayushimishra2809/movielens-dataset?resource=download

Code

In this section, I’ll be explaining the code for the movie recommender systems I’ve built using Python.

Data Reading and Manipulation

  • The dataset consists of two .csv format files: movies and ratings
  • movies.csv: which contain movie_id, title, and genres.
movies.csv
  • ratings.csv: which contain user_id, movie_id, rating(out of 5), timestamp.
ratings.csv

We can see in movies.csv that the genres are clubbed together in one column which makes it hard for us to perform operations on the DataFrame hence what I did was I created a new column for every unique genre present. If the movie includes the particular genre then the respective column will be given a value of 1 else a 0.

updated movies_df

1. Popularity-Based Recommendations

  • As briefly explained above, for popularity-based recommendations we need the popularity of the movies.
  • In this case, it would be the average ratings of each movie by different users and the number of users that have rated that specific movie.
  • The average ratings and the count can be considered as parameters of popularity for a particular movie.
  • To get the average ratings and the count we have created a new data frame called popularity using the below code:
# creating a groupby of ratings average and count by users for each movie 
rating_avg_count = ratings.groupby('movieId').agg({'rating': ['mean', 'count']})
rating_avg_count

#creating a new dataframe popularity by merging rating_avg_count and movies
popularity = pd.merge(left=movies, right=rating_avg_count, left_on='movieId', right_on=rating_avg_count.index)

popularity.rename(columns={'title': 'Movie Title', ('rating', 'mean'): 'Average Movie Rating', ('rating', 'count'): 'Number of Reviews'}, inplace=True)

Function:

  • description: generates popularity-based recommendations depending on input genre.
  • params: genre - specific genre,threshold - min. number of reviews, nums - number of recommendations.
  • returns: Dataframe with top nums recommendations.
# function to generate recommendations based on the popularity of the movie
def popularity_recommender(genre, threshold, nums):
result = popularity[popularity[genre] == 1] # filter movies with the given genre
result = result[result['Number of Reviews'] > threshold] # filter movies with a minimum number of reviews
result = result.sort_values(by='Average Movie Rating', ascending=False)[:nums] # sort movies by average rating and select top n
return result

IpyWidget:

# Function to handle button click for popularity-based recommendations
def handle_popularity(event):
with recommendation_list:
recommendation_list.clear_output()
genre = genre_dropdown.value
threshold = review_threshold_input.value
nums = recommendation_count_input.value
recommendations = popularity_recommender(genre, threshold, nums)[['Movie Title', 'genres', 'Average Movie Rating', 'Number of Reviews']] # select relevant columns
display(recommendations) # Display the recommendations

# Widgets for popularity-based recommendations
genre_dropdown = widgets.Dropdown(
options=unique_genres,
description='Select genre:'
)

genre_dropdown.style.description_width = '200px'
genre_dropdown.layout.width = '400px'

review_threshold_input = widgets.IntText(
description='Minimum reviews threshold:'
)

review_threshold_input.style.description_width = '200px'
review_threshold_input.layout.width = '400px'

recommendation_count_input = widgets.IntText(
description='Number of recommendations:'
)

recommendation_count_input.style.description_width = '200px'
recommendation_count_input.layout.width = '400px'

recommendation_list = widgets.Output() # Output widget to display recommendations

popularity_button = widgets.Button(description='Generate Recommendations')

popularity_button.layout.width = '200px'

popularity_button.on_click(handle_popularity)
popularity_widgets = widgets.VBox([genre_dropdown, review_threshold_input, recommendation_count_input, popularity_button])

display(Markdown("## Popularity-Based Recommendations"), popularity_widgets, recommendation_list) # Display the widgets and the output widget
IpyWidget Popularity-based

2. Content-Based Recommendations

  • Content-based recommendations would essentially be providing recommendations based on the similarity of a given movie to other movies in the dataset and their genre similarity.
  • Since we will be providing the movie title as the input we need to be able to search for the movie title in the dataset, hence I’ve built a simple search engine using `TfidfVectorizer`:
# function to clean the title of the movie
def clean_title(title):
return re.sub("[^a-zA-Z0-9 ]", "", title)

# apply the clean_title function to the movie titles
movies["Clean Title"] = movies['title'].apply(clean_title)

# tfidf vectorizer for the movie titles to generate a matrix of TF-IDF values
vectorizer = TfidfVectorizer(ngram_range=(1,2)) #ngram_range -> allows the vectorizer to recognise not only single words but also two consecutive words(anagrams)

tfidf = vectorizer.fit_transform(movies['Clean Title'])

# function to search for similar titles based on input using cosine similarity between the TF-IDF matrix and the input title
def search_title(title):
title = clean_title(title) # clean the input title
query_vec = vectorizer.transform([title]) # transform the input title into a TF-IDF vector
similarity = cosine_similarity(query_vec,tfidf).flatten() # calculate the cosine similarity between the input title and all movie titles
indices = similarity.argsort()[::-1][:5] # sort the similarity scores in descending order and select the top 5 indices
results = movies.iloc[indices] # get the movie titles corresponding to the top 5 indices
return results #[['Clean Title', 'genres']]
search engine
  • To find similarities between the movies we have created a cosine similarity matrix from just the different genre columns we have used. The cosine similarity value ranges between 0.0 to 1.0, where 0.0 is no similarity and 1.0 would be a perfect similarity.
# find cosine similarity of the movie's genres
movies_similarity = cosine_similarity(movies.drop(['title','movieId','genres','Clean Title'],axis=1))
cosine similarity matrix
  • Now we must combine the search engine along with filtering the scores. The search engine’s result will give us the index of the movie title input from which we filter out the number of movie recommendations asked for by the user in the similarity matrix.

Function:

  • description: generates recommendations based on the similarity of the movie’s genres with input movie genre
  • params: movie_id - movieId extracted from search_title, nums - number of recommendations
  • returns: Dataframe with top nums recommendations
# function to generate recommendations based on the similarity of the movie's genres
def content_recommender(movie_id, nums):
similar_movies_indices = movies_similarity[movie_id].argsort()[::-1][:nums] # nums most similar movies indices to the input movie
similar_movies = movies.loc[similar_movies_indices] # getting the movie titles from the indices
similar_movies["score"] = movies_similarity[movie_id][similar_movies_indices] # getting the movie scores from the indices
return similar_movies[['Clean Title', 'genres', 'score']]

# Function to handle button click for content-based recommendations
def handle_content(event):
with recommendation_list:
recommendation_list.clear_output()
title = movie_title_input.value
nums = recommendation_count_input.value
results = search_title(title)
movie_id = results.index[0] # get the movie id of the first result

recommendations = content_recommender(movie_id, nums) # get the recommendations for the movie
display(recommendations) # Display the recommendations

IpyWidget:

# Widgets for content-based recommendations
movie_title_input = widgets.Text(
description='Enter movie title:'
)

recommendation_list = widgets.Output() # Output widget to display recommendations

movie_title_input.style.description_width = '200px'
movie_title_input.layout.width = '400px'

recommendation_count_input = widgets.IntText(
description='Number of recommendations:'
)

recommendation_count_input.style.description_width = '200px'
recommendation_count_input.layout.width = '400px'

content_button = widgets.Button(description='Generate Recommendations')
content_button.layout.width = '200px'
content_button.on_click(handle_content)
content_widgets = widgets.VBox([movie_title_input, recommendation_count_input, content_button])

display(Markdown("## Content-Based Recommendations"), content_widgets, recommendation_list) # Display the widgets and the output widget
content-based recommender

3. Collaborative Filtering

  • I’ve implemented item-based collaborative filtering because from recommendation systems we expect the output to be a movie that is rated high by users who have also watched and liked the same movie you like.
  • This is quite complicated so I’ll explain this using steps, please follow along:

Step 1: Find the users who watched the input movie have rated it higher than 4, by doing this we get the users that also like the movie you liked and gave it a high rating.

# find similar users who watched the input movie and rated it higher than 4
similar_users_new = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique()

Step 2: Since we don’t want the same movie but want other movies we have to filter out the other movies that similar users liked and rated it above 4.

# find the other movies that the simialr users have watched and rated it higher than 4
similar_users_recs = ratings[(ratings["userId"].isin(similar_users_new)) & (ratings["rating"] > 4)]["movieId"]

Step 3: Since we need to filter further, we need to find the percentage of similar users that have watched and liked the movies and only filter out the movies that were watched by more than 10% of the similar users.

similar_users_recs = similar_users_recs.value_counts() / len(similar_users_new)  #percentage of similar users that watched the movies

similar_users_recs = similar_users_recs[similar_users_recs > 0.10] #getting the movies that were watched by more than 10% of the similar_users

Step 4: To recommend movies to the user, we need to also find how much all the users have rated the movies that similar users have watched and rated it higher than 4 and find the percentage of all users that have watched the same movies as similar users and rated it higher than 4

# in order to recommend movies to the user, we need to also find how much all the users have rated the movies that similar users have watched
all_users = ratings[(ratings["movieId"].isin(similar_users_recs.index)) & (ratings["rating"] > 4)]

# find the percentage of all users that have watched the same movies as similar users and rated it higher than 4
all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique())

Step 5: Concat the percentages of users of similar users and all users that have watched the movies. Find the ratio of similar and all percentages, as we want movies that have a big difference between similar and all users we don’t want movies that are generally liked by everyone i.e. `all` but we need movies that are specifically liked by `similar users` more.

# concat the percenatges of users of similar users and all users that have watched the movies
rec_percentages = pd.concat([similar_users_recs, all_users_recs], axis=1)
rec_percentages.columns = ["similar", "all"]

# finding the ratio of similar and all percentages
rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"]

Step 6: We must sort the ratios in descending order and select the top n recommendations.

Function:

  • description: generates recommendations using item-based collaborative filtering by identifying similar users and selecting movies watched by them
  • params: movie_id - movieId extracted from search_title, nums - number of recommendations -returns: Dataframe with top nums recommendations
def item_collaborative_recommender(movie_id, nums):
similar_users_new = ratings[(ratings["movieId"] == movie_id) & (ratings["rating"] > 4)]["userId"].unique() # find similar users who watched the input movie and rated it higher than 4
similar_users_recs = ratings[(ratings["userId"].isin(similar_users_new)) & (ratings["rating"] > 4)]["movieId"] # find the other movies that the simialr users have watched and rated it higher than 4

similar_users_recs = similar_users_recs.value_counts() / len(similar_users_new) #percentage of similar users that watched the movies
similar_users_recs = similar_users_recs[similar_users_recs > 0.10] #getting the movies that were watched by more than 10% of the similar_users

all_users = ratings[(ratings["movieId"].isin(similar_users_recs.index)) & (ratings["rating"] > 4)] # find the percentage of all users that have watched the same movies as similar users and rated it higher than 4
all_users_recs = all_users["movieId"].value_counts() / len(all_users["userId"].unique()) # find the percentage of all users that have watched the same movies as similar users and rated it higher than 4

rec_percentages = pd.concat([similar_users_recs, all_users_recs], axis=1) # concat the percenatges of users of similar users and all users that have watched the movies
rec_percentages.columns = ["similar", "all"] # rename the columns

rec_percentages["score"] = rec_percentages["similar"] / rec_percentages["all"] # finding the ratio of similar and all percentages

rec_percentages = rec_percentages.sort_values("score", ascending=False) # sort the ratio(score) in descending order

return rec_percentages.head(nums).merge(movies, left_index=True, right_on="movieId")[["title","genres","score"]]

IpyWidget:

def handle_item_collaborative(event):
with recommendation_list:
recommendation_list.clear_output()
title = movie_input.value
results = search_title(title)
movie_id = results.iloc[0]["movieId"]
nums = recommendation_num_input.value
display(item_collaborative_recommender(movie_id, nums))

movie_input = widgets.Text(
description='Movie Title:',
disabled=False
)

recommendation_list = widgets.Output()

movie_input.style.description_width = '200px'
movie_input.layout.width = '400px'

recommendation_num_input = widgets.IntText(
description='Number of Recommendations:'
)

recommendation_num_input.style.description_width = '200px'
recommendation_num_input.layout.width = '400px'

item_collaborative_button = widgets.Button(description='Generate Recommendations')
item_collaborative_button.layout.width = '200px'
item_collaborative_button.on_click(handle_item_collaborative)

item_collaborative_widgets = widgets.VBox([movie_input, recommendation_num_input, item_collaborative_button])

display(Markdown('## Item-Based Collaborative Recommendations'), item_collaborative_widgets, recommendation_list)
collaborative filtering

Conclusion

In conclusion, diving into the realm of recommendation systems has been an enriching experience for me. From understanding the intricacies of different algorithms to implementing them in Python, this project has provided valuable insights into the world of data science and machine learning.

Whether you’re a movie buff seeking personalized recommendations or a data enthusiast exploring the nuances of recommendation systems, I hope this blog post has sparked your curiosity and inspired you to delve deeper into this fascinating field.

Until next time, happy coding!

--

--

Srikar V
Srikar V

Written by Srikar V

Aspiring AWS Machine Learning Specialist

No responses yet