Non-personalized recommenders systems is a system where everyones gets to see the same set of recommendation. The site doesn't logs in user details and transaction info, and the site assumes that every user is same and displays a list of general recommendations.
To explain this I am going to use the MovieLens data sets that were collected by the GroupLens Research Project at the University of Minnesota.
This data set consists of:
* 100,000 ratings (1-5) from 943 users on 1682 movies.
* Each user has rated at least 20 movies.
* Simple demographic info for the users (age, gender, occupation, zip)
I have made available their dataset and the ipython notebook used for this exercise in my github page.
import pandas as pd
import numpy as np
from plotly import tools
import plotly.plotly as py
from plotly.graph_objs import *
import datetime
dfdata=pd.read_csv('data/ml-100k/u.data' , sep="\t" , names=['userid', 'itemid', 'rating', 'timestamp'])
dfdata.head()
| userid | itemid | rating | timestamp |
| --- | --- | --- | --- | --- |
| 0 | 196 | 242 | 3 | 881250949 |
| 1 | 186 | 302 | 3 | 891717742 |
| 2 | 22 | 377 | 1 | 878887116 |
| 3 | 244 | 51 | 2 | 880606923 |
| 4 | 166 | 346 | 1 | 886397596 |
The fields in this dataset are the userid, itemid (movie) that a paerticular user has rating, his rating and the corresponding time at which the user has rated the movie. Please note that the timestamps are unix seconds since 1/1/1970 UTC. Lets create a fundtion that converts this information into actual datetime object
def epoch_to_datetime(timeinsecs):
return datetime.datetime.fromtimestamp(timeinsecs)
Converting the timestamps into the corresponing datetime object
dfdata['timestamp']=dfdata['timestamp'].apply(lambda x:epoch_to_datetime(x))
dfdata.head()
| userid | itemid | rating | timestamp |
| --- | --- | --- | --- | --- |
| 0 | 196 | 242 | 3 | 1997-12-04 07:55:49 |
| 1 | 186 | 302 | 3 | 1998-04-04 11:22:22 |
| 2 | 22 | 377 | 1 | 1997-11-06 23:18:36 |
| 3 | 244 | 51 | 2 | 1997-11-26 21:02:03 |
| 4 | 166 | 346 | 1 | 1998-02-01 21:33:16 |
dfdata.info()
Int64Index: 100000 entries, 0 to 99999
Data columns (total 4 columns):
userid 100000 non-null int64
itemid 100000 non-null int64
rating 100000 non-null int64
timestamp 100000 non-null datetime64[ns]
dtypes: datetime64[ns](1), int64(3)
memory usage: 3.8 MB
Now lets read the u.item data that contains the corresponding movie names for the movie id (itemid)
Information about the items (movies); this is a tab separated list of
| movie id | movie title | release date | video release date |
IMDb URL | unknown | Action | Adventure | Animation |
Children's | Comedy | Crime | Documentary | Drama | Fantasy |
Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi |
Thriller | War | Western |
The last 19 fields are the genres, a 1 indicates the movie is of that genre, a 0 indicates it is not; movies can be in several genres at once. The movie ids are the ones used in the u.data data set.
dfmovie=pd.read_csv('data/ml-100k/u.item' , sep='|' , names=['itemid','movie_title','release_date','video_release_date','imdb_url','unknown','action','advanture','animation','children','comedy','crime','documentary','drama','fantasy','film-noir','horror','musical','mystery','romance','scifi','thriller','war','western'],parse_dates=True)
dfmovie.head()
| itemid | movie_title | release_date | video_release_date | imdb_url | unknown | action | advanture | animation | children | ... | fantasy | film-noir | horror | musical | mystery | romance | scifi | thriller | war | western |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | Toy Story (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Toy%20Story%2... | 0 | 0 | 0 | 1 | 1 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 2 | GoldenEye (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?GoldenEye%20(... | 0 | 1 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2 | 3 | Four Rooms (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Four%20Rooms%... | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3 | 4 | Get Shorty (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Get%20Shorty%... | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | Copycat (1995) | 01-Jan-1995 | NaN | http://us.imdb.com/M/title-exact?Copycat%20(1995) | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
5 rows × 24 columns
For the sake of this exercise I am just using the movie title and itemid (movie id), lets ignore other columns
dfmovie=dfmovie[['itemid','movie_title']]
dfmovie.head()
| itemid | movie_title |
| --- | --- | --- |
| 0 | 1 | Toy Story (1995) |
| 1 | 2 | GoldenEye (1995) |
| 2 | 3 | Four Rooms (1995) |
| 3 | 4 | Get Shorty (1995) |
| 4 | 5 | Copycat (1995) |
Lets merge all these data into our original dataframe so that we also have all the movie information alongside.
df=pd.merge(dfdata,dfmovie,on='itemid')
df.head()
| userid | itemid | rating | timestamp | movie_title |
| --- | --- | --- | --- | --- | --- |
| 0 | 196 | 242 | 3 | 1997-12-04 07:55:49 | Kolya (1996) |
| 1 | 63 | 242 | 3 | 1997-10-01 16:06:30 | Kolya (1996) |
| 2 | 226 | 242 | 5 | 1998-01-03 20:37:51 | Kolya (1996) |
| 3 | 154 | 242 | 3 | 1997-11-09 21:03:55 | Kolya (1996) |
| 4 | 306 | 242 | 5 | 1997-10-10 10:16:33 | Kolya (1996) |
Now say we are entering a website like Netflix without logging in and not providing any user-specific information. For non-personalized recommendation, the question is what movies do we recommend for the users to watch. There are many ways to approach this, lets take a stab at it one by one.
## MEAN RATING
In this approach we calculate the mean mean rating for each movie, order with the highest rating listed first, and submit the top rated movies.
grouped=df.groupby('movie_title')
grouped.aggregate(np.mean)['rating'].sort(ascending=False,inplace=False).head(20)
/Users/aukauk/anaconda3/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning:
sort is deprecated, use sort_values(inplace=True) for for INPLACE sorting
movie_title
Marlene Dietrich: Shadow and Light (1996) 5.000000
Prefontaine (1997) 5.000000
Santa with Muscles (1996) 5.000000
Star Kid (1997) 5.000000
Someone Else's America (1995) 5.000000
Entertaining Angels: The Dorothy Day Story (1996) 5.000000
Saint of Fort Washington, The (1993) 5.000000
Great Day in Harlem, A (1994) 5.000000
They Made Me a Criminal (1939) 5.000000
Aiqing wansui (1994) 5.000000
Pather Panchali (1955) 4.625000
Anna (1996) 4.500000
Everest (1998) 4.500000
Maya Lin: A Strong Clear Vision (1994) 4.500000
Some Mother's Son (1996) 4.500000
Close Shave, A (1995) 4.491071
Schindler's List (1993) 4.466443
Wrong Trousers, The (1993) 4.466102
Casablanca (1942) 4.456790
Wallace & Gromit: The Best of Aardman Animation (1996) 4.447761
Name: rating, dtype: float64
avg_rating_df=grouped.aggregate(np.mean)['rating'].sort(ascending=False,inplace=False).to_frame().reset_index()
/Users/aukauk/anaconda3/envs/python2/lib/python2.7/site-packages/ipykernel/__main__.py:1: FutureWarning:
sort is deprecated, use sort_values(inplace=True) for for INPLACE sorting
avg_rating_df.head()
| movie_title | rating |
| --- | --- | --- |
| 0 | Marlene Dietrich: Shadow and Light (1996) | 5 |
| 1 | Prefontaine (1997) | 5 |
| 2 | Santa with Muscles (1996) | 5 |
| 3 | Star Kid (1997) | 5 |
| 4 | Someone Else's America (1995) | 5 |
Say I want to recommend all the users in the front landing page the movies with the average rating of 5 stars, I would recommend the following movies.
avg_rating_df[avg_rating_df['rating']==5]
| movie_title | rating |
| --- | --- | --- |
| 0 | Marlene Dietrich: Shadow and Light (1996) | 5 |
| 1 | Prefontaine (1997) | 5 |
| 2 | Santa with Muscles (1996) | 5 |
| 3 | Star Kid (1997) | 5 |
| 4 | Someone Else's America (1995) | 5 |
| 5 | Entertaining Angels: The Dorothy Day Story (1996) | 5 |
| 6 | Saint of Fort Washington, The (1993) | 5 |
| 7 | Great Day in Harlem, A (1994) | 5 |
| 8 | They Made Me a Criminal (1939) | 5 |
| 9 | Aiqing wansui (1994) | 5 |
Something seems obviously wrong (apologies if your faviourite movies is among in this list), but most of these movies seems not that popular compared to several other good movies I have seen.
To dig more closer, lets get an histogram of how many people have rated these movies that show the average rating of 5 stars.
movie_list=avg_rating_df[avg_rating_df['rating']==5]['movie_title']
num_raters=[]
for movie in movie_list:
num_raters.append(len(df[df['movie_title']==movie]))
print num_raters
data = [
Histogram (
x=num_raters
)
]
py.iplot(data, filename = 'Histogram_People_Rating_A_Movie')
[1, 3, 2, 3, 1, 1, 2, 1, 1, 1]
This is exactly where our problem is. The problem is that all of these movies are rated by either 1, 2 or 3 users and they have all given 5 stars to it. That doesnt make these movies great and we would rather shows the movies that have been rated by many number of users and also with highest ratings on our recommendations. This is the issue with a lot of non-personalized recommenders.
## SCALES AND NORMALIZATION
Take a look at the following link : https://janav.wordpress.com/2013/09/21/non-personalized-recommenders/
The author explains amazon.com's non-personalized recommendations that appear in section "customers who bought this item also bought" when users tries to look for a product.
How would it look like when we want to recommend movies instead of amazon product.
To illustrates this, I have taken out the following 5 movies in the movielens dataset.
movies=['Star Wars (1977)','Toy Story (1995)','GoodFellas (1990)','12 Angry Men (1957)', 'Alien 3 (1992)','']
movies
['Star Wars (1977)',
'Toy Story (1995)',
'GoodFellas (1990)',
'12 Angry Men (1957)',
'Alien 3 (1992)',
'']
dfr=df[df['movie_title'].isin(movies)]
dfr.head()
| userid | itemid | rating | timestamp | movie_title |
| --- | --- | --- | --- | --- | --- |
| 3397 | 308 | 1 | 4 | 1998-02-17 09:28:52 | Toy Story (1995) |
| 3398 | 287 | 1 | 5 | 1997-09-26 21:21:28 | Toy Story (1995) |
| 3399 | 148 | 1 | 4 | 1997-10-16 09:30:11 | Toy Story (1995) |
| 3400 | 280 | 1 | 4 | 1998-04-04 06:33:46 | Toy Story (1995) |
| 3401 | 66 | 1 | 3 | 1997-12-31 12:48:44 | Toy Story (1995) |
dfr=dfr.pivot_table(values='rating', index='userid', columns='movie_title')
dfr.head()
| movie_title | 12 Angry Men (1957) | Alien 3 (1992) | GoodFellas (1990) | Star Wars (1977) | Toy Story (1995) |
| --- | --- | --- | --- | --- | --- |
| userid |
| --- | --- | --- | --- | --- | --- |
| 1 | 5 | NaN | 4 | 5 | 5 |
| 2 | NaN | NaN | NaN | 5 | 4 |
| 4 | NaN | NaN | NaN | 5 | NaN |
| 5 | NaN | NaN | NaN | 4 | 4 |
| 6 | 4 | NaN | 4 | 4 | 4 |
Lets convert all nans to 0 and all other ratings to 1s.
def convert_to_zeros_ones(x):
if x > 0:
return 1
else:
return 0
dfr=dfr.applymap(lambda x:convert_to_zeros_ones(x))
dfr.head()
| movie_title | 12 Angry Men (1957) | Alien 3 (1992) | GoodFellas (1990) | Star Wars (1977) | Toy Story (1995) |
| --- | --- | --- | --- | --- | --- |
| userid |
| --- | --- | --- | --- | --- | --- |
| 1 | 1 | 0 | 1 | 1 | 1 |
| 2 | 0 | 0 | 0 | 1 | 1 |
| 4 | 0 | 0 | 0 | 1 | 0 |
| 5 | 0 | 0 | 0 | 1 | 1 |
| 6 | 1 | 0 | 1 | 1 | 1 |
Say I like "12 Angry Men" and I want to watch this movie. In order for the non-personalized recommender system to recommender other movies that people have watched based on this movie, we use the following formulae.
Score(X, Y) = Total People who watched X and Y / Total People who watched X
Total People who watched X are: (non-null entries)
dfr['12 Angry Men (1957)'].sum()
125
Total People who watched X and Y:
y_vals=['Star Wars (1977)','Toy Story (1995)','GoodFellas (1990)', 'Alien 3 (1992)']
score_dict={}
for y in y_vals:
x_and_y=pd.DataFrame(dfr['12 Angry Men (1957)']*dfr[y]).sum()
score_dict[y]=x_and_y.values*1./dfr['12 Angry Men (1957)'].sum()
score_dict
{'Alien 3 (1992)': array([ 0.208]),
'GoodFellas (1990)': array([ 0.48]),
'Star Wars (1977)': array([ 0.872]),
'Toy Story (1995)': array([ 0.648])}
Hence based on the highest scores, we would recommend 'Star Wars (1977)' and 'Toy Story (1995)' for the user to watch.
### BANANA TRAP
We want to make sure we aren't in banana trap here. What is a banana trap?
In a grocery store most of the customers will buy bananas. If someone buys a razor and a banana then you cannot tell that the purchase of a razor influenced the purchase of banana. Hence we need to adjust the formula to handle this case as well. The modified version is
(Total People who watched X and Y / Total People who watched X) /
(Total People who did not watch X but watched Y / Total People who did not watch X)
We have the numberator, we just need to calculate the denominator. Lets create a seperate column for people who did not watch 12 Angry Men.
'''convert 1s to 0s and vice versa for 12 angry men'''
def ones_to_zeros(x):
if x==1:
return 0
else:
return 1
dfr['Not_12_Angry_Men']=dfr['12 Angry Men (1957)'].apply(lambda x:ones_to_zeros(x))
dfr.head()
| movie_title | 12 Angry Men (1957) | Alien 3 (1992) | GoodFellas (1990) | Star Wars (1977) | Toy Story (1995) | Not_12_Angry_Men |
| --- | --- | --- | --- | --- | --- | --- |
| userid |
| --- | --- | --- | --- | --- | --- | --- |
| 1 | 1 | 0 | 1 | 1 | 1 | 0 |
| 2 | 0 | 0 | 0 | 1 | 1 | 1 |
| 4 | 0 | 0 | 0 | 1 | 0 | 1 |
| 5 | 0 | 0 | 0 | 1 | 1 | 1 |
| 6 | 1 | 0 | 1 | 1 | 1 | 0 |
score_dict={}
for y in y_vals:
x_and_y=pd.DataFrame(dfr['12 Angry Men (1957)']*dfr[y]).sum()
tot_x=dfr['12 Angry Men (1957)'].sum()
notx_and_y=pd.DataFrame(dfr['Not_12_Angry_Men']*dfr[y]).sum()
tot_notx=dfr['Not_12_Angry_Men'].sum()
score_dict[y]=(x_and_y.values*1./tot_x)/(notx_and_y*1./tot_notx)
score_dict
{'Alien 3 (1992)': 0 1.576865
dtype: float64, 'GoodFellas (1990)': 0 1.622169
dtype: float64, 'Star Wars (1977)': 0 1.032051
dtype: float64, 'Toy Story (1995)': 0 0.97986
dtype: float64}
This completely changes the game. Our normalized recommendation after adjusting for banana trap are GoodFellas (1990) and Alien 3 (1992) . This clearly shows that almost a lot of users had rated on the Toy Story and Star Wars.
# DAMPED MEANS
df.head()
| userid | itemid | rating | timestamp | movie_title |
| --- | --- | --- | --- | --- | --- |
| 0 | 196 | 242 | 3 | 1997-12-04 07:55:49 | Kolya (1996) |
| 1 | 63 | 242 | 3 | 1997-10-01 16:06:30 | Kolya (1996) |
| 2 | 226 | 242 | 5 | 1998-01-03 20:37:51 | Kolya (1996) |
| 3 | 154 | 242 | 3 | 1997-11-09 21:03:55 | Kolya (1996) |
| 4 | 306 | 242 | 5 | 1997-10-10 10:16:33 | Kolya (1996) |
grouped=df.groupby(['movie_title'])
dm_df=grouped['rating'].agg([np.sum, np.mean]).reset_index()
dm_df['count_ratings']=grouped.size().reset_index()[0]
dm_df.head()
| movie_title | sum | mean | count_ratings |
| --- | --- | --- | --- | --- |
| 0 | 'Til There Was You (1997) | 21 | 2.333333 | 9 |
| 1 | 1-900 (1994) | 13 | 2.600000 | 5 |
| 2 | 101 Dalmatians (1996) | 317 | 2.908257 | 109 |
| 3 | 12 Angry Men (1957) | 543 | 4.344000 | 125 |
| 4 | 187 (1997) | 124 | 3.024390 | 41 |