New recommendation engine

Description of the movie recommendation algorithm made by Tomasz Kociumaka, Jakub Tlałka and Filip Pacanowski

Introduction

I'm going to present a draft of methods we used to calculate movies and users similiarity functions which are used by our program. All of the described ideas need to be looked through and improved in order for them to be really useful in practice.

F[i][j] - movie similiarity function taking i and j movies and returning real number in interval [0,1)
U[i][j] - user similiarity function taking i and j user and returning real number in interval [0,1)
   
The smaller the value of functions is, the higher the similiarity is. Particularly F[i][i]=0 , U[i][i]=0

Phase 1 - calculating the U function

For given two users - u1,u2, we consider two parameters. How similiar are the sets of movies rated by them and what is the average difference between those rates. The parameters are calculated by the specified funcions and combined with weights into result. When comparing users whose numbers of ratings differ significantly, we avoid situations when the fact that the set of movies rated by u1 is nearly subset of those rated by u2 causes them to be very similiar. In such a situation it is better to base more on the marks difference. Generally, the bigger the sets of rated movies are, the more important is the average rating difference.

Phase 2 - calculating the F function

For given two movies f1,f2 we again calculate two parameters. Using the U function we try to tell how similiar are the sets of users who have rated the movies. We take some amount of pairs (u1,u2) where u1 rated f1 and u2 rated f2 and basing on value of U[u1][u2] we increase the first parameter. Together with it, we take the average difference of those ratings and combine them into a result. Similiarly to the phase 1, the significance of marks difference increases with the size of the examined set of users.

Phase 3 - improving the F function

In this phase we try to make the F function more similiar to a metric. For any pair of movies (a,b) we look through the movies similiar to both of them. For such a movie - c, we change the value of F[a][b] basing on F[a][c] and F[b][c]. The significance of this phase is limited in order not to spoil the results. One of the examples of its good influence on the results can be the Pirates of the Caribbean movies. We observed that the difference between parts 1 and 3 is too big comparing to the differences between 1 and 2 or 2 and 3. We suppose it is caused by the ratings difference which originated from the different receptions of those parts. Based on the tests, we decided to include it in our algorithm.

Considering other factors

Basing on the information about the cast of the movies, we try to improve the F function a bit. We make simple modifications taking into consideration the number of actors staring both of the given movies or the existence of the same director. It is a well known fact(we have also observed it in our previous results), that users ratings are highly influenced by the cast, so it is reasonable to condider it in the recommendations.

Recommending algorithm

We used the calculated U and F functions in our recommending algorithm. It was invented in a hurry and therefore it is very simple and needs a lot of modifications. For a given user - u we take the set of s(it may be a constant or some percent of number of movies rated by u) movies rated best by u - S. For each of them we choose s most similiar movies to them and add them to a set of propositions - P. Then for each element of P, we calculate how much would it probably suit the taste of u. We check how similiar is the movie to the elements of S and how similiar to u are the users who have rated this film and multiply those factors into a final result. The smaller it is, the more suitable the movie is for u. We finally sort the P set and recommend the top positions on such a sorted list.

Summary

Methods we have used, are usually very simple, most of the times they are just some kinds of average functions. It is necessary to modify parameters used, as they haven't been tested enough yet. We don't know if our algorithm is already better than the one used on the Filmaster as it is not easy to test. The whole program including calculation of U and F functions and the recommendations for all the users should not last more than the minute on the given data size(about 50 000 marks). While it is not problematic to speed up the duration of our methods, we find it hard to
decrease the memory size it uses. I have decribed the idea in its original shape. However, we have already realised some weak points that need to be changed. We are going to implement some changes and test the algorithm more thoroughly in the future.

Second approach presently used on Filmaster as guess algorithm 2 based on ideas developed by the teams participating in Netflix Competition

Different view on the problem

Imagine a matrix R where each row stands for a user and columns represent movies. Every entry of form u x m contains a real number in a <1,10> range which is rating given to a movie m by user u. In our case most of the entries are empty and we have to fill them with guesses based on the entries we already have. In order to do it we use method similiar to singular value decomposition(svd) of this matrix. It means that we want to factorize it into two matrixes U: u_num x f_num and M: f_num x m_num (where f_num is arbitrarily chosed integer, u_num is number of users and m_num - number of movies) such that they product - R' resembles R (every nonempty entry in R has its equivalent in R' and the average difference between them is as small as possible). R' is our guessing matrix, so every entry in it is our prediction of (user,movie) rating.

Does it make sense?

When we analyse movies we have to notice, that they have a number of different features. Every single feature of a movie can also be rated, ie dialogs or the violence level. And every user has a different attitude to those features. One may enjoy violence, the other might not stand it. So we can also rate users' preferences for each feature. Here we get our U and M matrixes where f_num dimension is the number of features we consider. Notice that in R' every guessed rating (u,m) is now of form: sum of U[u][f]*M[f][m] for every feature f. So the higher the preference for feature f is and the higher is its level in a movie, the higher guessed rating will be. We also need to scale it so it would be in a <1,10> range. So all we have to do is to find the possibly best U and M matrixes.

Counting U and M matrixes

There are many svd algorithms most of which are not optimal enough for our needs. Instead we use some kind of heuristic using a teaching approach. We start with both matrixes U,M filled with the same unsignificantly small real number in each entry. We will now modify it separately for each feature. We begin with teaching the first feature and then after a certain number of teaching cycles we proceed to the second etc. In a teaching cycle we compute the actual R' matrix and then take all nonempty entries (u,m) in R matrix and for each of them  we use a given formula:
err = lrate * (R[u][m]-R'[u][m])
U[u][f] += err*M[f][m]
M[f][m] += err*U[u][f]
where lrate is a constant real number and f is a number of presently computed feature.
We notice that when our guessed rating is too low, err will be positive so the entries in U and M will increase. In the other case, err is negative and entries decrease. During the next teaching cycles the err absolute value decreases, as our guess ratings are getting closer to the real ones. In the end going through more cycles does not have significant effect so we proceed to the next feature.
This simple formula happens to give quite satisfactory results. The RMSE for the entries used in teaching can be lowered significantly, but in order to test our algorithm correctness, we have to separate a set of nonempty entries that won't be used in algorithm and count RMSE on this set.

How is it implemented

Computing is divided into three parts:
a) In film20/recommendations/bot_helper.py there is prepare_data function. It gets all necessary data from the database, the users and movies sets and the ratings (separately the rated ones, and those to be guessed). It creates text files in a temp_data folder containing teaching set for the algorithm, the query sets containing entries in the database we want to compute(ratings and rating_comparators), and additional data for the algorithm. As to the query set, we can specify whether we want to compute scores just for new users or for everybody. It is easy to modify it, so it would count only ratings that weren't computed before, or not to count rating_comparators.
b) The c++ program - count_recommendations is the implementation of a teaching algorithm. It computes the U and M matrixes using a method described above, and then computes entries of R' specified in a query set made by the above prepare_data function and writes them into the sql file - data_update.sql as the sql commands updating the database with new values of guess_rating_alg2 (or rating_comparators.score2).
c) The data_update.sql file is dumped into the database.
Everything is launched in the run_compute_probable_scores.sh file

What is to be done

The most important issue is to accelerate the process of dumping data into a database. Secondly change the algorithm computing score2 in rating_comparators as it gives bad results or abandon it and use the previous method only.

Enter labels to add to this page:
Please wait 
Looking for a label? Just start typing.
  1. Sep 01, 2009

    Borys Musielak says:

    Short status update: the second approach is implemented on DEV and waiting to be...

    Short status update: the second approach is implemented on DEV and waiting to be deployed on PROD after making some performance optimizations (updating the rataing table takes forever on prod): http://jira.filmaster.org/browse/FLM-169