This documents the approach presently used on Filmaster as "guess algorithm 2" based on ideas developed by the teams participating in Netflix Competition
Imagine a matrix R where each row stands for a user and columns represent movies. Every entry of form u x m contains a real number in a <1,10> range which is rating given to a movie m by user u. In our case most of the entries are empty and we have to fill them with guesses based on the entries we already have. In order to do it we use method similiar to singular value decomposition(svd) of this matrix. It means that we want to factorize it into two matrixes U: u_num x f_num and M: f_num x m_num (where f_num is arbitrarily chosed integer, u_num is number of users and m_num - number of movies) such that they product - R' resembles R (every nonempty entry in R has its equivalent in R' and the average difference between them is as small as possible). R' is our guessing matrix, so every entry in it is our prediction of (user,movie) rating.
Does it make sense?
When we analyse movies we have to notice, that they have a number of different features. Every single feature of a movie can also be rated, ie dialogs or the violence level. And every user has a different attitude to those features. One may enjoy violence, the other might not stand it. So we can also rate users' preferences for each feature. Here we get our U and M matrixes where f_num dimension is the number of features we consider. Notice that in R' every guessed rating (u,m) is now of form: sum of U[u][f]*M[f][m] for every feature f. So the higher the preference for feature f is and the higher is its level in a movie, the higher guessed rating will be. We also need to scale it so it would be in a <1,10> range. So all we have to do is to find the possibly best U and M matrixes.
Counting U and M matrixes
There are many svd algorithms most of which are not optimal enough for our needs. Instead we use some kind of heuristic using a teaching approach. We start with both matrixes U,M filled with the same unsignificantly small real number in each entry. We will now modify it separately for each feature. We begin with teaching the first feature and then after a certain number of teaching cycles we proceed to the second etc. In a teaching cycle we compute the actual R' matrix and then take all nonempty entries (u,m) in R matrix and for each of them we use a given formula:
err = lrate * (R[u][m]-R'[u][m])
U[u][f] += err*M[f][m]
M[f][m] += err*U[u][f]
where lrate is a constant real number and f is a number of presently computed feature.
We notice that when our guessed rating is too low, err will be positive so the entries in U and M will increase. In the other case, err is negative and entries decrease. During the next teaching cycles the err absolute value decreases, as our guess ratings are getting closer to the real ones. In the end going through more cycles does not have significant effect so we proceed to the next feature.
This simple formula happens to give quite satisfactory results. The RMSE for the entries used in teaching can be lowered significantly, but in order to test our algorithm correctness, we have to separate a set of nonempty entries that won't be used in algorithm and count RMSE on this set.
How is it implemented
Computing is divided into three parts:
a) In film20/recommendations/bot_helper.py there is prepare_data function. It gets all necessary data from the database, the users and movies sets and the ratings (separately the rated ones, and those to be guessed). It creates text files in a temp_data folder containing teaching set for the algorithm, the query sets containing entries in the database we want to compute(ratings and rating_comparators), and additional data for the algorithm. As to the query set, we can specify whether we want to compute scores just for new users or for everybody. It is easy to modify it, so it would count only ratings that weren't computed before, or not to count rating_comparators.
b) The c++ program - count_recommendations is the implementation of a teaching algorithm. It computes the U and M matrixes using a method described above, and then computes entries of R' specified in a query set made by the above prepare_data function and writes them into the sql file - data_update.sql as the sql commands updating the database with new values of guess_rating_alg2 (or rating_comparators.score2).
c) The data_update.sql file is dumped into the database.
Everything is launched in the run_compute_probable_scores.sh file
What is to be done
The most important issue is to accelerate the process of dumping data into a database. Update: this has been greatly improved in FLM-169. Secondly change the algorithm computing score2 in rating_comparators as it gives bad results or abandon it and use the previous method only.
Get the code!
The code of the described movie recommendation engine is free software, AGPLv3-licensed and available in bitbucket: http://bitbucket.org/filmaster/filmaster-test/src/tip/count_recommendations.cpp