Description of the movie recommendation algorithm made by Tomasz Kociumaka, Jakub Tlałka and Filip Pacanowski
Introduction
I'm going to present a draft of methods we used to calculate movies and users similiarity functions which are used by our program. All of the described ideas need to be looked through and improved in order for them to be really useful in practice.
F[i][j] - movie similiarity function taking i and j movies and returning real number in interval [0,1)
U[i][j] - user similiarity function taking i and j user and returning real number in interval [0,1)
The smaller the value of functions is, the higher the similiarity is. Particularly F[i][i]=0 , U[i][i]=0
Phase 1 - calculating the U function
For given two users - u1,u2, we consider two parameters. How similiar are the sets of movies rated by them and what is the average difference between those rates. The parameters are calculated by the specified funcions and combined with weights into result. When comparing users whose numbers of ratings differ significantly, we avoid situations when the fact that the set of movies rated by u1 is nearly subset of those rated by u2 causes them to be very similiar. In such a situation it is better to base more on the marks difference. Generally, the bigger the sets of rated movies are, the more important is the average rating difference.
Phase 2 - calculating the F function
For given two movies f1,f2 we again calculate two parameters. Using the U function we try to tell how similiar are the sets of users who have rated the movies. We take some amount of pairs (u1,u2) where u1 rated f1 and u2 rated f2 and basing on value of U[u1][u2] we increase the first parameter. Together with it, we take the average difference of those ratings and combine them into a result. Similiarly to the phase 1, the significance of marks difference increases with the size of the examined set of users.
Phase 3 - improving the F function
In this phase we try to make the F function more similiar to a metric. For any pair of movies (a,b) we look through the movies similiar to both of them. For such a movie - c, we change the value of F[a][b] basing on F[a][c] and F[b][c]. The significance of this phase is limited in order not to spoil the results. One of the examples of its good influence on the results can be the Pirates of the Caribbean movies. We observed that the difference between parts 1 and 3 is too big comparing to the differences between 1 and 2 or 2 and 3. We suppose it is caused by the ratings difference which originated from the different receptions of those parts. Based on the tests, we decided to include it in our algorithm.
Considering other factors
Basing on the information about the cast of the movies, we try to improve the F function a bit. We make simple modifications taking into consideration the number of actors staring both of the given movies or the existence of the same director. It is a well known fact(we have also observed it in our previous results), that users ratings are highly influenced by the cast, so it is reasonable to condider it in the recommendations.
Recommending algorithm
We used the calculated U and F functions in our recommending algorithm. It was invented in a hurry and therefore it is very simple and needs a lot of modifications. For a given user - u we take the set of s(it may be a constant or some percent of number of movies rated by u) movies rated best by u - S. For each of them we choose s most similiar movies to them and add them to a set of propositions - P. Then for each element of P, we calculate how much would it probably suit the taste of u. We check how similiar is the movie to the elements of S and how similiar to u are the users who have rated this film and multiply those factors into a final result. The smaller it is, the more suitable the movie is for u. We finally sort the P set and recommend the top positions on such a sorted list.
Summary
Methods we have used, are usually very simple, most of the times they are just some kinds of average functions. It is necessary to modify parameters used, as they haven't been tested enough yet. We don't know if our algorithm is already better than the one used on the Filmaster as it is not easy to test. The whole program including calculation of U and F functions and the recommendations for all the users should not last more than the minute on the given data size(about 50 000 marks). While it is not problematic to speed up the duration of our methods, we find it hard to
decrease the memory size it uses. I have decribed the idea in its original shape. However, we have already realised some weak points that need to be changed. We are going to implement some changes and test the algorithm more thoroughly in the future.