[EM] IMDb top 250 list with PR

Mon Feb 13 05:25:09 PST 2023

On 11.02.2023 22:27, Toby Pereira wrote:
> I've long thought that the IMDb top 250 list 
> https://www.imdb.com/search/title/?groups=top_250&sort=user_rating would 
> be more interesting if it was done using a sequential PR method. That 
> way you would likely get more of people's absolute favourite films 
> rather than those with broad appeal, but which might not be right at the 
> top of people's lists. You're more likely to see differents genres 
> represented etc.

I think I wrote a post about more or less this use of proportional 
representation, but I just can't seem to find it. I said that I was 
looking for a method that would work on very large numbers of candidates 
and voters, and a reply suggested the use of SPAV/SPRV.

In my case, I was thinking about recommender systems (which is another 
subject of interest of mine); the simplest recommender systems take the 
current user's preferences (watched movies/ratings) and determines a 
similarity to other users', then does a weighted sum of those 
preferences to create suggestions for the users in question.

In essence, the other users' preferences become weighted rated or 
Approval votes (depending on whether the system gathers ratings or just 
thumbs up/down). There are other tricky parts when the recommendations 
are implicit (e.g. they consist of movies the user has watched rather 
than explicit thumbs up/down) because it's much harder to determine 
whether a user didn't watch because he didn't know about the movie, or 
because he knew he wouldn't like it.

In any case, summing the ratings predictions by other users becomes a 
kind of approval/range voting process. But the output is often "hey, you 
liked Casino Royale; here I will suggest every other Bond movie". So 
proportional representation makes a lot of sense.  But since the 
database could potentially be extremely large (both in terms of 
users/voters and items/candidates), STV can't work, and methods that 
aren't polytime (e.g. Schulze STV) are right out. So SPAV/SPRV does seem 
the right type of voting method to use in such a context.

I implemented SPAV on a basic recommender system for another site, and I 
think it provided more varied answers, but because "users who liked what 
you liked also liked" type reasoning can't get more subtle relations, 
the quality is still limited: I got a sort of "cycling behavior". E.g. 
suppose a user had watched Casino Royale and Interstellar, SPRV/SPAV 
would then produce something like [Spy movie] [Spy movie] [Spy movie] 
[Sci-fi] [Sci-fi] [Sci-fi] [Spy movie] [Spy movie] [Sci-fi] [Sci-fi]...

Which is better than just all spy movies all the time, but could still 
be improved upon :-)

Current recommender systems are considerably more advanced than the 
nearest-neighbor type above, but I think they miss an active learning 
component, the lack of which leads them to be either too uninteresting 
(recommending the same thing over and over), or have too little 
information to say anything at all (thus recommending mainly things that 
appeals to everybody).

Xing et al.'s paper, "Enhancing collaborative filtering music 
recommendation by balancing exploration and exploitation", 
https://archives.ismir.net/ismir2014/paper/000140.pdf, gives one way to 
do this. They focus on the problem that recommending the same thing over 
and over leads to the user getting bored and suggest a Bayesian approach.

Another approach could be having an interactive system where the user 
can choose between different groups or clusters of items depending on 
his interest; and when one group is chosen, the system then produces 
subgroups within that group. E.g. first having a choice between spy 
movies and sci-fi, and then Carré-type vs Mission Impossible-type spy 
movies, or between space fantasy and near-future sci-fi (e.g. Star Wars 
vs Gravity).

It might even be possible to combine Bayesian low rank matrix completion 
with Thompson sampling to get a more comprehensive sort of active 
learning. But I'm not sure if BLRMC is just a neat trick to make low 
rank matrix completion (relevant for latent factor type recommenders) 
into a maximum posteriori estimation problem, or if the distribution 
itself can be used. And it might be much too hard to sample anyway.

Whew, this got a bit off-topic. But I guess I'm saying... recommender 
systems are not a trivial matter either :-)

-km