Distributions of rankings

| No Comments

A few postings ago, Andrew wondered about the shape of the long tail. OneEyedMan's comment reminded us that the extensive NetFlixPrize dataset contains information about almost half a million users' ratings on almost 20000 movies. It's an excellent playground, although I was told that the data was corrupted.

So, I was happy to notice Ilya Grigorik's analysis of the distributions of the dataset. In particular, the average user seems to be centered at 3.8 (on a scale from 1-5), indicating that people do try to watch movies they like. But the uneven distribution of score variance across users indicates that one could model the type of user, perhaps with a mixture model:

I must also note that NetFlix users have an incentive to score movies even with lukewarm scores, which moderates the above distribution. On most internet sites that allow users to rank content, the extreme scores (1 or 5) are overrepresented: some people make the effort to write a review only when they are very unhappy and want to punish someone, or when they are very happy and want to reward or recommend the work to others.

Another interesting source of rating distributions is the Interactive Fiction Competition results page: it has numerous histograms of scores for individual IF works.

Leave a comment

Subscribe to Entry

Email: