« Just in time for the holidays | Main | What is the evidence on birth order and brain cancer? »
December 23, 2006
Distributions of rankings
A few postings ago, Andrew wondered about the shape of the long tail. OneEyedMan's comment reminded us that the extensive NetFlixPrize dataset contains information about almost half a million users' ratings on almost 20000 movies. It's an excellent playground, although I was told that the data was corrupted.
So, I was happy to notice Ilya Grigorik's analysis of the distributions of the dataset. In particular, the average user seems to be centered at 3.8 (on a scale from 1-5), indicating that people do try to watch movies they like. But the uneven distribution of score variance across users indicates that one could model the type of user, perhaps with a mixture model:
I must also note that NetFlix users have an incentive to score movies even with lukewarm scores, which moderates the above distribution. On most internet sites that allow users to rank content, the extreme scores (1 or 5) are overrepresented: some people make the effort to write a review only when they are very unhappy and want to punish someone, or when they are very happy and want to reward or recommend the work to others.
Another interesting source of rating distributions is the Interactive Fiction Competition results page: it has numerous histograms of scores for individual IF works.
Posted by Aleks at December 23, 2006 1:19 PM
Trackback Pings
TrackBack URL for this entry:
http://www.stat.columbia.edu/~cook/movabletype/mt-tb.cgi/753
Listed below are links to weblogs that reference Distributions of rankings:
» Shortening of the long tails from Statistical Modeling, Causal Inference, and Social Science
Chris Anderson popularized the idea that internet will fundamentally change the way media works: while the mass media and retail create a small number of big hits, the internet will "flatten" the field in the sense that there will be... [Read More]
Tracked on January 25, 2007 11:52 AM