December 2006 Archives

2006 retrospective

| 2 Comments

Checking on progress:

1. I finished a few.

2. I was not particularly successful here. Easier said than done, I suppose.

3. I'd give myself a solid B on this one.

4. The guys at the bike shop did this one. Also fixed my front brake, which apparently was hanging on by literally one strand of cable.

5. Yep.

6. No comment.

7. Nope. But did get a paper accepted on my struggles in this area.

8. Not yet. This one's a push to 2007.

9. Don't recall.

10. Nope. But I didn't get any further behind, either. Gotta work harder recording the amusing incidents.

11. Didn't do so well on this one (even though, by the look of it, it seems like the easiest of resolutions).

12. Enough unfinished business here that I don't think I need anything new for 2007. Well, ok, here's one: I'd like to go to the movies at least once.

from Mark Liberman here.

I ran across this interesting interview with Mark "Smiley" Glickman. He discusses the Glicko and Glicko-2 rating systems which are based on dynamic Bayesian models. From a statistical perspective, some of the most interesting discussion comes near the middle of the interview where he discusses the chess federation's ongoing project to monitor average ratings, and the challenge of comparing ratings of people in different years. The bit at the very end is also interesting--it reminds me of the claim I once heard that a chess player, if given the option of being a better player or having a higher rating, would choose the higher rating. One of the difficulties of numerical ratings or rankings is that people can take them too seriously, and Glickman discusses this.

High-dimensional data analysis

| 1 Comment

I came across this talk by David Donoho (see also here for more detail) from 2000. I was disappointed to see that he scooped me on the phrase "blessing of dimensionality" but I guess this is not such an obscure idea.

More interesting are the different perspectives that one can have on high-dimensional data analysis. Donoho's presentation (which certainly is still relevant six years later) focuses on computational approaches to data analysis and says very little about models. Bayesian methods are not mentioned at all (the closest is slide #44, on Hidden Components, but no model is specified for the components themselves). It's good that there are statisticians working on different problems using such different methods.

Donoho also discusses Tukey's ideas of exploratory data analysis and discusses why Tukey's approach of separation from mathematics no longer makes sense. I agree with Donoho on this, although perhaps from a different statistical perspective: my take on exploratory data analysis is that (a) it can much more powerful when used in conjunction with models, and (b) as we fit increasingly complicated models, it will become more and more helpful to use graphical tools (of the sort associated with "exploratory data analysis") to check these models. As a latter-day Tukey might say, "with great power comes great responsibility." See this paper and this paper for more on this.

I was also trying to understand the claim on page 14 on Donoho's presentation that the fundamental roadblocks of data analysis are "only mathematical." From my own experiences and struggles (for example, here), I'd interpret this from a Bayesian perspective as a statement that the fundamental challenge is coming up with reasonable classes of models for large problems and large datasets--models that are structured enough to capture important features of the data but not so constrained as to restrict the range of reasonable inferences. (For a non-Bayesian perspective, just replace the word "model" with "method" in the previous sentence.)

Diana S. Grigsby-Toussaint writes,

I recently came across your "Statistical Modeling, Causal Inference, and Social Science" website in my attempt to determine the best analysis for my research. As there were some inquiries about whether GEE is a better approach than multilevel modeling, I was hoping you could help with my dilemma.

I am interested in neighborhood (defined as census tract) influences on childhood diabetes risk in the city of Chicago. Although I have a little over 1200 cases, ~40% of my tracts have only 1 case, and the average number of cases per tract is 5. GEE has been suggested as the better approach to HLM, but I am not getting much support for this option....any suggestions for the best approach or articles that might provide some insight?

My quick response: see here.

My longer response:

Bruce McCullough writes,

The probability of getting brain cancer is determined by the number of younger siblings. So claim some scientists, according to an article published in the current issue of The Economist.

I have ordered your book so that I can read more about controlling for intermediate outcomes, but I am not yet confident enough to tackle it myself. Perhaps you might blog this?

I'll give my thoughts, but first here's the scientific paper (by Altieri et al. in the journal Neurology), and here are the key parts of the news article that Bruce forwarded:

Distributions of rankings

| No Comments

A few postings ago, Andrew wondered about the shape of the long tail. OneEyedMan's comment reminded us that the extensive NetFlixPrize dataset contains information about almost half a million users' ratings on almost 20000 movies. It's an excellent playground, although I was told that the data was corrupted.

So, I was happy to notice Ilya Grigorik's analysis of the distributions of the dataset. In particular, the average user seems to be centered at 3.8 (on a scale from 1-5), indicating that people do try to watch movies they like. But the uneven distribution of score variance across users indicates that one could model the type of user, perhaps with a mixture model:

I must also note that NetFlix users have an incentive to score movies even with lukewarm scores, which moderates the above distribution. On most internet sites that allow users to rank content, the extreme scores (1 or 5) are overrepresented: some people make the effort to write a review only when they are very unhappy and want to punish someone, or when they are very happy and want to reward or recommend the work to others.

Another interesting source of rating distributions is the Interactive Fiction Competition results page: it has numerous histograms of scores for individual IF works.

Just in time for the holidays

| 2 Comments

Aleks forwarded this along:

atomic-energy-lab-01.jpg

The entries themselves were pretty funny, but I also liked the comment on the atomic energy kit entry by the guy with "a comfortable six-figure salary." Maybe if he'd had a little less radiation exposure as a child, he'd have a comfortable seven-figure salary by now . . .

Playing around a bit with the income-voting data (see here for a couple pictures and links to our paper, or here for an example of journalistic confusion, or here for lots more), we made the following maps, which show our estimates of what would have happened in the 2000 election if they had only counted the votes of people in different income categories:

rb by income.png

I came across this article by the late Leo Breiman from 1997, "No Bayesians in foxholes." It's fun sometimes to go back and see what people were saying nearly a decade ago. This one is particularly interesting because it presents a strongly anti-Bayesian position which used to be common in statistics (see, for example, various JRSS-B discussions during the 70s and 80s) but you don't really hear about anymore. Breiman wote:

The Current Index of Statistics lists all statistics articles published since 1960 by author, title, and key words. The CIS includes articles from a multitude of journals in various fields—medical statistics, reliability, environmental, econometrics, and business management, as well as all of the statistics journals. Searching under anything that contained the word “data” in 1995–1996 produced almost 700 listings. Only eight of these mentioned Bayes or Bayesian, either in the title or key words. Of these eight, only three appeared to apply a Bayesian analysis to data sets, and in these, there were only two or three parameters to be estimated.

Actually, our toxicology paper appeared in the Journal of the American Statistical Association in 1996--how could Breiman have missed that one (our model had 90 parameters, and the paper had a detailed discussion of why the prior distribution was needed in order to get reasonable results)? Was he restricting himself to papers with "data" in their keywords? Putting "data" as a keyword in an applied statistics paper is something like putting "physics" as a keyword in a physics paper!

OK, OK . . .

I read this article by Jim Berger. I agree with much of it, except that I think he unnecessarily privileges certain improper prior distributions. More and more, I'm thinking it makes sense to have noninformative (or weakly-informative) prior distributions that are proper but vague (see here, for example). In addition, I think Berger's approach would be improved by model checking. Objective Bayesian analysis can be so much more effective when its worst excesses are curbed via model checking. See this paper from the International Statistical Review for some theory and Chapter 6 of our Bayesian book for some examples.

Bill Harris has a fun little calculation of a conditional probability using three different data sources. Could be a good example for teaching intro probability or basic Bayesian inference.

Here's the abstract for a talk by Dr. Albert Tarantola, Institut de Physique du Globe de Paris:

While the conventional way for making inferences from observations goes through the use of conditional probabilities (via de Bayes identity), there is an alternative. It consists in introducing some new definitions in Probability Theory (image and reciprocal image of a probability, intersection of two probabilities), that are accompanied by a compatibility property. The resulting theory is simple, accepts a clear Bayesian interpretation, and naturally incorporates the Popperian notion of falsification (for us, falsification of models, not of theories). The applications of the theory in the domain of inverse problems shall be discussed.

Unfortunately I can't make the talk. I can't figure out what he's saying in the abstract, but the topic interests me. If anybody knows more about this, please let me know.

P.S. Brian Borchers writes,

Tarantola has been writing about Bayesian approaches to geophysical inverse problems for some time. He has recently (2005) published a book on inverse problem theory (Inverse Problem Theory and Methods for Model Parameter Estimation, SIAM 2005) that you might find interesting.

The "image of a probability" doesn't appear in the SIAM book, but it is the topic of Tarantola's new book, "Mapping of Probabilities". You can download a draft (or at least the first two chapters) from his web site at http://www.ipgp.jussieu.fr/~tarantola/

Jimmy's weight over time

| 1 Comment

What do I like best about this graph?

jimmy.jpg

Keith pointed me to these pictures on Curtis McMullen's webpage. My favorite is this one (which I previously saw on Yuval's door at Berkeley):

biketracks.gif

(Although I can't figure out why it's classified under Topology.)

There's lots of other cool stuff there, including this cascade of bifurcations:

Stephen Jessee sent me this paper (joint work with Doug Rivers) on party identification and voting. He writes,

I [Jessee] show that most people do in fact have some level of policy ideology that has an important effect on their voting behavior. The influence of party identification, however, is also quite strong. Judging from the baseline of Downsian policy voting, I show that independents, even those with lower levels of political sophistication, perform quite well on average, and engage in essentially unbiased spatial policy voting. Partisans of similar levels of sophistication, by contrast, are systematically pushed away from more rational decision rules and seem to be making biased choices in translating their policy preferences into vote choices. On the whole, it seems clear that party identification operates more as a systematic bias than a profitable heuristic.

Continuing, Jessee writes,

Interpreting p-values

| 5 Comments

Dan Goldstein made this amusing graph:

pval.gif

in discussing this paper.

Columbia Science Fellows

| 2 Comments

Chris Wiggins pointed me to this interesting-looking book by Sarah Igo:

Americans today "know" that a majority of the population supports the death penalty, that half of all marriages end in divorce, and that four out of five prefer a particular brand of toothpaste. Through statistics like these, we feel that we understand our fellow citizens. But remarkably, such data--now woven into our social fabric--became common currency only in the last century. Sarah Igo tells the story, for the first time, of how opinion polls, man-in-the-street interviews, sex surveys, community studies, and consumer research transformed the United States public.. . . Tracing how ordinary people argued about and adapted to a public awash in aggregate data, she reveals how survey techniques and findings became the vocabulary of mass society--and essential to understanding who we, as modern Americans, think we are.

As a survey researcher, this looks interesting to me. Parochially, I'm reminded of our own observation that in the 1950s it was more rational to answer a Gallup poll than to vote. Nowadays, most of us are participants as well as consumers of surveys.

Sarah Croco writes:

Radford is speaking in the statistics seminar on Monday 11 Dec (noon at 903 Social Work Bldg, for you locals):

Constructing Efficient MCMC Methods Using Temporary Mapping and Caching

I [Radford] describe two general methods for obtaining efficient Markov chain Monte Carlo methods - temporarily mapping to a new space, which may be larger or smaller than the original, and caching the results of previous computations for re-use. These methods can be combined to improve efficiency for problems where probabilities can be quickly recomputed when only a subset of `fast' variables have changed. In combination, these methods also allow one to effectively adapt tuning parameters, such as the stepsize of random-walk Metropolis updates, without actually changing the Markov chain transitions used, thereby avoiding the issue that changing the transitions could undermine convergence to the desired distribution. Temporary mapping and caching can be applied in many other ways as well, offering a wide scope for development of useful new MCMC methods.

This reminds me of a general question I have about simulation algorithms (and also about optimization algorithms): should we try to have a toolkit of different algorithms and methods and put them together in different ways for different problems, or does it make sense to think about a single super-algorithm that does it all?

Usual hedgehog/fox reasoning would lead me to prefer a toolkit to a superalgorithm, but the story is not so simple. For one thing, insight can be gained by working within a larger framework. For example, we used to think of importance sampling, Gibbs, and Metropolis as three different algorithms, but they can all be viewed as special cases of Metropolis (see my 1992 paper, although I guess the physicists had been long aware of this). Anyway, Radford keeps spewing out new algorithms (I mean "spew" in the best sense, of course), and I wonder where he thinks this is all heading.

P.S. The talk was great; slides are here.

Redblue for locals

| 2 Comments

I'm speaking Monday 11 Dec, 4:30pm at the City University of New York Graduate Center (365 Fifth Avenue at 34th Street). Location is room 9204, on the 9th floor (not the Kimmel Center, room 907, which is what I'd posted earlier). Here's the announcement, here's the paper, here's the graph:

superplot_var_slopes_annen_2000.png

and here's the abstract:

This is just pitiful . . . from an ad in the New Yorker:

The New Yorker Humor on the slopes at Beaver Creek Featuring Dennis Miller

Need a lift this winter? Join the laughter in Beaver Creek, Colorado. On the last weekend in February, The New Yorker Promotion Department's Humor on the Slopes event fills the famed destination with three days of highly elevated comedy. Resort to laughter with performances by comedians including Dennis Miller, appearances by New Yorker cartoonists, a comedy-film sneak preview, and much more.

Yeah, yeah, I know, that's how they can afford to pay Ian Frazier and the rest of the gang. But still . . .

On the other hand, Harold Ross was from Aspen so maybe it all makes sense.

P.S. Yes, "Dennis Miller" is in boldface in the original.

P.P.S. Typo fixed.

Jose von Roth writes,

The long tail and the fat head

| 6 Comments

There's a lot of talk about the long tail--for example, there are a zillion books out there, each selling a few copies a week (hey--those are my books out there in the tail!), and a zillion blogs out there, each getting a few hits a day (hey--that's our blog out there...). We're no longer in the era of mass consumption, etc etc.

I was wondering: who are the consumers of the long-tail items? I'd conjecture that the people who buy books in the long tail are, on average, buyers of many books. Similarly, I'd conjecture that the rarefied few who read our blog read many other blogs as well. In contrast, the average buyer of a bestseller such as The Shangri-La Diet might not be buying so many books, and, similarly, the average reader of BoingBoing might be reading not so many blots.

Or maybe I'm wrong on this, I don't know. I'm picturing a scatterplot, with one dot per book (or blog), on the x-axis showing the number of buyers (or readers), on the y-axis showing the average number of books bought (or blogs read) per week by people who bought thiat book (or read that blog). Or maybe there's a better way of looking at this.

The question is: is the "long tail" being driven by a "fat head" of mega-consumers?

Swivel: Web 2.0 and Data

| 4 Comments

The latest craze on the internet is the migration of applications from desktop to the Web. The latest is "Swivel": the internet archive for data, something I have written about before. While there is not much to be seen at the site, TechCrunch has some intriguing snapshots:

I guess that one can upload the data, access data that others have posted, and perform some simple types of analysis. It might not sound much, but having a database of data will remove the need for people to provide summaries of it. Anyone interested in the problem can perform the summaries for himself. This will make data analysis much more approachable than before. This can also become competition to existing spreadsheet and statistical software, and a platform for deploying recent research: it is often frustrating for a researcher in statistical methodology how difficult it is to actually enable users to benefit from the most recent advances in the research sphere.

RSS

| 2 Comments

I am sometimes asked how to get an RSS feed from this blog. I've been told you can do it here.

Sample size and self-efficiency

| 1 Comment

Xiao-Li Meng is speaking this Friday 2pm in the biostatistics seminar (14th Floor, Room 240, Presbyterian Hospital Bldg, 622 West 168th Street). Here's the abstract:

One of the most frequently asked questions in statistical practice, and indeed in general quantitative investigations, is "What is the size of the data?" A common wisdom underlying this question is that the larger the size, the more trustworthy are the results. Although this common wisdom serves well in many practical situations, sometimes it can be devastatingly deceptive. This talk will report two of such situations: a historical epidemic study (McKendrick, 1926) and the most recent debate over the validity of multiple imputation inference for handling incomplete data (Meng and Romero, 2003). McKendrick's mysterious and ingenious analysis of an epidemic of cholera in an Indian village provides an excellent example of how an apparently large sample study (e.g., n=223), under a naive but common approach, turned out to be a much smaller one (e.g., n<40) because of hidden data contamination. The debate on multiple imputations reveals the importance of the self-efficiency assumption (Meng, 1994) in the context of incomplete-data analysis. This assumption excludes estimation procedures that can produce more efficient results with less data than with more data. Such procedures may sound paradoxical, but they indeed exist even in common practice. For example, the least-squared regression estimator may not be self-efficient when the variances of the observations are not constant. The morale of this talk is that in order for the common wisdom "the larger the better" be trusted, we not only need to assume that data analyst knows what s/he is doing (i.e., an approximately correct analysis), but more importantly that s/he is performing an efficient, or at least self-efficient, analysis.

This reminds me of the blessing of dimensionality, in particular Scott de Marchi's comments and my reply here. I'm also reminded of the time at Berkeley when I was teaching statistical consulting, and someone came in with an example with 21 cases and 16 predictors. The students in the class all thought this was a big joke, but I pointed out that if they had only 1 predictor, it wouldn't seem so bad. And having more information should be better. But, as Xiao-Li points out (and I'm interested to hear more in his talk), it depends what model you're using.

I'm also reminded of some discussions about model choice. When considering the simpler or the more complicated model, I'm with Radford that the complicated model is better. But sometimes, in reality, the simple model actually fits better. Then the problem, I think, is with the prior distribution (or, equivalently, with estimation methods such as least squares that correspond to unrealistic and unbelievable prior distributions that do insufficiant shrinkage).

Gueorgi pointed us to this website on name frequencies. Gueorgi writes,

Given a first and last name, it estimates the number of people in the US with the same name. They take the data from the 1990 Census and make an assumption that the first and last name are uncorrelated. There is a brief section on accuracy here. It might be a bit silly, but at least provides an easy way to look up of Census name frequencies (assuming their scripts work correctly). >From a research perspective, if such a website proves popular, perhaps one could use the same basic idea and produce better estimates by including first x last name correlation, and maybe add the functionality to collect user data like basic demographics, etc. to use with "how many x's you know" surveys.

Wow, the ">From" in his email really takes me back . . .

Anyway, for first names, I prefer the Baby Name Voyager, which has time series data and cool pink-and-blue graphics, but it is convenient to have the last names too. By assuming independence, I think this will overestimate the people named "John Smith" and underestimate the people named "Kevin O'Donnell," (I once looked up John Smith in the white pages and found that, indeed, it's less common than you'd expect from independence. Which makes sense, since if you're named Smith, you'll probably avoid the obvious "John." Unless it's a family name, or unless you have a sense of humor, I suppose.)

But Matt comments:

Also, I think this might be a good tool for teaching undergrads. In my class we just covered the basic rules of probability and I tried to get across the idea of independence of events. A name like Jose Cruz provides a good examples of things that are not independent.

I'm down with that. And it could be a cool class project to do some checking of phone directories. The violation of independence is reminiscent of the dentists named Dennis.

Overdispersed Poisson regression

| 3 Comments

Manuel Spínola from the Instituto Internacional en Conservación y Manejo de Vida Silvestre at the Universidad Nacional in Heredia, Costa Rica, writes,

Charitable giving and defaults

| 5 Comments

I was thinking more about a framework for understanding these findings by Arthur Brooks on the rates at which different groups give to charity. Some explanations are "conservatives are nicer than liberals" or "conservatives have more spare cash than liberals" or "conservatives believe in charity as an institution more than liberals." (My favorite quote on this is "I'd give to charity, but they'd spend it all on drugs.")

But . . . although I think there's truth to all of the above explanations, I think some insight can be gained by looking at this another way. Lots of research shows that people are likely to take the default option (see here and here for some thoughts on the topic). The clearest examples are pension plans and organ donations, both of which show lots of variation and also show people's decisions strongly tracking the default options.

For example, consider organ donation: over 99% of Austrians and only 12% of Gernans consent to donate their organs after death. Are Austrians so much nicer than Germans? Maybe so, but a clue is that Austria has a "presumed consent" rule (the default is to donate) and Germany has an "explicit consent" rule (the default is to not donate). Johnson and Goldstein find huge effects of the default in organ donations, and others have found such default effects elsewhere.

Implicit defaults?

My hypothesis, then, is that the groups that give more to charity, and that give more blood, have defaults that more strongly favor this giving. Such defaults are generally implicit (excepting situations such as religions that require tithing), but to the extent that the U.S. has different "subcultures," they could be real. We actually might be able to learn more about this with our new GSS questions, where we ask people how many Democrats and Republicans they know (in addition to asking their own political preferences).

Does this explanation add anything, or am I just pushing things back from "why to people vary in how much they give" to "why is there variation in defaults"? I think something is gained, actually, partly because, to the extent the default story is true, one could perhaps increase giving by working on the defaults, rather than trying directly to make people nicer. Just as, for organ donation, it would probably be more effective to change the default rather than to try to convince people individually, based on current defaults.

Boris pointed me to this book by Arthur Brooks, who looked at statistics on charitable giving from several surveys between 1996 and 2004. Some findings:

On average, religious people are far more generous than secularists with their time and money. This is not just because of giving to churches—religious people are more generous than secularists towards explicitly non-religious charities as well. They are also more generous in informal ways, such as giving money to family members, and behaving honestly.

The nonworking poor—those on public assistance instead of earning low wages—give at lower levels than any other group. Meanwhile, the working poor in America give a larger percentage of their incomes to charity than any other income group, including the middle class and rich.

A religious person is 57% more likely than a secularist to help a homeless person.

Conservative households in America donate 30% more money to charity each year than liberal households.

If liberals gave blood like conservatives do, the blood supply in the U.S. would jump by about 45%.

I have a few quick thoughts:

1. These findings are interesting partly because they don't fit into any simple story: conservatives are more generous, and upper-income people are more conservative [typo fixed; thanks, Dan], but upper-income people give less than lower-income people. Such a pattern is certainly possible--in statistical terms, corr(X,Y)>0, corr(Y,Z)>0, but corr(X,Z)<0)--but it's interesting.

2. Since conservatives are (on average) richer than liberals, I'd like to see the comparison of conservative and liberal donations made as a proportion of income rather than in total dollars.

3. I wonder how the blood donation thing was calculated. Liberals are only 25% of the population, so it's hard to imagine that increasing their blood donations could increase the total blood supply by 45%.

4. The religious angle is interesting too. I'd like to look at how that interacts with religion and ideology.

5. It would also be interesting to see giving as a function of total assets. Income can fluctuate, and you might expect (or hope) that people with more assets would give more.

We're looking forward to getting into these data and making some plots. (Boris suggested the secret weapon.)

P.S. Bruce McCullough points out Jim Lindgren's comments here on the study, questioning Brooks's reliance on some of his survey data.

P.P.S. Also see here for more of my thoughts.

Recent Comments

  • Courtney l: What is a good last name for Seth? read more
  • Andrew Gelman: Ken: Yes, I think that's huge, and I suspect it's read more
  • Ken: An important question is how does the employment of these read more
  • Matt Stephenson: Thank you all very much for your help in answering read more
  • Andrew Gelman: Burt: I agree that multimember districts are no free lunch, read more
  • Burt: The shift to multimember districts is not a free lunch. read more
  • yoyo: having no senators at all would be even better. read more
  • Andrew Gelman: Anne: I'll post my answer to your question in a read more
  • Marvin: Interesting stuff. On a not very related note: how about read more
  • Steve Sailer: Just ask Obama-contributor James D. Watson about open-mindedness. Thank God read more
  • Anonymous: Given this finding: Democrats have tended to live in dense, read more
  • keith: Predicting the likelihood of meaningful collaborations with others on stats? read more
  • Anne: Pardon my ignorance, but what's wrong with a kernel density? read more
  • Andrew Gelman: William: I don't think it's that so many people follow read more
  • Humble Physics Grad Student: I don't think statistical data analysis in physics is naive read more
  • dave fournier: Hybrid MCMC is used in AD Model Builder which is read more
  • Frank D: Ellis- I'm glad you're targeting your work towards the Stat read more
  • Andrew Gelman: Ellis: I've always assumed that much of Sedaris's stories were read more
  • Ramat: With Utah, it probably has to do with age distribution read more
  • Truman: Are you familiar with the term "pathological science" as coined read more