Recently in Miscellaneous Statistics Category

1. Understanding the 'Russian Mortality Paradox' in Central Asia: Evidence from Kyrgyzstan

Short answer: alcohol and suicide.

2. Lumberjacks as a counterexample to the idea of a "risk premium"

They take lots of risks and don't get paid well for it.

3. Cell size and scale

This is a visualization you won't want to miss.

4. Three guys named Matt

5. The political philosophy of the private eye

A genre that was rendered obsolete in 1961 (but nobody realizes it).

Jay Kaufman writes:

Monday 2 Nov, 5-6:30pm at the Methodology Institute, LSE. No link to the seminar on the webpage, so I'll give you the information here:

Why we (usually) don't worry about multiple comparisons

Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise.

Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the $p$-values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.

This work is joint with Jennifer Hill and Masanao Yajima.

(Here's a video version of a related talk that I gave at a meeting on statistics and neuroscience.)

P.S. My talk briefly touches upon some work done by a researcher at the London School of Economics!

P.P.S. I'm speaking at LSE on Tuesday also (on a different topic).

P.P.P.S. I'll be speaking again a couple times in London later in the academic year, but on other topics. All my talks there will be different.

AT has a blog

| 4 Comments

Here.

This news article has made a bit of a splash: Seth Borenstein sent around a temperature time series to four statisticians--just sending the numbers without saying where they came from--and the statisticians uniformly concluded that there were no consistent temperature declines over time:

"If you look at the data and sort of cherry-pick a micro-trend within a bigger trend, that technique is particularly suspect," said John Grego, a professor of statistics at the University of South Carolina.

I don't have anything to add on the temperature series--there's only so much you can learn from a context-free data analysis, and I don't think anyone would want to take this particular set of blind statistical analyses as being at all informative about the science. But there's more going on here.

Winston Churchill said that sometimes the truth is so precious, it must be attended by a bodyguard of lies. Similarly, for a model to be believed, it must, except in the simplest of cases, be accompanied by similar models that either give similar results or, if they differ, do so in a way that can be understood.

In statistics, we call these extra models "scaffolding," and an important area of research (I think) is incorporating scaffolding and other tools for confidence-building into statistical practice. So far we've made progress in developing general methods for building confidence in iterative simulations, debugging Bayesian software, and checking model fit.

My idea for formalizing scaffolding is to think of different models, or different versions of a model, as living in a graph, and to consider operations that move along the edges of this graph of models, both as a way to improve fitting efficiency and as a way to better understand models by making informative comparisons. The graph of models connects to some fundamental ideas in statistical computation, including parallel tempering and particle flitering.

P.S. I want to distinguish scaffolding from model selection or model averaging. Model selection and averaging address the problem of uncertainty in model choice. The point of scaffolding is that we would want to compare our results to simpler models, even if we know that our chosen model is correct. Models of even moderate complexity can be extremely difficult to understand on their own.

Keith points me to this article by Gretchen Chapman and Jingjing Liu:

Previous research has demonstrated that Bayesian reasoning performance is improved if uncertainty information is presented as natural frequencies rather than single-event probabilities. A questionnaire study of 342 college students replicated this effect but also found that the performance-boosting benefits of the natural frequency presentation occurred primarily for participants who scored high in numeracy. This finding suggests that even comprehension and manipulation of natural frequencies requires a certain threshold of numeracy abilities, and that the beneficial effects of natural frequency presentation may not be as general as previously believed.

Sounds interesting. Unfortunately the article has no killer graph to make the point. In psychology, the killer graph often takes the form of a plot with two lines that cross, thus demonstrating the interaction of interaction of interest. Maybe Chapman and Liu could do this for their next article.

P.S. I gotta say, it would be pretty cool to be named "Jingjing." Sort of a Boutros Boutros or Mike Michaelson thing going on here.

My talk in Lyon on Monday

| 4 Comments

Some computational and modeling issues for hierarchical models

How can we fit a complex statistical model and have confidence in our results? There are several challenges, including (a) setting up models that are complicated enough to reflect the aspects of reality that we want to study, (b) regularization or partial pooling to get stable estimates for the resulting large number of parameters, (c) actually fitting the model (in Bayesian terms, getting a point estimate or posterior simulations, (d) checking the fit of the model to data, (e) attaining confidence that the fitting procedure is bug-free, and (f) understanding the fitted model. We discuss these in the context of nonnested varying-intercept, varying-slope multilevel logistic regression models that we have been using to estimate public opinion in demographic and geographic subgroups of the U.S. population.

Mon 26 Oct, 9.30 on the ground floor of the Latarjet building at International Agency for Research on Cancer (IARC), 150 Cours Albert Thomas, Lyon. This is where Martyn Plummer (the JAGS guy) works.

Le casse-tête des petits effets

| 4 Comments

An Encyclopedia of Probability

| 8 Comments

Carl Bialik reports on a website called the Book of Odds (really, as Carl points out, these are probabilities, not odds, but that's not such a problem because, at least to me, probabilities are much more understandable than odds anyway). It's pretty cool. I could give some examples here but I encourage you to just go to the site yourself and check it out. One thing I really like is that it gives the source of every number: right on the page it gives the country and date of the information, then you can click to get the details. Awesome.

The only thing that bothers me a little bit about the site is that it is almost too professional. When something's that slick, I worry about whether I can trust them.

In contrast, Nate Silver's website is respected but not particularly attractive. And the NameVoyager is just the coolest thing in the world, and, yes, it's professional and it's commercial--that's fine--but it doesn't have the suspicion-inducing hyper-professionalism of the Book of Odds. Seeing the all-so-appealing photo of the bright-eyed oldsters illustrating the "Will you live to be 100?" item that's currently featured on the site's main page, I just think--this is too slick to be trusted. (In case you're wondering, their data say that a randomly-chosen 90-year-old has only a 1-in-9 chance of living to 100. Actually, they say 1 in 8.85, but you know what I think about extra decimal points.)

In some way I prefer the charmingly and unabashedly commercial OK Cupid site to the Book of Odds, which looks so, so commercial but claims only purely altruistic goals. I just don't know what to think.

Anyway, whatever the true story happens to be, it's great stuff. Fun to browse, and a great teaching tool too, I'd think. Enjoy.

Fernando Hoces De La Guardia writes:

Last night we did the traditional first year econ phd student's skit nite @ Penn.

One particular thing that I noticed was that we had less public that what the upper years told us to be prepared for.

Somebody suggested that it was due to Passover and Good Friday. My immediate reaction was "science & religion don't go usually together". By this I meant a prior of mine that the fraction of religious people is a lot less within a scientific discipline than among the rest of the population.

Two things pop out of my head this morning:

- in which data base can I check that prior?

- if true, are economists more religious than other scientists?

My reply: Usually people look these things up at the General Social Survey, which has a convenient web interface. Good luck!

The economy isn't going so well, but there are some interesting possibilities here at Columbia University. One such option that you should be thinking about is the Earth Institute Fellowship, which pays well, includes a research stipend, and puts you in an exciting interdisciplinary community of faculty and postdoctoral researchers. The Earth Institute at Columbia brings in several postdocs each year--it's a two-year gig--and some of them have been statisticians (recently, Kenny Shirley and Leontine Alkema). We're particularly interested in statisticians who have research interests in development and public health. It's fine--not just fine, but ideal--if you are interested in statistical methods also. The EI postdoc can be a place to do interesting work and begin a research career.

If you're a statistician who's interested in this fellowship, feel free to contact me directly--you have to apply to the Earth Institute directly (see link above), but I'm happy to give you advice about whether your goals fit into our program. It's important to me, and to others in the EI, to have statisticians involved in our research.

Deadline for applications is 1 Dec.

Denis Cote writes:

I am reviewing a paper using logistic regression and I am uncertain about the way they coded their inputs.

They have different ordinal variables coming from self-report questions. For example, self-perceived health" with its answer choice: excellent, very good, good, fair, poor.

Or weight coded as underweight, normal, overweight and obese. They entered the answers as categorical-binary variables (unsure about the precise coding).

Shouldn't they have kept a single ordinal variable? What would be the best practice with ordinal variables?

I think I would not have asked this question if I hadn't read and applied your 2 standard deviations technique!

My reply:

Carlisle Rainey writes:

In an earlier blog post, you suggest: "...do a global search-and-replace to change 'DV' to 'outcome' and to change 'OLS' to 'linear regression'." Would you provide a quick explanation why or point me somewhere to find the answer myself?

My reply:

1. I don't like the term "dependent variable" because of confusion with dependence of random variables. To me, "outcome" makes it clearer that you are choosing which variables to use as predictors and which as outcome. "Predictee" would be ok too, I guess.

2. "OLS" focuses on the optimization task; "linear regression" focuses on the model. I think the model is more important that how it's estimated. To put it another way, "OLS" generalizes to weighted least squares, least absolute deviation, etc. "Linear regression" generalizes to logistic regression, nonlinear regression, etc. I find the latter set of generalizations more important and interesting.

This came in the email:

Lazy ways of modeling proportions

| No Comments

Andrew Therriault writes:

I'm creating a model of issue emphasis in political campaigns as a product of public opinion (so candidates choose what to discuss strategically based on which issue will help them most), and the data I'm using combines candidates' ad spending (coded by issue) with the public's issue positions in the candidates' districts. Thus far, I've used percentage of ad spending per issue for each candidate as my DV in OLS and tobit models. I know that this specification is not optimal, though, because of the correlation between each candidate's observations (since they are constrained to sum to 100).

See here for a link to a statistical consultant. I don't know the guy, but his credentials seem strong and he has a nice-looking website and what seems like a good philosophy.

Response to two-slit discussion

| 7 Comments

Thanks for all the comments. I responded here. To summarize briefly:

1. Many people commented that the laws of probability work just fine in quantum mechanics, you just have to include the act of measurement in your model: there is no latent joint distribution that exists out there to be passively measured.

I agree, but my point was that when we apply probability theory to analyze surveys, experiments, observational studies, etc., we typically do assume a joint distribution and we typically do treat the act of measurement as a direct application of conditional probability. If classical probability theory (which we use all the time in poli sci, econ, psychometrics, astronomy, etc) needs to be generalized to apply to quantum mechanics. Which makes me wonder if it should be generalized for other applications too.

2. Some commenters discussed work in political science and psychometrics in which researchers are working on generalized probability models, inspired by quantum probability, to do statistical data analysis. Looks like it could be interesting.

P.S. Just to clarify further: I know more physics than most statisticians do, but that's not a lot, and I certainly don't think I have anything useful to say about quantum mechanics beyond what Richard Feynman (or, for that matter, Bill Jefferys) has written already. Where I do have expertise is in the application of probability models to diverse applied fields. And what I'm wondering is whether it would be appropriate to generalize the usual probability models there, just as it is necessary to do for quantum mechanics.

This is all standard physics. Consider the two-slit experiment--a light beam, two slits, and a screen--with y being the place on the screen that lights up. For simplicity, think of the screen as one-dimensional. So y is a continuous random variable.

Consider four experiments:

1. Slit 1 is open, slit 2 is closed. Shine light through the slit and observe where the screen lights up. Or shoot photons through one at a time, it doesn't matter. Either way you get a distribution, which we can call p1(y).

2. Slit 1 is closed, slit 2 is open. Same thing. Now we get p2(y).

3. Both slits are open. Now we get p3(y).

4. Now run experiment 3 with detectors at the slits. You'll find out which slit each photon goes through. Call the slit x. So x is a discrete random variable taking on two possible values, 1 or 2. Assuming the experiment has been set up symmetrically, you'll find that Pr(x=1) = Pr(x=2) = 1/2.

You can also record y, thus you can get p4(y), and you can also observe the conditional distributions, p4(y|x=1) and p4(y|x=2). You'll find that p4(y|x=1) = p1(y) and p4(y|x=2) = p2(y). You'll also find that p4(y) = (1/2) p1(y) + (1/2) p2(y). So far, so good.

The problem is that p4 is not the same as p3. Heisenberg's uncertainty principle: putting detectors at the slits changes the distribution of the hits on the screen.

This violates the laws of conditional probability, in which you have random variables x and y, and in which p(x|y) is the distribution of x if you observe y, p(y|x) is the distribution of y if you observe x, and so forth.

A dissenting argument (that doesn't convince me)

To complicate matters, Bill Jefferys writes:

As to the two slit experiment, it all depends on how you look at it. Leslie Ballentine wrote an article a number of years ago in The American Journal of Physics, in which he showed that conditional probability can indeed be used to analyze the two slit experiment. You just have to do it the right way.

I looked at the Ballentine article and I'm not convinced. Basically he's saying that the reasoning above isn't a correct application of probability theory because you should really be conditioning on all information, which in this case includes the fact that you measured or did not measure a slit. I don't buy this argument. If the probability distribution changes when you condition on a measurement, this doesn't really seem to be classical "Boltzmannian" probability to me.

In standard probability theory, the whole idea of conditioning is that you have a single joint distribution sitting out there--possibly there are parts that are unobserved or even unobservable (as in much of psychometrics)--but you can treat it as a fixed object that you can observe through conditioning (the six blind men and the elephant). Once you abandon the idea of a single joint distribution, I think you've moved beyond conditional probability as we usually know it.

And so I think I'm justified in pointing out that the laws of conditional probability are false. This is not a new point with me--I learned it in college, and obviously the ideas go back to the founders of quantum mechanics. But not everyone in statistics knows about this example, so I thought it would be useful to lay it out.

What I don't know are whether there are any practical uses to this idea in statistics, outside of quantum physics. For example, would it make sense to use "two-slit-type" models in psychometrics, to capture the idea that asking one question affects the response to others? I just don't know.

Zbicyclist writes:

Doug Rivers's discussion about weighting and bias reminds me of a common trick I don't see very much, and that is using effective sample size as a planning/tracking tool. ESS is (Sum of the weights)^2 / Sum of (weights^2).

If all n cases are weighted equally, ESS is n. Otherwise, ESS

Because I've instituted this as a tracking signal and as a planning tool, I've sometimes been asked to justify it; the clearest explication is Kish, Leslie (1992) Weighting for unequal Pi. Journal of Official Statistics, 8, 183-200, which makes sense, since I pulled the ESS formula from my class notes from when I took sampling from Kish.

My question is this: given that this is such a simple tracking and planning metric, why does it seem so hard to find in the literature?

My reply: I dunno. I use this formula all the time, but I usually just derive it myself from scratch when I need it. In general, survey sampling books are weak on this stuff. The other thing is that this formula in general overestimates the sampling variance, I think. It's a formula for variance with unequal-probability sampling (as indicated by the title of the Kish article). When weights are constructed using poststratification (as is standard, see for example any of my many articles on the topic), the sampling variance will be lower.

Listen to me on the radio

| 1 Comment

Apparently the BBC has a radio show all about statistics. It's broadcast at 1330 on Friday afternoons and repeated at 2000 on Sundays on Radio 4. I taped the interview a few weeks ago; I wonder how much of it they'll use. The interviewer, Tim Hartford, was excellent. I just hope they cut some of the part near the end when I got too relaxed and started to babble.

The topic was my article (with David Weakliem), Of Beauty, Sex, and Power.

P.S. I took a listen. They did a pretty good job of cutting and pasting my rambling into coherent sequences of sentences. I still sound pretty dry and professorial, I'm afraid, but at least it's to the point.

Doug writes:

Probability sampling is a great invention, but rhetoric has overtaken reality here. Both of the probability samples in this study had large amounts of nonresponse, so that the real selection probability--i.e., the probability of being selected by the surveyor and the respondent choosing to participate--is not known. Usually a fairly simple nonresponse model is adequate, but the accuracy of the estimates depends on the validity of the model, as it does for non-probability samples. Nonresponse is a form of self-selection. All of us who work with non-probability samples should spend our efforts trying to improve the modeling and methods for dealing with the problem, instead of pretending it doesn't exist.

Good stuff. Read the whole thing. Doug was, along with me and several others, an advisor on the recent report on National Election Study weighing.

Steve Farmer writes:

I am working on a hierarchical logistic regression model is SAS using the GLIMMIX function. I really liked your average predictive comparisons Figure 21.7 from your book, "Data Analysis Using Regression and Multilevel Hierarchical Models" (2007, p 473). I wondered if you could direct me to the easiest way to accomplish this analysis short of doing it manually. Is there a command or macro you could direct me to in SAS or STATA?

My reply: Some version of this may actually be available in Stata, although my guess is that it would be based on a single central point rather than averaging over all the data as we do in our book (and, before that, in my article with Pardoe). I've been planning to put it in an R package sometime but have never gotten around to it.

I received the following email:

I was hoping if you could take a moment to counsel me on a problem that I'm having trying to calculate correct confidence intervals (I'm actually using a bootstrap method to simulate 95%CIs). . . . [What follows is a one-page description of where the data came from and the method that was used.]

My reply:

Without following all the details, let me make a quick suggestion which is that you try simulating your entire procedure on a fake dataset in which you know the "true" answer. You can then run your procedure and see if it works there. This won't prove anything but it will be a way of catching big problems, and it should also be helpful as a convincer to others.

If you want to carry this idea further, try to "break" your method by coming up with fake data that causes your procedure to give bad answers. This sort of simulation-and-exploration can be the first step in a deeper understanding of your method.

And then I got another, unrelated email from somebody else:

I am working on a mixed treatment comparison of treatments for non-small cell lung cancer. I am doing the analysis in two parts in order to estimate treatment effects (i.e. log hazard ratios) and absolute effects (by projecting the log hazard ratios onto a baseline treatment scale parameter; the baseline treatment times to event are assumed to arise from a Weibull distribution. . . . .[What follows is a one-page description of the model, which was somewhat complicated by constraints on some of the variance parameters] . . . I can get my analysis to run with constraints imposed on the treatment specific prior distributions for PFS and OS, and on the population log hazard ratios for PFS and OS. However, my proble is that the constraint does not appear to be doing anything and the results are similar to what I obtain without imposing the constraint. This is not what I expect . . .

My reply:

Sometimes the data are strong enough that essentially no information is supplied by external constraints. You can, to some extent, check how important this is for your problem by simulating some fake data from a setting similar to yours and then seeing whether your method comes close to reproducing the known truth. You can look at point estimates and also the coverage of posterior intervals.

The National Election Study is hugely important in political science, but, as with just about all surveys, it has problems of coverage and nonresponse. Hence, some adjustment is needed to generalize from sample to population.

Matthew DeBell and Jon Krosnick wrote this report summarizing some of the choices that have to be made when considering adjustments for future editions of the survey. The report was put together in consultation with several statisticians and political scientists: Doug Rivers, Martin Frankel, Colm O'Muircheartaigh, Charles Franklin, and me. Survey weighting isn't easy, and this sort of report is just about impossible to write--you can't help leaving things out. They did a good job, though, and it's great to have this stuff put down in an official way, so that people can work off it of it when going forward.

It's a lot harder to write a procedure for general use than to do a single analysis oneself.

Some corrections

I have a few corrections to add to the report that unfortunately didn't make it into the final version (no doubt because of space limitations):

MICE, by Stef van Buuren and others, is an R package with many similarities to our "mi". Actually, MICE came first, and it's recently been updated. Should be available as an R download very soon. It would be interesting to see how the programs differ. We should probably look more carefully into cases where they give different results.

Lee Wilkinson sends along this cool paper demonstrating a simple but effective automatic classifier:

Linf is a classi er that was designed to address the curse of dimensionality and polynomial complexity by using projection, binning, and covering in a sequential framework. For class-labeled points in high-dimensional space, Linf employs computationally-efficient methods to construct 2D projections and sets of rectangular regions on those projections that contain points from only one class. Linf organizes these sets of projections and regions into a decision list for scoring new data points.

Linf is not a hybrid or modi cation of existing classifi ers; it employs a new covering algorithm. The accuracy of Linf on widely-used benchmark datasets is comparable to the accuracy of competitive classi fiers and, in some important cases, exceeds the accuracy of competitors. Its computational complexity is sub-linear in number of instances and number of variables and quadratic in number of classes.

I also like the article's delightfully understated conclusion. After a page of bullet points on the virtues of their method, the authors write:

Given these distinctive features and its fundamental di fferences from other classi fiers, Linf is a candidate for inclusion in portfolios of classi fiers.

If only all of us could be so modest.

P.S. Table 1 and Figure 5 shouldn't be in alphabetical order, and I think Figure 6 would work better as a parallel coordinates plot. These are pretty minor comments, but Lee is an authority on statistical graphics so I hold him to a high standard.

Ian Stevenson writes:

I received the following email:

Aaron Gullickson writes:

John Q. writes:

Shane Murphy writes:

Dumpin' the data in raw

| 7 Comments

Benjamin Kay writes:

I just finished the Stata Journal article you wrote. In it I found the following quote: "On the other hand, I think there is a big gap in practice when there is no discussion of how to set up the model, an implicit assumption that variables are just dumped raw into the regression."

I saw James Heckman (famous econometrician and labor economist) speak on Friday, and he mentioned that using test scores in many kinds of regressions is problematic, because the assignment of a score is somewhat arbitrary even if the order was not. He suggested that positive, monotonic transformations scores contain the same information and lead to different standard errors if in your words one just "dumped into the regression". It was somewhat of a throw away remark, but considering it longer, I imagine he mans that a difference of test scores need have no constant effect. The remedy he suggested was to recalibrate exam scores such that they have some objective meaning. For example, a mechanics exam scored between one and a hundred, one can pass (65) only if they successfully rebuild the engine in the time allotted, but better scores indicate higher quality or faster speed. In this example one might change it to a binary variable to passing or not, an objective testing of a set of competencies. However, doing that clearly throws away information.

Do you or the readers of Statistical Modeling, Causal Inference, and Social Science blog have any advice here? The transformation of the variable is problematic and the critique of transformations on using it raw seems a serious one, but the act of narrowly mapping it onto a set of objective discrete skills seems to destroy lots of information. Percentile ranks on exams might be a substitute for the raw scores in many cases, but introduces other problems like in comparisons between groups.

My reply: Heckman's suggestion sounds like it would be good in some cases but it wouldn't work for something like the SAT which is essentially a continuous measure. In other cases, such as estimated ideal point measures for congressmembers, it can make sense to break a single continuous ideal-point measure into two variables: political party (a binary variable: Dem or Rep) and the ideology score. This gives you the benefits of discretization without the loss of information.

In chapter 4 of ARM we give a bunch of examples of transformations, sometimes on single variables, sometimes combining variables, sometimes breaking up a variable into parts. A lot of information is coded in how you represent a regression function, and it's criminal to just take the data as they appear in the Stata file and just dump them in raw. But I have the horrible feeling that many people either feel that it's cheating to transform the variables, or that it doesn't really matter what you do to the variables, because regression (or matching, or difference-in-differences, or whatever) is a theorem-certified bit of magic.

Just quaid, part 2

| 2 Comments

Christopher Beam's recent news article on qalys includes this amazing quote:

QALYs also assume that a year lived by an 80-year-old is worth less than one lived by a 20-year-old. But that's not accurate, says Dana Goldman of the RAND Corp. "It's not taking into account hope, not taking into account the chance of living to see your daughter's wedding, it's not getting at the extra value we put on the end of life." Yes, the U.S. health care system has to rein in costs, says Goldman, but "QALY is not ready for prime time."

Maybe this guy is being taken out of context, but . . . "the chance of living to see your daughter's wedding"??? There's always individual variation; that doesn't mean you can't try to capture averages.

Alan Bergland writes:

I am a graduate student studying evolutionary biology at Brown University. I am writing you with what I think is a simple question, but I cannot seem to find an answer I feel comfortable with.

I am trying to test a planned contrast using posterior distributions from a mixed model (the mixed model is calculated in lme4, and the simulations in arm). The model is fairly complicated, but at the end of the day, there are two fixed effect treatments with two levels each that I am interested in. Lets call these fixed effects "treatment A" (with levels A and a) and "treatment B" (with levels B and b). I am interested in the interaction between treatment A and treatment B, but have a specific hypothesis about the form of that interaction I would like to test. Specifically, I would like to test if ab is less than Ab & aB=AB.

As you and Jennifer Hill suggest in your Multilevel/Hierarchical models book (p. 20), I could test if ab

Once I can calculate the probability that Ab=AB, would it be reasonable to calculate the probability that (ab is less than Ab & aB=AB) as Pr(ab is less than Ab)*Pr(aB=AB)?

My reply:

1. Don't use the arm's sim() function for lmer() objects. The current version is wrong; we're fixing it now, and the replacement should be available in about a month.

2. I don't recommend testing if aB=AB. At least in the sorts of problems I work on, no two comparisons are exactly equal. I think it makes more sense to estimate the relevant comparison, get the confidence interval, and make a graph. You could also do things like calculate the posterior probability (based on simulations) that ab < AB & |aB - AB|

Kobi forwarded this on, I don't know anything about it but it looks like it could be interesting:

The American Statistical Association organizes a program in which young researchers can submit writing samples and get comments from statisticians who are more experienced writers. I agreed to participate in this program, as long as the authors were willing to have their articles and my comments posted here.

I'm going to start with my general advice after reading and commenting on the two articles sent to me. I think this advice should be of interest to nearly all the readers of this blog. Then I'll link to the articles and give some detailed comments.

General advice

Both the papers sent to me appear to have strong research results. Now that the research has been done, I'd recommend rewriting both articles from scratch, using the following template:

1. Start with the conclusions. Write a couple pages on what you've found and what you recommend. In writing these conclusions, you should also be writing some of the introduction, in that you'll need to give enough background so that general readers can understand what you're talking about and why they should care. But you want to start with the conclusions, because that will determine what sort of background information you'll need to give.

2. Now step back. What is the principal evidence for your conclusions? Make some graphs and pull out some key numbers that represent your research findings which back up your claims.

3. Back one more step, now. What are the methods and data you used to obtain your research findings.

4. Now go back and write the literature review and the introduction.

5. Moving forward one last time: go to your results and conclusions and give alternative explanations. Why might you be wrong? What are the limits of applicability of your findings? What future research would be appropriate to follow up on these loose ends?

6. Write the abstract. An easy way to start is to take the first sentence from each of the first five paragraphs of the article. This probably won't be quite right, but I bet it will be close to what you need.

7. Give the article to a friend, ask him or her to spend 15 minutes looking at it, then ask what they think your message was, and what evidence you have for it. Your friend should read the article as a potential consumer, not as a critic. You can find typos on your own time, but you need somebody else's eyes to get a sense of the message you're sending.

It's been a dramatic month: A month ago, a coalition of some of the leading teams qualifies for the $1 million grand prize for improving the accuracy of the movie-recommending model by more than 10%. But, they would close the competition 30 days afterward, in case someone else is able to improve upon the result. This happened less than a day before the deadline, by The enormous Ensemble, composed of 23 previously separate teams and individuals. Of course, most of the progress towards the victory was through the models making use of new significant patterns in the data, such as that of time.

The development of an ensemble from many separate teams was another accomplishment, and the GPT's inclusion rules provide some insight into the process: "shares" of the winnings were distributed based on how much was a contribution able to improve the result in terms of percentage points. Simon Owens describes what it was like to participate in The Ensemble.

Bayesian statistics always works with ensembles: the posterior is a weighted average of all models, the weight being based on the fit of each model times the prior quality of the model. There are some additional Bayesian elements that could be a part of future competitions, such as Bayesian scoring functions.

In the past I was asked to contrast Occam's razor with the Epicurean principle. Occam's razor is the Bayesian prior, or the the yang principle: simpler models have greater a priori weight (because we tend to economize that what is useful). Occam's razor goes back to Aristotle, who wrote "For the more limited, if adequate, is always preferable," and "For if the consequences are the same, it is always better to assume the more limited antecedent" in his Physics. We mathematically express it as the prior.

Epicurean principle is the yin, or mathematically expressed as the integral over the model space. Ensembles go back to Epicurus' letter to Herodotus: "When, therefore, we investigate the causes of [...] phenomena, [...] we must take into account the variety of ways in which analogous occurrences happen within our experience." Thus, Bayesian statistics combines the yin and the yang, balancing the pursuit of simplicity with the limitations of uncertainty.

[7/31/09: Added a link to Simon Owens' interview with The Ensemble.]

That modeling feeling

| 11 Comments

It goes like this: there's something you want to estimate and you have some data. Maybe, to take my favorite recent example, you want to break down support for school vouchers by religion, ethnicity, income, and state (or maybe you'd like to break it down even further, but you have to start somewhere).

Or maybe you want to estimate the difference between how rich and poor people vote, by state, over several decades--but you're lazy and all you want to work with are the National Election Studies, which only have a couple thousand respondents, at most, in any year, and don't even cover all the states.

Or maybe you want to estimate the concentration of cat allergen in a bunch of dust samples, while simultaneously estimating the calibration curve needed to get numerical estimates, all in the presence of contamination that screws up your calibration.

Or maybe you want to identify the places in the United States where it's cost-effective to test your house for radon gas--and the data you have across the country are 80,000 noisy measurements, 5,000 accurate measurements, and some survey data and geological information.

Or maybe you want to understand how perchloroethylene is absorbed in the body--a process that is active at the time scale of minutes and also weeks--given only a couple dozen measurements on each of a few people.

Or maybe you want to get a picture of brain activity given indirect measurements from a big clanking physical device encircling a person's head.

Or maybe you want to estimate what might have happened in past elections had the Democrats or Republicans received 1% more, or 2% more, or 3% more, of the vote.

Or maybe . . . or maybe . . .

What all these examples have in common is some data--not enough, never enough!--and a vague sense arising in my mind of what the answer should look like. Not exactly what it would look like--for example, I did not in any way anticipate the now-notorious pattern of vouchers being more popular among rich white Catholics and evangelicals and among poor blacks and Hispanics (maybe I should've anticipated it; I'm not proud in the level of ignorance that I had that allowed this finding to surprise me, I'm just stating the facts)--but what it could look like. Or, maybe it would be more accurate to say, various things that wouldn't look right, if I were to see them.

And the challenge is to get from point A to point B. So, you throw model after model at the problem, method after method, alternating between quick-and-dirty methods that get me nowhere, and elaborate models that give uninterpretable, nonsensical results. Until finally you get close. Actually, what happens is that you suddenly solve the problem! Unexpectedly, you're done! And boy is the result exciting. And you do some checking, fit to a different dataset maybe, or make some graphs showing raw data and model estimates together, or look carefully at some of the numbers, and you realize you have a problem. And you stare at your code for a long long time and finally bite the bullet, suck it up and do some active debugging, fake-data simulation, and all the rest. You code your quick graphs as diagnostic plots and build them into your procedure. And you go back and do some more modeling, and you get closer, and you never quite return to the triumphant feeling you had earlier--because you know that, at some point, the revolution will come again and with new data or new insights you'll have to start over on this problem, but, for now, yes, yes, you can stop, you can step back and put in the time--hours, days!--to make pretty graphs, you can bask in the successful solution of a problem. You can send your graphs out there and let people take their best shot. You've done it.

But, not so deep inside you, that not-so-still and not-so-small voice reminds you of the compromises you've made, the data you've ignored, the things you just don't know if you believe. You want to do more, but that will require more computing, more modeling, more theory. Yes, more theory. More understanding of what these things called models do. Because, just like storybook characters take on a life of their own, just like Gollum wouldn't die and Frank Bascombe comes up with wisecracks all on his own, and Ramona Quimby won't stay down even if you try to make her, and so on and so on and so on, just like these characters, each with his or her internal logic, so any statistical model worth fitting also has its internal logic, mathematical properties latent in its form but, Turing-machine-like, impossible to anticipate before applying it to data--not just "real data" (how I hate that phrase), but data from live problems. And then comes Statistical Theory--the good kind, the kind that tells us what our models can and cannot do, when they can bend with the data and when they snap. (Did you know that doubly-integrated white noise can't really turn corners? I didn't, until I tried to fit such a model to data that went up, then down.) And you do your best with your Theory, and your simulations, and even your computing (yuck!). But you move on. And you hope that when it's time to come back to this problem, you'll have some better models at hand, things like splines and time series cross sectional models, and you'll have a programming and modeling environment where you can just write down latent factors and have them interact, and you'll be able to include three-way interactions, and four-way interactions, and . . . and . . . you hope that in ten years you'll be fitting the models that, ten years ago, you thought you'd be fitting in five years. And you take a rest. You write up what you found and you write up exactly what you did (not always so easy to do). And a new question comes along. You want a quick answer. You try putting together available data in a simple way. You try some weighting. But you don't believe your answer. You need more data. You need more model. You get to work.

That's how it feels, from the inside.

Freedom House is currently seeking individuals with demonstrated professional experience to work with civil society organizations in Egypt through the International Executive Volunteers (IEV) program for 3 months beginning in September 2009.

Volunteers must have a minimum of five years of relevant professional experience, the ability to commit to 3 months of service, and a resourceful, innovative personality. Previous overseas experience, particularly in Egypt and in the Middle East and North Africa is preferred.

Statistician/Polling Specialist
A statistician/polling specialist has been requested to provide support in the preparation and analysis of survey methodology and questionnaire data. Tasks will include designing work plans, managing logistics, reporting results to targeted groups, and developing relationships with key constituencies. Additional knowledge or expertise is needed for volunteer management - recruiting, retaining, and training for key projects. Arabic language skills are preferred but not required.

Impact factors

| 7 Comments

A bunch of years ago, I published an article (using some of the material in my Ph.D. thesis) in the Journal of Cerebral Blood Flow and Metabolism. It's ranked as the #25 journal in neuroscience, and has a pretty crappy impact factor of 5.7.

By comparison, the impact factors of the top statistics journals a few years ago were:
JASA 1.6, JRSS 1.5, Ann Stat 1.3, Ann Prob 0.9, Biometrika 1.8, Biometrics 1.1, Stat Sci 2.0, Technometrics 1.3.

So now you know why statisticians don't like impact factors.

Daniel Egan sent me a link to an article, "Standardized or simple effect size: What should be reported?" by Thom Baguley, that recently appeared in the British Journal of Psychology. Here's the abstract:

It is regarded as best practice for psychologists to report effect size when disseminating quantitative research findings. Reporting of effect size in the psychological literature is patchy -- though this may be changing -- and when reported it is far from clear that appropriate effect size statistics are employed. This paper considers the practice of reporting point estimates of standardized effect size and explores factors such as reliability, range restriction and differences in design that distort standardized effect size unless suitable corrections are employed. For most purposes simple (unstandardized) effect size is more robust and versatile than standardized effect size. Guidelines for deciding what effect size metric to use and how to report it are outlined. Foremost among these are: (i) a preference for simple effect size over standardized effect size, and (ii) the use of confidence intervals to indicate a plausible range of values the effect might take. Deciding on the appropriate effect size statistic to report always requires careful thought and should be influenced by the goals of the researcher, the context of the research and the potential needs of readers.

Egan writes:

I run into the problem of reporting coefficients all the time, mostly in the context of presenting effects to non-statisticians. While my audiences are generally bright, the obvious question always asked is "which of these is the biggest effect?" The fact that a sex dummy has a large numerical point estimate relative to number-of-purchases is largely irrelevant - its because sex's range is tiny compared to other covariates. But moreover, sex is irrelevant to "policy-making" - we can't change a persons sex! So what we're interested in is the viable range over which we could influence an independent variable, and the second-order likely affect upon the dependent. So two questions: 1. For pedagogical effect, is there any way of getting around these problems? How can we communicate the effects to non-statisticians easily (and think someone who has exactly 10 minutes to understand your whole report) 2. Is there any easy way to infer the elasticity of the effect - i.e. how much can we change the dependent, by attempting to exogenously change one of the independents? While I know that I could design the experiment to do this, I work in far more observational data - and this "effect" size is really what matters the most.

My quick reply to Egan is to refer to my article with Iain Pardoe on average predictive comparisons, where we discuss some of these concerns.

I also have some thoughts on the Baguley article:

Daljit Dhadwal writes:

On the Ask Metafilter site, someone asked the following:

How does statistical analysis differ when analyzing the entire population rather than a sample? I need to do some statistical analysis on legal cases. I happen to have the entire population rather than a sample. I'm basically interested in the relationship between case outcomes and certain features (e.g., time, the appearance of certain words or phrases in the opinion, the presence or absence of certain issues). Should I do anything different than I would if I were using a sample? For example, is a p-value meaningful in this kind of case?

My reply:

This is a question that comes up a lot. For example, what if you're running a regression on the 50 states. These aren't a sample from a larger number of states; they're the whole population.

To get back to the question at hand, it might be that you're thinking of these cases as a sample from a larger population that includes future cases as well. Or, to put it another way, maybe you're interested in making predictions about future cases, in which case the relevant uncertainty comes from the year-to-year variation. That's what we did when estimating the seats-votes curve: we set up a hierarchical model with year-to-year variation estimated from a separate analysis. (Original model is here, later version is here.)

So, one way of framing the problem is to think of your "entire population" as a sample from a larger population, potentially including future cases. Another frame is to think of there being an underlying probability model. If you're trying to understand the factors that predict case outcomes, then the implicit full model includes unobserved factors (related to the notorious "error term") that contribute to the outcome. If you set up a model including a probability distribution for these unobserved outcomes, standard errors will emerge.

After finding the Howard Wainer interview, I looked up the entire series of Profiles in Research published by the Journal of Educational and Behavioral Statistics. I don't have much to say about most of these interviews: some of these people I'd never heard of, and I don't really have much research overlap with the others. Probably I have the most overlap with R. D. Bock, who's done a lot of work on multilevel modeling, but, for whatever reason, his stories didn't grab my interest.

But I was curious about the interview with Arthur Jensen. I've never met him--he gave a talk at the Berkeley statistics department once when I was there, but for some reason I wasn't able to attend the talk. But I've heard of him. As the interviewers (Daniel Robinson and Howard Wainer) state:

Our article (by Yu-Sung, Jennifer, Masanao, and myself, and based also on work with Kobi, Grazia, and Peter Messeri) will be appearing in the Journal of Statistical Software, in a special issue on missing-data imputation. Here's the abstract:

Our mi package in R has several features that allow the user to get inside the imputation process and evaluate the reasonableness of the resulting models and imputations. These features include: flexible choice of predictors, models, and transformations for chained imputation models; binned residual plots for checking the fit of the conditional distributions used for imputation; and plots for comparing the distributions of observed and imputed data in one and two dimensions. In addition, we use Bayesian models and weakly informative prior distributions to construct more stable estimates of imputation models. Our goal is to have a demonstration package that (a) avoids many of the practical problems that arise with existing multivariate imputation programs, and (b) demonstrates state-of-the-art diagnostics that can be applied more generally and can be incorporated into the software of others.

We've made lots of improvements since listing the package last year (here). There's still a lot more work to do, in many different directions (including multilevel models, nonignorable models, the self-cleaning oven, and making the program run faster in sorts of ways), and we keep improving it. But it's good to have something out there.

To actually get the R package, just open your R window, click on Packages, Install packages, and grab mi.

Sometimes people think it's a disaster when you have more predictors than data points, but I always point out that, no, it's better to have 9 predictors than just 1 or 2. After all, if you really wanted just 1 or 2, you could just throw out most of your data!

Nate's chart is excellent, especially the ordering of the candidates in order of the percent favoring resignation:

sanford2.PNG

I also like the gratuitious exclamation marks which add fun value without actually making the graph any harder to read. The key reason this works is that Nate wisely did not fill in the blank squares with "No!"s.

My only comments are:

Of Beauty, Sex, and Power

| 7 Comments

Our article has appeared in The American Scientist. (Here's a link to the full article; hit control-plus to make the font more readable.) I highly recommend it for your introductory (or advanced) statistics classes. We start with a silly story of a flawed statistical analysis of sex ratios that managed to sneak into a serious scientific journal, then discuss general issues of how to interpret inconclusive statistical findings (including a brief analysis of data from People Magazine's 50 Most Beautiful People lists), and then loop back and discuss the statistical reasons that exaggerated claims can get amplified by the news media.

20096592237373-2009-07GelmanF4.jpg

The article begins as follows:

Greg Mankiw writes:

The next time you hear someone cavalierly point to international comparisons in life expectancy as evidence against the U.S. healthcare system, you should be ready to explain how schlocky that argument really is.

He points to the following claim by Gary Becker:

National differences in life expectancies are a highly imperfect indicator of the effectiveness of health delivery systems.for example, life styles are important contributors to health, and the US fares poorly on many life style indicators, such as incidence of overweight and obese men, women, and teenagers. To get around such problems, some analysts compare not life expectancies but survival rates from different diseases. The US health system tends to look pretty good on these comparisons.

Becker cites a study that finds that the U.S. does better than Europe in cancer survival rates and in the availability of hip and knee replacements and cataract surgery.

It makes a lot of sense to think of health as multidimensional, so that some countries can do better in life expectancy while others do better in hip replacements and cancer survival.

But I disagree with Mankiw's claim that it's "schlocky" to compare life expectancy. If the U.S. really is spending lots more per person on health care and really getting less in life expectancy compared to other countries . . . that seems like relevant information.

I want to explore the distinction between self-experimentation and formal experimentation in the context of a recent discussion on Seth's blog.

The story begins with two people who found, via self-experimentation, how to make their acne go away:

A student . . . had gone on a camping trip and found that her acne went away. At first she thought it was the sunshine; but then, by self-experimentation, she discovered that the crucial change was that she had stopped using soap to wash her face.
A friend of Seth writes: "I started "washing" my face with water about a month ago, and [now] my face is acne free and soft as a pair of brand new UGG boots. [He had had acne for years.]"

In the comments section, someone writes:

While it would be nice to think that all we have to do to get rid of acne is stop using those expensive cleanser and just use water - this is just anecdotal evidence you present. It would require a large clinical trial to be conclusive.

Seth replies that informal experimentation is cheaper and faster than more formal clinical trials. Also, different things might work for different people, so whether or not a treatment has been evaluated a large study, it might make sense to test it yourself--especially for something such as acne or weight loss that is not an urgent concern.

This got me thinking . . . what are the benefits (if any) of a formal controlled trial? In statistics, we usually frame these benefits by comparing to observational studies. The big risk in an observational study is that the treatment and control groups will differ in important ways (as in the famous hormone replacement therapy story). Is this worth the cost? Maybe. Sometimes.

A related issue is bias, a word which I am using in the conversational rather than the statistical sense. For example, how would you want to evaluate the risks and effectiveness of a new drug that was developed by a pharmaceutical company at the cost of millions of dollars? I'd be suspicious of an observational study: even if conducted by professionals, there just seem to be too many ways for things to be biased.

In Seth's acne example, there is no financial source of bias. And, as Seth points out, the test is free to apply on yourself. If I had a kid with acne, I'd give it a try and do an experiment--which means trying the soap and no-soap conditions on different days (or different weeks, or months) and measuring and recording acne levels. One thing I've gathered from Seth's work is that there are big benefits to be gained by doing self-experimentation with careful measurement and record keeping, rather than simply trying different things and trying to remember what works.

On the other hand, yeah, I'm skeptical about Seth's acne claims, and I think a larger study would be more likely to convince me. But I don't think it would have to be expensive. All Seth (or somebody) needs is to set up a protocol for deciding when to wash with soap or water and a protocol for measuring acne, then he could get a bunch of volunteers to flip coins and try it. This blog has a few thousand readers, and Seth's diet forum has thousands of participants, so it shouldn't be so hard to find people to do this. I'm not so interested in acne myself, but according to Seth (and others, I assume), "acne really matters," so maybe it's worth giving this a try.

The American Statistical Association awarded the 2009 Excellence in Statistical Reporting Award to Sharon Begley of Newsweek. From the official announcement:

The above remark, which came in the midst of my discussion of an analysis of Iranian voting data, illustrates a gap--nay, a gulf--in understanding between statisticians and (many) nonstatisticians, one of whom commented :that my quote "makes it sound that [I] have not a shred of a clue what a p-value is."

Perhaps it's worth a few sentences of explanation.

Benford's law is an amusing mathematical pattern in which the first digits of randomly sampled numbers tend to have a distribution in which 1 is the most common first digit, followed by 2, then 3, and so forth. It's the distribution of digits that arises from numbers that are sampled uniformly on a logarithmic scale.

In our Teaching Statistics book, Deb and I describe a classroom demonstration where we show how Benford's law applies to street addresses sampled randomly from the telephone book. In a more serious vein, Walter Mebane has written about the application of Benford's law to vote counts.

In the past several days, a few people have asked me about applying these ideas to the recent Iranian election. Today, Stephane Reissfelder pointed me to an article by Boudewijn Roukema, which states:

The results of the 2009 Iranian presidential election presented by the Iranian Ministry of the Interior (MOI) are analysed based on Benford's Law and an empirical variant of Benford's Law. The null hypothesis that the vote count distributions satisfy these distributions is rejected at a significance of p < 0.007, based on the presence of 41 vote counts for candidate K that start with the digit 7, compared to an expected 21.2-22 occurrences expected for the null hypothesis. A less significant anomaly suggested by Benford's Law could be interpreted as an overestimate of candidate A's total vote count by several million votes. Possible signs of further anomalies are that the logarithmic vote count distributions of A, R, and K are positively skewed by 4.6, 5.8, and 2.5 standard errors in the skewness respectively, i.e. they are inconsistent with a log-normal distribution with p ` 4 × 10−6, 7 × 10−9, and 1.2 × 10−2 respectively. M's distribution is not significantly skewed.

I don't buy it. First off, the whole first-digit-of-7 thing seems irrelevant to me. Second, the sample size is huge, so a p-value of 0.007 isn't so impressive. After all, we wouldn't expect the model to really be true with actual votes. It's just a model! Finally, I don't see why we should be expecting distributions to be lognormal.

Maybe there's something I'm missing here, but that's my quick take. This is not to say that I think the election was fair, or rigged, or whatever--I have absolutely zero knowledge on that matter--just that I don't find this analysis convincing of anything. I will say, though, that Roukema deserves credit for presenting the analysis clearly.

P.S. In response to comments: let me emphasize that I'm not saying that I think nothing funny was going on in the election. As I wrote, I'm commenting on the statistics, I don't know the facts on the ground. To move my comments in a more constructive direction (I hope), let me pull out this useful comment from Roukema's article: "One possible method to test whether this is just an odd fluke would be
to check the validity of the vote counts for candidate K in the voting areas
where the official number of votes for K starts with the digit 7." Further investigation could be a good thing here.

I did not find Roukema's argument convincing; that does not mean that I consider it a bad thing that the article was written. The article is a first draft of an analysis; it might end up leading to nothing, or it might be unconvincing as it stands now but lead to some important breakthroughs. We can see what further analysis turns up. Again, my verdict is not a Yes or a No, it's an "I'm not convinced."

The defining values

| 30 Comments

From Flat Earth News:

You could argue that every profession has its defining value. For carpenters, it might be accuracy: a carpenter who isn't accurate shouldn't be a carpenter. For diplomats, it might be loyalty: they can lie and spy and cheat and pull all sorts of dirty tricks, and as long as they are loyal to their government, they are doing their job. For journalists, the defining value is honesty--the attempt to tell the truth. That is our primary purpose. All that we do--all that is said about us--must flow from the single source of truth-telling.

What is the defining value of statisticians?

P.S. My favorite of the responses below is Mike Anderson's:

Separate the signal from the noise, then look at the noise for more signals.

I like this because (a) it acknowledges the presence of "noise" (that is, variation) but (b) recognizes that the "signal" is what's most important.

Mandelbrot on taxonomy

| 3 Comments

Taxonomies are fractal with, at any node, some number of branches (typically one or two major branches and several minor ones). Here's a fascinating article by Benoit Mandelbrot from 1955 on models of taxonomic structures. Great stuff. The article was published in Information Theory--3rd London Symposium, ed. Colin Cherry, and is hard to find online. At least it was until now.

mandelbrot2.png

David VandenBos writes:

I stumbled upon your blog a few weeks ago . . . However, a good amount of your technical articles go over my head because of my lack of statistics education/training/experience. Do you have any basic reading suggestions for learning applied statistics? My organization captures tons of info and safely tucks it away into databases, but I'm really interested in learning how to get it out and make use of it.

Does anybody have any suggestions? I like my book with Jennifer but maybe there's something more basic to start with? There's also this online book on statistical graphics by Rafe Donahue which is actually fun to read.

P.S. I don't think any of the usual intro stat books would be good here. I think they focus too much on conventional topics and not enough on applied statistics. Not really the fault of these books: they're designed for the undergraduate curriculum, not for practitioners.

Google Fusion Tables

| 4 Comments

Google just launched a pre-alpha "Fusion Tables". The visualization capability is okay, the interface is not fully stable, but the cool thing is the ability to merge two tables, something I've spent a lot of time doing manually in the past, or with ad-hoc scripts.

Here's an example where I merge their GDP table with a disease table. I need to pick the "WHO Regions/Country" in the right column, so that both tables get aligned:

fusion-tables.png

Afterwards, I can do a scatter plot of GDP rank (X) with child mortality/1000 (Y):

gdp-child-mortality.png

So, high GDP makes child mortality less likely, but not always, and it's not a correlation.

Even if Fusion tables is pre-alpha, the table fusion capability makes it immediately useful. The collaboration features look cool, but it will take some time to get them to work right. Then we'll have proper horizontal collaboration.

As I've discussed here on occasion, I like to standardize continuous regression inputs by dividing by two standard deviations. That way the rescaled variables each have sd of 1/2, which is approximately the same sd as any binary predictor, allowing the coefficients to be interpreted together.

Standardizing is often thought of as a stupid sort of low-rent statistical technique, beneath the attention of "real" statisticians and econometricians, but I actually like it, and I think this 2 sd thing is pretty cool.

As Aleks pointed out, however, standardizing based on the data is not strictly Bayesian, because the interpretation of the model parameters then depends on the sample data. As we discussed, a more fully Bayesian approach would be to think of the scale for standardization as an unknown parameter to itself be estimated from the data.

P.S. Recall that "inputs" are not the same as "predictors."

P.P.S. I scale by 2 sd to be consistent with 0/1 predictors. In retrospect, I wish I'd scaled by 1 sd and then coded binary predictors as -1 and 1 to be consistent. This would've been simpler overall. But I think it's too late now.

Statistics police?

| 1 Comment

The Numbers Guy has an article titled This U.K. Sheriff Cites Officials for Serious Statistical Violations, and a corresponding blog post:

Mobilized by distressingly low levels of public trust in official statistics, the U.K. government is embarking on a daring, and possibly unique, experiment. With broad support, Parliament in 2007 approved the formation of the U.K. Statistics Authority, a group with the budget, authority and independence to question other government agencies on the numbers they release to the public. [...]

The agency's task is a delicate one. If it uncovers reams of faulty data that might have been used in crafting public policy, Britons' fraying faith in public institutions could be further eroded.

Interesting, a truth-assurance agency would be a good thing, also useful for validating the truthfulness of other statements that often get twisted by marketing. We might be finally making progress with the problems that Josiah Stamp identified many years ago.

1. Coalitions, voting power, and political instability.

Thurs 4 Jun, 3:30pm, Kane Hall 210 at the University of Washington. Part of the Math Across Campus series.

We shall consider two topics involving coalitions and voting. Each topic involves open questions both in mathematics (probability theory) and in political science.
(1) Individuals in a committee or election can increase their voting power by forming coalitions. This behavior yields a prisoner's dilemma, in which a subset of voters can increase their power, while reducing average voting power for the electorate as a whole. This is an unusual form of the prisoner's dilemma in that cooperation is the selfish act that hurts the larger group. The result should be an ever-changing pattern of coalitions, thus implying a potential theoretical explanation for political instability.
(2) In an electoral system with fixed coalition structure (such as the U.S. Electoral College, the United Nations, or the European Union), people in diferent states will have different voting power. We discuss some flawed models for voting power that have been used in the past, and consider the challenges of setting up more reasonable mathematical models involving stochastic processes on trees or networks.


2. Culture wars, voting and polarization: divisions and unities in modern American politics.

Fri 5 Jun, 9:45am, Kane Hall 225 at the University of Washington. Part of the 10th anniversary celebration of the Center for Statistics and the Social Sciences.

On the night of the 2000 presidential election, Americans sat riveted in front of their televisions as polling results divided the nation's map into red and blue states. Since then the color divide has become a symbol of a culture war that thrives on stereotypes--pickup-driving red-state Republicans who vote based on God, guns, and gays; and elitist, latte-sipping blue-state Democrats who are woefully out of touch with heartland values. But how does this fit into other ideas about America being divided between the haves and the have-nots? Is political polarization real, or is the real concern the perception of polarization? We address these questions using a results from our own research and that of others.


3. Creating structured and flexible models: some open problems.

Mon 8 Jun, 11am, Fairmont Lounge, St. John's College, 2111 Lower Mall, University of British Columbia. Statistics Department seminar.

A challenge in statistics is to construct models that are structured enough to be able to learn from data but not be so strong as to overwhelm the data. We introduce the concept of "weakly informative priors" which contain important information but less than may be available for the given problem at hand. We also discuss some related problems in developing general models for taxonomies and deep interactions. We consider how these ideas apply to problems in social science and public health. If you don't walk out of this talk a Bayesian, I'll eat my hat.


4. Red state, blue state, rich state, poor state.

Mon 8 Jun, 3pm, Fairmont Lounge, St. John's College, 2111 Lower Mall, University of British Columbia. Statistics Department seminar.


If you come to any of these, please ask lots of questions!

P.S. I've never spoken at UBC, but I have given a couple of talks in the statistics department at UW. The first time was twenty years ago. The talk went OK, I think--it was on medical imaging--but I did a horrible thing by leading off with a joke. I could probably get away with that now, but it didn't go over well then. In my defense, the joke was related to the topic of the talk. But it was a pretty bad joke. The second talk was about twelve years ago. The topic was model checking in spatial statistics. I think it went fine, but I recall that there was one spatial statistics expert in the audience who was disappointed at how simple my model was. It worked ok for what we were doing, though.

To paraphrase Bill James, the alternative to doing statistics is not "not doing statistics," it's "doing bad statistics."

Some people bemoan the excessive quantitative nature of academic political science nowadays. I certainly agree that there's room for nonquantitative work, but you also want to have some people who know their way around numbers. Or else you'll end up with this sort of horrible non-analysis by David Runciman of U.S. elections. What's striking about Runciman's article--and he's a well-respected political theorist, I'm sure--is that he relies on statistics all over the place. He just doesn't know what he's talking about--and, even worse, doesn't seem to know that he doesn't know.

I mouth off all the time about things I don't know about. But at least when I go on about Karl Popper, for example, I ground it in my own experience as a researcher, I don't just spout off in general.

Anyway, my point is not to pick on Runciman for a year-old article that he probably whipped off in a couple of hours and maybe already regrets. I'm just using it as an example of how people who don't know statistics are doomed to rely on statistics all the same.

Just as Bill James pointed out how fans who hate sabermetrics (and all it stands for) were forming all sorts of misinformed opinions based on batting averages and the like.

Here.

Stephen Senn quips: "A theoretical statistician knows all about measure theory but has never seen a measurement whereas the actual use of measure theory by the applied statistician is a set of measure zero."

Which reminds me of Lucien Le Cam's reply when I asked him once whether he could think of any examples where the distinction between the strong law of large numbers (convergence with probability 1) and the weak law (convergence in probability) made any difference. Le Cam replied, No, he did not know of any examples. Le Cam was the theoretical statistician's theoretical statistician, so there's your answer.

The other comment of Le Cam's that I remember was his comment when I showed him my draft of Bayesian Data Analysis. I told him I thought that chapter 5 (on hierarchical models) might especially interest him. A few days later I asked him if he'd taken a look, and he said, yes, this stuff wasn't new, he'd done hierarchical models back when he'd been an applied Bayesian back in the 1940s.

A related incident occurred when I gave a talk at Berkeley in the early 90s in which I described our hierarchical modeling of votes. One of my senior colleagues--a very nice guy--remarked that what I was doing was not particularly new; he and his colleagues had done similar things for one of the TV networks at the time of the 1960 election.

At the time, these comments irritated me. But, from the perspective of time, I now think that they were probably right. Our work in chapter 5 of Bayesian Data Analysis is--to put it in its best light--a formalization or normalization of methods that people had done in various particular examples and mathematical frameworks. (Here I'm using "normalization" not in the mathematical sense of multiplying a function by a constant so that it sums to 1, but in the sociological sense of making something more normal.) Or, to put it another way, we "chunked" hierarchical models, so that future researchers (including ourselves) could apply them at will, allowing us to focus on the applied aspects of our problems rather than on the mathematics.

To put it another way: why did Le Cam's hierarchical Bayesian work in the 1940s and my other colleague's work in 1960s not lead to more widespread use of these methods? Because these methods were not yet normalized--there was not a clear separation between the math, the philosophy, and the applications.

To focus on a more specific example, consider the method of multilevel regression and poststratification ("Mister P"), which Tom Little and I wrote about in 1997, then David Park, Joe Bafumi and I picked back up in 2004, and then finally took off with the series of articles by Jeff Lax and Justin Phillips (see here and here). This is a lag of over 10 years, but really it's more than that: when Tom and I sent our article to the journal Survey Methodology back in 2006, the reviews said basically that our article was a good exposition of a well-known method. Well-known, but it took many many steps before it became normalized.

Edo Airoldi writes:

We have two postdoctoral fellowships available for up-to three years, with competitive salary and travel support. The two postdocs are expected to contribute to active research projects, including:

1. Development of statistical methodology, algorithms, and theory for analyzing complex graphs and dynamical systems,

2. Analysis of coordinated regulatory mechanisms driving the cell cycle, metabolism, and environmental responses in yeast, bacteria, and cancer systems,

3. Analysis of signaling and metabolic pathways in dynamic chemical contexts, and

4. Development of dynamic models of coalition formation and stability.

Handy statistical lexicon

| 3 Comments

These are all important methods and concepts related to statistics that are not as well known as they should be. I hope that by giving them names, we will make the ideas more accessible to people:

Mister P: Multilevel regression and poststratification.

The Secret Weapon: Fitting a statistical model repeatedly on several different datasets and then displaying all these estimates together.

The Superplot: Line plot of estimates in an interaction, with circles showing group sizes and a line showing the regression of the aggregate averages.

The Folk Theorem: When you have computational problems, often there's a problem with your model.

The Pinch-Hitter Syndrome: People whose job it is to do just one thing are not always so good at that one thing.

Weakly Informative Priors: What you should be doing when you think you want to use noninformative priors.

P-values and U-values: They're different.

Conservatism: In statistics, the desire to use methods that have been used before.

WWJD: What I think of when I'm stuck on an applied statistics problem.

Theoretical and Applied Statisticians, how to tell them apart: A theoretical statistician calls the data x, an applied statistician says y.

The Fallacy of the One-Sided Bet: Pascal's wager, lottery tickets, and the rest.

Alabama First: Howard Wainer's term for the common error of plotting in alphabetical order rather than based on some more informative variable.

The USA Today Fallacy: Counting all states (or countries) equally, forgetting that many more people live in larger jurisdictions, and so you're ignoring millions and millions of Californians if you give their state the same space you give Montana and Delaware.

Second-Order Availability Bias: Generalizing from correlations you see in your personal experience to correlations in the population.

The "All Else Equal" Fallacy: Assuming that everything else is held constant, even when it's not gonna be.

The Self-Cleaning Oven: A good package should contain the means of its own testing.

The Taxonomy of Confusion: What to do when you're stuck.

The Blessing of Dimensionality: It's good to have more data, even if you label this additional information as "dimensions" rather than "data points."

Scaffolding: Understanding your model by comparing it to related models.

I know there are a bunch I'm forgetting; can youall refresh my memory, please? Thanks.

P.S. No, I don't think I can ever match Stephen Senn in the definitions game.

Data.gov

| 2 Comments

Hal Daume pointed me to this. I haven't tried it out yet, but it looks like the right idea.

Traffic map update

| No Comments

Commenters pointed out that the map to which I linked yesterday actually shows the number of people entering each station, not, as implied by the visual structure of the map, the traffic on the subway lines between the stations. I agree with the commenters that line width doesn't seem like a good way to show information that is at the station level. Better to use differently-sized circles or something like that.

But this sets up a fun statistical problem: estimate the traffic on the subway lines given the data on the number of people entering each station (along with any other available data, and whatever modeling assumptions are needed to complete the picture). I guess there must be people at the transportation dept. doing this sort of thing, but I wouldn't be surprised if they're using deterministic solve-for-x algorithms that could be improved by a more statistical approach.

P.S. Richard Clegg writes in:

As you surmised this is a well-studied problem. Actually in the field of road transport this would be broken into two separate but related problems -- the origin-demand matrix estimation problem (given a set of observations what set of demands from origin to destination best explain them) and the related traffic assignment problem (given an origin demand matrix and a network with limited capacity on links how does one assign traffic onto network links).

In particular the traffic assignment problem has some attractive statistical properties if certain assumptions are made.

I replied:

About 25 yrs ago I worked on finite-element methods for thermal models, so I figured the mathematics would be similar. As noted on blog, I suspect that inclusion of some stochastic elements to the problem could improve things as well as extend the range of problems to which these methods could be applied.

And Clegg added the following:

For the origin-demand matrix problem there are a variety of approaches both frequentist and Bayesian -- I am far from an expert here (but hope to be more expert soon since I am involved with a grant proposal on the subject which I am hoping will be funded). For the traffic assignment problem there are a number of approaches, "deterministic" and "stochastic" to varying degrees. In the stochastic approach you make certain assumptions about how users disperse across routes of different costs (by assuming an error distribution on the user's perception of route costs -- as it turns out, a Gumbell distribution often produces "nice" answers). There are even the so-called "doubly stochastic" problems where the demand from each origin to each destination is assumed to have a distribution and then users perceives routes imperfectly according to another distribution. If you google "Stochastic user equilibrium" you will find more about the problem than you ever wanted to know.

Sounds good. I also expect there's some room for improvement using hierarchical modeling.

I've made this point before, but I just received an email on the topic and so I thought I'd point youall to section 3.3 of this article of mine from 2003 where I make the argument in detail.

This article--A Bayesian Formulation of Exploratory Data Analysis and Goodness-of-fit Testing--is one of my favorites. It also features:
- A potted history of Bayesian inference (section 2.1)
- The first published definition (I think) of U-values and P-values (section 2.3)
- A model-checking perspective on the problem of degenerate estimates for mixture models (section 3.1)
- Why this isn't all obvious (section 5)

The article is based on a presentation I gave a year earlier at a conference. It was supposed to appear in the proceedings volume, but it was late, and the conference organizer was so annoyed he refused to include it. So I published it in the International Statistical Review instead. A year later I published a related article, Exploratory Data Analysis for Complex Models, as a discussion paper in the Journal of Computational and Graphical Statistics. That second article is more coherent, but personally I prefer the International Statistical Review article because it covers so many little topics that don't fit into existing theories of inference. I think of these examples as analogous to the quantum anomalies that toppled classical physics around 1900. In this case, what I want to topple is classical Bayesian inference--by which I mean Bayesian theory that does not include model building and model checking.

Groves rules out use of sampling in 2010 census:

President Barack Obama's pick to lead the Census Bureau on Friday ruled out the use of statistical sampling in the 2010 head count, seeking to allay GOP concerns that he might be swayed to put politics over science. Robert M. Groves, a veteran survey researcher from the University of Michigan, also testified during his confirmation hearing that he remains worried about fixing a persistent undercount of hard-to-reach populations . . . Census officials have already acknowledged that tens of millions of residents in dense urban areas -- about 14 percent of the U.S. population -- are at high risk of being missed because of language problems and an economic crisis that has displaced homeowners.

My comments:

I have a great respect for Bob Groves, and I would trust his decisions on what to do with the Census more than I would trust my own.

Bob's statement that "there is simply no time to prepare for it" seems eminently reasonable to me, especially given the cost constraints under which the census operates. On the statistical merits of the issue, I'm pretty sure that adjusted numbers would be better than unadjusted numbers. The census people know what they're doing, and there are known problems of nonresponse, and, for anything where I care about the damn answer, I'd use their adjusted estimates over the raw numbers.

As a social scientist, I hope the census bureau could release two sets of numbers, one unadjusted for political reasons and one adjusted for those of us who want the most accurate inferences possible.

That said, I'm ignoring a possible indirect effect of adjusting the numbers: If people know that the census will do adjustment, maybe they'll be less likely to participate in the enumeration in the first place. It's hard to measure such an effect and, hey, it might be important. I don't know.

I'm not thinking so much of individuals deciding whether to respond to the census, but rather of the decisions of local jurisdictions, where various spending formulas depend on population. For example, if it's known that the census won't be adjusted, then I'd expect the government of New York City to put a lot of effort into convincing people to participate. If it is known that the census will be adjusted, then there'd be a lot less motivation for localities to do what it takes to boost participation.

Conditional on the data already being collected, you'd definitely want to make statistical adjustments; it's a tougher call to decide on this ahead of time. Also, if you know for sure you won't be adjusting, this will affect the effort you put into collecting the data in different places. So if you're not going to adjust, you might as well make that decision right away.

P.S. To expand on this slightly, I think any debates over census adjustments are fundamentally political debates, not statistical disagreements. The scientific consensus on adjustment is pretty easy (although people can argue about the details of implementation, as noted by Lawrence in comments below). It's the political consensus that's difficult, as there are clear winners and losers. With a lack of political consensus, all you need is a little bit of dust and confusion in the air to give a sense of a lack of scientific consensus, which then gets piped back in to justify inaction in the political process.

Andrew Grogan-Kaylor writes:

Graphs of subway ridership

| 1 Comment

Recently on Gothamist, there was a post about this site. It depicts subway ridership since 1905, as measured at each subway stop (by annual recorded entries).

I wish that the graphs were click-able to enlarge them; though it's fun to look at this way, it's tough to compare the graphs with that tiny size. You can zoom in slightly to display the station names, which is nice. It seems as though this is still a work in progress on some fronts, however: you can't zoom in or out too far or you lose the map altogether.

Survey analysis in R

| 3 Comments

There's a lot of good stuff here (from Thomas Lumley). It's all classical stuff--no small-area estimation, no Mister P, etc.--but the classical stuff is still pretty useful. The "survey" package for R looks pretty good; in particular, it allows you to specify the survey design, which is a big step beyond simply specifying survey weights.

I'd also like to recommend Sharon Lohr's book from 1999. When's the second edition coming out?

Ian Ayers refers to the research by Brett Pelham, Matthew Mirenberg, and John Jones that people are likely to have names that are related to their occupations, places of birth, etc. Pelham et al. write:

Taken together, the names Jerry and Walter have an average frequency of 0.416%, compared with a frequency of 0.415% for the name Dennis. Thus, if people named Dennis are more likely than people named Jerry or Walter to work as dentists, this would suggest that people named Dennis do, in fact, gravitate toward dentistry. A nationwide search focusing on each of these specific first names revealed 482 dentists named Dennis, 257 dentists named Walter, and 270 dentists named Jerry.

In his blog, Ayres referred to this finding but wrote:

To be honest, I [Ayres] am not fully persuaded that either of these results is true.

I think that Ayres is saying this because the effect sounds so large: Even if there really were something going on, could it really explain the difference between 482 and 257, nearly a factor of 2?

Let me repost a simple conditional probability calculation that might put Ayres's mind at ease:

There were 482 dentists in the United States named Dennis, as compared to only about 260 that would be expected simply from the frequencies of Dennises and dentists in the population. On the other hand, the 222 "extra" Dennis dentists are only a very small fraction of the 620,000 Dennises in the country; this name pattern thus is striking but represents a small total effect. If we assume that 222 of these Dennises are "extra" dentists--choosing the profession just based on their name--that gives 221/620000= .035% of Dennises choosing their career using this rule. I can certainly believe that the naming effect could be as high as .035%.

What percentage of people pick their job based on their name?

And here is my quick calculation that approximately 1% of Americans choose their career based on their first name:

James Heckman recently posted this article, which is based on a paper from 1980. (This sort of thing happens; for example, I just published an article based on work from 1986.) Heckman's tongue-in-cheek article begins:

This paper uses data available from the National Opinion Research Center's (NORC) survey on religious attitudes and powerful statistical methods to evaluate the effect of prayer on the attitude of God toward human beings.

He sets up a model for the intensity of prayer, given its effectiveness. The key assumption is as follows:

Accept on faith that the conditional density of x [the intensity of prayer in the population] given y [God's attitude arrayed on a scale ranging from 0 to 1] is of the form g(x|y) = a(y) exp(xy).

That is, the higher y is, the more prayer we'd see, which makes sense. (Heckman labels the function a(y) as "unknown," but, unless I'm missing something, a(y) is a normalizing constant that can be calculated in closed form by integrating exp(xy) over x. Perhaps this mistake, if it is one, can be caught before the article appears in press.)

Given the reasonable enough model above, Heckman points out that you can differentiate the density of x and learn something about the distribution of y, the effectiveness of prayer.

What does it all mean?

Of course Heckman is joking, but it appears he might be making a more serious point when he comments:

Provided conditional density (1) is assumed, we do not need to observe a variable in order to compute its conditional expectation with respect to another variable whose density can be estimated. For example, one can extend current empirical work in a variety of areas of economics to estimate the effect of income on happiness or the effect of income inequality on democracy.

I don't think this is literally an issue. True, all four of the variables Heckman mentions--income, happiness, income inequality, and democracy--can only be measured with error, but certainly they can be (and are) measured when they are studied empirically.

But I got a little worried that maybe there's something more going on here, some reason I should be giving a little less credence to studies linking economics to psychology and political science. Is Heckman implying that those cross-disciplinary studies have, at bottom, no more foundation than his argument on the effectiveness of prayer?

So I went back to Heckman's article to try to find the flaw in the reasoning. (By "flaw," I don't mean that Heckman was making a mistake; rather, I'm speaking of the hidden logical flaw that makes the reasoning flow, just as in those mathematical arguments where you "prove" 1=0 by means of a series of algebraic expressions that include a division-by-zero.)

Rereading carefully, I found the flaw. I actually think this article would be a good one for a take-home exam in a theoretical statistics class. I'll give the answer below.

Maybe because I spend so much time working with numbers, I'm as interested in the process of statistics as much as in its outcomes.

A couple months ag I told you about my struggles with the GDP of Russia and how I had inadvertently become entangled with the question.

More recently, I heard about Dick Morris's claim that, "In the last five months, according to the Federal Reserve Board, the money supply in the United States has increased by 271 percent."

271%??? Where did that implausible-looking number come from? Bill Peterson traced this to a 27.1% (note the decimal point) annualized rate of growth in M1 reported on a Federal Reserve website. So it sounded like a simple case of innumeracy (compounded by some partisan foolishness on Morris's part that, I argued here, doesn't do the Republican Party any favors).

But then an anonymous commenter wrote, "Dick Morris was referring to the Federal Reserve Adjusted Monetary Base which did, in fact, grow by a multiple between 2.5x-3x in the five months spanning October, 2008 through March, 2009." The commenter provided a couple of links and concluded,

In short, Dick Morris is right and you are wrong. I believe it is called a cruel irony when you publicly mock someone's intelligence only to find out subsequently that they are correct and you, well, you stepped in it.

I've made mistakes before and so it hardly shocked me that I got something wrong again! Apparently I'd been too quick to believe the Chance News entry that had gotten me started on this. In retrospect, it seemed pretty silly that I was so quick to trust the zero-budget Chance News while disparaging the well respected newspaper, The Hill (where Morris's column had originally appeared).

At this point, I really wanted to see the "271%" so I could issue a full-throated retraction. Unfortunately (or, maybe, fortunately, in the sense that it led to this story), when I followed the links supplied by the commenter, I could not find a 271% growth in the money supply anywhere! Which led back to the original puzzle of where the number came from. Was it simply a mis-transcribing of the 27.1%, or was there something else going on?

I was reminded of a legal consulting project I once worked on, where the statistician on the other side had done an analysis which I had then replicated, getting completely different results. But I didn't feel confident about my own claims until I tracked down how the other guy had done it wrong. It took me 2 hours to get the correct answer myself and to check it to my satisfaction [amusingly, I first typed "statisfaction" there], and 6 hours to get into the problem in enough depth to figure out what the other statistician had done wrong. (I bill by the hour so I remember these time totals. And, believe me, the other guy billed lots lots more than 8 hours to get his wrong answer!)

OK, back to Dick Morris's 271%. The latest insight came from Robert Waldmann, who commented as follows:

I [Waldmann] think I understand how he missed the damned dot, overlooked the concept of "annualised" and decided to call a 271% increase "tripling" not "almost quadrupling".

He mixed up H and M1. The monetary base has roughly tripled I think (and if I'm wrong well Morris is ignorant too).

If he didn't know about money multipliers, the money supply process, fractional reserve banking and my mother's maiden name (all equally certain) he might think this meant the money supply tripled. So he sends his long suffering research assistant to find the proof that the money supply tripled. The poor unfortunate guy came back with the number which Morris miss read due to the fact that "He puts ideology first and the [data] a distant second."

This story has the ring of truth to it: the research assistant was sent to do an impossible task, and Morris's ideology blocked him from realizing the mistake. (And, presumably, nobody edits his column at The Hill.) Interesting.

I remain ignorant regarding the money supply. One of the few things I remember from economics class in 11th grade is that "the money supply" is not well defined because of the presence of nonmonetary assets such as stocks, bonds, real estate, etc., as well as checking accounts and the like.

P.S. I'm still waiting for the anonymous commenter to come back to me with more data. I still think it's possible that there's a 271% in there (or, at least, "a multiple between 2.5x-3x," as the commenter claimed) that makes sense and that I just didn't know where to look.

Popularity (of a sort)

| 2 Comments

Morgan Ryan sent me this quote:

"Perhaps the most perplexing part of the study lay in the attitude of the statisticians, who showed no enthusiastic confidence in their own figures. They should have reached certainty, but they talked like other men who knew less. The method did not result in faith. . . . at last, a scholar, fresh in arithmetic and ignorant of algebra, fell into a superstitious terror of complexity as the sink of facts." --Education of Henry Adams

Wisdom from the Meng

| 2 Comments

Here [link fixed]. I love this stuff.

Dan Kahan writes:

Hi. I'm wondering if you -- and readers of your blog -- have a take on how to preempt the mistake of construing overlapping confidence intervals as indicating that distinct predictors (e.g, two treatments in an experiment) do not have a significantly different effect. See Schenker, N. & Gentleman, J.F. On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals. Am. Stat. 55, 182-186 (2001). The mistake is common enough to make me fret about using really nice bar plots w/ CIs when the CIs overlap. One can always point out in the text that it is a mistake to see overlapping CIs as indicating lack of a significant difference, and then report the relevant difference of the relevant point estimates & the CI associated with *that* difference, but since having to use additional text to explain how to interpret a figure undermines the whole point of using a figure, I'm wondering if there are better reporting or graphic-display strategies I'm unaware of.

My reply: I'm not as worried as you might expect by this, as statistical significance is pretty arbitrary anyway. I'm more worried about people not realizing that the difference between "significant" and "not significant" is not itself statistically significant. Ultimately, if there's a particular comparison you want people to make, you have to make it yourself, and if there are any comparisons that you don't want people to make, it's best to explicitly tell them not to do it.

The Ph.D. students of Columbia's statistics department have arranged a one-day conference centered on student research, in memory of Minghui Yu, our student who tragically died last year. Conference information is here.and the schedule of speakers and topics is here. It looks like an interesting mix of topics.

Alex Frankel sent in this:

A professor at Oxford University and his team have perfected a model whereby they can calculate whether the relationship will succeed. In a study of 700 couples, Professor James Murray, a maths expert, predicted the divorce rate with 94 per cent accuracy. His calculations were based on 15-minute conversations between couples who were asked to sit opposite each other in a room on their own and talk . . . Professor Murray and his colleagues recorded the conversations and awarded each husband and wife positive or negative points depending on what was said. Partners who showed affection, humour or happiness as they talked were given the maximum points, while those who displayed contempt or belligerence received the minimum. . . .

I looked up James Murray and couldn't find any article describing these results; 94% accuracy sounds pretty good to me, but it's difficult to make any comment based only on news reports. It appears, though, that Murray's main home is the University of Washington, not Oxford--at least, there seems to be a lot more info on Murray at UW than at Oxford--and he's cowritten a book on The Mathematics of Marriage, so this isn't a new area for him.

There must be a bit of a discussion of this sort of thing in the clinical psychology literature? Perhaps this would be a good topic for teaching logistic regression forecasting, better than our usual boring examples.

One thing about the news report puzzled me, though; at the end, it says:

The forecast of who would get divorced in his study of 700 couples over 12 years was 100 per cent correct, he said. But "what reduced the accuracy of our predictions was those couples who we thought would stay married and unhappy actually ended up getting divorced".

Huh?? If the accuracy was 100%, then what does he mean by "what reduced the accuracy of our predictions"? Were they hoping for 110%?

Self-experimentation

| 8 Comments

Jimmy sent this along:

Still, Mr. Perry wondered whether caffeine would help him. When he retired from rowing last July, he decided to do a randomized, blinded, placebo-controlled experiment on himself.

Strangled by data?

| 7 Comments
Google in 1998

Image via Wikipedia

A frustrated ex-Googler writes:
Yes, it's true that a team at Google couldn't decide between two blues, so they're testing 41 shades between each blue to see which one performs better. I had a recent debate over whether a border should be 3, 4 or 5 pixels wide, and was asked to prove my case. I can't operate in an environment like that. I've grown tired of debating such miniscule design decisions. There are more exciting design problems in this world to tackle.

So, Google observes people and their clicks to determine the color or line thickness. When your software phones back every time it is used, it's like having a microphone or camera in a car that detects every mistake, or that measures the response time.

It is easy to optimize the line thickness, but it's more difficult to optimize the overall design of the study. When your working day has 16 hours, and you spend 15 of them on optimization, there is not much time left for new designs.

The analogy carries over to statistical practice: your model is only as good as the data you're using. And the data, while plentiful and accurate, might be preventing you from solving the problem, looking for keys under the lamp post. Methodology can often be just as constraining as the data.

Over the past few decades, most policy programs were focused on remediation based on easily measured demographic variables, such as age, gender, income, race, education, ideology, ability - at the expense of variables that are harder to model and measure, such as honor, talent, potential, trustworthiness, motivation.

Following my skeptical discussion of their article on the probability of a college basketball team winning after ahead or behind by one point at halftime, Jonah Berger and Devin Pope sent me a long and polite email (with graph attached!) defending their analysis. I'll put it all here, followed by my response. I'm still skeptical on some details, but I think that some of the confusion can be dispelled with a minor writing change, where they make clear that their 6.6% estimate is a comparison to a model.

Berger and Pope's first point was a general discussion about their methods:

John Shonder pointed me to this discussion by Justin Wolfers of this article by Jonah Berger and Devin Pope, who write:

In general, the further individuals, groups, and teams are ahead of their opponents in competition, the more likely they are to win. However, we show that through increasing motivation, being slightly behind can actually increase success. Analysis of over 6,000 collegiate basketball games illustrates that being slightly behind increases a team's chance of winning. Teams behind by a point at halftime, for example, actually win more often than teams ahead by one. This increase is between 5.5 and 7.7 percentage points . . .

This is an interesting thing to look at, but I think they're wrong. To explain, I'll start with their data, which are 6572 NCAA basketball games where the score differential at halftime is within 10 points. Of the subset of these games with one-point gaps at halftime, the team that's behind won 51.3% of the time. To get a standard error on this, I need to know the number of such games; let me approximate this by 6572/10=657. The s.e. is then .5/sqrt(657)=0.02. So the simple empirical estimate with +/- 1 standard error bounds is [.513 +/- .02], or [.49, .53]. Hardly conclusive evidence!

Given this tiny difference of less than 1 standard error, how could they claim that "being slightly behind increases a team's chance of winning . . . by between 5.5 and 7.7 percentage points"?? The point estimate looks too large (6.6 percentage points rather than 1.3) and the standard error looks too small.

What went wrong? A clue is provided by this picture:

Halfscore.jpg

As some of Wolfers's commenters pointed out, this graph is slightly misleading because all the data points on the right side are reflected on the left. The real problem, though, is that what Berger and Pope did is to fit a curve to the points on the right half of the graph, extend this curve to 0, and then count that as the effect of being slightly behind.

This is wrong for a couple of reasons.

First, scores are discrete, so even if their curve were correct, it would be misleading to say that being behind increases your chance of winning by 6.6 points. Being behind takes you from a differential of 0 (50% chance of winning, the way they set up the data) to 51% (+/- 2%). Even taking the numbers at face value, you're talking 1%, not their claimed 5% or more.

Second, their analysis is extremely sensitive to their model. Looking at the picture above--again, focusing on the right half of the graph--I would think it would make more sense to draw the regression line a bit above the point at 1. That would be natural but it doesn't happen here because (a) their model doesn't even try to be consistent with the point at 0, and (b) they do some ridiculous overfitting with a 5th-degree polynomial. Don't even get me started on this sort of thing.

What would I do?

I'd probably start with a plot similar to their graph above, but coding score differential consistently as "home team score minus visiting team score." Then each data point would represent different games, they could fit a line and see what they get. And I'd fit linear functions (on the logit scale), not 5th-degree polynomials. And I'd get more data! The big issue, though, is that we're talking about maybe a 1% effect, not a 7% effect, which makes the whole thing a bit less exciting.

P.S. It's cool that Berger and Pope tried to do this analysis. I also appreciate that they attempted to combine sports data with a psychological experiment, in the spirit of the (justly) celebrated hot-hand paper. I like that they cited Hal Stern. And, even discounting their exaggerated inferences, it's perhaps interesting that teams up by 1% at halftime don't do better. This is just what happens when studies get publicized before peer review. Or, to put it another way, the peer review is happening right now! I've put enough first-draft mistakes on my own blogs that I can't hold it against others when they do the same.

P.P.S. Update here.

$7,600 (World Bank 2007)

$9,100 (World Bank 2007)

$14,700 (PPP adjusted, World Bank 2007)

$4,500 (World Bank 2006)

$7600 or $14,400 (gross national income: "Atlas method" or "purchasing power parity," World Bank 2007)

$12,600 (IMF 2008), $9,100 (World Bank 2007), or $12,500 (CIA 2008)

$2,637 in 2000 US dollars (World Bank 2007); that's $3,200 in 2007 dollars

$2,621 (World Bank 2006) or $8,600 (IMF)

Sure, I realize these statistics cannot be calculated exactly, and, sure, I realize there are definitional issues within a country and choices to be made when converting to other currencies. Still, there's a lot of variation here!

At the very least, this is a good example for a statistics, economics, or political science class to illustrate the difficulties of measurement.

P.S. See here (scroll down to item 3) for why we've been looking this up.

In the context of a discussion of rich and poor voters in the U.S. and other countries, Matthew Yglesias posted this graph from our Red State, Blue State book:

fig7.4.png

The commenters raised several issues that I'd like to clarify here. (In particular, it looks like we miscoded some of the GDP per capita numbers, which doesn't affect our conclusions but is a bit embarrassing.)

1. The meaning of the graph

Chris Wiggins points us to this announcement for a conference next year:

Simulation has greatly advanced climate science, but not sufficiently to the profit of theory and understanding. How can simulation better advance climate science and what mathematical issues does this raise? Our hypothesis is that the development of climate science (i.e., theory and understanding) will be best served by focusing computational and intellectual resources on model and data hierarchies. By bringing together physicists, mathematicians, statisticians, engineers and climate-scientists, and focusing on several themes that reach across scales and scientific methodologies, our program will provide a framework for advancing our use of hierarchical methods in our attempt to understand the climate system.

There will be an active program of research activities, seminars and workshops throughout the March 8 - June 11, 2010 period and core participants will be in residence at IPAM for fourteen weeks. The program will open with tutorials, and will be punctuated by four major workshops and a culminating workshop.

This all makes sense to me, although, given the topic, I'm surprised that no statisticians seem to be involved. Lots of potential for interesting models and graphs.

Andy Sutter writes:

It's been a while (~2 years?) since I was last reading your blog semi-regularly and submitted a comment or two, but I was reading something today that made me recall those days.

At the time, I was curious about why social scientists present data as charts of regression coefficients, since I'd never seen such a presentation in the physical sciences.

I'm working on a project involving the evaluation of social service innovations, and the other day one of my colleagues remarked that in many cases, we really know what works, the issue is getting it done. This reminded me of a fascinating article by Atul Gawande on the use of checklists for medical treatments, which in turn made me think about two different paradigms for improving a system, whether it be health, education, services, or whatever.

The first paradigm--the one we're taught in statistics classes--is of progress via "interventions" or "treatments." The story is that people come up with ideas (perhaps from fundamental science, as we non-biologists imagine is happening in medical research, or maybe from exploratory analysis of existing data, or maybe just from somebody's brilliant insight), and then these get studied (possibly through randomized clinical trials, but that's not really my point here; my real focus is on the concept of the discrete "intervention"), and then some ideas are revealed to be successful and some are not (with allowances taken for multiple testing or hierarchical structure in the studies), and the successful ideas get dispersed and used widely. There's then a secondary phase in which interventions can get tested and modified in the wild.

The second paradigm, alluded to by my colleague above, is that of the checklist. Here the story is that everyone knows what works, but for logistical or other reasons, not all these things always get done. Improvement occurs when people are required (or encouraged or bribed or whatever) to do the 10 or 12 things that, together, are known to improve effectiveness. This "checklist" paradigm seems much different than the "intervention" approach that is standard in statistics and econometrics.

The two paradigms are not mutually exclusive. For example, the items on a checklist might have had their effectiveness individually demonstrated via earlier clinical trials--in fact, maybe that's what got them on the checklist in the first place. Conversely, the procedure of "following a checklist" can itself be seen as an intervention and be evaluated as such.

And there are other paradigms out there, such as the self-experimentation paradigm (in which the generation and testing of new ideas go together) and the "marketplace of ideas" paradigm (in which more efficient systems are believed to evolve and survive through competitive pressures).

I just think it's interesting that the intervention paradigm, which is so central to our thinking in statistics and econometrics (not to mention NIH funding), is not the only way to think about process improvement. A point that is obvious to nonstatisticians, perhaps.

The mysteries of the spam filter

| 5 Comments

I just received an email from "info@googlelotto.com" with subject line, "Your email just won £500,000 British Pounds in our anniversary promo." This email went into my inbox; it did not get caught by the spam filter.

What I wanna know is, if "Your email just won £500,000 British Pounds in our anniversary promo" isn't spam, what is???

"Perpetually Statistically Curious" writes:

Say you have two variables, Y1 and Y2, whose correlation depends on the value of a third dichotomous variable, X. Now say you take the absolute value of the difference between Y1 and Y2, and regress that absolute difference on the dichotomous (indicator) variable, X. My sense is that the expected value of the coefficient for the variable X in the regression would be related in a deterministic way with the gap between the correlations between Y1 and Y2 at the different values of X. But how?

This comes up in research on identical and fraternal twins, where the chief research interest is in the degree of similarity on some trait between identical twins relative to similarity on some trait between fraternal twins.

I'm always yammering on about the difference between significant and non-significant, etc. But the other day I heard a talk where somebody made an even more basic error: He showed a pattern that was not statistically significantly different from zero and he said it was zero. I raised my hand and said something like: It's not _really_ zero, right? The data you show are consistent with zero but they're consistent with all sorts of other patterns too. He replied, no, it really is zero: look at the confidence interval.

Grrrrrrr.

Eric Loken writes:

Last week the New York Times published an article on a possible Obama effect on test scores of black test takers. . . . The authors claim that they gave a short academic aptitude type test to black and white test-takers. When they administered the test last summer, they noted a difference between average scores for blacks and whites. However, after (now) President Obama had received his party's nomination and given his acceptance speech, the difference in scores disappeared. The theory is that Obama's rise has had a positive motivating influence on test taking performance.

Eric then gives some background:

Now that we're on the topic of econometrics . . . somebody recommended to me a book by Deirdre McCloskey. I can't remember who gave me this recommendation, but the name did ring a bell, and then I remembered I wrote some other things about her work a couple years ago. See here.

And, because not everyone likes to click through, here it all is again:

Hey, this looks cool!

| No Comments

Visualization and Control in Insect Flight

Atilla Bergou, Physics Department, Cornell University

Insects have a 100 million year head-start on us in learning how to fly. Thus, we have a lot to learn from them. Currently, one of the greatest challenges in this study is the accurate measurement, characterization and visualization of the motions of these animals. Recent advances in high-speed videography have allowed us to begin exploiting techniques from computer vision which hold immense promise to resolve these problems. In this talk, I will show our efforts in incorporating ideas from computer vision and physics to study the complex motion of an insect's wing. This motion is due not only to muscular activation but also to fluid, inertial, and elastic forces. Thus, it may be that not all aspects of the wing motion are actively controlled by the insect. We ask whether changes in the wing orientation of flying fruit flies are actuated by insect muscles, or if their wings turn over passively like a falling leaf. By applying a three- dimensional reconstruction technique to high-speed films of freely flying fruit flies, we are able to capture their intricate motion at a level of detail that has previously been impossible. We extract the detailed wing kinematics of flies using a novel motion tracking algorithm, compute the forces acting on the wings and infer whether flapping flight is possible without pitching control.

The talk is 3pm Wed 4 Feb CESPR 414 Sindeband East.

Letting the side down

| 3 Comments

The phone just rang. I picked it up and heard: "Could I speak with the youngest female in the household who is eligible to vote?" My reply: "Sorry, we're busy."

I'm either a traitor to my profession by not participating in a poll, or a contributor by increasing the problem of missing data.

Ed Vul, Christine Harris, Piotr Winkielman, and Harold Pashler wrote an article where:

1. They point out that correlations reported in FMRI medical imaging studies are commonly overstated because researchers tend to report only the highest correlations, or only those correlations that exceed some threshold.

2. They suggest that these statistical problems are leading researchers, and the general public, to overstate the connections between social behaviors and specific brain patterns.

After posting on this article, I received a bunch of comments and questions as well as some responses:

This article by Jabbi, Keysers, Singer, and Stephan argues that, because brain imaging resesarchers adjust their p-values and significance thresholds for multiple comparisons (the thousands of voxels in a brain image), their statistical methods don't have the problems that Vul et al. claimed.

This reply by Vul to the Jabbi et al. article. Here Vul argues that adjustment of significance levels does not stop the selected correlations themselves from being too high. I found Vul's argument here to be convincing. Multiple comparisons methods control the rate of false alarms in a setting where true effects are zero--but I don't see that to be relevant to the imaging setting, where differences are not in fact zero. Lots of things affect blood flow in the brain, and we would never expect the average scans of two different groups of people to be the same.

This article by Lieberman, Berkman, and Wager, who defend social neuroscience and argue the following:

1. They accept Vul et al.'s point 1 above (correlations are overstated) but present some evidence that the correlations aren't as overstated as Vul et al. might fear.

2. They disagree with the implied claim that the overstated correlations have distorted scientists' understanding of social neuroscience research.

3. They object to Vul et al's focusing on social neuroscience, given that the same statistical issues arise in all sorts of brain imaging studies.

4. They point out some specific areas where Vul et al. mischaracterized the data-analytic methods used in this field.

I think Lieberman et al. make some good points, but, as Vul et al. point out, researchers often do use correlations to summarize their results. And, even if said correlations survived a multiple-comparisons analysis, readers might interpret these at face value without understanding the selection issue. So all this shake-out is probably a good thing, especially where correlation estimates are being compared to each other.

My thoughts

First off, I haven't worked seriously in medical imaging for nearly 20 years and have only one published paper in the area, so my comments are mostly informed by my perspective on general statistical issues, as well as my own experience thinking about estimation of effect sizes in studies with low statistical power.

Regarding the singling-out of social neuroscience, I see the point of Lieberman et al. I was thinking that maybe one reason for this is that in social neuroscience it's perhaps more difficult to get external validation in the way that might be more possible in other areas of neuroscience where there is some measurement in the blood or whatever that can be taken. I'm not sure about this, just a conjecture.

It's hard for me to believe that the approach based on separate analyses of voxels and p-values, is really the best way to go. The null hypothesis of zero correlations isn't so interesting. What's really of interest is the pattern of where the differences are in the brain.

Related to this point is that, ultimately, when trying to understand differences in brain processing between different sorts of people (or between people doing different tasks), the maximum correlation among voxels is ultimately not what you're looking for. That is why researchers summarize using regions of interest (as in p.7 of the Lieberman et al. article). Vul et al. were correct to warn about overinterpretation of correlations that have been selected as the maximum: the naive reader can see such correlations (and accompanying scatterplots) to think that certain personality traits are more predictable from brain scans than they actually are.

I think the way forward will be to go beyond correlations and the horrible multiple-comparisons framework, which causes so much confusion. Vul et al. and Lieberman et al. both point out that classical multiple comparisons adjustments do not eliminate the systematic overstatement of correlations. A hierarchical Bayes approach (using some sort of mixture for the population of pixel differences, ideally modeled hierarchically with pixels grouped within regions of interest) would help here..

And now for some amateur psychologizing (unsupported by any statistical analysis, correlational or other)

I suspect that one of the motivations of Vul et al in writing their article was frustration at too-good-to-be-true numbers which they felt led to exaggerated claims of neuro super-science.

Conversely, I suspect one of the frustrations of Lieberman et al. is that they are doing a lot more than correlations and fishing expeditions--they're running experiments to test theories in psychology, they're trying to synthesize results from many different labs. And from that perspective it must be frustrating for them to see a criticism (featured in the popular press) that is so focused on correlation, which is really the least of their concerns.

It also seems that both sides were irritated by what they saw as giddy press coverage: on one side, claims of dramatic breakthroughs in understanding the biological basis of behavior and personality; on the other, claims of a dramatic Emperor-has-no-clothes debunking. As scientists, most of us welcome press coverage--after all, we think this work is important and we'd like others to know about it--but . . . fawning press coverage of something that we think is wrong--that's just annoying.

P.S. Wager is a friend--he teaches in the psychology department here--but I don't think my personal knowledge has hindered my evaluation here.

P.P.S. I ran the above by various people involved and they gave some helpful clarifications. But I've probably left in a couple of sloppy statements here and there.

Recent Comments

  • Andrew Gelman: I discussed this issue in the blog entry linked above, read more
  • Andrew Gelman: Yes, exactly. I think people are making a big mistake read more
  • Bill Drissel: As I hear English, {problem} linked to {candidate cause} and read more
  • Bill Jefferys: I appreciate the link to the very cool "size of read more
  • Thank God for western civ: The under 30 crowd supports school vouchers and social security read more
  • Jared: Elke Weber, right there at Columbia, has done a bunch read more
  • Thorfinn: Maybe you're right about the risk premium, but I'm not read more
  • JonBen: Very interesting data. I understand the social context of putting read more
  • Radu Craiu: I feel compelled to confess that I have read K read more
  • Paul: I think a lot of the issue comes down to read more
  • Nick Cox : Jacob: Thanks for your extra comments. You'd have saved yourself read more
  • Asa: Thanks everyone. I figured out a pretty solid solution to read more
  • Stuart Buck: Is it that medical schools are trying to screen out read more
  • Jacob: BTW, in no way I am putting down R. R read more
  • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
  • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
  • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
  • Jonathan: As an economist who does his work with "the public," read more
  • BrendanH: I'll second the lme4/R recommendation, on the grounds that it read more
  • Chris Brew: In Linguistics Ohio State invites people to on-site recruiting visits read more