May 2008 Archives
The New Yorker has a circulation of a million and this has a circulation of zero (rounding to the nearest million). The winner and finalists are just so, so, so, so much funnier than anything the New Yorker ever has for these things. Not even close.

Winner:
"Fuck, it's the dream again. I'm on trial, surrounded by tiny non-whale mammals, and I don't even know what I'm being tried for. Wake up, don't worry, you'll wake up, wake up." — Snazzy Spazz
Finalists:
"Did I mention that my client's last name is Kennedy?" —Kathy H
"Objection! ANOTHER request for production of documents? Your honor, I believe this discovery process is excessive. And frankly, my client has been made to jump through too many hoops already!" —Stevo Darkly
See index here.
P.S. I linked to this a couple years ago but it's worth mentioning again. I guess I should put it in the "Cultural" links on the blog.
From Joseph Kang and Joseph Schafer:
When outcomes are missing for reasons beyond an investigator’s control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.
Mark Levy pointed me to this. I don't know anything about this area of research, but if true, it's just an amazing, amazing example of the importance of measurement error:
The 20th century warming trend is not a linear affair. The iconic climate curve, a combination of observed land and ocean temperatures, has quite a few ups and downs, most of which climate scientists can easily associate with natural phenomena such as large volcanic eruptions or El Nino events.But one such peak has confused them a hell of a lot. The sharp drop in 1945 by around 0.3 °C - no less than 40% of the century-long upward trend in global mean temperature - seemed inexplicable. There was no major eruption at the time, nor is anything known of a massive El Nino that could have caused the abrupt drop in sea surface temperatures. The nuclear explosions over Hiroshima and Nagasaki are estimated to have had little effect on global mean temperature. Besides, the drop is only apparent in ocean data, but not in land measurements.
Now scientists have found – not without relief - that they have been fooled by a mirage.
This is funny. Ubs takes us from "Since 1916, no Democrat has won the White House without winning West Virginia" to "No Democrat has won the White House without winning Missouri since 1824."
The sad thing is that I've seen reputable social scientists do analyses with data over several decades including "state effects," i.e., coefficients for states that don't vary over time. Going back to 1916 is sillier than most, but not all, such things I've seen. The trouble is that people have been brainwashed into thinking that something called "fixed effects" solves all their problems, so they turn their brains off.
Beyond this, predicting the winner doesn't make much sense, given that you're counting all the elections that have been essentially tied; see point 5 here.
This just came in the email:
As you are likely aware, this past February one of the two Uninterruptible Power Supply (UPS) systems in the Science Center data center failed completely as the result of a short circuit. . . .
It's a good thing they had two of them!
Here's Joe Bafumi's graph predicting House vote shares from pre-election polls in midterm elections:

(See here for the big version of the graph.)
I imagine there's something similar going on in presidential years.
A political scientist wrote in with a question that actually comes up a lot, having to do with hard-to-estimate group-level variance and correlation parameters in multilevel models. The short answer is, when these things are hard to estimate, it's often because there's not much information about them in the data, and sometimes you can get away with just setting these parameters to zero and making your life easier.
Anyway, here's his question:
Note that . . . Interestingly . . . Obviously . . . It is clear that . . . It is interesting to note that . . . very . . . quite . . . of course . . . Notice that . . .
I'm sure there are many more that I've forgotten. Most of youall probably know about most of these, but I don't know that many people know to avoid "very."
Commenting on our comparison of 1896 and 2000, Jonathan Rodden sent in this graph of Democratic vote share vs. population density in congressional districts from 1952-1996:

As Jonathan noted, the pattern of high-density areas voting strongly Democratic is relatively new. (But I don't buy the way his lines curve up on the left; I suspect that's an unfortunate artifact of using quadratic fits rather than something like lowess or spline.) Also there seems to be some weird discretization going on in the population densities for the early years in his data.
P.S. I don't like that the graphs go below 0 and above 1, but that's probably a Stata default. I don't hold it against Jonathan--after all, he made a graph for me for free--but I do think that better defaults are possible.
I led a 4-hour workshop on teaching statistics at the Association for Psychological Science meeting yesterday. Here's the powerpoint--I didn't actually get through all of it, because we spent nearly half the time in group discussions.
Here's the book, and here are other thoughts on teaching from the blog.
In response to the comments on his forthcoming book, Paul Murrell writes:
John Sides has a graph showing that, for the past twenty years, Jews have been giving 70-80% of their votes to Democratic presidential candidates. From our forthcoming Red State, Blue State book, here are some data going back to 1968 (from the National Election Study):

Perhaps also of interest is how this relates to religious attendance. More frequent attenders are more likely to vote Republican, but the pattern varies by denomination. Here's what was happening in 2004 (as estimated from the Annenberg pre-election survey):

The graph for 2000 looks similar except that the line for Jews was flat in that year.
Why care about Jewish voters?
The underlying question, though, is why should we care about a voting bloc that represents only 2% of the population (and even if Jews turn out at a 50% higher rate than others, that would still be only 3% of the voters), most of whom are in non-battleground states such as New York, California, and New Jersey? Even in Florida, Jews are less than 4% of the population. I think a lot of this has to be about campaign contributions and news media influence. But, if so, the relevant questions have to do with intensity of opinions among elite Jews rather than aggregates.
P.S. This sort of concern is not restricted to Jews, of course. Different minority groups exercise political power in different ways. I just thought it was worth pointing out that this isn't a pure public opinion issue but rather something with more indirect pathways.
Eric Loken writes,
Criteria Corp is a company doing employee testing (basically psychometrics meets on-demand assessment). We're also going to blog on various issues relating to psychometrics and analyses of testing data. We're starting slowly on the blog front, but a few days ago we did one on employment tests for the NFL.. A few scholars have argued that the NFL's use of the Wonderlic (a cognitive aptitude measure) is silly as it shows no connection to performance. But we showed that for quarterbacks, once you condition on some minimal amount of play, the correlation between aptitude and performance was as high as r = .5...which is quite strong. It's the common case of regression gone bad when people don't recognize that the predictor has a complex relationship to the outcome. There are many reasons why a quarterback doesn't play much; so at the low end of the outcome, the prediction is poor and the variance widely dispersed. But there are fewer reasons for success, and if the predictor is one of them, then it will show a better association at the high end.
Here's their blog, and here's Eric's football graph:

P.S. The graph would look better with the following simple fixes:
1. Have the y-axis start at 0. "-2000.00" passing yards is just silly.
2. Label the y-axis 0, 5000, 10000. "10000.00" is just silly. Who counts hundredths of yards?
3. Label the x-axis at 10, 20, 30, 40. Again, who cares about "10.00"?
I've complained about R defaults, but the defaults on whatever program created the above plot are much worse! (I do like the color scheme, though. Setting the graph off in gray is a nice touch.)
Constantine Frangakis, Ming An, and Spyridon Kotsovilis write:
Problem: suppose we conduct a study of known design (e.g. completely random sample) to measure *just a scalar* (say income, gene expression example from Rafael Irizarry), and suppose we get full response. Question: what data do we actually observe? Answer: we observe an infinite dimensional variable, which can carry extra information about how we analyze the scalar (say to estimate the population mean).Logic:
Steven Levitt's blog is great, but . . . shouldn't it be Monica Das Gupta who deserves the hero treatment? Here are Das Gupta's graphs:


This isn't news at all--Das Gupta's graphs came out at least a year ago. Shouldn't the scientist who was correct all along--and published the data to show it--get more of the credit?
P.S. I published a false theorem once myself (and an erratum note a few years later, when the error was pointed out to me), but I'd hate to think this is "incredibly rare" behavior.
P.P.S. And many other errors get caught before publication.
When I took science in 9th grade, I remember being disturbed by a gap in the story. From one direction, we were told about atoms and subatomic particles and how they clustered into molecules. From the other, we were told about cells--single-celled animals and single human cells, then multicelled animals, then larger things such as jellyfish, etc., building up to people. We even talked about the parts of a cell--nucleus, axons, cilia, etc.
But we never were given the link between molecules and cells. And what really bothered me was that there was never even any recognition of the gap. This was really too bad, because long molecules are cool--there are proteins shaped like hooks that grab onto other molecules, etc. But it was either atoms or cells, nothing in between.
I was thinking about this recently after reading two blog entries by Steven Levitt. Here he writes that rich people aren't really so much richer than poor people because rich people pay more for "fancy cars, expensive wine, etc." This confuses me because I thought that, under the usual principles of economics, we should assume that fancy cars, etc., are worth their price--otherwise competitors would come into the market and sell them for less. Levitt's related point is that the narrowing of the gap between rich and poor can be credited to Wal-Mart. I can see how this could be true, but once again I'm confused, because I thought standard economic theory said that if Wal-Mart didn't exist, someone would invent it. I have an uncomfortable feeling here that economics is sometimes telling us that things are inevitable (the law of supply and demand) and other times is celebrating unique organizations such as Wal-Mart.
I'm not saying that economists are wrong on this--clearly, supply and demand are powerful forces, and it's also clear that organizations such as Toyota or Bell Labs or, for that matter, City Harvest, can make a difference. Marketing is an art, and just as, if Picasso had never been born, there would still be abstract art but there would be no Picassos, I can well imagine that in a different world, there would be no Wal-Mart, and maybe Americans would all be paying fifteen cents more each for peanut butter, or whatever.
But . . . I'm still disturbed by the lack of connection that is made between the fundamental principles of economics (under which $5,000 worth of expensive wine has the same value as $5,000 worth of Cheetos) and the sort of technocratic reasoning (the kind of thing that makes me, as a statistician, happy) where you try to assign a cost to each thing.
Really this applies to economics, or "freakanomics," in general: For example, you can do some data analysis to see if sumo wrestlers are cheating, or you can just say that sumo wrestling supplies an entertainment niche and leave it to the wrestlers to figure out how to optimally collude. Either sort of analysis is ok, but I rarely see them juxtaposed--it's typically one or the other, and the conclusions seem to depend a lot on which mode of analysis is chosen.
I don't think there are any easy answers here--to borrow a physics analogy, a stable economy is necessarily at a phase transition, entrepreneurs can't repeal the law of supply and demand, and conversely "supply and demand" don't mean squat if nobody's there to take advantage of opportunities, etc. But I think there can be trouble if you can pull out a macro or a micro argument and not always see the connection between them.
P.S. This problem is not at all unique to economics. For example, some political scientists (such as myself) study public opinion and others study strategic bargaining among political actors. And we tend to work in parallel, even though of course these concepts interact. I study voters' attitudes on issues and where they stand compared to the Democrats and Republicans, whereas Thomas Ferguson studies campaign contributions by major industries. It's all part of the same big picture but it's hard to put it all together in one place.
And I'm not saying this to criticize Levitt: he has interesting things to say both in the "big picture" sense and in detailed technical analyses. I just think there's a big gap there that's not often acknowledged.
After I remarked here on the notorious rudeness of one of the frequent posters to the R listserv, several commenters agreed with me (for example, Richard Morey wrote, "I've found that the R-help forums are legendary for the rude poster(s)"), but Nick Cox writes,
problems of clashing styles and expectations are generic to all technical lists that I've ever heard of that are not selective about membership. . . . The problem is a political question, not a technical one. The question is what to do when people, for whatever reason, do not follow the standards laid down for proper behaviour in a group, a discussion list, and one that they willingly join. . . . The solution being complained about is that some people -- usually "senior" people on a list with recognisable expertise -- are very firm in reminding posters of poor questions about the need to be much more precise about what their difficulty is, to read the documentation, etc. As this advice is very much part of the guidelines that people are asked to follow, it seems disingenuous, if not hypocritical, to complain when those people are trying their level best to maintain the standards of the list, exactly as advertised.
Cox's comments are interesting--and they suggest that when I and others think the R posters are particularly rude, we just don't have much experience with large listservs--but I actually want to take this in a different direction.
In my previous entry, I wrote, "I think it's ok for you to just post your questions and ignore that one [rude] person if he replies to you." Personally, I find rudeness from strangers unpleasant, even on a listserv, but I recognize that's just the way things are, and that's why I advise people to post to the list anyway.
Why is rudeness so upsetting, and how does it relate to altruism?
The more interesting question, perhaps, is why does this sort of rudeness hurt so much? Even though, as a logical matter, we should be able to ignore this rudeness, it actually hurts enough to dissuade people from posting.
The other aspect of listserv rudeness that intrigues me is that posting answers to a listserv is basically an act of altruism. People don't get paid to do it, they don't get academic credit, and in fact if they're rude they don't even get a lot of respect for it (at least, not as much respect as they might deserve, given their contributions). So, it's not enough of an answer to describe rude posters as assholes--if they were real assholes, they wouldn't be posting at all, right? They're sort of like those legendary caring-but-firm teachers who put a huge amount of effort into helping each individual student, but show it in this crusty, sarcastic, tough-guy fashion. I guess I can see, following Cox's argument above, that rude posters are serving the greater good by hurting people's feelings. I just think it's impressive that people would be so altruistic as to do this.
P.S. Posting on a blog is similar but less altruistic. For example, yes, I answer people's questions here but I also get a chance to promote my own work, which I don't see being done so much on the R listserv.
P.P.S. Given all the comments below, I fear I wasn't being clear enough in my own views here. I am being serious above, not sarcastic. Although certain listserv posters can be abrasive and even rude, I really do feel they are altruistic in providing all this free help on the list. Stylistic issues aside, I think that people who give help in this way are performing a valuable service, and I appreciate this, both for myself (on the occasions that I use the list) and on behalf of students and others whom I refer to the list.
Reading Paul Murrell's new book made me think some more about debugging. Jennifer and I discuss debugging in Chapter 19 of our book. I'll give Paul's recommended steps for debugging and then my comments. Paul's advice:
Paul Murell's new book begins as follows:
The basic premise of this book is that scientists are required to perform many tasks with data other than statistical analyses. A lot of time and effort is usually invested in getting data ready for analysis: collecting the data, storing the data, transforming and subsetting the data, and transferring the data between different operating systems and applications.Many scientists acquire data management skills in an ad hoc manner, as problems arise in practice. In most cases, skills are self-taught or passed down, guild-like, from master to apprentice. This book aims to provide a more structured and more complete introduction to the skills required for managing data.
This seems like a great idea, although it makes me think the title should say "data management" rather than "data technologies." Also, I hope that he clarifies that "data" does not simply mean "raw data." We often spend a lot of time working with structured data, in which the structuring (multilevel structures, missing data patterns, time series and spatial structure, etc.) is an important part of the data that is often obscured in traditional computer representations of data objects. As we've recently discussed, even something as simple as a variable constrained to lie in the range [0,1] is not usually stored as such on the computer.
I like the book a lot and think every statistician should have a copy. I have some comments on Section 2.3.4, on Debugging Code, which I'll place in a separate blog entry. For now, here are some little comments on various parts of the book:
We're more likely to listen to expensive advice. Dog bites man, or is there something I'm missing here?
P.S. I know from personal experience that if you raise your consulting fee high enough, people do start to choke on the price and either say no or reduce the number of hours they want. Both of which are good outcomes from my perspective.
Juan-Jose Gibaja-Martins writes,
In our recent research we have been working with data on per capita GDP, Human Development Index, etc in european countries and we are interested in performing a Principal Components Analysis (PCA). So far we have been weighting our data -based on the idea that we should allow a bigger weight to a bigger economy- My question is: does it make sense to weight our data -for instance with the population or the GDP of each country- or should we keep all the weights equal to 1? Is it advisable to use weighted PCA? If so, when?
Oddly enough, a related question came up in our quantitative political science research group a couple weeks ago: Rebecca and Matt were talking about a cross-national study that had something like 1000 respondents from each of several European countries, and the question was how to weight the results to account for the fact that some countries are larger than others. A simple data analysis would implicitly count all countries equally, which doesn't sound right.
My solution (as usual) is multilevel modeling plus poststratification: in short, you only need to worry about these weights to the extent that the estimand of interest varies from country to country. Rather than weighting the data, my recommendation is to use a multilevel model to get estimates for all the countries and then to poststratify to get continent-wide estimates, if these are desired.
Marco Morales sent me this paper of his with Rene Bautista:
Exit polls are seldom used for voting behavior research despite the advantages they have compared to pre and post-election surveys. Exit polls reduce potential biases and measurement errors on reported vote choice and other political attitudes related to the time in which measurements are taken. This is the result of collecting information from actual voters only minutes after the vote has been cast. Among the main reasons why exit polls are not frequently used by scholars it is the time constraints that must be placed on the interviews, i.e., short questionnaires, severely limiting the amount of information obtained from each respondent. This paper advances a combination of an appropriate data collection design along with adequate statistical techniques to allow exit polls to overcome such a restriction without jeopardizing data quality. This mechanism implies the use of Planned Missingness designs and Multiple Imputation techniques. The potential advantages of this design applied to voting behavior research are illustrated empirically with data from the 2006 Mexican election.
This sounds cool. I'd only add that all surveys are "planned missingness." That's what makes it a survey rather than a census. Also I want to take a look at their data and see if their results are consistent with what we found in our analysis of the 2006 Mexican presidential election.
I'm not the only one who gets frustrated about such things.
This article by Tim Harford reminds me of an example I used to give in my decision analysis class:
When I was younger, people used to complain about candy bars getting smaller and smaller. (For example, Stephen Jay Gould has a graph in one of his books showing the size of the standard Hershey bar declining from 2 ounces in 1965 gradually down to 1.2 ounces in 1980, and for that matter I can recall tunafish cans gradually declining from 8 ounces to 6 ounces.) And I remember going to the candy machine with my quarter and picking out the candy bar that was heaviest--I don't remember which one--even if it wasn't my favorite flavor, to get the most value for the money.
But now I realize that, rationally, candymakers should charge more for smaller candy bars. The joy from eating the candy is basically discrete--I'll get essentially no more joy from a 1.7-ounce bar than from a 1.4-ounce bar. But the larger bar will be worse for my health (no big deal if I eat just one, but with some cumulative effect if I eat one every day, similarly with the sodas and so forth). And, given the well-known fact that nobody can eat just part of a candy bar, I get more net utility from the small bar, thus they should charge more.
See here for a link to a research study on this.
John Cook has an interesting story here. I agree with his concerns. It's hard enough to reproduce my own analysis, let alone somebody else's. This comes up sometimes when revising a paper or when including an old analysis in a book, that I just can't put all the data back together again, or I have an old Fortran program linked to S-Plus that won't run in R, or whatever, and so I have to patch something together with whatever graphs I have available.
Also, when consulting, I've sometimes had to reconstruct the other side's analysis, and it can be tough sometimes to figure out exactly what they did.
The Applied Statistics Center at Columbia University invites applications for a post-doctoral fellowship focusing on missing data and Bayesian inference. Eligible candidates have received a PhD in Statistics or related field. We will also consider candidates who have received appropriate quantitative training while earning a PhD in the social, behavioral or health/medical sciences. The successful candidate will be responsible for helping to build an innovative multiple imputation program including development of new models and diagnostics. Accordingly some expertise in Bayesian statistics is necessary. Also, strong programming experience in R is a must and facility with C++ is strongly encouraged. The position will also involve working with social science and health research practitioners, so the ability to perform interdisciplinary work is necessary.
The principal investigators on this project are Andrew Gelman, Jennifer Hill, and Peter Messeri, and we're also working with Angela Aidala, Jane Waldfogel, Irv Garfinkel, and other researchers in social science and public health.
The Applied Statistics Center is an exciting research environment at Columbia involving many students, faculty, and postdocs in dozens of methodological and applied research projects.
The fellowship is for one year but may be extended by mutual agreement and contingent on funding considerations. Salary is commensurate with degree and experience.
Please submit the following materials electronically to Juli Simon Thomas: your letter of application, curriculum vitae, writing sample, programming sample, three letters of recommendation, and a two- to three-page research statement describing your research interests.
Backround
Some of our previous papers in this area include the following:
Links have been fixed (thanks for pointing out, Jeremiah).
In a comment here, Sean writes, "I would more interested in an in-depth discussion of the statistical challenges of climate modeling than in the political angles of the question." I respect that opinion, but I think it makes more sense for me to write about what I know more about, which in this case is patterns in public opinion.
Brandon Keim writes,
Over the last year and a half, the number of Americans who believe the Earth is warming has dropped. The decline is especially precipitous among Republicans: in January 2007, 62 percent accepted global warming, compared to just 49 percent now. . . . The confounding part: among college-educated poll respondents, 19 percent of Republicans believe that human activities are causing global warming, compared to 75 percent of Democrats. But take that college education away and Republican believers rise to 31 percent while Democrats drop to 52 percent.That strikes me [Keim] as deeply weird. I don't even have a snarky quip, much less an explanation.
This does seem a bit weird: you might think that college grads are more likely to go with the scientific consensus on global warming, or you might think that college grads would be more skeptical, but it seems funny that it would go one way for Democrats and the other for Republicans.
Things become clearer when I looked at the graph (which was thoughtfully presented next to Keim's article):

Among college grads, there is a big partisan divide between Democrats and Republicans. Among non-graduates, the differences are smaller. This is completely consistent with research that shows that people with more education are on average more politically polarized (see, for example, figure 9a of my paper with Delia). Basically, higher educated Democrats are more partisan Democrats, and higher educated Republicans are more partisan Republicans. On average, educated people are more tuned in to politics and more likely to align their views with their political attitudes.
From this perspective, it's really not about the scientific community at all, it's just a special case of the general phenomenon of elites being more politically polarized (a phenomenon that we discuss in chapter 8 of our forthcoming book, and which is related to divisions between red and blue states).
P.S. I followed the link from Andrew Sullivan. And here's the detailed Pew report (and, remember, Pew gives out raw data!).
I've written up my rant more seriously here. Here's the new abstract:
Bayesian inference is one of the more controversial approaches to statistics. The fundamental objections to Bayesian methods are twofold: on one hand, Bayesian methods are presented as an automatic inference engine, and this raises suspicion in anyone with applied experience. The second objection to Bayes comes from the opposite direction and addresses the subjective strand of Bayesian inference. This article presents a series of objections to Bayesian inference, written in the voice of a hypothetical anti-Bayesian statistician. The article is intended to elicit elaborations and extensions of these and other arguments from non-Bayesians and responses from Bayesians who might have different perspectives on these issues.
And here's how the article concludes:
In the decades since this work and Box and Tiao’s and Berger’s definitive books on Bayesian inference and decision theory, the debates have shifted from theory toward practice. But many of the fundamental disputes remain and are worth airing on occasion, to see the extent to which modern developments in Bayesian and non-Bayesian methods alike can inform the discussion.
In answer to many of the earlier commenters: yes, I have replies for the criticisms. But I didn't want to put them here because I worried that they would inhibit the flow of discussion that I'd like to see come from this article. I will post my replies at some point (at which time I'm sure they'll be a disappointment, after all the hype).
I mentioned this in class all the time this semester so I thought I should share it with the rest of you. The folk theorem is this: When you have computational problems, often there's a problem with your model. I think this could be phrased more pithily--I'm not so good in the pithiness department--but in any case it's been true in my experience.
Also relevant to the discussion is this paper from 2004 on parameterization and Bayesian modeling, which makes a related point:
Progress in statistical computation often leads to advances in statistical modeling. For example, it is surprisingly common that an existing model is reparameterized, solely for computational purposes, but then this new con guration motivates a new family of models that is useful in applied statistics. One reason why this phenomenon may not have been noticed in statistics is that reparameterizations do not change the likelihood. In a Bayesian framework, however, a transformation of parameters typically suggests a new family of prior distributions.
I was thinking more about axes that extend beyond the possible range of the data, and I realized that it's not simply an issue of software defaults but something more important, and interesting, which is the way in which graphics objects are stored on the computer.
R (and its predecessor, S) is designed to be an environment for data analysis, and its graphics functions are focused on plotting data points. If you're just plotting a bunch of points, with no other information, then it makes sense to extend the axes beyond the extremes of the data, so that all the points are visible. But then, if you want, you can specify limits to the graphing range (for example, in R, xlim=c(0,1), ylim=c(0,1)). The defaults for these limits are the range of the data.
What R doesn't allow, though, are logical limits: the idea that the space of the underlying distribution is constrained. Some variables have no constraints, others are restricted to be nonnegative, others fall between 0 and 1, others are integers, and so forth. R (and, as far as I know, other graphics packages) just treats data as lists of numbers. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn't know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.
What I think would be better is for every data object to have a "type" attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc. If the type is not specified (i.e., NULL), it could default to unconstrained continuous (thus reproducing what's in R already). Graphics functions could then be free to use the type; for example, if a variable is constrained, one of the plotting options (perhaps the default, perhaps not) would be to have the constraints specify the plotting range.
Lots of other benefits would flow from this, I think, and that's why we're doing this in "mi" and "autograph". But the basic idea is not limited to any particular application; it's a larger point that data are not just a bunch of numbers; they come with structure.
The discussion here of graphics defaults inspired me to collect this list of defaults in R graphics that I don't like. In no particular order:
- Axes that extend below 0 or above 1
- Tick marks that are too big. They're ok on the windows graphics device, but when I make my graphs using postscript(), I have to set tck=-.02 so that they're not so big.
- Axis labels that are too far from the axes
- Axis numbers that are spaced too closely together
- A horrible system of cryptic graphics parameters ("mgp", "mar", "xaxs", "xaxt", etc)
- Too much space on the outside of the graph. This becomes a real problem when many graphs are put on the page. This can be corrected using mar, but it's a pain, and lots of people don't know about this and just use the default settings (which is why bad defaults are a problem).
I'm sure I could make my own functions to do this but I haven't ever gotten around to doing this; I just copy code from old examples.
There are also things that I have to do by hand but should be done automatically (yes, I know that means I should write my own functions . . .), in particular, labeling individual lines directly on a graph rather than with a legend.
P.S. Yes, I know R is free so I shouldn't complain . . .
You never know what you'll find in the Dining and Wine section nowadays.
John Sides posts these useful graphs:
As John writes, "The party loyalty of Democrats has been increasing over time and has essentially hovered at 90% since 1992. (And Republicans are similarly loyal to the Republican nominee.)" Here's the story from the 2000 election:
To which I'd also add this (from this paper with Joe Bafumi and David Park):

This shows the improvement in prediction given party ID and also demographics and political ideology.
The short story: voters are more predictable than they themselves realize.
P.S. John's graphs are fine, but the y-axis shouldn't go below 0 or above 100%.
Longhai Li did a really cool Ph.D. thesis (under the supervision of Radford Neal) on computing for models with deep interactions. The website containing all stuff about this software, including
the R packages, documentations and references, is here and here. Here's a quick description (from the website):
This R package is used in two situations. The first is to predict the next outcome based on the previous states of a discrete sequence. The second is to classify a discrete response based on a number of discrete covariates. In both situations, we use Bayesian logistic regression models that consider the high-order interactions. The time arising from using high-order interactions is reduced greatly by our compression technique that represents a group of original parameters as a single one in MCMC step. In this version, we use log-normal prior for the hyperparameters. When it is used for the second situation --- classification, we consider the full set of interaction patterns up to a specified order.
And here's the research paper (by Longhai and Radford). I wonder if they've achieved some of my goals in wanting weakly informative priors for models with interactions. That Cauchy thing rings a bell.
P.S. to Longhai: I don't recommend keeping your software in two places. Won't it be a pain to keep both sites up-to-date? Or maybe it's done automatically, I don't know.
Alex Reed writes,
Hey, this looks interesting . . . Inderjeet Mani, of Mitre Corporation, will be here on Monday, speaking on Interpreting Fictional Narrative: Crossing Some Ancient Frontiers. Here's the abstract:
While progress has been made on computational understanding of the flow of time in non-fictional genres, there has been little attention paid to time in literary texts. I will discuss a new project that examines the intersection between computational linguistics and narratology. I argue that understanding time in fiction requires not only the construction of timelines, but also a grasp of how characters, and readers' attitudes towards them, evolve. Accordingly, one needs to represent the goals and outcomes of characters' actions, superimposing a model of plot as an additional layer on top of the timeline. The theory models narrative progression in terms of changes in an ideal reader's emotional reactions to particular characters as the plot unfolds. In addition to examining samples from well-known literary works, I will discuss progress to date on an annotation scheme for plot and character evaluations.
Perhaps someone from the Classics department can come and comment on how this relates to the ancient theories of tragedy, comedy, etc. I also wonder how his theories work with explicitly time-organized fiction such as that of Jonathan Coe and Richard Ford (and I guess we could throw Wordsworth in there too), as compared to more straightforward narrative.
The talk will be 1:30 PM, Monday May 12th in the Back Open Conference Area of the CS Building. (enter the CS Building within Mudd and ask the receptionist to direct you back).
Michael Franc looked at Federal Election Commission data on campaign contributions and found some interesting things:
Through May 1, the Democratic presidential field has suctioned up a cool $5.7 million from the more than 4,000 donors who list their occupation as “CEO.” The Republicans’ take was only $2.3 million. Chief financial officers, general counsels, directors, and chief information officers also break the Democrats’ way by more than two-to-one margins. . . .
I'm not actually sure where these numbers come from. When I queried the FEC database (looking up "ceo" from 01/01/2008 to 05/01/2008), the total contributions (not just for presidential candidates) were only $45,124. So I must be doing something wrong here in my query. In any case, I guess it makes sense that most of the contributions have gone to Democrats so far, since (a) the Democratic primary has been much more competitive than the Republican, and (b) the Democrats are favored to win this year.
Franc continues:
In this upside-down campaign season when populist GOP campaigners like John McCain and Mike Huckabee surprised the pundits with their primary victories or, in the case of Ron Paul, their fundraising prowess, it almost makes sense that the party of the country club set has been winning the fundraising race among the common man. . . . This trend extends to the saloons, where the Democrats carry the bartenders and the Republicans the waitresses. . .
The bit about the bartenders and waitresses caught my eye. But when I looked it up, I found no contributions from either group this year. Going to the entire database, I did find some "waitress" contributions between 1998 and 2005, but they were mostly to Democrats. Also a few bartender contributions since 1998, again mostly to Democrats. So I'm not really sure about that. I emailed Franc to ask for his data source so I hope to learn more.
Setting aside the data difficulties, I think Franc makes an important point in the conclusion of his article:
I dream of a day when a journalist such as Ezra Klein, when seeing a graph such as this from Rob Goodspeed,

will immediately say, Hey! Why are these items in alphabetical order? That just confuses things. (It's not like they need to be in alphabetical order so that we can look up "faith" in the index or whatever.)
I have no substantive comment on the graph except that it seems unfair to McCain in that his page has fewer total words, which as displayed in the graph makes him look less substantive overall. I mean, maybe it's just a choice for him to focus on just a few issues.
P.S. I'm not knocking Goodspeed, who put in the work to make the graph, or Klein, who went to the trouble of finding it. I'm just saying that in the ideal world, an irrelevantly alphabetized graph would JUMP OUT OF THE PAGE as something not quite right, in the way that a typo or grammatical error does now. But, hey, my job is education, right? So here's my try.
P.P.S. Howard Wainer has called this the Alabama First error and wrote an article on the topic in Chance in 2001.
My favorite statistics demonstration is the one with the bag of candies. I've elaborated upon it since including it in the Teaching Statistics book and I thought these tips might be useful to some of you.
Preparation
Buy 100 candies of different sizes and shapes and put them in a bag (the plastic bag from the store is fine). Get something like 20 large full-sized candy bars, 20 or 30 little things like mini Snickers bars and mini Peppermint Patties. And then 50 or 60 really little things like tiny Tootsie Rolls, lollipops, and individually-wrapped Life Savers. Count and make sure it's exactly 100.
You also need a digital kitchen scale that reads out in grams.
Also bring a sealed envelope inside of which is a note (details below). When you get into the room, unobtrusively put the note somewhere, for example between two books on a shelf or behind a window shade.
Setup
Hold up the back of candy and the scale and write the following on the board:
Each pair of students should:
1. Pull 5 candies out of the bag
2. Weigh the candies
3. Write down the weight
4. Put the candies back in the bag!!
5. Pass the scale and bag to your neighbors
6. Silently multiply the weight of the 5 candies by 20.
(And, as Frank Morgan told me once, remember to read aloud everything you write on the board. Don't write silently.)
The students should work in pairs. Explain that their goal is to estimate the total weight of all the candies in the bag. They can choose their 5 candies using any method--systematic sampling, random sampling, whatever. Whichever pair guesses closest to the true weight. they get the whole bag!
Demonstrate how to zero the scale, give the scale and the bag of candies to a pair of students in the front row, and let them go.
Action
The demo will proceed silently while the rest of the class proceeds. So do whatever you were going to do in class. Take a look to make sure the scale and bag are moving slowly through the room. After about 30 or 40 minutes, it will reach the back and the students will be done.
At this point, ask the pairs, one at a time, to call out their estimates. Write them on the board. They will be numbers like 3080, 2400, 4340, and so forth. Once all the numbers are written, make a crude histogram (for example, bins from 2000-3000 grams, 3000-4000, 4000-5000, etc.). This represents the sampling distribution of the estimates.
Now call up two students from the class (but not from the same pair) to look at all the estimates. Ask them what their best guess is, having seen this information. As the class if they agree with these two students. Now give the bag to the two students in the front of the room and have them weigh it.
Punch line
The weight of all 100 candies will be something like 1658. It's always, always, always lower than all of the individual guesses on the board. Write this true weight as a vertical bar on the histogram that you've drawn. This is a great way to illustrate the concepts of bias and standard error of an estimator.
Now call out to the students who are sitting near where you hid the envelope: "Um, uh, what's that over there . . . is it an envelope??? Really? What's inside? Could you open it up?" A student opens it and reads out what's written on the sheet inside: "Your guesses are all too high!"
Aftermath
Now's the time to talk about sampling. Large candies are easy to see and to grab, while small candies fall through the gaps between the large ones and end up at the bottom of the bag. You can draw analogies to doing a random sample by going to the mall or by sending out an email survey and seeing who responds. Ask, How could you do a random sample. It won't be obvious to the students that the way to do a random sample is to number each of the candies from 1 to 100 and pick numbers at random. Also, as noted above, this is an example you can use later in the semester to illustrate bias and standard error.
P.S. My feeling about describing these demos is the same as what Penn and Teller say about why they show audiences how they do their tricks: it's even cooler when you know how it works.
P.P.S. Remember--it's crucial that the candies in the bag be of varying sizes, with a few big ones and lots of little ones!
When you leave a voice mail, please say your name and phone number slowly and clearly. Thank you.
Ben Goldacre links to this article by Christopher Murray et al.:
Rick Romell of the Milwaukee Journal Sentinel pointed me to the National Highway Traffic Safety Administration’s data on fatal crashes. Rick writes,
In 2006, for example, NHTSA classified 17,602 fatal crashes as being alcohol-related and 25,040 as not alcohol-related. In most of the crashes classified as alcohol-related, no actual blood-alcohol-concentration test of the driver was conducted. Instead, the crashes were determined to be alcohol-related based on multiple imputation. If I read NHTSA’s reports correctly, multiple imputation is used to determine BAC in about 60% of drivers in fatal crashes.
He goes on to ask, "Can actual numbers be accurately estimated when data are missing in 60% of the cases?" and provides this link to the imputation technique the agency now uses and this link to an NHTSA technical report on the transition to the currently-used technique.
My quick thought is that the imputation model isn't specifically tailored to this problem and I'm sure it's making some systematic mistakes, but I figure that the NHTSA people know what they're doing, and if the imputed values made no sense, they would've done something about it. That said, it would be interesting to see some confidence-building exercises to give a sense that the imputations make sense. (Or maybe they did this already; I didn't look at the report in detail.)
Benjamin Kay points to this:
I came across this paper by Sanford Gordon, Catherine Hafer, and Dimitri Landa, who write:
Do individuals give political contributions simply because they derive an expressive or other consumption benefit from doing so? Or are they attempting to influence policy outcomes? If the consumption view is correct, then political donations are just another means by which citizens participate in the political process (unequal to be sure), and need not imply improper or undemocratic influence. In contrast, donation decisions that are driven by an investment motivation, especially when they are made on behalf of small but economically powerful minority interests, naturally raise concerns about the possibility of an undemocratic exchange of policy for dollars.We [Gordon et al.] propose a strategy to distinguish investment and consumption motives for political contributions by examining the behavior of individual corporate executives. If executives expect contributions to yield policies beneficial to company interests, those whose compensation varies directly with corporate earnings should contribute more than those whose compensation comes largely from salary alone. We find a robust relationship between giving and the sensitivity of pay to company performance, and show that the intensity of this relationship varies across groups of executives in ways that are consistent with instrumental giving but not with alternative, taste-based, accounts. Together with earlier findings, our results suggest that contributions are often best understood as purchases of "good will" whose returns, while positive in expectation, are contingent and rare.
The empirical part of the paper looks cool--I have no experience looking at this sort of data and so can't really say anything beyond "it's cool." (Well, I will say that I'd like to see a scatterplot to make it clear at a glance what their data are saying.) But I do have some thoughts on the general framework. They consider political contributions as "consumption" or "investment"--which, as far as I know, follows the mainstream of the discipline, but I have a problem with this approach.
I just don't really see the clear distinction between "consumption" and "investment" in this context.
If someone is contributing from an "expressive or other consumption benefit," presumably this person is giving to the candidate whose policies he or she favors. (Perhaps there are some people who give to the other side for reputational reasons, for example an oil company executive who happens to be a Democrat might give to a Republican so he won't stand out in the crowd, or a college professor might donate to Obama to fit in, even if he's actually a McCain supporter. Or maybe it could go the other way too, that someone would donate $20 to the other side just to get a reputation for being unorthodox. But I imagine this sort of thing represents only a very tiny minority of contributions.) Conversely, someone who's donating as an investment probably thinks that his or her candidate is good for the country as a whole. As the authors note, the translation of unequal financial resources to unequal political resources is a potential distortion of the democratic process--I just don't understand this distinction, especially in light of the fact that voting and small-dollar political contributions are rational to the extent that the voter or contributor believes that his or her preferred candidate will benefit the general good.
The 2008 Democratic primary brings to mind a similar contest in 1972, where an experienced champion faced an exciting young challenger. I'm speaking, of course, of the world chess championship, where Bobby Fischer, down 2 games to zero, destroyed Boris Spassky and unequivocally established himself as the best player in the world.

The Clinton-Obama contest has led to confusion: Obama has basically won the election in the sense of being on track to get more than half of the delegates. In that case, how can Hillary Clinton retain the support of 40% of Democrats nationwide? And how did she manage to win Pennsylvania?
John Sides presents some data backing up the standard political science view that news blips are not so important in determining election outcomes in two-candidate races.
Ubs links to a Wall Street Journal column by John Yoo on problems with the Democrats' presidential nominating procedure. Before going into the details of how Yoo makes a botch of election history, Ubs writes, "I'm not accusing Yoo of being ignorant of history. I know he's a well-educated man, and his words in this column strongly suggest he knows exactly what he's talking about. In spite of that, he somehow manages to turn history upside-down so that it seems to mean exactly the opposite. How one does that, other than out of ignorance, I don't know. Outright deceit? A lawyerly disregard for anything but advocacy? I'm definitely accusing him of something, I'm just not sure what."
My take on this is slightly different: I'm guessing that Yoo is like a lot of people who, once they take a side on an issue, quickly slip toward the assumption that all the facts automatically support their position. As a statistician, I'd like to think I'm particularly aware of the general issue of discordant evidence. (To take Yoo's example, just because a particular nominating system might be bad, you don't have to think that it's bad in all cases--this is what seems to have led him astray in his discussion of the 1824 election, as Ubs discusses in detail.) In contrast, a lawyer may be trained more to brush aside or not even notice details that contradict his main story. Perhaps this is even more true of a lawyer such as Yoo who is famous for writing opinions that are kept secret.
The unwillingness to accept discordant evidence is not unique to lawyers, of course. Hal Stern once telling me about how, in the classic book on racetrack betting, Dr. Z's examples were set up so his system always won. As Hal pointed out, no system will win all the time--all that's required is that it beat the track's 18% edge or whatever--but in a narrative it's disturbing to see counterexamples (unless they're clearly swallowed up into an "it's all right at the end" narrative).
Anyway, that's just a longwinded way of saying that I don't think Yoo was necessarily being deceptive or malicious here. First, I think he probably is somewhat ignorant of the details of elections from the early 1800s (after all, so am I, and I'm a political science professor specializing in American politics); second, he can be falling into the unfortunate but common habit of just assuming that his argument, if correct, must hold in 100% of cases.
But, why?
The more interesting question to me, though, is something that Ubs doesn't ask, which is why did Yoo write this Wall Street Journal column at all? With all his notoriety, wouldn't he be better off keeping his head down rather than writing partisan articles that bring his name further to attention? After all, he's not an expert on elections (at least, I can't find any research by him on the topic), so presumably he could've recommended that someone else write that article. Why would he stick his head up like this and make himself a target?
Here my theory is that Yoo has fully gone through the mirror at this point and has emerged as a political activist. As an academic researcher, you have to be careful of what you say, lest it affect the reputation of your scholarly efforts. Thus the endless qualifications that I and others resort to in all our published work.
To elaborate further: I'm not taking about mistakes. Researchers of all levels of ability make mistakes. Yoo's example seems different--the issue is not so much that he made some errors in his column, but that he stuck his neck out by writing a column on a topic where he's not an expert, and then made the mistakes. It just seems so unnecessary to me.
But, and here the metaphor of the "looking glass" comes in: All of us who are applied researchers have mirror images in the public sphere, where our work--or distorted versions of our work--become more widely known. Many of us want to publicize our work--to write Wall Street Journal op-eds, as it were--partly just to make our work more widely known, partly to present our work the way we think it should be presented, and partly to position ourselves to be more likely to promote our future work. But in doing that we have to protect our research reputations. At some point, though, the publicity or advocacy becomes the point, rather than the research itself. For Yoo, perhaps his reputation as a researcher is so politicized at this point that there's nothing left to protect. At this point, he might as well go for it and develop a name for himself as a freelance editorial-page writer?
As a researcher, I envy newspaper columnists' opportunity to have their writings immediately read by millions of people. At the same time, I assume they envy my ability to spend as much time on in-depth research projects as I would like. On the occasions that I try to write something for a broad readership, I'm careful to protect my viability (as Bill Clinton might say) as a researcher. I wonder if Yoo has decided that the choice has already been made for him.


Recent Comments