November 2008 Archives

More on red/blue/rich/poor in 2008

| 3 Comments

After this, here's more, again from exit poll crosstabs that Jared pulled off the CNN website:

Difference in McCain vote share, comparing people in each state with family incomes over and under $50,000 (thus, states that are high on this graph are those where richer people were much more Republican than poorer people):

incomevoting3.png

The same graph, but for whites only (following Larry Bartels's suggestion):

incomevoting4.png

As before, the states are colored as red or blue where McCain or Obama won by more than 10% of the two-party vote, and purple for the states in between.

Lots of interesting patterns here.

Fitting a model with constraints

| 6 Comments

Chris Chatham writes:

I am using multilevel logistic regression to model individuals' abilties to 'stop' a planned motor movement (my binary outcome), based on the delay between the beginning of the trial and the occurrence of the stop signal (my input with 4 different values). As an apriori assumption, I'd like to specify that the fitted model predict perfect 'stopping' when the stop signal is provided at 0 delay and no 'stopping' whatsoever when the signal is provided at each individual's maximum reaction time. While these particular delays were not tested, the assumptions are sound; my question is how I can ensure that the model fits these assumptions without including some arbitrary number of made-up observations in my dataset?

My reply: As a person who has difficulty in suppressing motor movements, I'm interested in your example. Getting to the statistics, I can't understand enough of what you're saying to give a direct answer, but more generically I'd say try to avoid any hard constraints such as zero intercepts or sharp cutoffs at maximum reaction time. I'd first fit the model straight with no such constraints. Then if the constraints are consistent with your inferences, you could consider setting certain parameters to zero (perhaps in this case you'd be setting main effects to zero while letting interactions vary).

Bill Harris writes:

Could it be that liberals / Democrats are more likely found in regions of high population density and that conservatives are more likely found in regions of low population density?

Is it credible that low population density encourages people to think they control their own destiny (there are fewer around to help them) and that there are few limits to their growth (those in Alaska probably run up against fewer obvious limits to growth than those in Connecticut)? Or does the causality work the other way? Are conservatives generally attracted to areas of lower population density?

Is it credible that high population density encourages people to think of the limits to growth and the need to get along with others in ways that seem liberal or even socialistic to their low-density fellow citizens?

Could that explain why Europe, for example, tends to be more liberal than the US and more likely to support environmental concerns? Could that explain why the coasts of the US (at least in the cities) seem more liberal than the middle of the country?

My reply:

Yes, I think there may be something to this. You might want to look at the work of Jonathan Rodden, who has been looking into the persistent liberalism of city-dwellers.

Following all the rules

| No Comments

I had this book sitting on my shelf for awhile--I must've bought it used at some point--called Keys to the City: Tales of a New York City Locksmith. I recently brought it with me on a trip and read it on a plane. The book was ok and had blurbs from David Sedaris and the New York Times, but . . .

At his fun and informative website, Sam Wang writes:

The right tool for thinking about this is the statistics of the binomial distribution, which describes the distribution of all possible outcomes in a two-choice situation with fixed probability p.

I know that many people think this, but after years of work in this area I have concluded that the binomial distribution is essentially never appropriate for studying elections.

It's an old, old story but always worth hearing again, this time from Kevin Carey:

Name that tune . . . in 8 words

| No Comments

I was in a bookstore the other day and picked up Richard Ford's most recent sequel to The Sportswriter, opened to a random page, and read the following sentence:

I'm eager to go, though still light-headed.

That's just so Bascombe! It's amazing how one sentence selected at random captures the style so well.

Red/blue/rich/poor: 2008 update

| 4 Comments

In our book, we discussed how the rich-state, poor-state divide was larger among the rich than the poor--or, to put it another way, how rich people in states such as Mississippi are much more Republican than poor people in Mississippi, but rich people in Connecticut do not vote so differently from poor people in Connecticut.

What happened in 2008? From the exit poll data at the CNN website, we get:

3states1.png

On the logarithmic scale:

3states2.png

The x-positions of these lines are in different places because Mississippi and Connecticut got small samples and CNN didn't post the percentages for some of the extreme categories which had small n's.

Here are three states ranging from Texas (strongly Republican) to Florida (battleground) to California (strongly Democratic). Texas actually has a higher per-capita income than Florida, but here are the exit poll data in any case:

3states3.png

The more systematic thing to do is to look at all 50 states. In each, I took McCain's share of the two-party vote for each income category where we had data, then regressed it on the category numbers (which we originally numbered 1 through 8 and then standardized to have mean 0 and standard deviation 0.5). I then plotted these regression coefficients on a graph along with state income:

incomevoting1.png

The y-scale of the graph roughly represents McCain's vote share among the rich minus his share among the poor, within the state. We see the familiar pattern from our book, that the association of rich with Republican holds everywhere but is strongest in poor states. The states are colored as red or blue where McCain or Obama won by more than 10% of the two-party vote, and purple for the states in between.

But there's a potential problem here, as illustrated by the Mississippi-Connecticut pattern above. The data from Mississippi are more at the low end of income, and the data from Connecticut are more at the high end. We already know that the relation between income and Republican voting flattens out at higher incomes, and so maybe Connecticut's flat slope arises just because we're taking its numbers from the flatter part of the curve.

To correct for this, for each state we take the regression plotted above, then we fit the same regression to the same range of incomes from the national exit poll, then we add back in the full regression of the national poll using all eight income categories. The result is a quick estimate of what the entire difference between rich and poor would be in the state, if we were to have sufficient data from all eight income categories within each state.

And here's the result:

incomevoting2.png

A few of the southern states on the left part of the graph have high rich-poor voting differences (even after controlling for the range of incomes where the comparisons were being made), but the overall pattern of rich and poor states isn't so strong.

Further thoughts:

1. Larry Bartels comments that if you only look at whites, the rich voter, poor voter pattern is similar in rich and in poor states. So one of our main findings from the Red State, Blue State book from the 2000 and 2004 elections did not persist in 2008.

2. Boy do I want the raw exit poll data so I don't have to screw around with these artificial missing data problems.

3. I also want some pre-election poll data. The exit polls were so screwed up this year, I don't fully trust anything based on exit poll data alone.

Here's something else from Frank Morgan:

The Prime Number theorem says that the probability P(x) that a large integer x is prime is about 1/log x. At about age 16 Gauss apparently conjectured this estimate after studying tables of primes. Greg Martin suggested to me [Morgan] the following heuristic way to approach the same conjecture, which appeared in my Math Chat column on August 19, 1999:

Suppose that there is a nice probability function P(x) that a large integer x is prime. As x increases by \Delta x = 1, the new potential divisor x is prime with probability P(x) and divides future numbers with probability 1/x. Hence P gets multiplied by (1 - P/x),\Delta P = - P^2/x,
or roughly

P' = - P^2/x.

The general solution to this differential equation is P(x) = 1/log cx.

Interesting. Every now and then when I've been stuck in a boring meeting, I've amused myself by trying to come up with a heuristic derivation of the prime number theorem but never with any success.

Frank Morgan is a wonderful teacher. I took a course from him in college and was impressed by his ability to help students of varying ability levels. (This was MIT so I guess the abilities were all on the high end, but still I think this is a challenge in any group.)

A few years ago I invited Frank to come to give a seminar on teaching to the mathematics and statistics departments at Columbia. One message I got from his talk was that much of teaching success comes from hard work. For example, every semester Frank would put the names and photos of his students on flash cards and memorize who was who.

This sort of thing was impressive to me. Any expert can demonstrate how great he is, but it takes someone very special to convey that anyone could achieve that level of success by just working hard.

That said, hard work is not enough. For example, statistics T.A.'s often spend dozens of hours preparing elaborate handouts for their students; this is almost always (in my impression) a waste of time. Better to adapt to the textbook, I think.

Anyway, I noticed this note by Frank on how he helps students prepare for their senior presentations:

At Williams every senior math major chooses a faculty advisor and gives a 35/40-minute colloquium talk. Since we currently have over fifty senior majors, this keeps us pretty busy, but we think it well worth the effort.

Here is how I like my advisees to prepare, starting a month before the talk and consulting with me every day or two . . .

Every day or two . . . that's impressive!

Blog upgrade from MT 3.3 to MT 4.2

| 4 Comments

We have upgraded the blog software from MT 3.3 to MT 4.2. There might be some hiccups, but we hope to have it operational as quickly as possible. Let us know if there are any problems!

Visualizing election polls

| 3 Comments

A colleague points me to these supremely ugly pie-like graphs by Richard Riesenfeld and Geoff Draper. On the other hand, who am I to say they're ugly? I'm sympathetic to the goal of "exposing complex relationships that are not obvious by usual methods of statistical analysis." And it's hard to argue with "Eighty-eight percent said they enjoyed using the software and 71 percent completed all the tasks without errors." I've certainly never performed such an evaluation of my own graphical methods, instead relying, Tufte-like, on my introspective judgment.

The score

| 3 Comments

Occasionally I post comments here on other people's books or articles, and sometimes I email the authors to get their feedback. Here's the score:

Responded:

John Clute
Richard Florida
Malcolm Gladwell
Sander Greenland
Daniel Gross
Mickey Kaus
Paul Krugman
Andrew Leonard
John Lott
Jay Nordlinger
Andrew Oswald
Ed Park
Steve Sailer
John Seabrook
Nassim Taleb
Josh Tenenbaum

Did not respond:

Robert Frank
Satoshi Kanazawa
George Packer
Russ Alan Price
David Runciman

I think I've missed a few here (in both categories). Also, some people I'm still waiting to hear from, and some respond but not in a useful way.

P.S. I just noticed: all these people are male (and most are white)! I'll have to diversify a bit!

Political engagement on the web

| No Comments

The Compete Blog (which posts a wealth of interesting data charts mined from monitoring web surfers) posted statistics
about proportion of web surfers that visit political websites:

webpolitics.png

Colorado, Connecticut and New Jersey are at the top. Colorado was a battleground state.

A question about the youth vote

| 1 Comment

Shivaji Sondhi writes:

I had a question for you about the youth vote. What are its ethnic and red/blue composition? The reason I ask is that I was trying to integrate the apparently growing Democratic dominance in this segment with various other beliefs I have seen expressed, e.g

a) that red states have larger fertility (affordable family formation or whatever)

b) that families have an impact on the political beliefs of children (more than educators, as educators insist - at least at the college level, I haven't really seen a discussion of school teachers) which would then provide a mechanism for (a) to affect voting share to the right of the spectrum

c) that the minorities form a growing share of the young which would tilt the playing field to the left.

My reply:

1. I don't yet have raw survey data. The exit polls on the web do break down the vote by age and race. Among blacks, Obama won about the same among all age groups. Among Hispanics, Obama did 8% better among the young than the old, and among whites, Obama did 14% better among the young than the old.

But . . . if you believe the exit polls (which I don't, completely), there was an interaction between age and race: many more of the young voters were ethnic minorities. Among blacks and Hispanics, there were three times as many under-30's as over-65's. (By comparison, among whites, there were more old voters than young voters.)

So the age effect partly arose from lots of young ethnic minorities coming out to vote.

2. People do tend to vote like their parents--children of Republicans are, on average, more likely to vote Republican--but cohort effects go on top of this. The recent economy and George W. Bush's approval ratings aren't likely to make the Republican Party popular with young people--especially those who are ethnic minorities. Any differences in birth rates between states are small compared to these big political swings, which are not just about Obama; see this graph from 2006:

27-4.gif

Malcolm Gladwell recounts the story of Sidney Weinberg, a kid who grew up in the slums of Brooklyn around 1900 and rose to become the head of Goldman Sachs and well-connected rich guy extraordinaire. Gladwell conjectures that Weinberg's success came not in spite of but because of his impoverished background:

Why did [his] strategy work . . . it's hard to escape the conclusion that . . . there are times when being an outsider is precisely what makes you a good insider.

Later, he continues:

It’s one thing to argue that being an outsider can be strategically useful. But Andrew Carnegie went farther. He believed that poverty provided a better preparation for success than wealth did; that, at root, compensating for disadvantage was more useful, developmentally, than capitalizing on advantage.

At some level, there's got to be some truth to this: you learn things from the school of hard knocks that you'll never learn in the Ivy League, and so forth. But . . . there are so many more poor people than rich people out there. Isn't this just a story about a denominator? Here's my hypothesis:


Pr (success | privileged background) >> Pr (success | humble background)

# people with privileged background << # of people with humble background


Multiply these together, and you might find that many extremely successful people have humble backgrounds, but it does not mean that being an outsider is actually an advantage.

Here's more from Gladwell's article:

NY Times has a good article on the state of recommender systems: "If You Liked This, Sure to Love That ". This is a description of one of the problems:

But his progress had slowed to a crawl. [...] Bertoni says it’s partly because of “Napoleon Dynamite,” an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and tries to predict how any given Netflix user will rate them, he’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” he’s off by an average of 1.2 stars.

The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. [...] It’s the type of quirky entertainment that tends to be either loved or despised.

And here is the stunning conclusion by fortunately anonymous computer scientists:


Some computer scientists think the “Napoleon Dynamite” problem exposes a serious weakness of computers. They cannot anticipate the eccentric ways that real people actually decide to take a chance on a movie.

Actually, computers do quite a good job modeling probability distributions for those more eccentric and unpredictable of us. Yes, the humble probability distribution, the centuries-old staple of statisticians is enough to model eccentricity! The problem is that Netflix makes it hard to use sophisticated models the scoring function is the antiquated and not just pre-Bayesian but actually pre-probabilistic root mean squared error or RMSE. For all practical purposes, the square root in RMSE is a monotonic transformation that won't affect the ranking of recommender models, and we can drop it outright.

So, if one looked at the distribution of ratings for Napoleon Dynamite on Amazon, it has high variance:
napoleondynamite.png

On the other hand, Lethal Weapon 4 ratings have lower variance:
lethalweapon4.png

If we use the average number of stars as the context-ignorant unpersonalized predictor (which I've discussed before), ND will give you mean squared pain of 3.8, and LW4 will give you the mean squared pain of 2.7. Now, your model might choose not to make recommendations with controversial movies - but this won't help you on Netflix Prize - you're forced to make errors even when you know you're making them. (R)MSE is pre-probabilistic: it gives no advantage to a probabilistic model that's aware of its own uncertainty.

The Earth Institute is looking for applicants for its postdoctoral fellows program, and if you're doing statistics you can work with me. It's a highly competitive program, deadline is 1 December so apply now:

Thanh Nguyen writes:

Could you tell me what is the difference between "uncertainty" and "ignorance" in this theory [of belief functions]? Some authors define "ignorance" as the "uncommitted belief" which is assigned to the whole frame of discernment, others define it as the difference between Plausibility and Belief (Pl() - Bel()). Some authors define value assigned by Belief function for elements as "uncertainty".

I don't know. All I know about belief functions is in my article about the boxer, the wrestler, and the coin flip, which is actually a writeup of something I did 20 years ago. So no new thoughts, unfortunately.

The Future of Data Analysis

| 4 Comments

Introduction A few days ago I was trying to explain the benefits of the Bayesian approach to a physicist who didn't care about the religion of truth and inference but primarily about solving a particular detection problem in particle physics. The probabilistic approach is rather standard and requires little persuasion, but the Bayesian aspect is is a level further than the probabilistic approach. So what is the benefit of the Bayesian approach? This posting will attempt to provide several reasons, from the most obvious to the least.

Frequentist Probability Probability is easily justified as a very elegant way of dealing with uncertainty in cases and variables. But probability is not observed directly but instead inferred - as are the parameters in contrast to observable predictors and outcomes. Frequentists state that the probability should be measured through the gold standard of an infinite sequence of observations, and question the benefit of Bayesian approach while criticizing the fact that inferring a parameter Bayesianly can yield worse accuracy than their favored method of "estimators" - and a bad prior can totally mess up inference. So why not use estimators if their asymptotic properties are good and the methodology often simpler than Bayes?

Overfitting Dividing the number of positive outcomes with the number of all outcomes to estimate the probability of the positive outcome is a very simple estimator: it's easy to have enough data to calculate this. But most interesting questions are not as simple: it is not interesting to calculate the probability of getting cancer, and the probability of getting cancer given smoking also requires removing the obvious effect of age. All these additional variables make a model more complicated, and the number of parameters greater. Without care and attention the model can start hallucinating properties that aren't there. The problem is shown in the following picture:

why-bayes.png

If your modeling problem is in the green area, you can happily use estimators or maximum likelihood. If you're entering the yellow area and want to retain some generalization power, you need some sort of regularization, epitomized by L1 and L2 regularization, AIC, feature selection or support vector machines. So why shouldn't we just regularize?

Priors Priors are how a Bayesian would perform regularization. After seeing a large number of regression problems from medical domains, we can safely assign a prior distribution to the size of a regression coefficient, as we have done in our paper. But then, what is the advantage over regularization? A prior is just a distribution of what the parameters should be over a particular category of problems! Isn't this a nice way to formulate regularization?

Model Uncertainty The crux of Bayes is in using probability to represent the uncertainty about the Platonic - the model, its parameters, the probability. The Bayesian approach truly starts paying a dividend when there is uncertainty in models and parameters, when we have insufficient data to accurately fit the model. Even if an estimator could rather accurately match the predictions obtained by a posterior, the variance in the posterior allows us to understand when the model can't be fit. To the best of my knowledge, no other methodology can automatically detect such problems.

Another problem that Andrew identified is that there might be situations where the data doesn't match the model very well - and even though there might be lots of data and a relatively simple model - it just doesn't fit, and the posterior will be vague.

Language of Modeling WinBUGS is an example of a higher-level modeling language. Just as programming languages have been celebrated as improving programmers' productivity: they do not require the programmer to think in terms of individual statements such as SET or JMP but in terms of functions, procedures, loops. Similarly, with Bayesian models we no longer have to think in terms of derivatives and fitting algorithms, but in terms of parameters having distributions and tied together in models. Gibbs sampler is a general-purpose fitter and proto-compiler. Of course, it's not nearly as efficient as a hand-written optimizer, but in the future tools like the Hierarchical Bayes Compiler (HBC) will create custom fitters given a higher-level specification of the model.

Summary The primary value of the Bayesian paradigm is its formal elegance which allows automation of key problems: probability takes care of unpredictability in phenomena, priors help prevent overfitting by providing outside experience (AI practitioners would refer to it as background knowledge), the use of model uncertainty helps determine the reliability of predictions, and applied Bayesians are beginning to develop model compilers!

Future The theory and practice of data analysis is currently all mixed up among a number of overlapping disciplines: (applied/mathematical/geo/medical/...)statistics, machine learning, data mining, (econo/psycho/bio)metrics, bioinformatics. All of them pursue the same problems with different but qualitatively similar tools, lacking the scale to build tools that would help them get to the next level. It is important to disentangle them. The future of data analysis should lie on these four fronts:


  1. reliable compilers and samplers that will work with large databases, provide reliable sampling (see BUGS, HBC - empowered by the new generation of programming languages such as Haskell)
  2. internet databases intended to manage background knowledge and related data sets, where the same variable appears and the same phenomenon appear in multiple tables, allowing priors to be based on more than a single data set. Research should be presented as raw data in a standardized form, not as reports and aggregates that prevent others from building on top of the finished work. Too many people are working on the same problems but not sharing the data because of an unsolved issue of the rights of the collectors of data who can only gain credit for publications (see FreeBase, Machine Learning Repository, Trendrr, Swivel, OECD.Stat)
  3. visualization & modeling environments that make it easier to clean and transform data, experiment with models, to present insights, to reduce the amount of time needed to turn data into a model that can be communicated. (see R Project, Processing, Gapminder)
  4. interpretable modeling is important to bring formal models closer to human intuition. It is still not clear what is the importance of a predictor for the outcome - the regression coefficient is close, but yet often confusing. With more powerful modeling frameworks, it is going to be possible to focus on this - not being worried about what one can fit, but instead with model choice, model selection, model language, visual language.

What do you think? What links did we miss?

John Seabrook writes:

There is also little consensus among researchers about what causes psychopathy. Considerable evidence, including several large-scale studies of twins, points toward a genetic component. Yet psychopaths are more likely to come from neglectful families than from loving, nurturing ones.

I'm confused here. If there's a big genetic component, wouldn't it stand to reason that parents of psychopaths are more likely to be neglectful and less likely to be loving and nurturing? So why the "Yet" in the quote above? Or is there something I'm missing?

P.S. in response to commenters: Yes, I agree that it's possible for psychopathy to be largely genetic without parents of psychopaths being much more likely to be neglectful.

What I didn't understand was Seabrook's implication that this would be surprising, the idea that if (a) a trait is genetically linked, and (b) a trait can be (somewhat) predicted by parental behavior, that the combination of (a) and (b) should be considered puzzling. By default, I'd think (a) and (b) would go together.

Kevin Denny writes:

Depressive symptoms are significantly higher amongst left-handed men. While 19% of right handed men report experiencing depressive symptoms for at least a two week period, the figure for left handed men is almost 25%. For women the corresponding percentages are 33% and 36% respectively but the difference is not statistically significant.

The analysis is of "a new large population survey from twelve European countries," a random sample of 27000 non-institutionalized people aged 50 and older. Handedness was classified based on self-reporting, and depression is measured using standard questions. Of the sample, about 7% of men and 6% of women were classified as left-handed.

My only suggestion (beyond reporting fewer significant digits in the tables) is to rescale the depression scale by dividing by two standard deviations; this would allow the coefficients to be interpretable on the same scale as those for the binary outcome (see Table 2).

Ben Lauderdale writes:

I [Ben] had this map [see below] on my door for the last week. Based on exactly the same calculation using constant 95% black support and census-proportional representation. The white counties are the ones whose census names didn't match properly with the names used in the library(maps) package in R, I was too lazy to fix them.

ben1.png

Cool. I'd only suggest using light gray rather than heavy black lines between counties; the map as it is overemphasizes the county borders, I think. But I respect his laziness; there's always time later to fix the details.

Ben continues:

[Below are] the state-by-state county share plots for the lower 49, Obama vote share as a function of black population share. V.O. Key's observation that whites who live near blacks in southern states are less positively inclined towards them is *still* visible in several states.

ben2.png

The circle areas are proportional to county voter turnout. (The biggest circle is L.A. county in California, and so forth.)

Ben also had this comment about his map:

It reminded me of something Bob Putnam would say every time someone presented an empirical talk in our Center for the Study of Democratic Politics series during the year he was a fellow here at Princeton: "You should include miles to the Canadian border as a variable in your regression, it is the most important proxy for political culture in America!" At least in the eastern half of the country, he has a point.

Except for New Hampshire and Vermont, I think.

P.S. For graphics enthusiasts, here are some earlier graphs that I gave the thumbs-down on before Ben came up with the 50 plots above:

A colleague was asking for suggestions for teaching a course in the comic novel. Beyond the obvious (Waugh, Wodehouse, Roth, Nabokov), I thought of:Our Man in Havana, by Graham Greene. Twain is another obvious call, except that his funny novels are also serious. The funniest non-serious thing I know of by Twain is Adam's Diary, but that's just a short story. We also discussed End Zone by Don DeLillo. And I've also heard that Gulliver's Travels is pretty good; I've never read it. I also think much of The Sportswriter and Independence Day by Richard Ford are hilarious, but I don't think they'd be classified as comic novels.

My latest thought is Little Children by Tom Perotta. It's an excellent book but it's not a great work of art, but that's the point: when teaching a class, maybe it's better to have something where the seams show a little.

P.S. See comments below. Also, Bridget Jones's Diary. And some kids' humor book: not something like Lemony Snicket that's supposed to be good, but something more lowbrow such as Goosebumps or Captain Underpants, to get a sense of what people think is funny. Also, something funny but completely non-novel-like, for example Chris Rock's book. Students can compare how the comic novels differ from the quick jokes.

Drew Conway pointed me to this:

The article entitled, "Bayesian Analysis for Intelligence: Some Focus on the Middle East," was written by Nicholas Schweitzer . . . JIOX provides no information on the essay's origins, but . . . it appears to be a declassified CIA piece written sometime in the 1970's (note mentions of Presidents Asad and Sadat, and Prime Minister Rabin on page one). . . . Schweitzer concludes that in general the Bayesian technique was able to more quickly predict "non-events" (i.e., when no hostilities would occur among Middle Eastern nations) than analysts using only their expertise and intuitions. The research design included no baseline for comparison to an actual event; therefore, we are left wondering if the Bayesian technique described here would be able to predict when something will actually happen. Despite this obvious shortcoming, it is very encouraging to observe the level of sophistication being implemented by CIA analysts some thirty-odd years ago.

I actually participated a couple years ago in an (unclassified) meeting on Bayesian analysis for military intelligence, so I know that these ideas are still out there. My only comment, regarding the Bayesian issue per se, is that the key to good statistical methods is typically making use of relevant information; non-Bayesian methods can also be effective if they can be adapted to use the info that goes into a Bayesian procedure.

Too loose

| 1 Comment

Will Wilkinson interviewed me for Bloggingheads today, and it was a disaster. I was too relaxed and I treated it as a conversation rather than a formal presentation or interview. As a result, I did too much b.s.-ing and too much conversational yapping, and not enough presentation of our research findings. I also said a bunch of things that are interesting or funny in informal conversation but probably come off as obnoxious or off-the-cuff in an interview that can be viewed interactively.

It's too bad, because my Red State, Blue State presentation is fun and informative, and I think the radio interviews I've done (with lengths ranging from 5 minutes to an hour) have gone well also. The two things that threw me off:
1. I've met Will before and I felt comfortable with him, hence too relaxed. Will was an excellent interviewer and gave me many opportunities to explain things; it wasn't his fault that I spouted off too much.
2. I've already spoken with Will about the book and so it was hard for me to remember to start from scratch--the audience won't necessarily be familiar with it.
3. Seeing my image in front of me while I was talking made me extra-focused on not twitching--always a bad thing. In a face-to-face or telephone interview, I usually forget about the twitching after a minute or so. Trying to suppress it takes a lot of mental effort that would be better used to think about my responses.

It would've been better to have some written talking points in front of me to keep me focused. The funny thing is, I did that for my early radio interviews but as I got more used to the format, I started speaking more off the cuff and it was going fine. This was just an interview too far. I had fun while it was happening, but afterward I realized what had gone wrong.

Anyway, it felt good to get this off my chest.

Someone went to our radon site and asked:

I'm thinking of mitigating my basement radon of 7.75 pci/l. It's a parcel slab with a crawl space. Why can't I just install an exhaust fan in the basement? Instead of PVC piping, drilling into the slab and sucking out the air underneath the membrane in the crawl space, etc. I have a high efficiency furnace with a fresh air inlet that wouldn't create negative pressure.

Phil's reply:

I said I wouldn't do more posts on the election, but . . . Eric Rauchway merged our provisional county data with Census numbers on %black and made some graphs, which I played with a little to get the following:

eric1.png

Percent black acts as a floor on Obama's vote share; beyond that, it predicts his vote better in some regions than others.

But really there are two things going on. First, Obama's getting nearly all the black vote; second, depending on the region, whites are voting differently in places with more or fewer African Americans.

Then I had a thought. Obama got 96% of the black vote. If he got 96% in every county--which can't be far from the truth--then we can use simple algebra to figure out his share of the non-black vote in every county. If B is the proportion black in the county and X is the (unknown) Obama vote share among non-blacks, then, for each county,

obama.vote = 0.96*B + X*(1-B)

And so

X = (obama.vote - .96*B) / (1 - B)

This is only an approximation--for one thing, it assumes turnout rates are the same among blacks and others--but it can't be too far off, I think. And it leads to the following graph:

eric2.png

(Lowess lines are shown in blue.) None of this is a huge surprise: outside the south, places with more African Americans tend to be liberal urban areas where people of other ethnicities also vote for Democrats; in the south, many African Americans live in counties where the whites are very conservative.

Notes:
1. These graphs are non-blacks, not whites. Some of the variation has to be explainable by the presence of other minority groups.
2. For a few of the southern counties, our estimates of X are negative; that just means that Obama got less than 96% of the black vote there, or there was differential turnout, or some combination of these.

Sign in the Chicago L:

Soliciting and Gambling are Prohibited on CTA Vehicles

I had no idea this would be a concern.

Christian Robert writes:

Objet: lancement de la campagne de post-doc 2009 de la Fondation

La Fondation Sciences Mathématiques de Paris offre quinze positions post-doctorales en mathématiques et en informatique fondamentale. D'une durée d'un an - éventuellement renouvelable - ces postes sont à pourvoir à partir du 1er octobre 2009 dans les laboratoires de recherche affiliés à la Fondation.

L'appel d'offre du programme post-doctoral sera ouvert du 31 octobre au 17 décembre 2008, sur le site de la Fondation et en anglais.

He says they pay well, too! And if you do statistics, maybe you can work with me next year...

John Kastellec made this graph of seats and votes in 2006 and 2008. For each year, the dot is what actually happened and the line is our estimated seats-votes curve based on modeling from the previous election year.

sv1.png

The Democrats did well in both years, but they didn't get as many seats as we would've expected, given their vote share. As I've already discussed, the Democrats' 56% share of the average district vote was pretty impressive, a 5.7 percentage point gain since 2004:

adv.png

But the Democrats performed less well than expected in converting votes to seats. This explains to me why Charlie Cook et al. felt that the Democrats' performance was disappointing. At the level of voters, however (and of public opinion), the party did fine in congressional voting.

I just love these stories.

Michael Herron sent me this article-in-progress by Jonathan Chapman, Jeffrey Lewis, and himself on residual votes in the 2008 Minnesota Senate race. They conclude:

In the Minnesota Senate case there is no doubt that the number of residual votes dwarfs the margin that separates Coleman from Franken. We show using a combination of precinct voting returns from the 2006 and 2008 General Elections that patterns in Senate race residual votes are consistent with, one, the presence of a large number of Democratic-leaning voters, in particular African-American voters, who appear to have deliberately skipped voting in the Coleman-Franken Senate contest and, two, the presence of a smaller number of Democratic-leaning voters who almost certainly intended to vote validly in the Senate race but for some reason did not do so. . . . At present, though, the data available suggest that the recount will uncover many of the former and that, of the latter, a majority will likely prove to be supportive of Franken.

Computational Finance with R

| 3 Comments

Jan Vecer is co-organizing a conference here at Columbia on 4 Dec on computational finance with R. Registration information is at the link.

Modeling growth

| 4 Comments

Charles Williams writes,

In a number of your examples in the multilevel modeling book you use growth as an outcome. I'm doing this in a study of firm growth in the cellular industry. In this setting, we need to control for firm size since firm's propensity to grow is definitely affected by its size. Someone suggested to me that I may have correlation between the size variable and the error term, since size is effectively in the denominator of the growth variable. They suggested using just the numerator of the growth term (subscribers added) as the outcome, since the denominator will be controlled for in the regression.

Have you run into this? Do you agree that there is a potential for bias in using size as a regressor for growth?

My reply: Yes, it makes sense to control for size (at the beginning of the study) in your regressions, probably on the log scale. I'd still use the ratio as an outcome because I think it would help the coefficients be more directly interpretable (which is a virtue in itself and also helps with efficiency if you have a hierarchical or Bayesian model).

"Not a few" = 6?

| 5 Comments

In a discussion of the historic nature of Barack Obama's election, Christopher Hitchens writes, "there were not a few elected black American representatives 40 years ago."

This claim surprised me, so I looked it up. In 1968, there were 5 African Americans in the House of Representatives and 1 in the Senate. This sounds like only "a few" to me! Was Hitchens just confused here, or am I missing something?

P.S. Somebody pointed out that there were black state and local officeholders as well. I guess it all turns on what is meant by "not a few." Blacks were certainly a very low percentage of all U.S. elected officials back then.

Information and application instructions are posted on the ETS Web site at http://www.ets.org/research/fellowships.html. The deadline for applying for the summer internship and postdoctoral fellowship programs is February 1, 2009. The deadlines for applying for the Harold Gulliksen program are December 1, 2008 for the preliminary nomination materials and February 1, 2009 for the final application materials.

Tyler Cowen's recent remark against team players reminded me of my paper a few years ago, Forming Voting Blocs and Coalitions as a Prisoner's Dilemma: A Possible Theoretical Explanation for Political Instability:

Individuals in a committee or election can increase their voting power by forming coalitions. This behavior is shown here to yield a prisoner's dilemma, in which a subset of voters can increase their power, while reducing average voting power for the electorate as a whole. This is an unusual form of the prisoner's dilemma in that cooperation is the sefil sh act that hurts the larger group. Under a simple model, the privately optimal coalition size is approximately 1.4 times the square root of the number of voters. When voters' preferences are allowed to di ffer, coalitions form only if voters are approximately politically balanced. We propose a dynamic view of coalitions, in which groups of voters choose of their own free will to form and disband coalitions, in a continuing struggle to maintain their voting power. This is potentially an endogenous mechanism for political instability, even in a world where individuals' (probabilistic) preferences are fixed and known.

Cool jargon, huh? Here's a pretty picture from the article:

pris2.png

And here's a schematic of the reasoning:

pris1.png

Richard Florida writes:

The critical feature of the creative economy is that it makes place the fundamental feature of politics, culture, and economics.

This isn't literally true, at least not in Alabama and Mississippi, where whites went 8 to 1 for McCain and blacks went something like 25 to 1 for Obama. But I think what Florida means is that place is more important than it used to be within demographically defined subgroups of the population (in particular, upper-middle-class whites).

The question is: how to state this hypothesis carefully, how to test it, and how to understand where (in space and time) it's largely true and where it's not. This is an important research project, I think.

In yesterday's blog entry I looked that the swing in congressional voting nationally (House Democrats gained 5.7%, on average, compared to 2004) and by state (compared to 2004, House Democrats gained in nearly every state). My graphs elicited several interesting comments including this from Steve Sailer:

Perhaps the reason that the GOP House losses of seats were considered not so bad compared to 2006 was because in 2008 the Democrats ran up huge turnouts in black-represented Congressional districts, which were already all Democratic?

Let's look at some district-by-district swings, starting in 2002:

congswings.png

Here, I'm excluding uncontested elections and those in which the challenger got less than 10% of the vote; dots indicate incumbents running for reelection, circles are open seats, and red points are those with black representatives as of 2008. (I just pulled the names off the Congressional Black Caucus website and didn't try to go back to earlier years on this.)

What happened? Overall, the Democrats gained a bit in 2004, a lot in 2006, and some in 2008. But we knew that (see the time series plot in the blog entry linked above). We also see a bit of scatter. Beyond this, yes, there are some patterns. In 2006, the Democrats particularly gained in Republican areas--see how those dots in the lower left of the second graph are way above the 45-degree line? In 2008, the swing is more uniform. (In addition, the black Democrats did pretty well in 2008 compared to 2006, but it doesn't seem like a big part of the story.)

Returning to the "How well did the Democrats actually do in 2008" question, I think that one problem is that people are comparing Obama's vote to Kerry's vote but then comparing the congressional Democrats in 2008 to the congressional Democrats in 2006. I think it's more appropriate to compare 2008 to 2004 in both cases. As Paul Krugman put it, "Maybe the reason people don’t see this is that the Democratic House gains were spread over two elections."

P.S. This is about it for now, I think. Time to return to regular statistics posting.

The Department of Statistics at Columbia University invites applications for an Assistant Professor position, commencing Fall 2009. A PhD in statistics or a related field and commitment to high quality research and teaching in statistics and/or probability are required. Outstanding candidates in all areas are strongly encouraged to apply. You should apply before December 1, 2008.

This is cool stuff (by Jeff Lax and Justin Phillips).

Mark Schmitt writes:

The long election cycle featured as many theories about how the election would turn out as there were presidential candidates in those first debates in 2007. Let's give some of the theories a post-final-exam assessment.

He discusses a bunch of things here, but the one that interests me the most is:

Economic Determinism: B. Some political scientists and economists like to remind us that for all the Palin jokes and PUMAs and debate gaffes, elections are pretty simple -- a good economy benefits the party in power; a bad economy creates a change election. There are various models that, ignoring all polls, aggregate and weight economic data to predict the outcome. The best known model is that of Yale's Ray Fair, which predicted an Obama victory with 51.9 percent of the vote, off by just a percentage point. Other models were also accurate.

My comment: Regarding the political science theories, I think "economic determinism" is a bit strong. These models do have other predictors and they also acknowledge error. Also, I know that Ray Fair did this stuff early on, but nowadays I think that political scientists such as Bob Erikson, Chris Wlezien, Doug Hibbs, Jim Campbell, and Larry Bartels are the more serious researchers in this area. If you want to read a whole book about the topic, I recommend Steven Rosenstone's Forecasting Presidential Elections from 1983. "Economic determinism" may look kind of simplistic, but I think the work of Rosenstone and his successors captures important truths.

Voter turnout update

| 1 Comment

Michael McDonald posts his updated estimate of voter turnout. Here's the updated graph:

turnout.png

And here are McDonald's comments. They are interesting from the standpoint of statistical inference as well as politically:

My [McDonald's] revised national turnout rate for those eligible to vote is 61.2% or 130.4 million ballots cast for president. This represents an increase of 1.1 percentage points over the 60.1% turnout rate of 2004. . . .

Postdoctoral opportunity with the Earth Institute

| No Comments

The Earth Institute is looking for applicants for its postdoctoral fellows program:

There's an idea going around that the Democrats turned in a disappointing performance in Congressional races this year. For example, a politically-minded friend of mine of the liberal persuasion wrote: "The election was good news, although the Democrats did not do quite as well in the Senate and House as I expected. Obama did not have very long coattails--given how anti-Republican Americans are these days."

Some of the pros say this too; for example, Charlie Cook writes, "given the strength of the top of the ticket nationally, one might have thought that the victory would have been more vertically integrated. . . . what happened down-ballot was not proportional to what happened at the top."

And Mickey Kaus attributes this to moderate ticket-splitters who, expecting that Obama would win, decided to support Republicans in Congress: "swing voters compensated for the bold, hopeful risk they took on Obama (including for overcoming any race prejudice) by gravitating back toward Republicans in their local Senate and House races."

The only trouble with this theory is that it's not supported by the data. Obama won 53% of the two-party vote, congressional Democrats averaged 56%. The average swing of 5.7% from Democratic congressional candidates in 2004 to Dems in 2008 was actually greater than the popular vote swing of 4.5% from Kerry to Obama.

Let's look at what happened state by state. Here I'm plotting the swing in average district vote in each state, comparing the congressional elections of 2004 to those of 2008, ordering the states by Kerry's share in 2004:

swings1.png

The horizontal blue line shows the average swing of 5.7%. The Democrats gained in nearly every state, with, unsurprisingly, some big swings in some of the small states that have only one or two congressional districts.

Now let's compare this to the state-by-state swing in the presidential vote:

swings2.png

Obama beat Kerry nearly everywhere, fairly uniformly with only a few exceptions--we knew that--but my point here is that Obama's swings weren't quite as large, on average, as the state congressional delegations'.

If you want, you can look at both swings at once:

swings3.png

In the states in the upper left of this graph, the Democrats improved more in the congressional than in the presidential vote; the states in the lower right are those where the Obama-Kerry swing was greater than the Democrats' swing in House races.

There are a lot more states in the upper left than in the lower right. Each state has its own story--for example, I wouldn't attribute Don Young's squeaker in Alaska to Barack Obama's coattails--but given the graphs above, I think it's hard to make the case that, overall, the voters were saying No to the Democrats in Congress. On the contrary, congressional Democrats averaged 56% of the vote--their best showing since 1976 (and far more than the Republicans' 52% in 1994).

Here's the story in a map:

swingmap.png

For some historical perspective, here are the Democrats' two-party vote share in presidential elections and average two-party vote in congressional elections since 1946:

adv.png

Presidential voting has been much more volatile than congressional voting (incumbency and all that). This makes the Democrats' 5.7-point gain over two elections even more impressive.

Summary

I think Charlie Cook was closer to the mark when he wrote, "The political environment and momentum that Democrats seemed to have in recent months may have led to an unrealistic set of expectations. In this, perhaps we pundits share some blame." I don't think it makes a lot of sense to consider Obama's 53% "enormously impressive" and congressional Democrats' 56% a disappointment.

The data demolish the idea that voters in 2008 were pulling the lever for Barack but not for the Dems overall (not for "Nancy Pelosi," if you will).

Notes

1. I thank John Kastellec and Jared Lander for gathering the data and sharing their thoughts.

2. I'm counting uncontested House candidates at 75% of the vote (see our earlier article for discussion of this and similar technical issues).

3. We use average district vote rather than total vote because congressional vote totals vary a lot, and we're trying to assess national public opinion (as judged, for example, in Kaus's quote above).

4. The Democrats won resoundingly; this means that the voters preferred them to the alternative; it does not necessarily mean the voters want the specific policies proposed by the Democrats. Recall the Democrats' surprising lack of popular success after 1976 and the Republicans' struggles after their 1994 sweep.

5. I'm talking about public opinion here, not campaign strategy. I'm sure that Democratic leaders were disappointed in their party's performance in key congressional races, especially given their immense financial resources this year. At the level of public opinion, though, the Democrats in Congress outperformed Obama overall and in 38 states--and their swing beat Obama's overall and in 32 states--so I think you'd be hard pressed to argue that the voters were balancing toward the Republicans in congressional voting. This is not to say that the voters have given the Democrats a blank check, but it really was a Democratic swing, not an Obama swing.

Big city Barack

| 17 Comments

This note by Nate inspired me to check the vote swings by county population. I don't have the urban/suburban/rural status of counties in an easily grabbable form (maybe Boris has these and can send to me) and so as something quick I plotted vote swing vs. county population. Actually, I don't have county population right here either and so I used total number of votes in the county in 2004. Many of the large-population counties are urban (such as Los Angeles, the largest); others are major suburban counties. Anyway, here's what we see:

swingspop.png

The blue line is the lowess curve fit to the data. There's a lot of variation--county size is not such a good predictor of swing--but there is indeed a pattern of bigger Obama swings in larger counties. (The counties are already ordered by size so there's no need to use larger circles to indicate larger counties as I did in the plots of county income posted earlier.)

To understand this better, let's break up the data by region of the country. Also, since we're at it, let's look at swings in the past couple of elections as well.

Here are the swings broken up by region of the country for the past few elections. The left column shows 1996/2000, the middle column shows 2000/2004, and the right column shows 2004/2008.

swingspop_more.png

What do we see?
1. The large-county/small-county differential in Obama's gains was particularly strong in the south and did not occur at all in the northeast. For example, Obama won 84% of the two-party vote in Philadelphia--but Kerry got 80% there four years ago. This 4% swing was about the same as Obama's swing nationally. Part of the issue here is that Obama had almost no room for improvement in these places.

2. The pattern of Democrats improving more in large-population counties is not unique to 2008. Gore did (relatively) well in big counties in all regions in 2000.

I got ahold of the county-level election returns from 2008 (as of a few days ago, so lots of precincts missing, but that's what I have to go with for now) and crosstabbed it with county income, dividing the counties into poorest, middle, and upper third, with cutpoints set so that approximately one-third of the U.S. population is in each category.

What happened in each lower, middle-income, and rich America?

countyswings.png

Obama did better than Kerry in all three graphs, but he did most uniformly better in the rich counties. (In this and subsequent graphs, the area of the circle is proportional to the number of voters in that county in 2004. It turns out that Obama did the worst, compared to Kerry, in low-population poor counties, so the graphs actually look a bit different if you plot all counties with equal-sized circles.)

These patterns are new to 2008. Checking the corresponding plots from 2000/2004 and 1996/2000, we don't see much of anything different comparing poor, middle-income, and rich counties.

The next step is to break things up by region of the country. Here's what we see:

swings2008.png

In the midwest and west, Obama outperformed Kerry in all sorts of counties. In the northeast, Obama did just a bit better than Kerry (who had that northeastern home-state advantage). In the south, Obama did almost uniformly better in rich counties, also did well in middle-income counties (although less so in Republican-leaning areas), and basically showed no improvement from Kerry in poor counties.

So, region and income are both part of the story here. As we already know from those maps of vote swing by county. These scatterplots are another way to look at it.

What happened in the two previous elections?

Let's take a look at the swings from 2000 to 2004:

Carlin and Louis, third edition

| 3 Comments

Brad Carlin and Tom Louis recently came out with a third edition of their book, originally called Bayes and Empirical Bayes Methods for Data Analysis with a plain green cover, now called Bayesian Methods for Data Analysis with a red cover with graphs on it. In title and appearance they are thus converging to our book. They even use the "Bugs code" and "R code" marginal notation that is in my book with Jennifer (see Carlin and Louis, page 178, for example).

What's fun, though, is how different their book is from ours. I highly recommend that anyone interested in Bayesian statistics buy their book as well as Bayesian Data Analysis. This review focuses on the features of Brad's and Tom's book that differ from ours.

bayesglm in Matlab?

| 3 Comments

I received the following email:

My post-election interview with Kathleen Dunn on Wisconsin Public Radio. It was fun. I blame Zacky for all my coughing.

Steve Sailer writes:

Based on the extremely similar results in 2000 and 2004, I [Sailer] had invented a novel and ambitious theory explaining why American states vote in differing proportions for Republican or Democratic candidates. My Affordable Family Formation theory isn't about who wins nationally, it's about how, given a particular national level of support, which states will be solid blue (Democrat), which ones purple (mixed), and which ones solid red (Republican). . . .

I have to say I prefer a college freshman's plot to yours, Andrew. Although, you did hack it together at 3am after strolling around Grant Park. And drawing the y axis from 0 is a mistake which you didn't make, too.

I wrote here here that the red/blue map was not redrawn; it was more of a national partisan swing.

In comment #39 to that entry, Scott de B. wrote: "How else would you define 'redrawing the red/blue map' other than 'a nationwide partisan swing'? By your definition, Reagan didn’t redraw the national map, but if Mondale’s lone state in 1984 had been Alabama instead of Minnesota, he would have."

My first response is that, yes, Obama's national swing was important, but it didn't much change the relative positions of the states. Let's see what happened in 1980/1984:

1980_1984.png

and in 1976/1980 (newly added):

1976_1980.png

These changes were indeed less uniform in their swings as compared to 2004/2008.

I graded this week's homeworks (from chapter 12 of ARM). When I write homework problems, I think about what they will be like to do. I don't think about what they will be like to grade. I'll try to write better homework problems in future books.

Henry posted some great links to voter turnout data and discussions of the topic by Michael McDonald. Henry's graph is here.

Just for fun, I decided to redisplay the information; here is my version:

turnout.png

I've updated it with the latest estimate as of 9 Nov 2008.

Key differences between my graph and Henry's:

1. I go back to 1948, Henry starts at 1980.
2. My y-range goes from 45% to 65%; Henry goes all the way from 0 to 100.
3. Henry's graph labels every election; I label every 20 years.
4. Henry's graph is in gray with many black horizontal lines and a blue line with data; mine is black and white with a line and with data points indicated by dots.

Items 1 and 2 above are the most important; I think: by showing a shorter time range and compressing the y range, Henry makes the changes look less impressive. I understand the rationale for including the whole y-range here, but in this case, since changes are being discussed, and a 5% change is, historically, a big deal, I prefer my graph. I did extend the y-scale out to the [45%,65%] range, though, because I wanted to give a little bit of perspective; it would somehow seem misleading for the data to cover the entire y-range in this case.

In any case, I'm not trying to criticize Henry here; making graphs is just something I like to do, and something I like to think about.

P.S. Below is my (updated) R code, for those of you who want to play at home:

Software update question

| 6 Comments

I was writing something in MS Word on the election and I suddenly noticed . . . all my instances of Obama had that red underline, indicating a misspelling. Oddly enough, neither McCain nor Kerry were flagged in this way. I wonder how long this will last. . .

P.S. "Palin" is also flagged by the spell-checker but "Biden" is ok.

P.P.S. Movable Type is currently flagging Obama, Palin, and Biden, but not McCain or Kerry.

Location, location, location

| 3 Comments

Yes, I'm a nerd. Yes, I'm sitting in a hotel room at my computer typing in data (too early to have anything in downloadable form) and doing scatterplots and regressions. But the hotel room is in Chicago.

I was just in Grant Park . . . it was pretty cool but I couldn't actually hear anything. So I went back to my hotel room and crunched some numbers.

Here are the take-home points:
1. The election was pretty close.
2. As with previous Republican candidates, McCain did better among the rich than the poor. But the pattern has changed among the highest-income categories.
3. The gap between young and old has increased–a lot. But there was no massive turnout among young voters.
4. Obama gained the most among ethnic minorities.
5. The red/blue map was not redrawn; it was more of a national partisan swing.
6. The pre-election polls did well, both for the national vote and for the states.

Here's the full story (with graphs!).

This is sort of silly but I couldn't resist doing a couple hours of programming today. . . . I took Nate Silver's latest simulations and computed the forecast of the national election (popular vote and electoral vote), conditional on various scenarios as of 7pm Eastern time.

The states whose polls close earliest are Virginia, Indiana, Georgia, South Carolina, and Kentucky (and also Vermont, which I'll ignore because of its atypicality).

I worked out a few scenarios, such as the five early states going as expected, McCain doing 5 points better than expected in those states, Obama doing 5 points better in those states, McCain winning Virginia, etc. Also some pretty pictures. For next election I want an interactive widget so people can really play at home, but these offline calculations are a start.

See here for details, or here for the longer article.

A bootstrap by another name

| 3 Comments

Yes, there are topics other than the U.S. election . . . 'Richard Sperling writes:

I'm having a little problem discerning the difference(s) between the parametric bootstrap and Monte Carlo simulation. I'd appreciate it if you would clarify the distinction.

This reminds me in grad school, when Raghu said that in the future, instead of saying "I took a sample of size n from a normal distribution" or whatever, he'd say "I took a bootstrap of size n . . ." and it would sound so much cooler.

2000/2004

| 1 Comment

I realized just realized that our maps of states won by Republicans and Democrats by income group (see here, for example, also recently posted by Matthew Yglesias) are from 2000, not from 2004. We also mislabeled these in Plate 3 of the Red State, Blue State book. My bad. Here are the maps and scatterplots based on exit polls in 2004:

6graphs2004.png

Not so different from 2000 (especially when you look at the scatterplots), with the most notable difference being Kerry's strength in New England.

John Kastellec sends in this blog entry by Jay Nordlinger, entitled "Dept. of Enduring Myths":

I’ve just come back from a weekend in Vermont — and here’s how I understand it: Modestly off people — “real Vermonters,” as some people say — are voting for McCain and Palin. Comfortably off people, such as those who own ski chalets, are voting for Obama and Biden. And the following has been frequently noted about the city of my residence, New York: The rich are voting Democratic. And those who work for them — driving cars, cleaning rooms, and so on — are voting Republican.

Yet, when I was growing up, the Republican party was always called the party of the rich, and it still suffers from that label. Over and over, that which I was taught is contradicted by the evidence of my lived experience.

Here are the results from the 2000 and 2004 exit polls:

newengland.png

At a national level, Republicans did much better among the rich than the poor. In New England, the relation between income and voting is weak, with richer voters being slightly more likely to vote Republican. We'll have to see what happens in 2008.

P.S. As statisticians we're taught to rely less on our lived experience and on impressions from a weekend visit to Vermont, and more on random-sample survey data. And that's what I'm doing here. But I have to admit that in many areas of my professional life (for example, in considering strategies for teaching and for research), I rely pretty much only on my lived experience and on the research equivalents of weekend visits to Vermont. Somehow, for things that affect me directly, statistical principles become less important. So I can see how, for a political journalist such as Nordlinger, it can be difficult to discount one's personal impressions. Nonetheless, I hope he can do so.

From S. V. Subramanian and Jessica Perkins.

subu.png

P.S. See John's comment below. He seems to have a good point. More here from Steve Kass.

My bad in not screening this more carefully before posting. In defense of Subramanian and Perkins, they sent me the paper and it was my idea to blog it. They were planning all along to do more systematic analysis of the raw data (which they haven't yet received).

At Red State, Blue State it's about politics, here at Statistical Modeling it's about survey sampling. Was it all based on a sample of size 6?

Rationality of voting, again

| 3 Comments

Dear Mr. Leonard,

A colleague pointed me to your article about our paper on why it is rational to vote. I'm glad you think our article is "pretty funny." We try to be entertaining even in our most serious writings. I agree with your comment that "we don't need a rational choice framework to provide a reason for participating in the process." And, in a world where nobody was making rational choice arguments, our article might not be necessary. But with prominent economic writers such as Steven Levitt telling people that it's irrational to vote, we think our article offers a useful corrective.

Beyond this, we are making a point which I believe you overlooked, which is that if you _are_ voting for rational reasons, than what is rational is to be voting for (perceived) social benefits, not for your own pocketbook. It is indeed irrational to vote if the gain that you're expecting is a potential $300 tax cut or better health insurance for yourself or whatever. But it is _not_ necessarily irrational to vote if your goal is to help the country as a whole.

Yours,
Andrew Gelman

P.S. If you're interested, our longer research article on rational voting is here.

Deception blog

| 1 Comment

I've linked to this before, but it's worth a reminder. Maybe one reason this stuff interests me is that I'm so bad at deception myself.

Recent Comments

  • Bill Jefferys: @Greg Davies: Is this article available, and can you provide read more
  • Hopefully Anonymous: "Steve Sailer | November 15, 2009 9:24 AM | Reply read more
  • Andrew Gelman: Preston: I think utility theory is great, both in theory read more
  • Brian Josephson: Pathological Science? Don't forget Pathological Disbelief! read more
  • Andrew Gelman: Bella: I thought it was very accurate. I just didn't read more
  • Bob Hawkins: You think Meryl Streep is wasted in "The Fantastic Mr. read more
  • Ken Williams: When I was in grad school, a fellow student (with read more
  • Bella Stander: Andrew, I thought Weiner's piece was hilarious. Painfully so, because read more
  • Phil: Wait a minute...you saw a movie? read more
  • Jonathan Rodden: Thanks for the comments Andy. A couple of quick responses: read more
  • Preston McAfee: Economics has more than its share of people for whom read more
  • Jonathan Rodden: The logic of our paper can help explain the situation read more
  • Jonathan: A quick response: first, it's not clear to me that read more
  • Andrew Gelman: If statisticians are using the normal distribution when they shouldn't, read more
  • Ruben Cabrera: When I receive emails like this I use this site read more
  • Jonathan: But I think that's because economists are busy teaching rather read more
  • Greg B Davies: It's true that in economics (and most decision theory) risk read more
  • superdestroyer: Who cares? The demogrpahic trends of the U.S. will soon read more
  • Charles Sutton: On a related note, have you seen 419eater.com ? Very read more
  • Anonymous: Continuous, not contiguous. read more