Results matching “R”

I've always wanted to write something for the Wichita Eagle . . .

P.S. My proposed title was, "What's the Matter with Kansas? Nothing--and the data prove it." I don't mind the revision but I would always always write "data" as plural!

Mathematics.

Statistics.

Some differences:

- Tao uses more words. This makes sense: he's busy explaining this stuff to himself as well as to his readers. To a statistician, these ideas are so basic that it's hard for us to really elaborate. (Also, I had a word limit.)

- Tao emphasizes that a confidence interval is not a probability interval. In my experience, confidence intervals are always treated as probability intervals anyway, so I don't spend time with the distinction.

- I emphasize that a poll is a snapshot, not a forecast.

- Tao says that the number of polled voters is fixed in advance. I don't think this is exactly true, what with nonresponse.

- Tao fills his blog entry with Wikipedia links. Wikipedia is ok but I'm not so thrilled with it; I'm happy with people looking things up in it if they want but I won't encourage it.

But we're basically saying the same thing. I like how I put it, but I'm sure a lot of people prefer Tao's style. Luckily there's room on the web for both!

"Data analysis" or "data synthesis?"

See discussion here.

ANOVA and the mixed-model muddle

Rick DeShon writes,

As I read through your discussion paper on the analysis of variance published in the Annals of Statistics in 2005, I became a bit confused about the connections between your notion of parameter batches and prior work on the topic of fixed and random effects. Specifically, I wonder how your approach connects to Nelder's "great mixed model muddle?"

My talks in Toronto

I finished Personal Days

Great ending. And, now that it's over, it reminds me even more of Jonathan Coe. Just one thing is bugging me now: what did the people in that office actually do for work. I mean, I know that it's on purpose that we're not told, but I'm still curious.

Personal Days: The Penultimate Post

I just started the last section of Ed Park's Personal Days--this final section appears to be a long rambling letter of the unreliable narrator type such as concludes The Rotter's Club--which reminds me of a particularly asinine passage in the incredibly overrated Godel Escher, Bach, which for some horrible reason I remember after nearly thirty years, where Hofstadter writes about how, when you read a book, you know you're coming to the end, which affects your expectations, unlike in real life stories or in a movie of indeterminate length, when the end can come as a surprise. The natural solution for a book would be to pad it with an indeterminate number of empty pages--not completely blank, of course, that would be too obvious, but with sentences that are clearly different from the main story. Hofstadter fatuously concluded that this would be impossible: to be convincing, the fake story would have to be close enough to the real one that, essentially, it would be part of the main narrative. But that's completely wrong: it would be easy enough to just have an only barely related story at the end, and then when the main story really did end, for example on page 240, the author could just have a paragraph saying, "This is the end of the story. The rest is padding," or something like that. I mean, you're not expecting the reader to look too carefully at the end matter: either it's really part of the book and the reader wouldn't want to lose the suspense, or it's fake matter, in which case the reader would still like to preserve the suspense of the story's actual length.

But that's not what I was planning to write about. What does Personal Days remind me of (besides it being a remake of Then We Came to the End)? The similarly alphabetically-structured Kafkaesque office nightmare story office nightmare Forlesen, for one thing. Although, oddly enough, Gene Wolfe was a Republican when he wrote that story, I think. The focus is different, though: the office takes up almost all of Forlesen's life time, but his family is ultimately what is central and nobody in the office is real to him; in Personal Days, only the office is real; the characters have no families.

My favorite things in Personal Days so far are the management-speak in the Jilliad and the goofy three-syllable restaurant names.

I pretty much couldn't keep the characters straight, even when I was reading the book. But I suspect this is part of the point. We'll see how I feel when I'm all done.

P.S. I am still training myself in writing with precision: two paragraphs above where it says "My favorite things," I originally had the sloppier "The best things." On the other hand, editing a blog entry is almost the definition of a waste of time. On the other other hand, I like to think this keeps me in practice for more important writing efforts.

P.P.S. I think I am ideally qualified to use the term Kafkaesque, having never read anything by Kafka except the first two pages of that story they give you to read in high school, where Gregor Samsa wakes up as a bug. I've read too much Orwell to be comfortable with "Orwellian."

P.P.P.S. Can blogs do hypertext? The Hofstadter digression in the first paragraph above belonged just where it did, but it's a distraction from my main points. I'd like to be able to enter it as some sort of clickable sidebar (without going to the trouble of setting it up as its own blog entry, which I just don't want to do)?

Our Cato event from last month will be on Book TV on C-Span2 this Sat, 18 Oct, 7pm, and Mon, 20 Oct, 6am. My presentation has gotten a bit slicker since then, but it's still good stuff, and you also get interesting discussions by Brink Lindsey and Michael McDonald.

Hey, I was right!

See here.

Ed Park is a Democrat

I'm about halfway through Personal Days and I'm pretty sure Ed Park is a Democrat. Or something like that, maybe a Green party member or whatever, but certainly not a Republican. Why? Is it just statistical reasoning, he's a youngish writer who lives in NYC? I think it's more than that, there's something about the book that screams "Democrat." Not that a Republican wouldn't make fun of corporate culture but it would be done in a more affectionate, Christopher Buckley-style way.

I'm not saying every artistic-type writer is a Democrat. For example, I don't know anything about David Foster Wallace's politics, but based on what I've read of him, he could've been a Republican. He probably wasn't, but he had that elitist thing going on.

David Mamet, he's a famous Democrat-turned-Republican, but I think it's fair to say that all along he could've been either. Updike's in the middle of the road, Gore Vidal is to the left of the Democrats but I could picture him as a Republican, sort of. . . .

OK, this is getting pretty pointless. . . clearly it's getting too close to the election for me . . . I'll have to finish Personal Days and tell Jeff whether I recommend it. Caroline read one page and said, hey, isn't this just like that other book you read about those people in an office? I said, yeah, but it's a great theme, surely big enough to hold two good books. I showed her the scene with Grime's typos, I'd been laughing aloud at that, but she didn't quite see the point. Perhaps it was only funny after the pages and pages of implicit setup.

Howard Wainer writes:

On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.

The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them fast.

Tyler McCormick, Matt Salganik, and Tian Zheng just wrote this article on using the scale-up method to estimate the size of people's social networks using responses to questions such as "How many people do you know named Kevin?" They build upon earlier work by Bernard, Killworth, McCarty et al. and Zheng, Salganik, and Gelman. This new paper is great; it takes these methods from the "cool" stage to the "useful" stage.

Red Blue at NYU

I'll be speaking Tues 14 Oct (that's tomorrow) 10am on Red State, Blue State at NYU, at 802 Kimmel Center, 60 Washington Square South. Pat Egan will discuss, and then there will be time for discussion. The talk will be open to the public.

I recently became aware of two papers by David van Dyk on a new approach to Gibbs sampling using incompatible conditional distributions. This seems similar to the parameter expansion or redundant parameter idea developed by C. Liu, J. Liu, Meng, Rubin, van Dyk, and others, but perhaps a bit more generalizable and thus usable in routine problems.

Here's the theoretical paper (with Taeyoung Park).

And here's the more applied paper (which has a logistic regression example), with Hosung Kang.

This looks great, although I'm still not sure exactly how to apply this to our problems. Maybe we're getting closer, though...

Feedback

Aleks sent along this article that suggests that debate-watchers are influenced by crowd noise and feedback graphics:

So-called Bayesian methods

Seth points me to these papers:

John P. A. Ioannidis, Effect of Formal Statistical Significance on the Credibility of Observational Associations, Am. J. Epidemiol. 2008 168: 374-383.

Hormuzd A. Katki, Invited Commentary: Evidence-based Evaluation of p Values and Bayes Factors. Am. J. Epidemiol. 2008 168: 384-388.

John P. A. Ioannidis, The Author Responds to "Evaluating p Values and Bayes Factors", Am. J. Epidemiol. 2008 168: 389-390.

I do not, do not, do not have the energy now to comment on these. Let me just say that what is labeled in the above articles as "Bayesian" is not the only way to do Bayesian statistics. I refer you to Bayesian Data Analysis for exposition of what I consider the more reasonable Bayesian approach, which is based on modeling rather than hypothesis testing and never involves computing the posterior probability that the null hypothesis is true.

I can't stop people from doing these other things and I wouldn't even try. But I would like them to be aware of this other, more direct approach. This paper may also help.

Whassup with the white working class?

A colleague asks,

How do you deal with the following from Alan Abramowitz and Ruy Teixeira's Brookings paper:
Indeed, just how far the Democrat party fell in the white working class' eyes over this time period can be seen by comparing the average white working class (whites without a four year college degree) vote for the Democrats in 1960-64 (55 percent) to their average vote for the Democrats in 1968-72 (35 percent). That's a drop of 20 points. The Democrats were the party of the white working class no longer…… Al Gore….lost white working class voters in the 2000 election by 17 points. And the next Democratic presidential candidate, John Kerry, did even worse, losing these voters by a whopping 23 points in 2004. One could reasonably ascribe the worsening deficit for Democrats in 2004 to the role of national security and terrorism after 9/11 but the very sizeable 2000 deficit cannot be explained on that basis. Apparently, the successes of the Clinton years, which included a strong economy that delivered solid real wage growth for the first time since 1973, did not succeed in restoring the historic bond between the white working class and the Democrats.

My reply: When you slice things by income, you see a clear pattern of Republicans doing better among the rich of all races (except maybe Asians, but I don't particularly trust those numbers what with small sample size):

national.png

Compared to earlier years, Democrats have lost among less well-educated voters and gained among the more educated voters, but their income profile hasn't changed so much. As E.J. Dionne has noted, the Democrats' strength among well educated voters is strongest among those with household incomes below $75,000--"the incomes of teachers, social workers, nurses, and skilled technicians, not of Hollywood stars, bestselling authors, or television producers, let alone corporate executives."

So a quick answer is that I don't necessarily see a machinist, say, as having more street-cred than a social worker with a graduate degree who makes the same amount of money. As Larry Bartels has pointed out, it's not so easy to identify exactly what is meant by "working class." There have been changes, but remember that the difference in voting between rich and poor has been as large in the past 10 years as it's ever been; see page 47 of the red-blue book. Yes, it's different rich and poor people than before, but it's still there. It's a mistake to think there was a past golden era of class-based voting. Geographic factors were important in voting decades ago, and they are now as well.

See here for my earlier comments on the Teixeira and Abramowitz article.

Finally, David Park made this graph of the trend since the 1950s of the rich-poor voting gap (the difference between Republican vote share among the upper third of income, minus the Republican vote share among the lower third) in Presidential elections. The gray dots represent all voters, the black dots represent whites only (yes, I know, they should be white dots...).

whites.png

The rich-poor voting gap among whites has in recent elections been a bit below its 1970s-1990s peak, but it's far from zero. And, what with increasing diversity in the minority population, it's not so clear that "whites" is as useful a category as it once was.

P.S. More here.

A new cost of living index

Boris passed this along. We've struggled with cost of living indexes (see here, here and here), so maybe this will be helpful.

Red-blue roundtable

Here's a fun discussion (still developing, it'll be going through Thursday, I think) on red and blue America, featuring pollster John Zogby, journalist Bill Bishop, consultant Valdis Krebs, and myself, moderated by Tom Nissley at Amazon.com.

My strategy is to make my points using graphs.

Macartan Humphreys's paper on coalitions

I gotta read this article:

The game theoretic study of coalitions focuses on settings in which commitment technologies are available to allow groups to coordinate their actions. Analyses of such settings focus on two questions. First, what are the implications of the ability to make commitments and form coalitions for how games are played? Second, given that coalitions can form, which coalitions should we expect to see forming? I [Humphreys] examine classic cooperative and new noncooperative game theoretic approaches to answering these questions. Classic approaches have focused especially on the first question and have produced powerful results. However, these approaches suffer from a number of weaknesses. New work attempts to address these shortcomings by modeling coalition formation as an explicitly noncooperative process. This new research reintroduces the problem of coalitional instability characteristic of cooperative approaches, but in a dynamic setting. Although in some settings, classic solutions are recovered, in others this new work shows that outcomes are highly sensitive, not only to bargaining protocols, but also to the forms of commitment that can be externally enforced. This point of variation is largely ignored in empirical research on coalition formation. I close by describing new agendas in coalitional analysis that are being opened up by this new approach.

And also this. And then relate all this to my research on coalition formation as a prisoner's dilemma.

Head over to the Red State, Blue State blog for my post on my new measure of Senator Barack Obama's (and other prominent IL Democrats) ideology from his service as an Illinois state Senator (from Hyde Park). It comes from a new research project of mine on state legislative ideology.

Amazon, U.S.A.

Amazon.com has this cool website showing which sorts of political books people are buying in which states:

amazon.png

What struck me was the similarity of this to the "voting patterns of the rich" map from our book:

3maps.png

I wonder what data from Wal-Mart from Wal-Mart would look like. Maybe like one of the lower of the two maps? I'm not sure, though, since, even at Wal-Mart, buyers of political books are more politically active and thus maybe more like "rich people" in their red-blue divisions.

There's a lot going on for those of you in the NY/NJ area.

1. On Monday morning I'm doing an activity on the Electoral College. But you can't come to that unless you're a 4th grader in Zacky's school.

2. Monday 4.30pm at room 801 International Affairs Building (at Columbia), I'm speaking on Red State, Blue State in an event cosponsored by the Columbia Journalism School, with discussions by Nicholas Lemann and Thomas Edsall and moderated by Sharyn O'Halloran.

3. Monday 7pm at the Princeton Club in midtown Manhattan, I'm speaking and signing books. You can only go to this one if you're a member of the club, I think.

4. Tuesday 4.30pm at Robertson Hall at the Woodrow Wilson School at Princeton University, there's an event sponsored by the New York and New Jersey chapters of the American Association for Public Opinion Research, featuring Joe Lenski, Chris Achen, Larry Hugick, and myself. After the panel there will be lots of time for informal discussion as well.

Bayes, Bayesians

I can't remember who said this first, and I can't remember if I've already put this on the blog, but the following definition may be helpful:

Every statistician uses Bayesian inference when it is appropriate (that is, when there is a clear probability model for the sampling of parameters). A Bayesian statistician is someone who will use Bayesian inference for all problems, even when it is inappropriate.

I am a Bayesian statistician myself (for the usual reason that, even when inappropriate, Bayesian methods seem to work well).

(The above is perhaps inspired by the saying that any fool can convict a guilty man; what distinguishes a great prosecutor is the ability to convict an innocent man.)

Cool historical maps

Hey, see here for info on a site that has cool interactive electoral vote maps with good historical details. Here's the map for the most important of all presidential elections:

1860.png

Walter de la Mare was a statistician

Cool.

She writes "sox" instead of "socks." What's that all about? Is this an accepted alternative spelling? (I wouldn't quite recommend the book, but it is also interesting in other ways.)

Why do swing states matter?

Hey, I got quoted in the Weekly Reader! Much cooler than the Annals of Statistics.

This is funny. It reminds me of when I was asked to help design a study, and I told the researcher I was upset to be involved in the design. Why? Because the #1 thing that statisticians like to say is, "Sorry, the analysis is really difficult because you screwed up the design." So, if you ask me to help with the design, I lose my best alibi!

"Beyond 'Fixed Versus Random Effects'"

Jeff pointed me to this paper by Brandon "not Larry" Bartels on using multilevel modeling for time series cross-sectional data. I agree with Bartels's recommendations, which are:

- Use a multilevel model to allow intercepts to vary by groups. This is more reliable than estimating intercepts by least squares or not allowing the intercepts to vary at all.
- Also allow slopes to vary. (Bartels doesn't emphasize this so strongly but I think this is important advice also.)
- Include as group-level predictors the group-level averages of important individual-level predictors. This will in many settings capture some of the otherwise unexplained group-level variation, as Joe Bafumi and I discuss.

Bartels also recommends representing individual-level predictors by their deviation from group averages. This is ok but I don't think it's necessary. It depends on the context. For example, if you have a predictor that is 1 if you're African American and 0 otherwise, I wouldn't want to subtract that from its state average. In that case you'd be better off including the individual predictor and state % African American as two predictors in the model. In other settings, Bartels's recommendation to center the predictor for each group makes more sense. Either way, this doesn't affect his main recommendation to fit a multilevel model, including important predictors in their group averages as well.

Individual and group-level predictors

Finally, I recommend my 2006 Technometrics paper, "Multilevel (hierarchical) modeling: what it can and cannot do," which begins:

Multilevel (hierarchical) modeling is a generalization of linear and generalized linear modeling in which regression coefficients are themselves given a model, whose parameters are also estimated from data. We illustrate the strengths and limitations of multilevel modeling through an example of the prediction of home radon levels in U.S. counties. The multilevel model is highly effective for predictions at both levels of the model, but could easily be misinterpreted for causal inference.

In particular, see the discussion in Section 2.4 of my paper on the interpretation of a group-level predictor. You have to be careful about calling such coefficients "effects" or interpreting them causally.

Just to let you know things are busy around here . . .

Juan Morales writes:

I am currently fitting a multilevel model to data of fruit removal rates which I model using binomial distributions for the number of removed fruit out of total fruits available. I would like to estimate the proportion of variance explained and the amount of pooling at each level (tree, forest stand and so on). You show how to do such things for the radon example and mention that something similar could be done for generalized linear models using deviances. Has this been done somewhere?

My reply: R-squared for multilevel linear models is discussed in our book and in my paper with Pardoe. I think it would make sense to do this with logistic regression also (perhaps using the latent variable formulation with residual s.e. of 1.6, as Jennifer and I discuss in chapter 5). But I haven't done it yet. A good research paper, I think!

Evaluating multi-site interventions

Rajeev sends a link to this paper on hierarchical modeling for evaluating multi-site interventions:

This article discusses the evaluation of programs implemented at multiple sites. Two frequently used methods are pooling the data or using fixed effects (an extreme version of which estimates separate models for each site). The former approach ignores site effects. The latter incorporates site effects but lacks a framework for predicting the impact of subsequent implementations of the program (e.g., would a new implementation resemble Riverside?). I present a hierarchical model that lies between these two extremes. Using data from the Greater Avenues for Independence demonstration, I demonstrate that the model captures much of the site-to-site variation of the treatment effects but has less uncertainty than estimating the treatment effect separately for each site. I also show that when predictive uncertainty is ignored, the treatment impact for the Riverside sites is significant, but when predictive uncertainty is considered, the impact for these sites is insignificant. Finally, I demonstrate that the model extrapolates site effects with reasonable accuracy when the site being predicted does not differ substantially from the sites already observed. For example, the San Diego treatment effects could have been predicted based on their site characteristics, but the Riverside effects are consistently underpredicted.

Seems like a good idea to me. Remember, interactions are important!

What am I reading now?

Marshal Zeringue asked, and I replied:

Mavericks of the past

Phil Klinkner writes:

History doesn’t repeat itself, the saying goes, but it does rhyme.

When I was about 9 years old, I read just about every book of fairy tales in the library. 398.2 in the Dewey decimal system, I remember it well.

Exciting 1% shift!

Brendan Nyhan offers this amusing example of a newspaper hyping poll noise. From the LA Times:

Registered voters who watched the debate preferred Obama, 49% to 44%, according to the poll taken over three days after the showdown in Oxford, Miss.

That is a small gain from a week ago, when a survey of the same voters showed the Democratic candidate with a 48% to 45% edge.

A small gain, indeed.

Interactive graphs of polls

Blogs as places?

Henry Farrell referred here to his blog as a "place." Which seemed funny to me because I think of a blog as a "thing." Henry replied:

That's the way that I [Henry] think about blogs (or at least group blogs and blogs with comments) - places where people meet up, chat, form communities, drift away from each other etc.

My analogy was blog-as-newspaper, the self-publishing idea, and I'm not used to thinking of a newspaper, or even a listserv, as a place. I think there is an aspect of the analogy that I'm still missing.

P.S. See Mark Liberman's thoughts in his blog here.

See here for my failed attempt to construct a political conspiracy theory around Lehman Brothers and the financial crisis.

My blog discussion with Eyal Shahar (see comments #3 and onward here) reminded me of a persistent challenge I face when talking with outsiders about Bayesian statistics.

Laura Wattenberg writes, "in baby naming as in so many parts of life, style, not values, is the guiding light."

Models for cumulative probabilities

Dan Lakeland writes:

I am working with some biologists on a model for time-to-response for animals under certain conditions. The model(s) ultimately are defined in terms of a differential equation that relates a (hidden) concentration of a metabolic product to the (cumulative) probability that an animal will respond within a given time by changing its behavior.

Maguffin as stone soup

Here.

What's up with Kazakhstan?

Chris Zorn pointed me to this graph and asked for my thoughts. I replied that I'd seen worse, but the use of two dimensions doesn't help, and the comparison to the GDP of Kazakhastan is just weird. I mean, who has any idea what is the GDP of Kazakhstan??

Chris replied,

I'm teaching first-year Ph.D. methods on PoliSci this term, and we have a feature called "Graph of the Day," where -- for five minutes or so at the beginning of every class -- the students all look at and comment on some graph from a paper, the press, etc. I used this one yesterday, and the response (from people with a grand total of three weeks of graduate education) was identical: "What's up with Kazakhstan?", and "Isn't a reference point supposed to be *non-obscure*?"

Mellow liberals and jumpy conservatives

Jamie points out this interesting article by Douglas Oxley et al. that appeared in Science last month. Here's the abstract:

Although political views have been thought to arise largely from individuals? experiences, recent research suggests that they may have a biological basis. We present evidence that variations in political attitudes correlate with physiological traits. In a group of 46 adult participants with strong political beliefs, individuals with measurably lower physical sensitivities to sudden noises and threatening visual images were more likely to support foreign aid, liberal immigration policies, pacifism, and gun control, whereas individuals displaying measurably higher physiological reactions to those same stimuli were more likely to favor defense spending, capital punishment, patriotism, and the Iraq War. Thus, the degree to which individuals are physiologically responsive to threat appears to indicate the degree to which they advocate policies that protect the existing social structure from both external (outgroup) and internal (norm-violator) threats.

I myself am extremely sensitive to sudden noises, so make of that what you will . . . Seriously, though, this seems related to John Jost's work on personality profiles and political affiliation.

Drew Linzer's poll tracker

Here. See here for my thoughts.

Things I saw while waiting for the train

When the sign says the train will be 0:05 late, it won't be 0:05 late. If it were going to be 0:05 late, they wouldn't say anything at all. In reality it will be 0:30 late. But they won't say it will be 0:30 late, because that would mean the train will be 1:30 late.

I got off a good line when I got on the train. I stepped in, saw a retirement-age couple already seated, and asked, Philadelphia? They said, yeah, that's where they're going, but they're not sure either, they hope they're in the right place. I said, yeah, I think this is right. (Pause) And, if there's one thing last week's news has taught us, it's that you can trust a guy in a suit.

That got a laff.

Ads in the Newark train station

A big picture of a hot-dog guy holding a mustard-slathered beauty, next to the words: You Want Cancer With That? Medical research shows hot dogs increase your risk of cancer... [An ad for some law firm that's suing food manufacturers.]

The Retreat at Princeton
Inpatient Alcohol and Drug Treatment for Executives & Professionals
www.RetreatAtPrinceton.com

P.S. I have no idea why, but this particular entry seems to attract a log of spam!

The following is my discussion of the article, "Approximate Bayesian inference for latent Gaussian models by using integrated nested Laplace approximations" by H. Rue, S. Martino and N. Chopin, for the Journal of the Royal Statistical Society:

Statisticians often discuss the virtues of simple models and procedures for extracting a simple signal from messy noise. But in my own applied research I constantly find myself in the opposite situation: fitting models that are simpler than I would like—models that clearly miss important features of the data and, more importantly, important features of the underlying system I am modeling—because of computational limitations.

Red-blue on Wisconsin Public Radio

I'll be talking about Red State, Blue State on the Kathleen Dunn show on Wisconsin Public Radio tomorrow (Tues 23 Sept), from 10-11 Central Time (that's 11-12 Eastern Time). For the second half of the show, you can call in with questions!

Here.

Red State, Blue State in Philadelphia

I'll be speaking on the Red State, Blue State book this Monday (22 Sept) at 4:30pm at the University of Pennsylvania. It'll be at the Annenberg School for Communication, Room 109. The address is 3620 Walnut Street, Philadelphia, PA. This is your chance to ask questions and also to meet some interesting people: the talk is cosponsored by the departments of Statistics, Biostatistics, and Political Science as well as the Annenberg School.

Larry Bartels writes about how "the contemporary electoral landscape, which is less volatile and more partisan than it has been at any time in the past half-century or more." Larry's presentation is clean and well illustrated by graphs, adding nicely to earlier discussion of this topic by John Sides.

Larry also has some comments about the problems that can occur when a historian is "moonlighting as a political scientist." Which reminds me of my own rants:

Every four years, some hardworking and enterprising journalists do some digging around in the political science literature, talk with some people who sound like they know what they're talking about, and then resurface to tell the world about the counterintuitive finding that the Electoral College actually benefits voters in large states.

Well, as I like to say to my social science students: Just 'cos it's counterintuitive, that don't make it true.

The Electoral College benefits voters in swing states, and it slightly benefits voters in small states (on average). Large states are not benefited (except when they happen to be swing states such as Ohio or Florida, but we knew that already).

See here for the fuller discussion.

I just wanted to put this out here to get out in front of the discussion. So that if any of you do see this argument floating around, youall can shoot it down before it fully takes off...

Student-t conference

A conference celebrating the 100th birthday W.S. "Student" Gosset's "The Probable Error of a Mean" and three other classic papers:

Graph of voter turnout by age

Here's a pretty picture (from Charles Franklin, link from John Sides):

Turnoutbyagecitizens.png

What a great graph! I won't be picky, but if I were, I'd make the following suggestions:
- Bigger numbers on the axes--as is, they're hard to read.
- Add percentage signs on the y-axis.
- Label age every 20 years rather than every 10.
- Put the "80-84" age group at 82 (rather than 80), and put the "85 and up" group at 88 (rather than 85).
- Pick colors other than red and blue.

Dipankar Bandyopadhyay writes:

I am currently running an autologistic regression model where I have some fixed effects and also spatial (autlogistic) terms. Is there any recommendation from you on the appropriate choice of prior on the variance when I put a normal prior on the regression coefficients? I mean, do you recommend a folded-t, or a half-cauchy, or a uniform over the traditional inverse gamma, and in such a case, where can I get the WinBUGS codes to put folded-t, or half cauchy/half-normal priors?

I've blogged about this before but it's worth mentioning again as a good teaching example. The site How Many of Me purports to estimate how many people in the U.S. have any particular name. But it can give wrong results; as "the other Craig Newmark" noted, it said there was only one of him, and there are actually at least two.

What the site actually does is to plug in esitmates of the frequency of the first name and the frequency of the last name and assume independence. The results can be wrong.

This could be a great example for teaching probability. Three questions: first, how can you check that the site really is assuming independence; second, how many people does the site assume are in the U.S.; third, how could you do better?

1. How can you check that the site really is assuming independence? We'll check four names and see how many it says there are of each:

Rebecca Schwartz: 171
Rebecca Smith: 6600
Mary Schwartz: 1047
Mary Smith: 40941

Calculate the ratios: 6600/171=39, 40941/147=39. Check.

Actually, to one more digit, the ratios are 38.6 and 39.1. Why the difference? Shouldn't they be exactly the same? Playing around with the last digits reveals that it can't be simple rounding error. Maybe some internal rounding error in the calculations? (Perhaps another good lesson for the class?) Hmm, let me go back and check. Number of Mary Schwartzes: 1047. Check. Number of Mary Smiths? 40491. Uh oh, I'd transposed the digits when copying the number. Now the ratios agree (to within rounding error)

The website is definitely assuming independence. I have no doubt that there are some Mary Schwartzes out there but no way that the frequencies of Marys among Smiths and Schwartzes is exactly identical.

2. How many people does the site assume are in the U.S.? The site says there are 4,024,977 people in the U.S. with the first name Mary, 3,069,846 people in the U.S. with the last name Smith, and 40,491 Mary Smiths. 4024977*3069846/40491 = 305 million. So that's what they're assuming.

3. How could you do better? Phone books are an obvious start. They don't have everybody and there are other sampling difficulties involved (for example, a telephone that's under the name of only one person in the family, leaving the others unlisted) but it would give you some clear information about how large are the discrepancies from indepdence.

And, a bonus:

4. A bad idea (which might be tried by a naive instructor who doesn't get the point): Using this to teach the chi-squared test for statistical independence. This is a bad idea for two reasons: first, the data in HowManyofMe.com are not a sample under statistical independence; they are exactly statistically independent (a/b=c/d) and so a chi-squared test is beside the point. Second, for real data the point is not whether they could be explained by statistical independence--they can't--but how large the discrepancy is. This can be expressed using probabilities or odds ratios or whatever but not by the magnitude or the p-value of a chi-squared test. (If you want to use this example to illustrate chi-squared, this is the point you'd have to make.)

P.S. I've never met the other Andrew Gelman, but I did once meet someone who lives down the street from him (in New Jersey).

The biggest problem in model selection?

A student writes:

One of my most favorite subjects is model-selection. I have read some papers in this field and know that it is so widely used in almost every sub-field in statistics. I have studied some basic and traditional criterion such as AIC, BIC and CP. The idea is to set a consistent optimal criterion, usually it's not easy when the dimensionality is high, but my question is, what is the biggest problem and why it is so hard?

Jim Manzi says yes, and he has some data. He says that in 46 out of 48 states, there's a positive correlation between a county's neighborhood-level inequality and its vote for Kerry.

P.S. Also see interesting thoughts in the comments section below.

P.P.S. This paper by Mark Frank also seems relevant to the discussion. Frank writes:

For many states, the share of income held by the top decile experienced a prolonged period of stability after World War II, followed by a substantial increase in inequality during the 1980s and 1990s. This paper also presents an examination of the long-run relationship between income inequality and economic growth. Our findings indicate that the long-run relationship between inequality and growth is positive in nature and driven principally by the concentration of income in the upper end of the income distribution.

P.P.P.S. See also the graphs here (from chapter 5 of the Red State, Blue State book).

Bob Carpenter on Extreme Programming

Bob Carpenter writes the following regarding Extreme Programming, focusing specifically on some of my struggles in statistical computing:

Well, it's a bit extreme. I think you'll find better overall advice in Hunt and Thomas's Pragmatic Programmer without all the silver bullet rhetoric. I wouldn't bother with the Agile development stuff, but that's the currently trendy descendant of this whole line of thinking.

Having said that, there are several good take-away messages from the Extreme Programming (XP) process, but I don't believe, as its proponents do, that you need to do everything their way.

I love pair programming -- it's not only a great way to learn for a novice/expert or expert/expert pair, it's a great way to keep quality high by keeping each other honest and it's a great way to catch bugs early on. But if you follow the XP advice to the letter, that's the only way you'd program, which is impractical in most groups.

You should start using unit testing, which I believed your group referred to as a "self-cleaning oven" (though don't keep the tests in the same file as the code).

Research coding is both good and bad for XP. The specifications tend to move around even more than in the usual XP project, because you don't even know if what you're trying to do is possible before you start much of the time.

And you definitely need version control, which Masanao and Yu-Sung set up for you through RForge.

With some R and BUGS under my belt, I really wish it was easier to stick to the don't repeat yourself (DRY) principles. All of the R code I've seen could use much more modularity.

Finally, you should get into refactoring; it's really what you're trying to do with BayesGLM, though it may be easier to refactor GLM first if you're going to start from scratch.

My talk at Harvard on Wed 17 Sept

I'll be speaking on Red State, Blue State this Wed, 17 Sept, 12-1:30, in the Government Dept at Harvard. It's at 1737 Cambridge St., Room K-354. If you live in the Boston area, this is your chance to come and ask your questions and give your suggestions.

Fast sparse regression and classification

Aleks points me to this paper by Jerry Friedman on non-Bayesian regularization methods. I'd also recommend our Bayesian approach (see this Annals of Applied Statistics paper). Once you're going to assume a probability model for the data (a likelihood), it's a pretty small step to include prior information as well. But read Friedman's paper in any case. He focuses more on computational issues than we do. There are really two parallel literatures.

Heinlein's fan mail solution

Phil pointed me to this. Very nice. I have to get someone to program my emailer to do something similar...

Taxonomy of confusion

Rachel writes that she gave our students (it's a grad class in applied statistics, based on the Gelman and Hill book) what she thought of as a "Taxonomy of Confusion"... types of things they might be confused about and what they should do before asking the T.A.:

1. statistics-related questions that are prerequisite to the course--get an Intro to Stats book, don't ask the T.A. unless you really must.

2. statistics-related questions that are part of the course--read the book, ask a friend, then ask the T.A.

3. you know what you want statistically but you don't know the name of the function--google "R standard deviation" or write the function yourself... if you can't find it, ask a friend then ask the T.A.

4. you know the function's name but you can't figure out how it works: type help(sd), then ask a friend then ask the T.A.

5. you wrote code but you get error messages: DEBUG using tips like, print things out, break into smaller steps (we should do more on this later).

6.you wrote code and it doesn't do what you think it should do but there are no errors: DEBUG (more on this later).

This just seemed hilarious to me. Maybe it was the deadpan tone.

More on interactions

Bruce McCullough writes:

Don't know if you're aware of this, but if you need more evidence for the primacy of interaction effects, data mining is a great place to look. My degree is in economics. I was taught to use interaction effects as a test for nonlinearity, and that was about it.

My data mining experience of the past few years has taught me that interaction effects can be neglected at my own peril. A wonderful paper that illustrates this is "Variable selection in data mining: Building a predictive model for bankruptcy," by Dean P. Foster and Robert A. Stine in the Journal of the American Statistical Association (2000). The usual linear regression doesn't work. The model with lots of interactions works very well.


In response to something Robin Hanson wrote on his blog (sorry I can't find the exact link, I think it was at the end of July, 2008), I wrote:

If you're in D.C., you should stop by. . . . I'm speaking in the statistics department at George Washington University on the topic of interactions. Here's the powerpoint and here's the abstract:

As statisticians and practitioners, we all know about interactions but we tend to think of them as an afterthought. We argue here that interactions are fundamental to statistical models. We first consider treatment interactions in before-after studies, then more general interactions in regressions and multilevel models. Using several examples from our own applied research, we demonstrate the effectiveness of routinely including interactions in regression models. We also discuss some of the challenges and open problems involved in setting up models for interactions.

The talk will be today, Wed 10 Sept, at 3pm at 1957 E Street, Room 212. If you don't know where that is, you can call the department (202-994-6356) and they should be able to give you directions.

Tomorrow (Thurs) I'll be speaking with Boris at noon at the Cato Institute on Red State, Blue State. It's not too late to sign up for that.

Those silly voters

Rick Shenkman reminds us that voters are "grossly ignorant" about many issues. Now that the Cold War is over, we don't have to worry about voters not knowing about "throw weights" and such, but I think it's still probably a bad thing that "six in ten young people (aged 18 to 24) could not find Iraq on the map," that people overestimate by a factor of 50 the percentage of the federal budget that is spent on foreign aid, and so forth.

Bayesian computation in Java?

John Payne writes:

I am writing a Java program to do ecosystem modeling and we wish to use Bayesian MCMC methods for parameter estimation. I am interested in finding flexible, customizable Bayesian MCMC code that can be called in Java. I looked at documentation for JAGS, BUGS, and also Gregory Warnes's Hydra program (which is in Java). I have been unable to get a reply from Dr. Warnes but it seems Hydra is no longer being supported. As far as I can tell, BUGS is written in Component Pascal, which I am ignorant about. I have never tried JAGS; would you have any advice as to which avenue would be the most fruitful to pursue?

My reply: Right now, I think Jags is probably the best way to go. But others might have other suggestions here.

David Frum responded at his blog to my graph-laden comments on his New York Times article.

Frum emphasizes the difference between looking at county-level inequality as compared to state-level inequality. He also makes the point that inequality (at the state and county level) is often associated with big cities. Interesting stuff.

Frum also mentions Missouri, which is one of the states where richer counties favor the Democrats. Richer counties also lean Democratic in Nebraska, and most of the western and northeastern states (see pages 68-70 of the book), but in Indiana, South Dakota, Wisconsin, New Jersey, and most of the South, it goes the other way, with richer counties being more Republican. (I showed this in the map of Texas in my previous blog entry.) The patterns really do look different in different parts of the country, and Missouri is not like Texas in this respect. In any case, I haven't crunched the numbers on county-level inequality, and I agree with Frum that the patterns within a state can differ from those between states. Individually, richer Americans still lean Republican, but location matters a lot also.

Frum's facts and fallacies

David Frum, author of “Comeback: Conservatism That Can Win Again,” wrote an op-ed in the New York Times yesterday that has some interesting insights and but also suffers from some of the usual confusions about rich and poor, Democrats and Republicans. Overall I think Frum has some interesting things to say but I want to point out a couple of places where I think he may have been misled by focusing too strongly on the D.C. metropolitan area.

I read Richard Cook's biography of Alfred Kazin recently. It was surprisingly interesting--I say "surprisingly" because Kazin didn't live a particularly eventful life. I wanted to read the book in the first place because I like a lot of Kazin's writing and I wanted to understand how the pieces fit together. One thing I learned is that his sister married Daniel Bell. Not that the book featured any interesting anecdotes about Bell; still, it was satisfying to see the map filled in. I was struck by how financially precarious Kazin's life was. After the late 1930s, he was never poor, but it was a long time before he had a permanent job. There definitely seems to be a conceptual divide between those of us with steady jobs (the sort that pay us even if we're not really working) and people who start each year from baseline of zero income and have to earn every penny. (Well, I guess Kazin had book royalties, but I don't suppose that was enough to pay the rent.)

My favorite writing of Kazin's are his book reviews, especially of post-1950 literature, which is what I'm most likely to have read and to be able to relate to. (I just can't get into that Henry James stuff.) I'd love to read more of that. I have a collection of his reviews that came out around 1962, and it's excellent. (Not perfect; he sometimes irritates me with a smug all-knowing attitude of condescension, but most of the time it's interesting. For example, I got a lot out of his essay on John P. Marquand, even though Kazin is less of a fan of Marquand than I am. My take on this: Marquand made it look so easy that his skills were hard to appreciate until decades later, when nobody has come along to replace him.)

Cook takes a lot from the memoir, "What I Saw at the Fair," that Kazin's third wife, Ann Birstein, published a few years after Kazin's death. I went to the library and picked it up and gave it a quick read. She was still mighty angry at Kazin, even to the end, when she found out that he'd sold a collection of letters, including many from her, to the New York Public Library. It's gotta be a weird feeling to go to the library and come across your own decades-old letters.

"What I Saw at the Fair" is readable and interesting, but running through it is a funny idea--I'd call it pre-modern--that people's true essences are reflected in their physical appearance. Character after character is introduced as ugly or beautiful, and almost always this is an indicator to the inner being. This strategy works for Dickens, and in addition I'm willing to believe that there's some correlation between inner and outer beauty (especially given that both are in the eye of the beholder). But I know enough people to know that any such pattern is far from universally true. In reading Birstein's memoir, I was continually wondering whether she really believed that beautiful people are nicer, that ugly people compensated by being nasty, that Hannah Arendt was really "a Nazi," Along the same lines, she disparages Norman Mailer's machismo because his penis was small.

But what really struck me about Birstein's memoir is that she strongly identifies herself as a writer--she's published several novels--and she knew lots of writers and intellectuals, including Sylvia Plath, Bernard Malamud, Saul Bellow, and the aforementioned Daniel Bell--but she expresses no interest in any of their writings. Birstein's anecdotes about these people are interesting, but I'm surprised to see no discussion of their literature or their ideas. Perhaps this is her revenge on them for ignoring her writing all these years. In any case, I think she missed an opportunity. It would be like writing a book that takes place in the Giants' locker room and not talking about football. Birstein identifies being a writer with fiction writing and thinks it's funny that Kazin called himself a writer when he was only a critic. "A Walker in the City" has some beautiful phrases and images, but to me it doesn't read as smoothly as a good novel or even as smoothly as good criticism.

Kazin told Birstein that he couldn't love her if she weren't a writer (or something like that; I don't recall the exact wording), but he didn't show much respect for her actual writing. But maybe that has to do with Kazin's career as a critic of classic writing. If he was comparing to Mark Twain, Edith Wharton, etc., then it's no surprise that Birstein came off second best.

This brings me to a more general point. Birstein appeared to evaluate people based on their looks (or, perhaps, retroactively evaluated people's looks based on how much she liked them). Kazin perhaps evaluated Birstein unfairly because, as a writer, she was no Saul Bellow. Do we all do this sometimes? I evaluate statisticians based on their ability--not necessarily technical ability (although that's part of it) but more on whether they "get it" and can solve problems. And the statisticians who are really good at this? I like them as people, almost without exception. Conversely, I get irritated by statisticians who can't do it--especially those who seem to pump up bad ideas or disparage good ideas--I tend to think of them as lesser on a personal level. Some of this is legitimate, I think--part of being a good person is to recognize how one can be most helpful to others--but I probably lean too far in this direction. Even people who are nearly universally disliked, if they're good statisticians, I'll give them the benefit of the doubt. But if I don't like their ideas, it's hard to avoid disliking them.

For a sillier example, I remember reading in Susan Cheever's memoir that John Cheever rated people based on how strong were the drinks they served. Higher alcohol content = better person. And in playing pickup frisbee, I think that, on average, you'll be more liked as a person if you're a better frisbee player. (Although maybe in basketball it goes the other way...)

P.S. What happened to Kazin's first wife? After they broke up, she didn't want to get back together with him--a reasonable enough decision, especially considering how his life proceeded in the years after--but then I was mildly curious what happened with her after that, and the biography didn't say.

Prior elicitation in dynamic models

Nick Firoozye writes:

Burn-in Man

Mike McLaughlin writes:

I was wondering about MCMC burn-in and whether the oft-cited emphasis on this in the literature might not be a bit overstated.

My thought was that the chain is Markovian. In a Metropolis (or Metropolis-Hastings) context, once you establish the scale of the proposal distribution(s), successful burn-in gets you only a starting location inside the posterior -- nothing else is remembered, by definition! However, there is nothing really special about this particular starting point; it would have been just as valid had it been your initial guess and the burn-in would then have been superfluous. Moreover, the sampling phase will eventually reach the far outskirts of the posterior, often a lot more extreme than the sampling starting location, yet it will still (collectively) describe the posterior correctly. This implies that *any* valid starting point is just as good as any other, burn-in or no burn-in.

The only circumstance that I can think of in which a burn-in would be essential is in the case in which prior support regions for the parameters are not all jointly valid (inside the joint posterior), if that is even possible given the min/max limits set for the priors. Am I missing something?

My response: What you're missing is that any inference from a finite number of simulations is an approximation.

Consider an extreme example in which your sample takes independent draws from a N(mu,sigma^2) distribution, but you pick a starting value of X. The average of n simulations will then have the value, in expectation, of (1/n)X+ ((n-1)/n)mu (instead of the correct value of mu). If, for example, X=100 and n=100, you're in trouble! But a burn-in of 1 will solve all your problems in this example. (And in this example, n=100 would work just fine for most purposes.) True, if you draw a few gazillion simulations, the initial values will be forgotten, but why run a few zillion simulations if you don't have to? That will just slow down your important work.

More generally, your starting values will persist for awhile, basically as long as it takes for your chains to mix. If your starting values persist for a time T, then these will pollute your inferences for some time of order T, by which time you can already have stopped the simulations if you'd discarded some early steps.

P.S. See here for a different perspective, from Charlie Geyer. For the reasons stated above, I don't agree with what he writes, but you can read for yourself.

P.P.S. In my example above, you might say that it would be ok if you were to just start at the center of the distribution. One difficulty, though, is that you don't know where the center of the distribution is before you've done your simulations. More realistically, we start from estimates +/- uncertainty as estimated from some simpler model that was easier to fit.

Red-blue event at Cato in Washington, D.C.

Boris and I will be speaking at the Cato Institute in Washington, D.C., next Thurs (11 Sept) at noon on our Red State, Blue State book (also written with David Park, Joe Bafumi, and Jeronimo Cortina). The event will be moderated by Will Wilkinson; see the description here of the event on his blog.

All are welcome to come, but you should register online for the event. We'll be having a panel discussion with Michael McDonald (Brookings Institution and George Mason University) and Brink Lindsey of the Cato Institute. I'm curious what they have to say about our work, especially some of the stuff at the end of chapter 9 about the connections between public opinion and policy.

Dopey anti-doping tests?

Jim points me to this article by Don Berry, which argues that studies of doping in sports often don't correctly perform probability calculations.

Tyler Cowen discusses the possibility of economics prodigies. I refer him and his commenters to Dick De Veaux's saying, "Math is like music, statistics is like literature." You can decide yourself where economics is or should stand in this spectrum. I will say, though, that it can take decades to develop a good idea, just because you can be busy doing other things.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48