## November 21, 2008

### The Denominator, or, Is it an advantage to have a humble background?

Malcolm Gladwell recounts the story of Sidney Weinberg, a kid who grew up in the slums of Brooklyn around 1900 and rose to become the head of Goldman Sachs and well-connected rich guy extraordinaire. Gladwell conjectures that Weinberg's success came not in spite of but because of his impoverished background:

Why did [his] strategy work . . . it's hard to escape the conclusion that . . . there are times when being an outsider is precisely what makes you a good insider.

Later, he continues:

It’s one thing to argue that being an outsider can be strategically useful. But Andrew Carnegie went farther. He believed that poverty provided a better preparation for success than wealth did; that, at root, compensating for disadvantage was more useful, developmentally, than capitalizing on advantage.

At some level, there's got to be some truth to this: you learn things from the school of hard knocks that you'll never learn in the Ivy League, and so forth. But . . . there are so many more poor people than rich people out there. Isn't this just a story about a denominator? Here's my hypothesis:

Pr (success | privileged background) >> Pr (success | humble background)

# people with privileged background << # of people with humble background

Multiply these together, and you might find that many extremely successful people have humble backgrounds, but it does not mean that being an outsider is actually an advantage.

Weinberg was decoupled from the business establishment in the same way, and that seems to have been a big part of what drew executives to him. The chairman of General Foods avowed, “Sidney is the only man I know who could ever say to me in the middle of a board meeting, as he did once, ‘I don’t think you’re very bright,’ and somehow give me the feeling that I’d been paid a compliment.” That Weinberg could make a rebuke seem like a compliment is testament to his charm. That he felt free to deliver the rebuke in the first place is testament to his sociological position. You can’t tell the chairman of General Foods that he’s an idiot if you were his classmate at Yale. But you can if you’re Pincus Weinberg’s son from Brooklyn. Truthtelling is easier from a position of cultural distance.

Is this really true? My guess is that it's not so hard to tell your Yale classmate you think he's not very bright, if you say it in a charming way. College fraternity guys like to jokingly insult each other, no?

### Netflix Prize scoring function isn't Bayesian

NY Times has a good article on the state of recommender systems: "If You Liked This, Sure to Love That ". This is a description of one of the problems:

But his progress had slowed to a crawl. [...] Bertoni says it’s partly because of “Napoleon Dynamite,” an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and tries to predict how any given Netflix user will rate them, he’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” he’s off by an average of 1.2 stars.

The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. [...] It’s the type of quirky entertainment that tends to be either loved or despised.

And here is the stunning conclusion by fortunately anonymous computer scientists:

Some computer scientists think the “Napoleon Dynamite” problem exposes a serious weakness of computers. They cannot anticipate the eccentric ways that real people actually decide to take a chance on a movie.

Actually, computers do quite a good job modeling probability distributions for those more eccentric and unpredictable of us. Yes, the humble probability distribution, the centuries-old staple of statisticians is enough to model eccentricity! The problem is that Netflix makes it hard to use sophisticated models the scoring function is the antiquated and not just pre-Bayesian but actually pre-probabilistic root mean squared error or RMSE. For all practical purposes, the square root in RMSE is a monotonic transformation that won't affect the ranking of recommender models, and we can drop it outright.

So, if one looked at the distribution of ratings for Napoleon Dynamite on Amazon, it has high variance:

On the other hand, Lethal Weapon 4 ratings have lower variance:

If we use the average number of stars as the context-ignorant unpersonalized predictor (which I've discussed before), ND will give you mean squared pain of 3.8, and LW4 will give you the mean squared pain of 2.7. Now, your model might choose not to make recommendations with controversial movies - but this won't help you on Netflix Prize - you're forced to make errors even when you know you're making them. (R)MSE is pre-probabilistic: it gives no advantage to a probabilistic model that's aware of its own uncertainty.

Posted by Aleks Jakulin at 1:27 PM | Comments (6) | TrackBack

## November 20, 2008

### Still another 10 days to apply for an Earth Institute postdoc

The Earth Institute is looking for applicants for its postdoctoral fellows program, and if you're doing statistics you can work with me. It's a highly competitive program, deadline is 1 December so apply now:

Postdoctoral Fellows Program in Sustainable Development at The Earth Institute

The Earth Institute at Columbia University is the world’s leading academic center for the study, teaching, and implementation of sustainable development. It builds on excellence in the core disciplines—earth sciences, biological sciences, engineering sciences, social sciences, and health sciences—and stresses cross-disciplinary approaches to complex problems.

Through research, training, and global partnerships, The Earth Institute mobilizes science and technology to advance sustainable development and address environmental degradation, placing special emphasis on the needs of the world’s poor.

The Earth Institute seeks applications from innovative postdoctoral candidates or recent Ph.D., M.D., and J.D. recipients interested in a broad range of issues in sustainable development.

The Postdoctoral Fellows Program in Sustainable Development provides scholars who have a foundation in one of the Institute’s core disciplines the opportunity to acquire the cross-disciplinary expertise and breadth needed to address critical issues in the field of sustainable development, including reducing poverty, hunger, disease, and environmental degradation. Those who have developed cross-disciplinary approaches during graduate studies will find numerous opportunities to engage in leading research programs that challenge their skills.

Candidates for the Postdoctoral Fellows Program should submit a proposal for research that would contribute to the goal of global sustainable development. This could take the form of participating in and contributing to an existing multidisciplinary Earth Institute project, an extension of an existing project, or a new project that connects existing Institute expertise in novel ways. Candidates should identify their desired small multidisciplinary mentoring team, i.e., two or more senior faculty members or research scientists/scholars at Columbia with whom they would like to work during their fellowship.

For detailed information on The Earth Institute, its research centers, programs, and affiliated Columbia University departments, please visit http://www.earthinstitute.columbia.edu

Fellowships will ordinarily be granted for a period of 24 months.

Application forms should be completed online at http://fellows.ei.columbia.edu/2009/

Applications submitted by December 1, 2008, will be considered for fellowships starting in the summer or fall of 2009.

Research Director, OARP
rricobelli@ei.columbia.edu
The Earth Institute at Columbia University
B-16 Hogan Hall, MC 3277
New York, NY 10025
Program e-mail: fellows@ei.columbia.edu

Columbia University is an affirmative action/equal opportunity employer.
Minorities and women are encouraged to apply.

## November 19, 2008

### Genetically-influenced traits running in families

There is also little consensus among researchers about what causes psychopathy. Considerable evidence, including several large-scale studies of twins, points toward a genetic component. Yet psychopaths are more likely to come from neglectful families than from loving, nurturing ones.

I'm confused here. If there's a big genetic component, wouldn't it stand to reason that parents of psychopaths are more likely to be neglectful and less likely to be loving and nurturing? So why the "Yet" in the quote above? Or is there something I'm missing?

P.S. in response to commenters: Yes, I agree that it's possible for psychopathy to be largely genetic without parents of psychopaths being much more likely to be neglectful.

What I didn't understand was Seabrook's implication that this would be surprising, the idea that if (a) a trait is genetically linked, and (b) a trait can be (somewhat) predicted by parental behavior, that the combination of (a) and (b) should be considered puzzling. By default, I'd think (a) and (b) would go together.

## November 13, 2008

### Modeling growth

Charles Williams writes,

In a number of your examples in the multilevel modeling book you use growth as an outcome. I'm doing this in a study of firm growth in the cellular industry. In this setting, we need to control for firm size since firm's propensity to grow is definitely affected by its size. Someone suggested to me that I may have correlation between the size variable and the error term, since size is effectively in the denominator of the growth variable. They suggested using just the numerator of the growth term (subscribers added) as the outcome, since the denominator will be controlled for in the regression.

Have you run into this? Do you agree that there is a potential for bias in using size as a regressor for growth?

My reply: Yes, it makes sense to control for size (at the beginning of the study) in your regressions, probably on the log scale. I'd still use the ratio as an outcome because I think it would help the coefficients be more directly interpretable (which is a virtue in itself and also helps with efficiency if you have a hierarchical or Bayesian model).

## November 12, 2008

### Fellowship and internship programs at the Educational Testing Service

Information and application instructions are posted on the ETS Web site at http://www.ets.org/research/fellowships.html. The deadline for applying for the summer internship and postdoctoral fellowship programs is February 1, 2009. The deadlines for applying for the Harold Gulliksen program are December 1, 2008 for the preliminary nomination materials and February 1, 2009 for the final application materials.

## November 11, 2008

### Job opening--come here and work down the hall from me

The Department of Statistics at Columbia University invites applications for an Assistant Professor position, commencing Fall 2009. A PhD in statistics or a related field and commitment to high quality research and teaching in statistics and/or probability are required. Outstanding candidates in all areas are strongly encouraged to apply. You should apply before December 1, 2008.

The department currently consists of 20 faculty members, 35 PhD students, and over 100 MS students. The department has been expanding rapidly and, like the University itself, is an extraordinarily vibrant academic community. For further information about the department and our activities, centers, research areas, and curricular programs, please go to our web page at: http://www.stat.columbia.edu.

Inquiries may be made to dk@stat.columbia.edu .

Review of applications will begin December 1, 2008. Applications received after this date may be considered until the position is filled or the search is closed. Columbia University is an Equal Opportunity/Affirmative Action employer.

## October 30, 2008

### More on scaling regression inputs

Tom Knapp writes:

I have four questions and one correction about your article about scaling regression inputs in Statistics in Medicine:
1. In your party identification example you show that division by two standard deviations reversed the relative magnitudes of some regression coefficients. Near the end of your paper, with respect to Itani et al., you say "dividing by one (rather than two) standard deviation will lead the reader to understate the importance of these continuous inputs". Is that always the case?

2. How did you get your paper published in SIM, given that the only reference to medicine is in two of those last three examples?

3. In the text accompanying Figure 2 you say "the coefficient for the interaction of income and ideology is now higher than the coefficient for race [black]". If I'm reading the data in that figure correctly I think you meant to say that the coefficient for parents.party is now higher.

4. On page 2866 you say that log transformations are not appropriate for Likert scales. Do you have a reference for that claim? I think Likert scales are inappropriate for linear regression analysis in general and require the use of ordinal regression analysis.

5. On page 2868 you have a brief paragraph regarding the ability of experienced practitioners to interpret the regression coefficients in the top half of Figure 2. I guess I qualify (I taught statistics for 41 years), and I usually interpret regression coefficients by eyeballing the associated t's or p's. Why didn't you provide same? I calculated all of the t's for the unscaled coefficients; for black and for parents.party I got -5.76 and 16.33, respectively, so parents.party is the stronger predictor. [Incidentally, you probably should have reported another
place or two for the data in Figure 2, since the coefficient and the standard error for age squared are both 0.00]

My reply: First off, it's a thrill to get a comment from someone who taught statistics for 41 years! I've been doing it for barely half as long. To get to specifics:

1. Dividing by 1 sd is roughly comparable to a binary predictor being coded as +/- 1. Dividing by 2 sd is roughly comparable to a binary predictor being coded as 0/1. The 0/1 coding is much more common (at least, in the examples that I've seen), which is why I chose the 2 sd scaling.

2. I think it got rejected by 2 other places; I can't quite remember where. But each time I made major improvements.

3. Yes, that's right. D'oh!

4. I'm not so bothered by treating a 1-5 or 1-10 scale linearly, on the assumption that the difference between 1 and 2 is approximately the same as the difference between 3 and 4, or whatever. I'm working on a research project to use Bayesian methods to bridge between the extremes of pure linearity and pure ordered-categorical models.

5. That's a good point. Ordering by statistical significance is not the same as ordering by importance, but it would've been a good idea to discuss this in the article.

## October 20, 2008

### Two countries separated by a common language

Some differences:

- Tao uses more words. This makes sense: he's busy explaining this stuff to himself as well as to his readers. To a statistician, these ideas are so basic that it's hard for us to really elaborate. (Also, I had a word limit.)

- Tao emphasizes that a confidence interval is not a probability interval. In my experience, confidence intervals are always treated as probability intervals anyway, so I don't spend time with the distinction.

- I emphasize that a poll is a snapshot, not a forecast.

- Tao says that the number of polled voters is fixed in advance. I don't think this is exactly true, what with nonresponse.

- Tao fills his blog entry with Wikipedia links. Wikipedia is ok but I'm not so thrilled with it; I'm happy with people looking things up in it if they want but I won't encourage it.

But we're basically saying the same thing. I like how I put it, but I'm sure a lot of people prefer Tao's style. Luckily there's room on the web for both!

### "Data analysis" or "data synthesis?"

See discussion here.

## October 14, 2008

### Don't Ask, Don't Tell: The New Rules of the SAT and College Admissions

Howard Wainer writes:

On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.

The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them fast.

Among their conclusions were that schools should consider making their admissions "SAT optional," that is allowing their applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept.

Howard continues:

Has the admissions process been hampered in schools that have instituted an SAT optional policy?

The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969. Bowdoin is a small, highly competitive liberal arts college in Brunswick, Maine. A shade under 400 students a year elect to matriculate at Bowdoin, and roughly a quarter of them choose not to submit their SAT scores. . . .

As it turns out the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin. In fact the correlation between grades and SAT scores was 12% higher for those who didn't submit them than for those who did.

So not having this information does not improve the academic performance of Bowdoin's entering class — on the contrary it diminishes it. Why would a school opt for such a policy? Why is less information preferred to more? . . .

We see that if all of the students in Bowdoin's entering class had their SAT scores included the average SAT at Bowdoin would sink from 1323 to 1288, and instead of being second among these six schools they would have been tied for next to last. Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in the average. I believe that Bowdoin's adoption of this policy pre-dates US News and World Report's rankings, so that was unlikely to have been their motivation, but I cannot say the same for schools that have chosen such a policy more recently.

Interesting. Howard has some data showing that, unsurprisingly, the students who don't supply their SAT are mostly (but, interestingly, not always, those scoring lower):

(I don't find the y-axis on this graph very helpful, but that's another story.)

So what's the deal? Who are those kids with 1450 SAT's who aren't submitting their scores?

This reminds me . . .

Sound psychometric (i.e., statistical) principles tell us that, if we have an applicant who's taken a test multiple times, to use his or her average score. But for our PhD admissions, we generally take the higher score. I understand our psychological reasons for doing this--we want to think the best of a person--but, statistically, it seems like a bad idea.

Cool.

## October 1, 2008

### Applied Statistics Center Monthly Update for October 2008

Just to let you know things are busy around here . . .

Contents: Featured Research from ASC Fellows Seminars Coming Soon News

********************************************

Featured Research from ASC Fellows:

- Pablo Pinto (Political Science) writes that he's working on a project titled The Politics of Investment, with Santiago M. Pinto:
"In this project we try to establish whether foreign direct investment (FDI) reacts to changing political conditions in host countries. More specifically, we explore the existence of partisan cycles in FDI investment performance, which should be reflected in different patterns of investment at the industrial level. While there has been extensive work on the effects of policy decisions (trade and tax policy in particular) on aggregate FDI flows (Feldstein, Hines, & Hubbard 1995; Hines 2001), we find that that the link between partisanship and investment performance has not been duly explored in the literature. The first paper of this series, The Politics of Investment: Partisanship and the Sectoral Allocation of FDI, was published in the June 2008 issue of Economics & Politics. We are currently working on several extensions: Partisanship, Imperfect Capital Mobility and the Sectoral Allocation of FDI; Partisan Governments, Wages and Employment."

- Want your work to be the Featured Research in next month's newsletter? Let us know! Send an email to ejs2130@columbia.edu

***

Seminars Coming Soon:

Quantitative Political Science seminar
- Date: October 2, 2008
- Speaker: Robert Erikson, Kelly Rader and Pablo Pinto, Columbia Political Science
- Topic: TBA

- Date: October 16, 2008
- Speaker: Amy Lerman, Princeton Politics
- Topic: TBA

Quantitative Methods in the Social Sciences seminar
- Date: October 1, 2008
- Speaker: Jennifer Booher-Jennings, Columbia
- Topic: Beyond High Stakes Tests: Teacher Effects on Other Educational Outcomes

- Date: October 15, 2008
- Speaker: Margot Jackson, Princeton
- Topic: TBA

Statistics Seminar
- Date: October 6, 2008
- Speaker: Dr. Adam A. Szpiro, Department of Biostatistics, University of Washington
- Topic: TBA

- Date: October 13, 2008
- Speaker: Dr. Ingemar Nåsell, Royal Institute of Technology, Stockholm
- Topic: On Persistence of Endemic Infections

- Date: October 20, 2008
- Speaker: Dr. Hernando Ombao, Brown University
- Topic: TBA

- Date: October 27, 2008
- Speaker: Dr. David Brillinger, Statistics Department, University of California, Berkeley
- Topic: TBA

Applied Mathematics Colloquium
- Date: October 7, 2008
- Speaker: Jason Fleischer, Princeton University
- Topic: Towards Optical Hydrodynamics

- Date: October 14, 2008
- Speaker: Sorin Tanase-Nicola, University of Michigan
- Topic: TBA

- Date: October 16, 2008
- Speaker: Misha Chertkov, LANL
- Topic: Belief Propagation and Beyond

- Date: October 21, 2008
- Speaker: Paul Francois, Rockefeller University
- Topic: TBA

- Date: October 28, 2008
- Speaker: Surya Ganguli, Keck Center, UCSF
- Topic: TBA

Applied Microeconomics seminar
- Date: October 1, 2008
- Speaker: Joshua Goodman
- Topic: TBA

- Date: October 8, 2008
- Speaker: Erzo Luttmer
- Topic: What Good is Wealth Without Health? The Effect of Health on the Marginal Utility of Consumption

- Date: October 15, 2008
- Speaker: Rajeev Cherukupalli
- Topic: TBA

- Date: October 22, 2008
- Speaker: Tumer Kaplan
- Topic: TBA

- Date: October 29, 2008
- Speaker: Amitabh Chandra
- Topic: TBA

Econometrics workshop
- Date: October 9, 2008
- Speaker: Arthur Lewbel, Boston College
- Topic: TBA

- Date: October 16, 2008
- Speaker: Peter Reinhardt Hansen, Stanford
- Topic: TBA

- Date: October 23, 2008
- Speaker: Eric Ghysels, North Carolina
- Topic: TBA

Econometrics colloquium
- Date: October 1, 2008
- Speaker: Dennis Kristensen, Columbia
- Topic: Testing Conditional Factor Models

- Date: October 8, 2008
- Speaker: Richard Davis, Columbia
- Topic: Structural Break Estimation in Time Series: Theory and Practice

- Date: October 15, 2008
- Speaker: Pierre Andre Chiappori, Columbia
- Topic: TBA

- Date: October 22, 2008
- Speaker: Yinghua He, Columbia
- Topic: Estimating School Choice Problem under Boston Mechanism as a Bayesian Game

## September 28, 2008

### Exciting 1% shift!

Brendan Nyhan offers this amusing example of a newspaper hyping poll noise. From the LA Times:

Registered voters who watched the debate preferred Obama, 49% to 44%, according to the poll taken over three days after the showdown in Oxford, Miss.

That is a small gain from a week ago, when a survey of the same voters showed the Democratic candidate with a 48% to 45% edge.

A small gain, indeed.

## September 25, 2008

### Models for cumulative probabilities

Dan Lakeland writes:

I am working with some biologists on a model for time-to-response for animals under certain conditions. The model(s) ultimately are defined in terms of a differential equation that relates a (hidden) concentration of a metabolic product to the (cumulative) probability that an animal will respond within a given time by changing its behavior.
Now mostly, in my experience, statistical models are models for averages, or particular quantiles of the dataset (medians etc). Most models attempt to predict something (like time to response) from something else (like say measured amounts of a drug). In this case, rather than predicting individual response times, we're trying to predict shape of a distribution from measured exposure to a certain environment.

In this case, we are tempted to use some measure of the goodness of fit to try to guess what is going on internally within the animal. For ease of computation, I'm fitting this model with maximum likelihood methods initially (a Bayesian approach may come later if time allows).

What is your opinion on model selection methods in this type of scenario? Your book index has "model selection and why we avoid it" which sounds unhelpful, but the section on model selection was actually more helpful than the index implied. Is there anything you can add in this context?

My reply: I'm not quite sure what your question is, but maybe, if I can translate it into the social-science examples with which I'm more familiar, I can imagine you're doing something like predicting what percentage of people will respond a certain way to an advertisement, or how low a price would have to be before half the people would buy something. Framed that way, these sorts of models are pretty common. In section 6.8 of ARM, we discuss the relation between certain models for individuals and for groups.

## September 18, 2008

### Student-t conference

A conference celebrating the 100th birthday W.S. "Student" Gosset's "The Probable Error of a Mean" and three other classic papers:

The Harvard University Department of Statistics presents:

"Quintessential Contributions: Celebrating Major Birthdays of Statistical Ideas and Their Inventors"
When: Saturday, September 27, 2008
Where: Radcliffe Gym, 18 Mason Street, Cambridge, MA

*Celebrating the 65th birthday of Donald B. Rubin and the 30th birthday of his "Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse"
**Invited Speaker: Fritz Scheuren

*Celebrating the 70th birthday of Carl N. Morris and the 25th birthday of his "Parametric Empirical Bayes Inference: Theory and Applications"
**Invited Speaker: Andrew Gelman

*** Catered Chinese Lunch ***

*Celebrating the 85th birthday of Herman Chernoff and the 35th birthday of his "The Use of Faces to Represent Points in K-Dimensional Space Graphically"
**Invited Speaker: Steve Wang

*Celebrating the 100th birthday of W.S. "Student" Gosset's "The Probable Error of a Mean"
**Invited Speaker: Stephen Stigler

*Student Presentation: "From Student to students"

*** "Student t Party" at Cambridge Queen's Head Pub***

Registration fee is waived for ALL current and former Harvard affiliates (students, staff and faculty), speakers and specially invited guests. Non-Harvard Affiliates' $50.00 registration fee includes beverages, food, and full symposium. See below for payment information. To register, provide the following information by September 15, 2008, to : Name: ______________________________________ Harvard affiliation, if any: _______________ E-mail address: ____________________________ Lunch desired: Yes _______ or No _______ Please make checks payable to: Harvard University Statistics Department and mail to me at the address below. Payment will also be accepted at the door, but we must have a YES or NO reply no later than September 15, 2008, regardless of registration status, to obtain an accurate headcount for catering purposes (attendees will receive marked name tags indicating lunch was reserved). Posted by Andrew at 8:22 PM | Comments (2) | TrackBack ## September 15, 2008 ### The biggest problem in model selection? A student writes: One of my most favorite subjects is model-selection. I have read some papers in this field and know that it is so widely used in almost every sub-field in statistics. I have studied some basic and traditional criterion such as AIC, BIC and CP. The idea is to set a consistent optimal criterion, usually it's not easy when the dimensionality is high, but my question is, what is the biggest problem and why it is so hard? Also I heard that this field has some relations to non-parameter statistics and linear model theories, but as an undergraduate student, I do not know any specific connections between them. I am working in a laboratory in biostatistics; are there any related problems in this field? My reply: In my opinion, the biggest difficulty is that AIC etc. are all approximations, not actual out-of-sample errors. The attempt to calculate out-of-sample errors leads to cross-validation which has its own problems. Some sort of general theory/methods for cross-validation would be good. I'm sure people are working on this but I don't think we're there yet. Regarding your final question: sure, just about every statistical method has biological applications. In this case, you're comparing different models you might want to fit. Posted by Andrew at 5:16 PM | Comments (14) | TrackBack ## September 11, 2008 ### More on interactions Bruce McCullough writes: Don't know if you're aware of this, but if you need more evidence for the primacy of interaction effects, data mining is a great place to look. My degree is in economics. I was taught to use interaction effects as a test for nonlinearity, and that was about it. My data mining experience of the past few years has taught me that interaction effects can be neglected at my own peril. A wonderful paper that illustrates this is "Variable selection in data mining: Building a predictive model for bankruptcy," by Dean P. Foster and Robert A. Stine in the Journal of the American Statistical Association (2000). The usual linear regression doesn't work. The model with lots of interactions works very well. Posted by Andrew at 10:27 PM | Comments (1) | TrackBack ## September 10, 2008 ### My talks this week in D.C.: today (Wed.) at George Washington University, Thurs. at the Cato Institute If you're in D.C., you should stop by. . . . I'm speaking in the statistics department at George Washington University on the topic of interactions. Here's the powerpoint and here's the abstract: As statisticians and practitioners, we all know about interactions but we tend to think of them as an afterthought. We argue here that interactions are fundamental to statistical models. We first consider treatment interactions in before-after studies, then more general interactions in regressions and multilevel models. Using several examples from our own applied research, we demonstrate the effectiveness of routinely including interactions in regression models. We also discuss some of the challenges and open problems involved in setting up models for interactions. The talk will be today, Wed 10 Sept, at 3pm at 1957 E Street, Room 212. If you don't know where that is, you can call the department (202-994-6356) and they should be able to give you directions. Tomorrow (Thurs) I'll be speaking with Boris at noon at the Cato Institute on Red State, Blue State. It's not too late to sign up for that. Posted by Andrew at 12:39 AM | Comments (3) | TrackBack ## September 3, 2008 ### Non-Aristotelian logic and municipal government That header got your attention, huh?? John Hull writes: Reading an article on "non-Aristotelean" logic, where P(A) is my confidence of A being true, I found (on page 10) the equation P(B=>C)=P(B[AND]C)/P(B). Since I work in municipal government, an obvious interpretation of this is the following: My confidence that if a person thinks the world is "flat" then they are dangerously stupid is the same as my confidence that a person believes the world is "flat" and is dangerously stupid, divided by my confidence that a person thinks the world is "flat." Setting aside the fact that when people's welfare is in the balance, I tend to become rather passionate and use rather strong language, I simply cannot wrap my head around this idea. For example, my confidence that if it's a lion, then it eats gazelles equals my confidence that it's is a lion eating a gazelle, divided by my confidence that it's is a lion. The left side of that equation is a (near) certainty — lions eat gazelles — but the right-hand side of the equation...how do I even begin to establish my confidence it's a lion, let alone the rest of it? Can you make this more understandable? Any help will be appreciated. My response: 1. I find the if-then connection to Aristotelian logic confusing. I'd prefer to start with probabilities as first principles, and then interpret conditional probabilities Bayeisanly or, equivalently, in a frequentist way as the long-run proportion of cases in a "reference set." (The choice of reference set is equivalent to the choice of what to condition on in a Bayesian calculation.) We discuss this in chapter 1 of Bayesian Data Analysis. 2. Right now, I'm realizing how nonintuitive many principles of probability are to some people. See this discussion here where one of the commenters want to assign a zero probability to an event (that of a tied congressional election) because it has never happened yet. That sounds commonsensical--but not if p=1/80,000 and n=20,000. Posted by Andrew at 12:42 AM | Comments (4) | TrackBack ### Melding statistics with engineering? Dan Lakeland writes: I recently enrolled as a PhD student in a civil engineering program. My interest could be described as the application of data and risk analysis to engineering modelling, design methods, and decision making. The field is pretty ripe, and infrastructure risk analysis is a common topic these days, but the simulations and statistical approaches taken so far have been a bit unsatisfactory. For example people studying the impact of bridge failures during earthquakes on the local economy might assume a constant cost per person-hour of delay throughout the rebuild period, or people might build statistical models of probability of building collapse, but I would call them pretty much prior distributions, not really based on much data, or based on a finite element computer model of the physics of a single model building. I think the application of data to engineering is bizarrely a rather new field. Or at least in a renaissance. Back in the 50s or earlier they used to do lots of tests, and generate graphical nomographs of the results (Like the Moody chart for fluid flow friction factors), but these days the emphasis is on detailed finite element analyses, which tell you a exactly how some model will perform, but doesn't deal at all with the difference between your model assumptions and reality. I'm attaching an article that I'm reading for an earthquake soil mechanics class, which shows pretty much the state of the art of applications of (bayesian) statistics to engineering. A CPT test is a test where they push a cone on the end of a long rod into the ground and measure the pressure being applied to the cone as a function of depth. another paper I've read uses artificial neural networks to predict the shear capacity of reinforced concrete beams. Engineers typically don't like ANN type approaches because they're data oriented and don't have explanatory power in terms of physics. On the other hand, the ANN model, because it's based on data, is a much better fit to real performance than the existing physics based models. I wonder if you might comment in your blog on melding statistics with engineering. especially how we can use data together with deterministic models, and build better engineering decision rules, both for everyday engineering, as well as for dealing with social investment decisions such as building code requirements for extreme events like earthquakes, hurricanes, and soforth. What decision theory books or articles do you know of that might be useful and relevant to this field? My reply I've long thought of statistics as a branch of engineering rather than a science. To me, statistics is all about building tools to solve problems. On the other hand, departments of Operations Research and Industrial Engineering tend to focus on probability theory rather than applied statistics, so I think we need our own departments. Getting to your specific question: yes, I know what you're talking about. Back in high school and college I spent a few summers working in a lab programming finite element methods. Ultimately this was all statistical, but I didn't see that at the time. I imagine there's been a huge amount of work in this area in the past 25 years, with iterative methods for refining grid boxes and so forth. It would be a fun area to work in. But I suspect it would be an effort to translate it into statistical language. It seems to me that engineers and physicists work very hard at solving particular problems, which are often big and difficult. Statisticians develop general tools for easy problems (e.g., logistic regression), which is a different sort of challenge. I think there's great potential for putting these perspectives together but I'm not quite clear where to start. I've seen some articles in statistics journals addressing your concerns but I haven't been so impressed by what I've seen there. Probably a better strategy is to start with the engineering literature and add uncertainty to that. Posted by Andrew at 12:17 AM | Comments (8) | TrackBack ## August 25, 2008 ### Dependent and independent variables Regarding the question of what to call x and y in a regression (see comments here), David writes, "The semantics are ugly, and don't really add much, because we are concerned with the relation of one to the other, not what they themselves are." I agree that the semantics don't really add much, but they can subtract, I think! First off, the words "dependent" and "independent" sound similar and can lead to confusion in conversation. Second, as commenter Infz noted, people confuse "independent variables" with statistical independence, leading to the incorrect view that multiple regression requires the predictors to be independent. I agree, though, that the term "parameter" can be confusing; sometimes it's something that you can vary and sometimes it's something you can estimate. And I've already discussed how "marginal" has opposite meanings in statistics and in economics. Posted by Andrew at 10:49 AM | Comments (3) | TrackBack ## August 23, 2008 ### "The method of multiple correlation" Someone writes: I was reading Harold Gulliksen's /Theory of Mental Tests/ (1950), and on p. 327-329 it describes a process for solving a set of equations of the form y = b1x1 + b2x2 + ... + bnxn so as to minimize the least square error. Sounds like regression. But this procedure claims to account for the correlation between all the x variables. He calls it "the method of multiple correlation". Why don't we use this procedure all the time, instead of standard regression, which assumes independence of the independent variables? My reply: I haven't ever heard of this before. But it sounds to me just like multiple regression (which does not assume independence of the x-variables). This confusion of terminology is one reason why I don't like to use the term "independent variables." I prefer to call them "predictors." Posted by Andrew at 4:09 PM | Comments (13) | TrackBack ## August 20, 2008 ### Interactions I have mixed feelings about this picture and accompanying note of Jeremy Freese, who writes: Key findings in quantitative social science are often interaction effects in which the estimated “effect” of a continuous variable on an outcome for one group is found to differ from the estimated effect for another group. . . . Interaction effects are notorious for being much easier to publish than to replicate, partly because it is easy for researchers to forget (?) how they tested many dozens of possible interactions before finding one that is statistically significant and can be presented as though it was hypothesized by the researchers all along. . . . There are so many ways of dividing a sample into subgroups, and there are so many variables in a typical dataset that have low correlation with an outcome, that it is inevitable that there will be all kinds of little pockets for high correlation for some subgroup just by chance. I take his point, and indeed I've written myself about the perils of fishing for statistical significance in a pond full of weak effects (uh, ok, let's shut down that metaphor right there). And I even cite Freese in my article. On the other hand, I'm also on record as saying that interactions are important (see also here). I guess my answer is that interactions are important, but we should look for them where they make sense. Jeremy's graph reproduced above doesn't really give enough context. Also, remember that the correlation between before and after measurements will be higher among controls than among treated units. Posted by Andrew at 12:25 AM | Comments (4) | TrackBack ## August 7, 2008 ### Whassup with Bart? I've seen Jennifer Hill and Ed George give great talks on Bayesian additive regression trees. It looked awesome. So why haven't these papers appeared anywhere? All I can find are preprints. Posted by Andrew at 11:08 AM | Comments (7) | TrackBack ## August 3, 2008 ### The mythical Gaussian distribution and population differences There was a dynamic discussion on gender differences in performance a few days ago. Many interesting points were raised, but most of them regarded differences in models (variance, mean), rather than differences in distributions. One of the comments referred to the Project TALENT database from 1960. It's one of the most exhaustive datasets of its type. I have been unhappy for quite some time because papers do not show the actual data. For that reason I wrote a small plotting program that allows visual comparisons of histograms. The plentiful TALENT data makes it possible to avoid binning or kernel smoothing. Here are some plots: The pink histogram is for girls, the blue one for boys, and where the pink and blue overlap, there is grey. It is interesting to observe the skew, which might indicate incentives, learning curves or unbalanced tests. One of the most striking examples of skew is the difference in reading comprehension between Catholic/Protestant and Jewish populace, but I also list mechanical reasoning: Project TALENT's data is from 1960, so things might have changed since then. Nowell & Hedges discuss some trends from 1960-1994. In the end, let me reiterate that this posting does not make any statements about the causality of these differences - I am merely providing the data as such. The only assumptions were that the missing values can be dropped (boys were overrepresented in this respect) and that both underlying populations are comparable (no systematic effects with respect to extraneous biases such as age). I did NOT observe boys being overrepresented on the low end of the spectrum for mathematics scores - but this could easily happen if one isn't careful throwing out the missing values coded with "-1" (5.4% among boys, 4.4% among girls). Posted by Aleks Jakulin at 6:47 PM | Comments (7) | TrackBack ### Classifying Olympic athletes as male or female, leading to a comment about the recognition of uncertainty in life I read an interesting op-ed by Jennifer Finney Boylan about classification of Olympic athletes as male or female. Apparently, they're now checking the sex of athletes based on physical appearance and blood samples. This should be an improvement over the simple chromosome test which can label a woman as a man because she has a Y chromosome, even if she is developmentally and physically female. But then Boylan writes: Most efforts to rigidly quantify the sexes are bound to fail. For every supposedly unmovable gender marker, there is an exception. There are women with androgen insensitivity, who have Y chromosomes. There are women who have had hysterectomies, women who cannot become pregnant, women who hate makeup, women whose object of affection is other women. I'm starting to lose the thread here. Nobody is talking about excluding from Olympic competition women who have had hysterctomies or cannot become pregnant, right? And lesbians are allowed to compete too, no? And makeup might be required for Miss America competition but not for athletes. Boylan continues: So what makes someone female then? . . . The only dependable test for gender is the truth of a person’s life . . . The best judge of a person’s gender is what lies within her, or his, heart. Would this really work? This just seems like a recipe for cheating, for Olympic teams in authoritarian countries to take some of their outstanding-but-not-quite-Olympic-champion caliber male athletes and tell them to live like women. It doesn't seem so fair to female athletes from the U.S., for example, to have to compete with any guy in the world who happens to be willing to say, for the purposes of the competition, that, in his heart, he feels like a woman. Why do I mention this in a statistics blog? I think people are often uncomfortable with ambiguity. Boylan correctly notes that sex tests can have problems and that there is no perfect rule, but then she jumps to the recommendation that there be no rules at all. Posted by Andrew at 1:24 PM | Comments (4) | TrackBack ## August 2, 2008 ### Let computers do the surveys! WSJ reports that people are more likely to provide socially-acceptable answers to survey questions about themselves when interviewed by a person (or even an avatar!) than when responding to an automated survey system or a recording. Such questions relate to politics, hygiene, exercise, health, and so on. The research is helping refine polling at a university phone center nearby. Activity at the center, which sits in a former school building, picks up around dinnertime when the staff makes calls for university-run surveys from a warren of cubicles. The questioners are asked to speak in even tones, reading from scripts. No one is allowed to say, "How are you?" in case the person on the other end had a bad day. The interviewers don't laugh; they don't want people to treat this as a social call. They are allowed only neutral responses such as "I see" or "Hmm." There are some interesting demonstrations at Harvard's Implicit project. Posted by Aleks Jakulin at 6:01 AM | Comments (6) | TrackBack ## August 1, 2008 ### Rube Goldberg statistics? Kenneth Burman writes: Some modern, computer intensive, data analysis methods may look to the non-statistician (or even our selves) like the equivalent of the notorious Rube Goldberg device for accomplishing a intrinsically simple task. Whereas some variants of the bootstrap or cross validation might fit this situation, mostly the risk this humiliation is to be found in MCMC-based Bayesian methods. I [Bruman] am not at all against such methods. I am only wondering if, to “outsiders” (who may already a negative impression of statistics and statisticians), these methods may appear like a Rube Goldberg device. You have parameters, likelihoods, hierarchies of fixed effects, random effects, hyper-parameters, then Markov chain Monte Carlo with tuning and burn followed by long “chains” of random variables, with possible thinning for lag-correlations, concerns about convergence to an ergodic state. And after all that, newly “armed” now with a “sample” of 100,000 (or more) numbers from a mysterious posterior probability distribution you proceed to analyze these new “data” (where did the real data go? – now you have more numbers than you started with for actual data) by more methods, simple (a mean) or complex (smoothing using kernel density methods, and then pull off the mode). All OK to a suitably trained statistician, but might we be in for ridicule and misunderstanding from the public? If such a charge were leveled at us (“you guys are doing Rube Goldberg statistics”) how would we respond, given the “complaint” comes from people with little or no statistics training? Of course, such folks may not be capable of generating such a critique, but could still realize they have no idea what the statistician is doing to the data to get answers. It does us no good if the public thinks our methods are Rube Goldberg in nature. Interesting question. I'll respond in a few days. But in the meantime, would any of you like to give your thoughts? Posted by Andrew at 1:10 AM | Comments (13) | TrackBack ## July 28, 2008 ### "An ounce of replication..." I was looking through this old blog entry and found an exchange I like enough to repost. Raymond Hubbard and R. Murry Lindsay wrote, An ounce of replication is worth a ton of inferential statistics. I questioned this, writing: More data are fine, but sometimes it's worth putting in a little effort to analyze what you have. Or, to put it more constructively, the best inferential tools are those that allow you to analyze more data that have already been collected. Seth questioned my questioning, writing: I'd like to hear more about why you don't think an ounce of replication is worth a ton of inferential statistics. That has been my experience. The value of inferential statistics is that they predict what will happen. Plainly another way to figure out what will happen is to do it again. To which I replied: I'm not sure how to put replication and inferential statistics on the same scale . . . but a ton is 32,000 times an ounce. To put in dollar terms, for example, I think that in many contexts,$32,000 of data analysis will tell me more than $1 worth of additional data. Often the additional data are already out there but haven't been analyzed. I think it's fun to take this sort of quotation literally and see where it leads. It's a rhetorical strategy that I think works well for me, as a statistician. Posted by Andrew at 12:10 AM | Comments (5) | TrackBack ## July 26, 2008 ### NYT vs WSJ on gender issues Aleks sends in a striking example of a news story presented in two completely different ways: I [Aleks] was looking at the NYT and WSJ today, and one particular discrepancy struck me. The NYT story, "Math Scores Show No Gap for Girls," by Tamar Lewin, says: Three years after the president of Harvard, Lawrence H. Summers, got into trouble for questioning women’s “intrinsic aptitude” for science and engineering — and 16 years after the talking Barbie doll proclaimed that “math class is tough” — a study paid for by the National Science Foundation has found that girls perform as well as boys on standardized math tests. . . . “Now that enrollment in advanced math courses is equalized, we don’t see gender differences in test performance,” said Marcia C. Linn of the University of California, Berkeley, a co-author of the study. “But people are surprised by these findings, which suggests to me that the stereotypes are still there.” . . . Although boys in high school performed better than girls in math 20 years ago, the researchers found, that is no longer the case. . . . The researchers looked at the average of the test scores of all students, the performance of the most gifted children and the ability to solve complex math problems. They found, in every category, that girls did as well as boys. . . . The NYT story had absolutely no mention of the girl/boy variance whatsoever. Compare to the WSJ version (girl/boy variance in the headline), "Boys' Math Scores Hit Highs and Lows," by Keith Winstein: Girls and boys have roughly the same average scores on state math tests, but boys more often excelled or failed, researchers reported. The fresh research adds to the debate about gender difference in aptitude for mathematics, including efforts to explain the relative scarcity of women among professors of science, math and engineering. In the 1970s and 1980s, studies regularly found that high- school boys tended to outperform girls. But a number of recent studies have found little difference. . . . [The recent study] didn't find a significant overall difference between girls' and boys' scores. But the study also found that boys' scores were more variable than those of girls. More boys scored extremely well -- or extremely poorly -- than girls, who were more likely to earn scores closer to the average for all students. . . . The study found that boys are consistently more variable than girls, in every grade and in every state studied. That difference has "been a concern over the years," said Marcia C. Linn, a Berkeley education professor and one of the study's authors. "People didn't pay attention to it at first when there was a big difference" in average scores, she said. But now that girls and boys score similarly on average, researchers are taking notice, she said. Here's some context from a few years back (I looked it up, because I wasn't sure exactly what Summers said, and the NYT article referred to him. From the NYT a few years ago: Dr. Summers cited research showing that more high school boys than girls tend to score at very high and very low levels on standardized math tests, and that it was important to consider the possibility that such differences may stem from biological differences between the sexes. Dr. Freeman said, "Men are taller than women, that comes from the biology, and Larry's view was that perhaps the dispersion in test scores could also come from the biology. What's amazing is that the two newspapers quote the same researcher but with two nearly opposite points. I assume she made both points to both newspapers, but the NYT reporter ran with the "stereotypes are still there" line and the WSJ reporter ran with "researchers are taking notice." It must be frustrating to Linn to have only part of her story reported in each place. (Yeah, yeah, I know that newspapers have space constraints. It still must be frustrating.) Posted by Andrew at 4:20 PM | Comments (19) | TrackBack ## July 16, 2008 ### The American (League) Dynasty Every year, the best players (or at least many of the best players) from Major League Baseball's American League play their counterparts in the National League in the All-Star Game. They played last night; the American league won in the 15th inning. Here's who won, from 1965 (when I was born) to the present, with 1965 at the left and 2008 at the right. NNNNNNNNNNNNNNNNNNANNANAAAAAANNNAAAAATAAAAAA The "T" indicates a tie (in 2002): unlike regular games, there is no requirement that the All-Star Game continue until somebody wins, and pitchers are reluctant to pitch too many innings and potentially hurt themselves. I was born into an era in which the National League won every game. Now, the American League wins (or, at least, doesn't lose) every game. This is happening in a sport where even bad teams beat good teams occasionally, so it's really mystifying. It would be possible to explain a small edge for one league or the other, that persists for a few years --- the league with the best pitcher will have an advantage, for example, and that pitcher can play year after year --- but these effects can't come close to explaining the long runs in favor of one team or another. Predicting next year's winner to be the same as this year's winner would have correctly predicted 80% of the games in my lifetime...and that's if we pretend the National League won the tie game in 2002. (If we pretend the American League won it, it's 84%). What would be a reasonable statistical model for baseball All-Star games, and why isn't it something close to coin flips? Posted by Phil at 3:19 PM | Comments (14) | TrackBack ## July 14, 2008 ### Thoughts on new statistical procedures for age-period-cohort analyses Posted by Andrew at 9:53 AM | Comments (0) | TrackBack ## July 11, 2008 ### Guernsey McPearson's Statistical Menagerie Here are some hilarious (if you're a statistician) sketches from Stephen Senn: Robustnik "These are the three laws of robustics. First law: get a computer. second law: get a bigger computer. Third law: what you really need is a much bigger computer." Favourite reading: I Robust, by Isaac Azimuth. Frequency Freak " Did you randomise? OK: so far so good. Now what would you have said if the third value from the left had been the second from the right. Hold on a minute. Are you sure you haven't looked at this question before?" Favourite reading: Casino Royale. Bog Bayesian " All you need is Bayes. It's the answer to everything. If only Adolf and Neville could have exchanged utility functions at Munich we could have saved the world a whole lot of bother round about the middle of the last century." Favourite reading: The Hindsight Saga. Subset Surfer "OK, so the egg's rotten but parts of it are excellent." Favourite reading: Europe on$5 a day.

Gibbs Sampler
" First catch your likelihood. Take one Super Cray, a linear congruential generator, any prior you like and if the whole thing isn't done to a turn within three days my name's not Gary Rhodes." Favourite reading: Mrs Beaton

Complete Consultant
" First we test the randomisation. Then we look for homogeneity between centres. Then we run the Shapiro-Wilks over it and if you like we'll throw in a Kolmogorov-Smirnov at no extra cost. Then we test for homogeneity of variance and look for outliers and even if that's OK we'll do a Mann-Whitney anyway just to be on the safe side. All this will be fully documented in a report with our company logo on every page." Favourite reading: The Whole Earth Catalogue.

Mr Mathematics
"I just don't see the problem. All you have to do is define the null hypothesis precisely, define the alternative hypothesis precisely, choose your type I error rate and use the most powerful test." Favourite reading: Brave New World.

Bootstrapper
"Look, this is the way to build the football team of the future. You choose a player. You put him back in the pool. You choose again. Do that long enough and if you don't eventually get a team which has Becks in it three times my name's not Sven Goran Erikson." Favourite reading: Bradley's Shakksperrr.

Unconditional Inferencer
"It's true that all the engines are on fire and the captain has just died from a heart attack but there's no need to worry because averaged over all flights air travel is very safe." Favourite reading: Grimm's Fairy Tales

And many more:

Data Explorer "Wow! It's all too beautiful. I mean, Man, the colours, the shapes and those rotations and dig those projections. It's like Lucy in the Sky with Diamonds meets the Walrus and the Eggman." Favourite reading: The Glass Bead Game.

Third Degree Bayesian " Look there is no way I am letting you out of this room until you give me a prior. Have you heard of the jackknife? Yes? Well this is a thumbscrew." Favourite reading: Justine.

Mr Megabyte
"Just you wait till virtual reality hits the statistical computing scene. The only thing holding us back is that we have been mentally crippled by having been brought up to use pencil and paper. In the third millennium we will all have statistical processing chips implanted behind our ears. Books are a thing of the past." Favourite video: Farenheit 451.

Absolute Abacus
"Of course, no real statistical techniques worth talking about have been discovered since 1962. I grant you that in the occasional difficult case you might wish to use an electronic computer but not everyone wants to travel down to Manchester each time they need to calculate something." Favourite reading: The Anglo Saxon Chronicle.

Tabulator
"What you really need to do is understand the field of application thoroughly, become familiar with every data point, check each one against original records and present the whole thing with some simple graphs and tables. All this probability rubbish is just a conspiracy got up by a bunch of mathematicians who don't even understand the first thing about data." Favourite reading: The Little House on the Prairie.

Mrs P
"Now now. Nursey won't go away until you've filled this bottle. And if you don't produce something soon you'll never grow up to get published. Now, would a nice cup of t help?" Favourite reading: Winnie the Pooh.

Whew! Just copying this made me feel good.

## June 27, 2008

### Beer, quality control, and Student's t distribution

John Cook's theory of why the t distribution was discovered at a brewery:

Beer makers pride themselves on consistency while wine makers pride themselves on variety. That’s why you’ll never hear beer fans talk about a “good year” the way wine connoisseurs do. Because they value consistency, beer makers invest more in extensive statistical quality control than wine makers do.

(On the other hand, Seth thinks that "ditto foods" are so late-twentieth-century, and that lack of uniformity in taste is ultimately healthier.)

## June 26, 2008

### The End of Theory: The Data Deluge Makes the Scientific Method Obsolete

Drew Conway pointed me to this article by Chris Anderson talking about the changes in statistics and, by implication, in science, resulting from the ability of Google and others to sift through zillions of bits of information. Anderson writes, "The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all."

Conway is skeptical, pointing out that in some areas--for example, the study of terrorism--these databases don't exist. I have a few more thoughts:

1. Anderson has a point--there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important.

2. That said, if you look at the end result of an analysis, it is often a simple comparison of the "treatment A is more effective than treatment B" variety. In that case, no matter how large your sample size, you'll still have to worry about issues of balance between treatment groups, generalizability, and all the other reasons why people say things like, "correlation is not causation" and "the future is different from the past."

3. Faster computing gives the potential for more modeling along with more data processing. Consider the story of "no pooling" and "complete pooling," leading to "partial pooling" and multilevel modeling. Ideally our algorithms should become better at balancing different sources of information. I suspect this will always be needed.

## June 25, 2008

### The popularity of statistics?

Jennifer pointed me to this site, which states that "white people hate math" but "are fascinated by 'the power of statistics' since the math has already been done for them." I'd like to believe this is true (the part about white people liking statistics, not the part about the math having already be done to them) but I'm skeptical. Everywhere I've ever taught, there have been a lot more math majors than stat majors, and I'm pretty sure this is true among the subset of students who are white. But it might be true that the business majors, the poli sci majors, the English majors, etc.--not to mention the people who don't go to college at all--prefer statistics to mathematics. Actually, I think most of these people should prefer statistics to mathematics. But I fear that a more likely reaction would be something like, "math is cool, statistics is boring."

P.S. I looked further down, and this "Stuff White People Like" site is just weird. "With few exceptions, white people are actually fond of almost any dictator not named Hitler"?? Huh? I mean, I can see that the site is a parody, but this is just weird.

## June 23, 2008

### Diagnostics for multivariate imputations: getting inside the black box

Random imputation is a flexible and useful way to handle missing data (see chapter 25 for a quick overview), but it's typically taken as a black box. This partly is a result of confusion over statistical theory. Structural assumptions such as "missingness at random" cannot be checked from data--this is a fundamental difficulty--but this does not mean that imputations cannot be checked. In our recent paper, Kobi Abayomi, Mark Levy, and I do the following:

We consider three sorts of diagnostics for random imputations: displays of the completed data, which are intended to reveal unusual patterns that might suggest problems with the imputations, comparisons of the distributions of observed and imputed data values and checks of the fit of observed data to the model that is used to create the imputations. We formulate these methods in terms of sequential regression multivariate imputation, which is an iterative procedure in which the missing values of each variable are randomly imputed conditionally on all the other variables in the completed data matrix.We also consider a recalibration procedure for sequential regression imputations.We apply these methods to the 2002 environmental sustainability index, which is a linear aggregation of 64 environmental variables on 142 countries.

The article has some pretty pictures (and some ugly pictures too; hey, we're not perfect). I don't know how directly useful these methods are; I think of them as providing "proof of concept" model checking for imputations is possible at all, and I'm hoping this will spur lots of work by many researchers in the area. Ultimately I'd like people (or computer programs) to check their imputations just as they currently check their regression models.

## June 16, 2008

### Friday the 13th study

Apparently, Friday the 13th is not unlucky, according to Dutch researchers: link to article.

I would like to see a parallel psychological study, to see if people are more careful on Friday the 13th, go out less, drive less (or just shorter distances) - and if people considering criminal activity hold off until the next day. I also wonder if there is an upswing in the types of "bad luck" they chose to survey on Saturday the 14th...

## June 12, 2008

### Some thoughts on the saying, "All models are wrong, but some are useful"

J. Michael Steele explains why he doesn't like the above saying (which, as he says, is attributed to statistician George Box). Steele writes, "Whenever you hear this phrase, there is a good chance that you are about to be sold a bill of goods."

He considers a street map of Philadelphia as an example of a model:

If I say that a map is wrong, it means that a building is misnamed, or the direction of a one-way street is mislabeled. I never expected my map to recreate all of physical reality, and I only feel ripped off if my map does not correctly answer the questions that it claims to answer. My maps of Philadelphia are useful. Moreover, except for a few that are out-of-date, they are not wrong.

Actually, my guess is that his maps are wrong, in that there probably are a couple of streets that are mislabeled in some way. Street maps are updated occasionally (even every year), but streets get changed, and not every change is captured in an update. I expect there are a few places where Steele's map has mistakes. (But I doubt it's like those old tourist street maps of Soviet cities which, I've been told, had lots of intentional errors to make it harder for people to actually find their way around too well.) In any case, I take his general point, which is that a street map could be exactly correct, to the resolution of the map.

Statistical models of the sort that I typically use are different in being generative: that is, they are stochastic prescriptions for creating data. As such, they can typically never be proven wrong (except in special cases, for example a binary regression model can't produce a data value of 0.6). The saying, "all models are wrong," is helpful because it is not completely obvious, since it can't always be proved in special cases.

Recall the saying that a chi-squared test is a "measure of sample size." With a small sample size, you won't be able to reject even a silly model, and with a huge sample size, you'll be able to reject any statistical model you might possibly want to use (at least in the social and environmental sciences, where I do most of my work). This is a simple point, and I can see how Steele can be irritated by people making a big point about it . . . .

But, the trouble is, many people don't realize that all models are wrong. They want to make statements such as, The probability is 0.74 that the logistic regression model with predictors A,B,and D is correct. This is not the sort of statement I ever want to say.

The point of posterior predictive checking (see chapter 6 of Bayesian Data Analysis, or chapter 8 in our regression book for a less explicitly Bayesian treatment) is to use numerical and graphical summaries to understand what aspects of the data are captured by the model and what aspects are not. The goal is not to check whether the model is "wrong"--after all, all models are wrong--but to see how well it fits. I agree with Steele that external validation is good too.

## June 9, 2008

### Doing statistics the Dunson way: nonparametric statistics for the 21st century

a book that is coming out on Nonparametric Bayes in Practice. I think David's work is great but I keep encountering it in separate research articles and never in a single place which explains when to use each sort of model. I'll have to read the article in detail, but it seems like a good start. I suggested to David that he write a book but he pointed out that nobody reads books. But do people read articles in handbooks? I don't know. I guess what's really needed is a convenient software implementation for all of it. In the meantime, this article seems like the place to go.

## May 30, 2008

### Demystifying double robustness: "in at least some settings, two wrong models are not better than one"

When outcomes are missing for reasons beyond an investigator’s control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.

### Post-World War II cooling a mirage

Mark Levy pointed me to this. I don't know anything about this area of research, but if true, it's just an amazing, amazing example of the importance of measurement error:

The 20th century warming trend is not a linear affair. The iconic climate curve, a combination of observed land and ocean temperatures, has quite a few ups and downs, most of which climate scientists can easily associate with natural phenomena such as large volcanic eruptions or El Nino events.

But one such peak has confused them a hell of a lot. The sharp drop in 1945 by around 0.3 °C - no less than 40% of the century-long upward trend in global mean temperature - seemed inexplicable. There was no major eruption at the time, nor is anything known of a massive El Nino that could have caused the abrupt drop in sea surface temperatures. The nuclear explosions over Hiroshima and Nagasaki are estimated to have had little effect on global mean temperature. Besides, the drop is only apparent in ocean data, but not in land measurements.

Now scientists have found – not without relief - that they have been fooled by a mirage.

The mysterious post-war ocean cooling is a glitch, a US-British team reports in a paper in this week’s Nature. What most climate researchers were convinced was real is in fact “the result of uncorrected instrumental biases in the sea surface temperature record,” they write. Here is an editor’s summary.

How come? Almost all sea temperature measurements during the Second World War were from US ships. The US crews measured the temperature of the water before it was used to cool the ships engine. When the war was over, British ships resumed their own measurements, but unlike the Americans they measured the temperature of water collected with ordinary buckets. Wind blowing past the buckets as they were hauled on board slightly cooled the water samples. The 1945 temperature drop is nothing else than the result of the sudden but uncorrected change from warm US measurements to cooler UK measurements, the team found.

Whaaa...?

The article (by Quirin Schiermeier) continues:

That’s a rather trivial explanation for a long-standing conundrum, so why has it taken so long to find out? Because identifying the glitch was less simple than it might appear, says David Thompson of the State University of Colorado in Boulder. The now digitized logbooks of neither US nor British ships contain any information on how the sea surface temperature measurements were taken, he says. Only when consulting maritime historians it occurred to him where to search for the source of the faintly suspected bias. Our news story here has more.

Scientists can now correct for the overlooked discontinuity, which will alter the character of mid-twentieth century temperature variability. In a News and Views article here (subscription required) Chris Forest and Richard Reynolds lay out why this will not affect conclusions about an overall 20th century warming trend.

And there's more:

But it may not be the last uncorrected instrument bias in the record. The increasing number of measurements from automated buoys, which in the 1970s begun to replace ship-based measurements, has potentially led to an underestimation of recent sea surface temperature warming.

## May 23, 2008

### Quarterbacks and psychometrics

Eric Loken writes,

Criteria Corp is a company doing employee testing (basically psychometrics meets on-demand assessment). We're also going to blog on various issues relating to psychometrics and analyses of testing data. We're starting slowly on the blog front, but a few days ago we did one on employment tests for the NFL.. A few scholars have argued that the NFL's use of the Wonderlic (a cognitive aptitude measure) is silly as it shows no connection to performance. But we showed that for quarterbacks, once you condition on some minimal amount of play, the correlation between aptitude and performance was as high as r = .5...which is quite strong. It's the common case of regression gone bad when people don't recognize that the predictor has a complex relationship to the outcome. There are many reasons why a quarterback doesn't play much; so at the low end of the outcome, the prediction is poor and the variance widely dispersed. But there are fewer reasons for success, and if the predictor is one of them, then it will show a better association at the high end.

Here's their blog, and here's Eric's football graph:

P.S. The graph would look better with the following simple fixes:
1. Have the y-axis start at 0. "-2000.00" passing yards is just silly.
2. Label the y-axis 0, 5000, 10000. "10000.00" is just silly. Who counts hundredths of yards?
3. Label the x-axis at 10, 20, 30, 40. Again, who cares about "10.00"?
I've complained about R defaults, but the defaults on whatever program created the above plot are much worse! (I do like the color scheme, though. Setting the graph off in gray is a nice touch.)

### A question of infinite dimensions

Constantine Frangakis, Ming An, and Spyridon Kotsovilis write:

Problem: suppose we conduct a study of known design (e.g. completely random sample) to measure *just a scalar* (say income, gene expression example from Rafael Irizarry), and suppose we get full response. Question: what data do we actually observe? Answer: we observe an infinite dimensional variable, which can carry extra information about how we analyze the scalar (say to estimate the population mean).

Logic:

1. Suppose we believe that if we had applied the same measurement device on all the population, then we would have some non-response. That would then mean that in the actual sample we got, the mere fact that we observed *all the data* is actual information and means that we got a non-representative sample of the population (just from the responders).

2. If we believe that (1) can be true, then we should worry. Reversely, if we do not worry, it implies we believe (1) is false. But there is no measurement device that is a priori guaranteed to work for all units, so we must worry.

3. The key issue now is that we usually think that, by incorporating the indicator of observation in a new column in the data, we believe we have fully described what we observed. But I suggest we have not. This is because we can iterate the logic of (1) now on the "new data": the fact ={that we observed that we had full responses} is also a nontrivial observation, as long as it is measured with a device that can sometimes be fallible. But when we iterate this logic we conclude that we actually observe an infinite sequence of variables.

This is very much similar to Godel's argument of incompleteness, applied to statistics if we treat a measurement device is a Turing machine. Its practical implication is that it is extremely important to understand the *variation in how* exactly each and every measurement was made because that variation is extra information **even if (and not only if) we observe all measurements !! **

I didn't really follow, so I asked Constantine to clarify. He wrote:

Here is an example of the first level.

1. Setting: suppose we are studying the income Y of a city's population, and Y in truth Y follows a log-normal distribution and we know that. We are to conduct measurements on units, with a measurement device (e.g., an interviewer) that can *possibly* give no response (if it gives response, we assume it is true).

2. Data: we now conduct a simple random sample, and with the measurement *device* we use, we get 100% response in the sample. Also, say with the data we get an MLE{median pr(Y)}=$54K and the MLE(SD(log(income)) i.e. among all response sample)=0.43, or MLE(SD(income))=$27K;

Question: Should we worry about non-response even if we got full response ?
Answer: The answer is YES, because we would get a DIFFERENT RESULT than $54K under some consideration of non-response, EVEN IF WE GOT FULL RESPONSE. 3. Example: What can that consideration be, and what answer could we get ? Suppose that if the *same measurement device had been applied to all the population*, we would have gotten 20% non response (R=0). Moreover, suppose that this nonresponse depends on the outcome in the sense that the ratio of the median income among responders versus non-responders is .7, which occurs because all incomes < median respond, but a random 60% of the incomes > median respond. Suppose also we know this - this gives a model for pr(R|Y). What is the MLE now, with the same log normal model but also the pr(R|Y) model ? It is$60K. It is significant to note that by the above MLEs, I mean no randomness, in the following sense: under the outcome model pr(Y) in part 1 and the pr(R|Y) model in part 3 (the pr(R|Y) model), we have:

## April 14, 2008

### p-values blah blah blah

Karl Ove Hufthammer points me to this paper by Raymond Hubbard and R. Murray Lindsay, "Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing."

I agree that p-values are a problem, but not quite for the same reasons as Hubbard and Lindsay do. I was thinking about this a couple days ago when talking with Jeronimo about FMRI experiments and other sorts of elaborate ways of making scientific connections. I have a skepticism about such studies that I think many scientists share: the idea that a questionable idea can suddenly become scientific by being thrown in the same room with gene sequencing, MRIs, power-law fits, or other high-tech gimmicks. I'm not completely skeptical--after all, I did my Ph.D. thesis on medical imaging--but I do have this generalized discomfort with these approaches.

Consider, for example, the notorious implicit assocation test, famous for being able to "assess your conscious and unconscious preferences" and tell if you're a racist. Or consider the notorious "baby-faced politicians lose" study.

From a statistical point of view, I think the problem is with the idea that science is all about rejecting the null hypothesis. This is what researchers in psychology learn, and I think it can hinder scientific understanding. In the "implicit association test" scenario, the null hypothesis is that people perceive blacks and whites identically; differences from the null hypothesis can be interpreted as racial bias. The problem, though, is that the null hypothesis can be wrong in so many different ways.

To return to the main subject, an alarm went off in my head when I read the following sentence in the abstract to Hubbard and Lindsay's paper: "p values exaggerate the evidence against [the null hypothesis]." We're only on page 1 (actually, page 69 of the journal, but you get the idea) and already I'm upset. In just about any problem I've studied, the null hypothesis is false; we already know that! They describe various authoritative-seeming Bayesian articles from the past several decades, but all of them seem to be hung up on this "null hypothesis" idea. For example, they include the notorious Jeffreys (1939) quote: "What the use of P implies … is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure." OK, sure, but I don't believe that the hypothesis "may be true." The question is whether the data are informative enough to reject the model.

Any friend of the secret weapon is a friend of mine

OK, now the positive part. I agree with just about all the substance of Hubbard and Lindsay's recommendations and follow them in practice: interval estimates, not hypothesis tests; and comparing intervals of replications (the "secret weapon"). More generally, I applaud and agree with their effort to place repeated studies in a larger context; ultimately, I think this leads to multilevel modeling (also called meta-analysis in the medical literature).

P.S. This is minor, but I'm vaguely offended by referring to Ronald Fisher as "Sir" Ronald Fisher in an American journal. We don't have titles here! I guess it's better than calling him Lord Fisher-upon-Tyne or whatever.

P.P.S. I don't know if I agree that "An ounce of replication is worth a ton of inferential statistics." More data are fine, but sometimes it's worth putting in a little effort to analyze what you have. Or, to put it more constructively, the best inferential tools are those that allow you to analyze more data that have already been collected.

## April 13, 2008

### R.I.P. Minghui Yu

Rachel wrote this note about our Ph.D. student who unexpectedly and tragically died recently.

## April 7, 2008

### Comment on "What are you going to do with your Ph.D. in Statistics?" conference

The conference consisted of two panels discussing various aspects of the working life of statisticians. The statisticians on the first panel were all currently working in academia, while the statisticians on the second panel were all working in industry.

If we want to pursue a career in academia, research should be something we enjoy. Panelist mentioned that teaching, while often a burden, should not be something that makes our lives miserable. As Eric Bradlow said, the remuneration for being an academic is not enough compensation for hating teaching and being miserable.

The panelists agreed it was important to work in a department where the people valued and respected the research that you did. The panelist’s research ideas came from a number of different sources, including collaborators, seminars and conferences (and they encouraged us to attend the latter two).

The panelist’s discussions reminded me that perhaps the most important aspect in choosing potential academic departments is finding a good fit. An important part of working life (I think) is being valued and finding collaborators, not only in the department you work in, but also in other departments around campus.

Industry Panel

Communication is a big part of working in industry. Although teaching students is not usually required, consulting with collaborators and colleagues is. There is not as much flexibility in industry as with academia (research must be in the companies interests), however, the compensation is usually much better.

All industry panelists agreed that statisticians must be excited by data. Many of the big companies (such as google, AT&T, etc) have an abundance of data. In order to thrive in these environments data should challenge and excite you.

The reception after the conference was a good chance to meet and talk with the panelists and ask questions about jobs in both academia and industry. It was a good time for me (as a postdoc) to evaluate what direction I hope to take my statistics career. Congratulations should go to the Columbia post-graduate statistics students for organizing such a successful conference.

## April 4, 2008

### A dismal theorem?

James Annan writes,

I wonder if you would consider commenting on Marty Weitzman's "Dismal Theorem", which purports to show that all estimates of what he calls a "scaling parameter" (climate sensitivity is one example) must be long-tailed, in the sense of having a pdf that decays as an inverse polynomial and not faster. The conclusion he draws is that using a standard risk-averse loss function gives an infinite expected loss, and always will for any amount of observational evidence.

I looked up Weitzman and found this paper, "On Modeling and Interpreting the Economics of Catastrophic Climate Change," which discusses his "dismal theorem." I couldn't bring myself to put in the effort to understand exactly what he was saying, but I caught something about posterior distributions having fat tails. That's true--this is a point made in many Bayesian statistics texts, including ours (chapter 3) and many that came before us (for example, Box and Tiao). With any finite sample, it's hard to rule out the hypothesis of a huge underlying variance. (Fundamentally, the reason is that, if the underlying distribution truly does have fat tails, it's possible for them to be hidden in any reasonable sample. It's that Black Swan thing all over again.) I think that Weitzman is making some deeper technical point, and I'm sure I'm disappointing Annan by not having more to say on this . . .

More

Searching on the web, I found this article by William Nordhaus criticizing Weitzman's reasoning. Unfortunately, Nordhaus's article just left me more confused: he kept talking about a utility function of the form U(c) = (1-c^(1-a))/(1-a), which doesn't seem to be relevant to the climate change example. Or to any other example, for that matter. Attempting to model risk aversion with a utility function--that's so 1950s, dude! It's all about loss aversion and uncertainty aversion nowadays. This isn't Nordhaus's fault--he seems to be working off of Weitzman's model--but it's hard for me to know how to evaluate any of this stuff if it's based on this sort of model.

Also, I don't buy Nordhaus's argument on page 4 that you can deduce our implicit value of non-extinction by looking at how much the U.S. government spends on avoiding asteroid impacts. This reminds me of the sorts of comparisons people do, things like total spending on cosmetics or sports betting compared to cancer research. I already know that we spend money on short-term priorities--I wouldn't use that to make boroad claims about the "negative utility of extinction."

Back to Weitzman's paper

I find abbreviations such as DT (for the "dismal theorem") and GHG (for greenhouse gases) to be distracting. I don't know if this is fair of me. I don't mind U.S. or FBI or EPA or other common abbreviations, but I find it really annoying to read a phrase such as, "Phrased di¤erently, is DT an economics version of an impossibility theorem which signifies that there are fat-tailed situations where economic analysis is up against a strong constraint on the ability of any quantitative analysis to inform us without committing to a VSL-like parameter and an empirical CBA framework that is based upon some explicit numerical estimates of the miniscule [sic] probabilities of all levels of catastrophic impacts up to absolute disaster?" The concepts are tricky enough as it is without me having to try to flip back and find out what is meant by DT, VSL, and CBA. But, if Weitzman were to spell out all the words, would the other economists think he's some sort of rube? I just don't know the rules here.

On page 37, near the end of the paper, Weitzman writes, "A so-called Integrated Assessment Model (hereafter IAM) . . .") I was reminded of Raymond Chandler's advice for writers: "When in doubt, have a man come through the door with a gun in his hand." Or, in this case, an abbreviation. Never let your readers relax, that's my motto.

I'm not sure how to think about the decision analysis questions. For example, Weitzman writes, "Should we have foregone the industrial revolution because of the GHGs it generated?" But I don't think that foregoing the industrial revolution was ever a live option.

P.S. I have to admit, "miniscule" sounds right. It begins with "mini," after all.

## April 3, 2008

### More data beats better algorithms

Boris sent along this. I can't comment on the examples used there, but I agree with the general point that it's good to use more data. To get back to algorithms, what I'd say is that one important feature of a good algorithm is that it allows you to use more data. Traditional statistical methods based in independent, identically distributed observations can have difficulty incorporating diverse data, whereas more modern methods have more ways in which data can be input.

## March 30, 2008

### Some thoughts on connections between biostatistics and statistics, prompted by an announcement for a meeting that I won't be able to attend

This looks interesting. Yi Li writes of a panel discussion at the Harvard biostatistics department. My own thoughts are below; first here's Li's description. There's some good stuff:

1) Should Biostatistics continue to be a separate discipline from Statistics? Should Departments of Biostatistics and Statistics merge. In other words, are we seeing a convergence of biostatistics and statistics? Biostatisticians develop statistical methodology, statisticians are getting involved in biological/clinical data. Even at Harvard we are considering moving closer to Cambridge, and some say that the move might lead to the eventual merge of the biostat and stat departments. What are your thoughts on the division between the disciplines of stat and biostat in general, whether it is widening or closing, and how it may affect our careers and career choices, especially for starting faculty, postdocs, students?

2) How should stats/biostats as a field respond to the increasing development of statistical and related methods by non-statisticians, in particular computer scientists?

It strikes me [Li] to some extent that statisticians get involved in applied problems in a rather arbitrary fashion based on haphazard personal connections and whether the statistician's personal methodological research fits with the applied problem. Are statisticians sufficiently involved in the most important scientific problems in the world today (at least of those that could make use of statistical methods) and if not, is there some mechanism that could be developed by which we as a profession can make our expertise available to the scientists tackling those problems?

3) How do we close the gap between the sophistication of methods developed in our field and the simplicity of methods actually used by many (if not most) practitioners? Some scientific communities use sophisticated statistical tools and are up to date with the newest developments. Examples are clinical trials, brain imaging, genomics. Other communities routinely use the simplest statistical tools, such as single two-sample tests. Examples are experimental biology and chemistry, cancer imaging, and many other fields outside statistics. How do we explain this gap and what can we do to close it?

4) What makes a statistical methodology successful? Some modern statistical methods have gotten to be very well known in the scientific world, even though they are not usually part of any basic statistics course for non-statisticians. The best examples might be the bootstrap, regression trees, wavelet thresholding. Even Kaplan-Meier and Cox model are not in elementary stat books! But most statistical methods, even when they are good enough to be published in a good statistical journal, might get referenced a few times within the statistics literature and then forgotten, never making it outside the statistics community. What makes a statistical methodology gain widespread popularity?

5) Where should computational biology and bioinformatics sit in relation to biostatistics, both at Harvard and elsewhere. Should these subjects be taught as part of cross-department programs of which biostat is a part or should they be housed within an expanded biostat department?

6) Terry Speed recently published an IMS column entitled "statistics without probability". He stated that "... the most prominent features of the data were systematic. Roughly speaking, we saw in the data very pronounced effects associated with almost every feature of the microarray assay, clear spatial, temporal, reagent and other effects, all dwarfing the random variation. These were what demanded our immediate attention, not the choice of two-sample test statistic or the multiple testing problem. What did we do? We simply did things that seemed sensible, to remove, or at least greatly reduce, the impact of these systematic effects, and we devised methods for telling whether or not our actions helped, none of it based on probabilities. In the technology-driven world that I now inhabit, I have seen this pattern on many more occasions since then. Briefly stated, the exploratory phase here is far more important than the confirmatory... How do we develop the judgement, devise the procedures, and pass on this experience? I don't see answers in my books on probability-based statistics."

My thoughts:

1) I think there are advantages to having two departments but they should certainly coordinate with each other. Here at Columbia, people are hired in one department or the other and nobody in the other department even hears about it, and we also have a biostatistics group in the psychiatry department. The trouble is, everybody's so busy. One idea is to have each department have a person whose job ("committee assignment") is to keep track of what's happening in the sister department and then report back to the others. There are just so many opportunities for collaboration and shared work with students and faculty, it's a shame to not take advantage.

2) I'm not supposed to go around saying that computer scientists are smarter than statisticians, but I think it's ok for me to say that computer science is great, and I welcome that field's involvement in statistical problems. I don't know that we have to "respond" in any way except by cross-listing courses and updating the curriculum every now and then.

Li makes an excellent point about statisticians getting involved in problems "in a rather arbitrary fashion based on haphazard personal connections. One way to do better, I think, is to post all the collaborative projects in an easy-to-hash format so that people can get involved in projects that best suit them. We're starting that here with our Applied Statistics Center but we have a ways to go, even at Columbia. At the very least, I recommend that other universities follow our path and start listing things.

3) There's a need for more research into simple methods. Simple doesn't have to mean stupid. Beyond that, I'm in favor of "closing the gap" one application at a time. But maybe that's not the most efficient way, given that millions of scientific papers are published each year.

4) I think applied Bayesian methods are "very well known in the scientific world, even though they are not usually part of any basic statistics course for non-statisticians." I'm surprised Li didn't mention Bayesian methods in the list: this suggests that the first step is for statisticians and biostatisticians to become aware of the important methods in our own fields!

To answer the question more generally, I think for a method to gain widespread popularity it needs to give people answers that they want, and ideally be easy to use and theoretically justified. One reason Xiao-Li, Hal, and I wrote our paper on posterior predictive checking was to place this very useful method in a theoretical framework with theta, y, and y.rep.

5) I have no opinion on this one.

6) It's funny that Terry Speed said this because, when I used to teach at Berkeley, I heard lots of people in the statistics department say that sort of thing. But at the same time they would teach extremely theoretical courses and discourage the Ph.D. students from learning about applied methods (outside of a few specific statistics-heavy fields such as biology). I don't think they were aware of statistical methods that bridge between science and theory. The Bayesian approach is one way (at least, we try to do this in our book) but lots of non-Bayesian methods focus on systematic effects also. Consider all the work in economics on program evaluation and causal inference. In our recent book on regression, Jennifer and I emphasize the importance of the deterministic part of the model. I can't say that we yet have a method to "develop the judgement, devise the procedures, and pass on this experience"--but we've definitely advanced beyond the 1950s-style "choice of two-sample test statistic or the
multiple testing problem." So I don't think things are as bad as Terry thinks, at least not in social science!

When, where, who

The panel discussion will be on Thurs 3 Apr from 2-3.30 in Kresge 213 (at the Harvard School of Public Health in Boston), and it will be led by Brad Efron, Colin Begg and David Harrington. It sounds like fun (it reminds me a bit of our symposium on statistical consulting), but I don't know how they expect to cover all of that in only an hour and a half!

## March 26, 2008

### Crime data bonanza!!!

Mike Maltz writes,

A New Data Set Available through Ohio State University’s Criminal Justice Research Center

So you think you know how to analyze time series! Well, how would you like to test your mettle on over 400,000 time series, each with up to 540 data points? The time series in question are monthly data from 1960-2004, for over 17,000 police departments, for seven crime types (murder, rape, robbery, aggravated assault, burglary, larceny, vehicle theft), as well as their sum (the so-called Crime Index), and an additional 19 subcategories – e.g., robbery with a gun, knife, personal weapons (hands, feet, etc.), or other; attempted rape; auto, truck or bus, or other vehicle theft. Or you can just view the data in different cities over time and see whether it rises and falls with various tides (unemployment, immigration, poverty, age or ethnicity distribution, etc., whatever your pet theory is). I [Maltz] have put all of the files and a plotting utility (so you can see each agency’s crime history) in a zipped file. Download it from http://sociology.osu.edu/mdm/UCR1960-2004.zip.

The data consist of monthly counts of these crimes reported by police departments throughout the country to the FBI as part of its Uniform Crime Reporting (UCR) Program. Since reporting to the UCR Program is entirely voluntary, some agencies are less than diligent in doing so, but for the most part they comply. However, major gaps still remain; for a discussion of these gaps, see “Bridging Gaps in Police Crime Data,” published by the US Bureau of Justice Statistics. Under a series of grants from the US National Institute of Justice, Harry Weiss, a graduate student here at OSU, and I cleaned the data as best we could.

Some of the gaps are just inadvertent (or, as statisticians would say, MCAR, missing completely at random). These can usually be filled in using relatively simple algorithms. The more significant problems, however, are those that are not gaps but “underestimates,” as when the City of Atlanta was bidding (successfully) for the Olympics and lowered its crime statistics in a more, shall we say, “hands-on” way (see http://www.cnn.com/2004/US/South/02/20/atlanta.police.audit.ap/index.html); New York, Philadelphia and Boca Raton also have had their own reporting scandals (http://query.nytimes.com/gst/fullpage.html?res=9F06E2D91F38F930A3575BC0A96E958260); and according to the creator of HBO’s “The Wire,” Baltimore is even better at it (http://www.huffingtonpost.com/david-simon/the-wires-final-s_b_91926.html):

"In Baltimore, where over the last twenty years Times Mirror and the Tribune Company have combined to reduce the newsroom by forty percent, all of the above stories pretty much happened. A mayor was elected governor while his police commanders made aggravated assaults and robberies disappear.

"... It would not have been easy for a veteran police reporter to pull all the police reports in the Southwestern District and find out just how robberies fell so dramatically, to track each individual report through staff review and find out how many were unfounded and for what reason, or to develop a stationhouse source who could tell you about how many reports went unwritten on the major's orders, or even further -- to talk to people in that district who tried to report armed robberies and instead found themselves threatened with warrant checks or accused of drug involvement or otherwise intimidated into dropping the matter."

Not all cities manipulate crime statistics. Even so, you might want to get rid of all of your preconceptions of how to deal with these data. It’s for that reason that a plotting utility is the centerpiece of the data set. You have to look at the data, not just throw it into the computerized maw and let Stata or SAS or SPSS give you some p values. By visually inspecting the data, you might see what the effect of a new policy, or police chief, or law has on crime. You might compare different cities with different characteristics. Whatever you do, it’s a relatively new data set that hasn’t yet been used much at all, so you’re getting in on the ground floor.

### Data

Aleks writes:

From here, see this. It could be used as a foundation to latch additional analysis functionality on top of it.

Here are some examples of interactive statistics on the web. But few things compare to the venerable b-course.

## March 25, 2008

### Incredible Illinois, or fun with percentages that can be larger than 100

Tyler Cowen links to a calculation by Tom Elia that "of Sen. Obama's 711,000 popular-vote lead, 650,000 -- or more than 90% of the total margin -- comes from Sen. Obama's home state of Illinois, with 429,000 of that lead coming from his home base of Cook County." This is interesting, but it's more a comment on how close the (meaningless) total popular vote count is, than a reflection of something funny going on in Cook County.

Put it another way. Suppose Obama's total margin was only 111,000 votes instead of 711,000. Then his 650,000 vote margin in Illinois would represent a whoppin 580% of the total margin, and Cook County would represent 390% of the total margin! But wait, how can a part be 390% of the whole??

What I'm sayin is, the "90%" and "60%" figures are misleading because, when written as "a percent of the total margin," it's natural to quickly envision them as percentages that are bounded by 100%. There is a total margin of victory that the individual state margins sum to, but some margins are positive and some are negative. If the total happens to be near zero, then the individual pieces can appear to be large fractions of the total, even possibly over 100%.

I'm not saying that Tom Elia made any mistakes, just that, in general, ratios can be tricky when the denominator is the sum of positive and negative parts. In this particular case, the margins were large but not quite over 100%, which somehow gives the comparison more punch than it deserves, I think.

P.S. Elia's comment that "Sen. Obama's 429,000-vote margin in Cook County alone is larger than the winning margin of either candidate in any state" is more directly interpretable because it's a difference, not a ratio. Obama won Illinois by a 32-percentage-point landslide. (By comparison, Clinton won New York with a 17-point margin and California [typo fixed] with a 9-point margin.)

### Peeking behind the curtain, or, What's (not) the matter with Portugal?

This is pretty embarrassing, but I think it's better to tell all, if for no other reason than to make others aware of the challenges of working with data . . .

OK, so we're reanalyzing some data from the Comparative Study of Electoral Systems, basically replicating some findings of Huber and Stanig but including additional countries and with some slightly different coding of political parties.

We have two key graphs.

First, for each country, we compute the difference between rich and poor in voting for the conservative party or parties. This graph (not shown here) reveals that the rich-poor gap in the United States is larger than most of the other (mostly European) countries in the sample.

For our second graph, we fit a model predicting conservative vote given income and religious attendance. For each country, the three lines show estimated conservative vote (compared to the national average) as a function of individual income, among people who attend religious services frequently (solid line), occasionally (light line), and never (dashed line).

The countries are ordered by increasing per-capita GDP. On the bottom line is the United States, with its familiar pattern of religious attendance mattering more for the rich than the poor. As you can see, religious people vote for conservative parties in many countries--Americans are far from unique in that way.

Wha...?

But whassup with Portugal? The only country where the religious vote in a less conservative way than the secular--the lines go in the wrong order! We asked some experts what was going on, and we were told that the center-left Socialist Party and the center-right Social Democratic Party seem to be resistant to the direction or degree of religiosity, and that the party competition in Portugal is basically non-ideological.

But, then, why the big difference between religious and secular in our data? Well, we were also told that the data for Portugal are probably crappy. So we figured we'd just remove Portugal from our graph and add a note why we excluded it, based on concerns about data and some comments about the party structure there. Put then we looked at the data again . . .

It turned out the problem was in the name of one party (the Popular Party)--it had an extra comma in its name and when we read in the data, we mistakenly counted it as a different party. Whoops! (Or, as Mezzanine-era Nicholson Baker would say, Whoop!)

Here's the corrected figure:

Yeah, yeah, I know, we better check all the party names carefully now.

P.S. I guess we could make the case that we were being Bayesian, in checking the results that contradicted our prior distribution. In this case, the prior wasn't really that religion always is associated with conservative voting, but rather that the countries followed some smooth distribution. Actually, when I first noticed the problem with Portugal, I assumed the data were ok and that there was some Portugal-specific story, perhaps a left-wing church-based party. (Yes, I'm sure that comment reveals my ignorance of Portugal, but that's the point here.) I was looking for the magic x-variable that explained the unexplained variation. In this case, the x-factor was a coding error...

P.P.S. More here.

## March 13, 2008

### The "all else equal fallacy"

I like John Tierney's New York Times column (for example, here), but sometimes he goes over the top in counterintuitiveness.

Here, for example, Tierney writes about someone who says, "in some circumstances it’s better to drive than to walk. . . . If you walk 1.5 miles, Mr. Goodall calculates, and replace those calories by drinking about a cup of milk, the greenhouse emissions connected with that milk (like methane from the dairy farm and carbon dioxide from the delivery truck) are just about equal to the emissions from a typical car making the same trip. . . . Michael Bluejay, who’s done some number-crunching at BicycleUniverse.info, says that walking is actually worse than driving if you replace the calories with food in the standard American diet and if the car gets more than 24 miles per gallon. . . ."

This is interseting to me because these guys are making a classic statistical error, I think, which is to assume that all else is held constant. This is the error that also leads people to misinterpret regression coefficients causally. (See chapters 9 and 10 of our book for discussion of this point.) In this case, the error is to assume that the walker and the driver will be making the same trip. In general, the driver will take longer trips--that's one of the reasons for having a car, that you can easily take longer trips. Anyway, my point is not to get into a long discussion of transportation pricing, just to point out that this seemingly natural calculation is inappropriate because of its mistaken assumption that you can realistically change one predictor, leaving all the others constant.

As we like to say, it's a great classroom example.

P.S. More here (also see discussion in the comments below).

## March 12, 2008

### Specifying a distribution from the mean and quantiles, or, just in case you thought this blog was nothing but square footage and Starbucks

David Kane writes,

What is the best way to simulate from a distribution for which you know only the 5th, 50th and 95th percentile along with the mean? In particular, I want to estimate the value for a different percentile (usually around the 40th) and associated confidence interval. I assume that the distribution is "smooth" and unimodal. For background, see here.
If you don't want to read all that, the short version is that I want to see if socioeconomic diversity has increased at Williams College over the last decade. (You may be interested in the same thing about Columbia.) It isn't easy to measure "inequality," of course, so for starters I just want to estimate what has happened at the 20th percentile. Williams has about 2000 students. So, I want to estimate the family income of the 400th poorest family.

Williams only has data on students who request financial aid. But that covers almost all the families in the bottom 1/3 of the distribution. Williams, like most colleges, does not want to give out much data. However, recent debate in Congress has resulted in Williams and other rich schools publishing some relevant data. Unfortunately, it isn't exactly what I want, hence my question.

To be concrete, Williams tells us, for each year since 1998, how many students are on aid and what the mean and the 5th, 50th and 95th percentiles of family income are for those students. But the number of students on aid has increased so the location of the 40th percentile for the entire student body (not just those on aid) is in a different location in the aided students distribution each year.

If you were given only two quantiles, I'd recommend that you just pick a reasonable 2-parameter distributional family, solve for the two parameters, and go from there (and do a sensitivity analysis considering other families). With 3 quantiles to fit, I'd say to take a 3-parameter skewed family (although I'm not quite sure what I'd actually use). But 3 quantiles and a mean . . . fitting to a 4-parameter family seems silly, and fitting to a 2-parameter or 3-parameter family using least squares doesn't sound quite right either.

The right thing to do, I think, is to have some model over distribution space, probably centered on some reasonable three-parameter family but with error. I'm not quite sure the best way to do this; maybe work with the cdf and transform the uniform. I wouldn't be surprised if there's a reasonable solution out there; it seems like a fun problem to work on.

Or, if I wanted an answer and was in a hurry, I'd try various curves that go thru the 5th, 50th, 95th and then play around until they match the mean correctly.

## March 4, 2008

### Starbucks/Walmart update

Alex F. commented here about problems with our Starbucks and Walmart data. Elizabeth Kaplan, who collected the data for me, replied:

Yeah Walmart was a bit of a pain to find the locations for as you can not search just by state on their website, like for Starbucks. In order to find the locations I relied on the yellow page results. Even though I looked through to eliminate double postings for walmarts with the same address, after I looked into it again tonight, it appears the yellow pages dramatically over represented the number of walmarts per state. I have attached the correct data. All of these numbers come from this website (http://www.walmartfacts.com/StateByState/) which I was unable to locate before.

As far as the data for starbucks that should be correct as I got it straight from their website. The one thing is that they don't list all affiliate stores (that is stores not own and operated by the company). There is no reliable source of data on affiliate stores by state, and obviously the yellow pages are not a good source. So the data I sent to you just includes Starbucks owned and operated stores.

Also for population I used the 2006 Census Bureau estimates.

This sort of thing happens all the time to me, so I certainly don't think Elizabeth should feel too bad about this. I'm just glad that Alex noticed and pointed out the problem. Anyway, here are the corrected maps:

and scatterplot:

And also, following Seth's suggestion, the scatterplot on the log scale:

And, following Kaiser's suggestion, a reparameterization showing people per store (rather than stores per million people):

## February 22, 2008

### Real statistics and folk statistics: modeling mental models

I was lucky to see most of the talk that Josh Tenenbaum gave in the psychology department a couple weeks ago. He was talking about some experiments that he, Charles Kemp, and others have been doing to model people's reasoning about connectedness of concepts. For example, they give people a bunch of questions about animals (is a robin more like a sparrow than a lion is like a tiger, etc.), and then they use this to construct an implicit tree structure of how people view animals. (The actual experiments were interesting and much more sophisticated than simply asking about analogies; I'm just trying to give the basic idea.) Here's a link to some of this work.

My quick thought was that Tenenbaum, Kemp, et al. were using real statistics to model people's "folk statistics" (by which I mean the mental structures that people use to model the world). I have a general sense that folk statistical models are more typically treelike or even lexicographical, whereas reality (for social phenomena) is more typically approximately linear and additive. (I'm thinking here of Robyn Dawes's classic paper on the robust beauty of additive models, and similar work on clinical vs. statistical prediction.) Anyway, the method is interesting. I wondered whether, in the talk, Tenenbaum might have been slightly blurring the distinction between normative and descriptive, in that people might actually think in terms of discrete models, but actual social phenomena might be better modeled by continuous models. So, in that sense, even if people are doing approximate Bayesian inference in their brains, it's not quite the Bayesian inference I would do, because people are working with a particular set of discrete, even lexicographic, models, which are not what I suspect are good descriptions of most of the phenomena I study (although they might work for problems such as classifying ostriches, robins, platypuses, etc.).

Near the end of his talk, Tenenbaum did give an example where the true underlying structure was Euclidean rather than tree-like (it was a series of questions about the similarity of U.S. cities), and, indeed, there he could better model people's responses using an underlying two-dimensional model (roughly but not exactly corresponding to the latitude-longitude positions of the cities) than a tree model, which didn't fit so well.

I sent Tenenbaum my above comment about real and folk statistics, and he replied:

I'd expect that for either the real world or the mind's representations of the world, some domains would be better modeled in a more discrete way and others in a more continuous way. In some cases those will match up - I talked about these correspondences towards the end of the talk, not sure if you were still there - while in other cases they might not. It would be interesting to think about both kinds of errors: domains which our best scientific understanding suggests are fundamentally continuous while the naive mind treats them as more discrete, and domains which our best scientific understanding suggests are discrete while the naive mind treats them as more continuous. I expect both situations exist.

Also, the "naive mind" is quite an idealization here. The kind of mental representation that someone adopts, and in particular whether it's more continuous or discrete, is likely to vary with expertise, culture, and other experiential factors.

I think the discrete/continuous distinction is a big one in statistics and not always recognized. Sometimes when people argue about Bayes/frequentist or parametric/nonparametric or whatever, I think the real issue is discrete/continuous. And I wouldn't be surprised if this is true in psychology (for example, in my sister s work on how children think about essentialism).

Tenenbaum replied to this with:

While the focus for most of my talk emphasized tree-structured representations, towards the end I talked about a broader perspective, looking at how people might use different forms of representations to make inferences about different domains. Even the trees have a continuous flavor to them, like phylogenetic trees in biology: edge length in the graph matters for how we define the prior over distributions of properties on objects.

On a less serious note . . .

This reminds me of all sorts of things from children's books, such as pictures of animals that include "chicken" and "bird" as separate and parallel categories, or stories in which talking cats and dogs go fishing and catch and eat real fish! (The most bizarre of all these, to me, are the Richard Scarry stories in which the sentient characters include a cat, a dog, and a worm, and they go fishing. My naive view of the "great chain of being" would put fish above worms, but I guess Scarry had a different view.)

## February 20, 2008

### Using simulation to do statistical theory

We were looking at some correlations--within each state, the correlations between income and different measures of political ideology--and we wanted to get some sense of sampling variability. I vaguely remembered that the sample correlation has a variance of approximately 1/n--or was that 0.5/n, I couldn't remember. So I did a quick simulation:


> corrs <- rep (NA, 1000)
> for (i in 1:1000) corrs[i] <- cor (rnorm(100),rnorm(100))
> mean(corrs)
[1] -0.0021
> sd(corrs)
[1] 0.1


Yeah, 1/n, that's right. That worked well. It was quicker and more reliable than looking it up in a book.

## February 15, 2008

### Linear regression is not dead, and please don't call it OLS

Lee Sigelman writes,

In the latest issue of The Political Methodologist, James S. Krueger and Michael S. Lewis-Beck examine the current standing of the time-honored but oft-dismissed-as-passe ordinary least squares regression model in political science research. . . . Krueger and Lewis-Beck report that . . . The OLS regression model accounted for 31% of the statistical methods employed in these articles. . . . “Less sophisticated” statistical methods — those that would ordinarily be covered before OLS in a methods course — accounted for 21% of the entries. . . . Just one in six or so of the articles that reported an OLS-based analysis went on to report a “more sophisticated” one as well. . . . OLS is not dead. On the contrary, it remains the principal multivariate technique in use by researchers publishing in our best journals. Scholars should not despair that possession of quantitative skills at an OLS level (or less) bars them from publication in these top outlets.

I have a few thoughts on this:

1. I don't like the term OLS ("ordinary least squares"). I prefer the term "linear regression" or "linear model." Least squares is an optimization problem; what's important (in the vast majority of cases I've seen) is the model. For example, if you still do least squares but you change the functional form of the model so it's no longer linear, that's a big deal. But if you keep the linearity and change to a different optimization problem (for example, least absolute deviation), that generally doesn't matter much. It might change the estimate, and that's fine, but it's not changing the key part of the model.

2. I like simple methods. Gary and I once wrote a paper that had no formulas, no models, only graphs. It had 10 graphs, many made of multiple subgraphs. (Well, we did have one graph that was based on some fitted logistic regressions--an early implementation of the secret weapon--but the other 9 didn't use models at all.) And, contrary to Cosma's comment on John's entry, our findings were right, not just published. The purpose of the graphical approach was not simply to convey results to the masses, and certainly not because it was all that we knew how to do. It just seemed like the best way to do this particular research. Since then, we've returned to some of these ideas using models, but I think we learned a huge amount from these graphs (along with others that didn't make it into the paper).

3. Sometimes simple methods can be justified by statistical theory. I'm thinking here of our approach of splitting a predictor at the upper quarter or third and the lower quarter or third. (Although, see the debate here.)

4. Other times, complex models can be more robust than simple models and easier to use in practice. (Here I'm thinking of bayesglm.)

5. Sometimes it helps to run complicated models first, then when you understand your data well, you can carefully back out a simple analysis that tells the story well. Conversely, after fitting a complicated model, you can sometimes make killer graphs.

## February 11, 2008

Justin Wolfers presents this graph that he (along with Eric Bradlow, Shane Jensen, and Adi Wyner) made comparing the career trajectory of Roger Clemens to other comparable pitchers:

The point is that Clemens did unexpectedly well in the later part of his career (better earned run average, allowed fewer walks+hits) compared to other pitchers with long careers. This in turn suggests that maybe performance-enhancing drugs made a difference. Justin writes:

To be clear, we don’t know whether Roger Clemens took steroids or not. But to argue that somehow the statistical record proves that he didn’t is simply dishonest, incompetent, or both. If anything, the very same data presented in the report — if analyzed properly — tends to suggest an unusual reversal of fortune for Clemens at around age 36 or 37, which is when the Mitchell Report suggests that, well, something funny was going on.

I can't comment on the steroids thing at all, but I will say that I'd like more information than are in the graphs. For one thing, Clemens is clearly not a typical pitcher and never has been. At the very least, you'd like to see the comparison of his trajectory with all the other individual trajectories, not simply the average. For another, the graphs above seem to be relying way too much on the quadratic fit. At least for the average of all the other pitchers, why not show the actual averages. Far be it from me to criticize this analysis (especially since I am friends with all four of the people who did it!)--this is just a recreational activity, and I'm sure these guys have better things to do than correct ERA's for A.L./N.L. effects, etc.--but I think you do want to have some comparisons of the entire distribution, as well as a sense of how much the "unusal reversal around ages 36 or 37" is an artifact of the fitted model.

P.S. to Justin, Eric, Shane, and Adi: Now youall have permission to be picky about my analyses in return. . . .

P.P.S. Nathan made this plot showing data from the 16 most recent Hall of Fame pitchers.

## February 8, 2008

Gary sent along this news article from the Syracuse Post-Standard:

Dead heat: Obama and Clinton split the Syracuse vote 50-50

by Mike McAndrew

In the city of Syracuse, the strangest thing happened in Tuesday's Democratic presidential primary.

Sen. Hillary Clinton and Sen. Barack Obama received the exact same number of votes, according to unofficial Board of Election results.

Clinton: 6,001.

Obama: 6,001.

"Wow, that is odd," said Jay Biba, Clinton's Central New York campaign coordinator. "I never heard of that in my life."

The odds of Clinton and Obama tying were less than one in 1 million, said Syracuse University mathematics Professor Hyune-Ju Kim.

"It's almost impossible," said Kim, who analyzed the statewide and citywide votes.

Lisa Daly, Obama's Syracuse campaign coordinator, said she thought a mistake had been made when she was first told the tally by the Board of Elections.

What are the chances of it happening?

"Good thing it wasn't a mayor's race," quipped Grant Reeher, a political science professor at Syracuse University's Maxwell School of Citizenship and Public Affairs.

A total of 12,346 votes were cast for Democrats in the city. Four other Democrats also received votes: John Edwards, 114; Dennis Kucinich, 113; Bill Richardson, 90; and Joe Biden, 27.

The tie is likely to be broken when elections officials recanvass the voting machines and add in the absentee and affidavit votes.

But for now, it's all even.

Update

The story The Post-Standard broke about Sen. Hillary Clinton and Sen. Barack Obama battling to a tie vote in the city of Syracuse was being posted Thursday on internet sites across the country.

Clinton and Obama each received 6,001 votes in Syracuse in the unofficial Board of Elections results. A total of 12,346 votes were cast in the city.

After doing a statistical analysis for The Post-Standard, Syracuse University mathematics professor Hyune-Ju Kim noted that the odds of Clinton and Obama getting the exact same amount of votes in Syracuse was less than one in 1 million.

To come to that conclusion, Kim factored in the state-wide and city-wide results in the Democratic primary.

Elaborating on Thursday, she noted: "The "almost impossible" odd is obtained when we assume the Syracuse voter distribution follows the New York state distribution. Since it is almost impossible to observe what we have observed, statistically we can conclude that Syracuse voter distribution is significantly different from the New York state distribution."

There would be less than one in 1 million chance of a tie occurring between Clinton and Obama in voting by a randomly selected group of 12,346 New York Democratic voters, she said.

Not to pick on some harried mathematics professor who'd probably rather be out proving theorems, but . . . of course Syracuse voters are not a randomly selected group of New Yorkers. You don't need a statistical test to see that. Regarding the probability of an exact tie: I don't think that's so low: a quick calculation might say that either Clinton or Obama could've received between, say, 5000 and 7000 votes, giving something like a 1/2000 chance of an exact tie. That's gotta be the right order of magnitude.

Anyway, I know this is silly--as pointed out in the article, it doesn't matter if there's a tie in Syracuse anyway. This might make a good classroom example, though. (See also here and here for more on the probability of a tied election.)

## February 6, 2008

### It's all over but the normalizin'

Ted Dunning writes:

You advocated recently [article to appear in Statistics in Medicine] the normalization of variables to have average deviation of 1/2 in order to match that of a {0,1} binary variable.

This recommendation will disturb lots of people for obvious reasons which may make your recommendation sell better.

But have you considered normalizing the binary variable to {-1, 1} instead of {0,1} before adjusting the mean to zero? This has the same effect but leaves larger communities happier, particularly because much of the applied modeling community has always normalized their binary variables to this range.

My reply: I actually went back and forth on this for awhile. In most of the regression analyses in political science, economics, sociology, epidemiology, etc., that I've seen, it's standard to code binary variables as 0/1. But, yeah, the other way to go would've been to standardize by dividing by 1 sd and then give the recommendation to code binary variables as +/- 1. Maybe that would've been a better idea. I was trying to decide which way would disturb people less, but maybe I guessed wrong!

## February 3, 2008

### Convergent interviewing and respondent-driven sampling

Bill Harris writes,

I stumbled across this project today and thought it might be related to a comment I posted last summer here.
I'm curious if Bob Dick's convergent interviewing perhaps predates RDS; I'm pretty sure I first learned convergent interviewing from Bob around 1992. I have a book by him, Rigor Without Numbers, that talks about convergent interviewing, as well. While my third edition is copyright 1999, it says that the first version was delivered at the XVIIIth Annual Meeting of Australian Social Psychologists, Greenmount, Queensland, 12-14 May 1989.

For more online on convergent interviewing, see here.

Matt responds:

I was not familiar with convergent interviewing and it does seem that it precedes RDS; as far as I know the first paper about RDS was published by Doug Heckathorn in 1997. But, it also seems that the methods are very different. RDS is designed to make population proportion estimates (e.g. What percentage of drug injectors in New York City have HIV?) while it seems that convergent interviewing is designed for qualitative research. Also, in convergent interviewing it seems that the researcher chooses who to interview next (and so can do this in a purposive way), but in RDS the choice of who gets recruited is made by the participants themselves, not the researcher, and in fact RDS estimation only works if people recruit randomly from their friends (i.e. no purposive choice). There may be insights that practitioners of both methods can learn from each other, but those connections aren't clear to me right now. On the other hand, sometimes these connections pop up in mysterious ways, so this idea might be helpful in the future.

What I saw as qualitatively similar between MCMC and convergent interviewing is the notion that you draw a sample in ways that seeks to maximize the information you gather from your sample, avoiding getting stuck in parts of the population that have very little to contribute to items of interest, as you might with a purely random sample.

I seem to recall it being said somewhere that one can select the next people to interview in CI by asking the current pair of respondents (i.e., on an iterative basis) who is the person most unlike them who is also in some sense mainstream. As I haven't gotten time to do much with your paper yet, I can't speak to RDS except via your claim that it relates to MCMC.

As for the intent, I think what you say is correct, although Bob Dick may wish to offer his views: CI is focused on qualitative research, and so you're more likely to surface a broad spectrum of answers but have no estimate of relative frequency of those answers in the population.

## January 31, 2008

### Debate over categorizing continuous variables

In a comment to an entry linking to my paper on splitting a predictor at the upper quarter or third and the lower quarter or third, MV links to this article by Frank Harrell on problems caused by categorizing continuous variables:

1. Loss of power and loss of precision of estimated means, odds, hazards, etc.

2. Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases

. . .

12. A better approach that maximizes power and that only assumes a smooth relationship is to use a restricted cubic spline . . .

I agree that it is typically more statistically efficient to use continuous predictors. But, if you are discretizing, our paper shows why it can be much more efficient to use three groups (thus, comparing "high" vs. "low", excluding "middle"), rather than simply dichotomizing into high/low.

As discussed in the paper, we specify the cutpoints based on the proportion of data in each category of the predictor, x. We're not estimating the cutpoints based on the outcome, y. (This handles points 7, 8, 9, and 10 of the Harrell article.)

We're not assuming that the regression function is flat within intervals or discontinuous between intervals. We're just making direct summaries and comparisons. That's actually the point of our paper, that there are settings where these direct comparisons can be more easily interpretable.

Just to be clear: I'm not recommending that discrete parameters be used for articles in the New England Journal of Medicine or whatever, in an area where regression is a well understood technique. I completely agree with Harrell that it's generally better to keep variables as continuous rather than try to get cute with discretization. On the other hand, when you have your results, it can be helpful to explain them with direct comparisons. The point of our paper is that, if you're going to do such direct comparisons, it's typically efficient to do upper and lower third or quarter, rather than upper and lower half.

## January 29, 2008

### Robust t-distribution priors for logistic regression coefficients

Bill DuMouchel wrote:

I recently came across your paper, "A default prior distribution for logistic and other regression models," where you suggest the student-t as a prior for the coefficients. My application involves drug safety data and very many predictors (hundreds or thousands of drugs might be associated with an adverse event in a database). Rather than a very weakly informative prior, I would prefer to select the t-distribution scale parameter (call it tau) to shrink the coefficients toward 0 (or toward some other value in a fancier model) as much as can be justified by the data. So I want to fit a simple hierarchical model where tau is estimated. Is there an easy modification of your algorithm to adjust tau at every iteration and to ensure convergence to the MLE of tau (or maximum posterior estimate if we add a prior for tau)? And do you know of any arguments for why regularization by cross-validation would really be any better than fitting tau by a hierarchical model, especially if the goal is parameter interpretation rather than pure prediction?

I replied:

We also have a hierarchical version that does what you want, except that the distribution for the coeffs is normal rather than t. (I couldn't figure out how to get the EM working for a hierarchical t model. The point is that the EM for the t model uses the formulation of a t as a mixture of normals, i.e., it's essentially already a hierarchical normal.)

We're still debugging the hierarchical version, hope to have something publicly available (as an R package) soon.

Regarding your qu about cross-validation, yes, I think a hierarchical model would be better The point of the cross-validation in our paper was to evaluate priors for unvarying parameters which would not be modeled hierarchically.

Bill then wrote:

I did have my heart set on a hierarchical model for t rather than normal, because I wanted to avoid over shrinking very large coefficients while still "tuning" the prior scale parameter to the data empirically. (Although my worry about over shrinking might be less urgent if I use prior information to create "batches" that can have their own centers of shrinkage, as in your in-progress hierarchical bayesglm program.)

Lee Edlefsen and I [Bill D.] are working on a drug adverse events dataset with about 3 million rows and three thousand predictors, using logistic regression and some extensions of LR, and with thousands of different response events to fit. Plus the potential non repeatability of MCMC results would be a real turnoff for the FDA regulators and pharma industry researchers.

An EM question

I have a question for Chuanhai or Xiao-Li or someone like that: is it possible to do EM with two levels of latent variables in the model? In the usual formulation, there are data y, latent parameters z, and hyperparameters theta, and EM gives you the maximum likelihood (or posterior mode) estimate of theta, conditional on y and averaging over z. This can commonly be done fairly easily because z commonly has (or can be approximated with) a simple distribution given y and theta. This scenario describes regression with fixed Student-t priors, or regression with normal priors with unknown mean and variance.

But what about regression with t priors with unknown center and scale? There are now two levels of latent variables. Can an EM, or approximate EM, be constructed here? As Bill and I discussed in our emails, Gibbs is great, and it's much easier to set up and program than EM, but it's harder to debug. There's something nice about a deterministic algorithm, especially if it's built with bells and whistles that go off when something goes wrong.

## January 28, 2008

### Some Recent Progress in Simple Statistical Methods

Simple methods are great, and "simple" doesn't always mean "stupid" . . .

Here's the mini-talk I gave a couple days ago at our statistical consulting symposium. It's cool stuff: statistical methods that are informed by theory but can be applied simply and automatically to get more insights into models and more stable estimates. All the methods described in the talk derived from my own recent applied research.

For more on the methods, see the full-length articles:

Scaling regression inputs by dividing by two standard deviations

A default prior distribution for logistic and other regression models

Splitting a predictor at the upper quarter or third and the lower quarter or third

A message for the graduate students out there

Research is fun. Just about any problem has subtleties when you study it in depth (God is in every leaf of every tree), and it's so satisfying to abstract a generalizable method out of a solution to a particular problem.

P.S. On the other hand, many of Tukey's famed quick-and-dirty statistical methods don't seem so great to me anymore. They were quick in the age of pencil-and-paper computation, and sometimes dirty in the sense of having unclear or contradictory theoretical foundations. (In particular, his stem-and-leaf plots and his methods for finding gaps and clusters in multiple comparisons seem particularly silly from the perspective of the modern era, however clever and useful they may have been at the time he proposed them.)

P.P.S. Don't get me wrong, Tukey was great, I'm not trying to shoot him down. I wrote the above P.S. just to remind myself of the limitations of simple methods, that even the great Tukey tripped up at times.

## January 25, 2008

### Kaiser Fung on business statistics in practice

Here's Kaiser Fung's presentation at our consulting mini-symposium. It was interesting to hear about the challenges of in-house consulting at Sirius Satellite Radio.

### Rindskopf’s Rules for Statistical Consulting

Our statistical consulting mini-symposium yesterday was great. I wish we'd been able to video it. There was lively discussion of the connections between statistical consulting and research, and the different aspects of consulting in academic, corporate, and legal environments.

I'll be posting everyone's slides. Here's David Rindskopf's contribution:

Rindskopf’s Rules for Statistical Consulting

Some of these rules are universal, while others apply only in particular situations: Informal academic consulting, formal academic consulting, or professional consulting. Hopefully the context will be apparent for each rule.

Communication with the Client:

(1) In the beginning, mostly (i) listen and (ii) ask questions that guide the discussion.

(2) Your biggest task is to get the client to discuss the research aims clearly; next is design, then measurement, and finally statistical analysis.

(3) Don’t give recommendations until you know what the problem is. Premature evaluation of a consulting situation is a nasty disease with unpleasant consequences.

(4) Don’t believe the client about what the problem is. Example: If the client starts by asking “How do I do a Hotelling’s T?” (or any other procedure), never believe (without strong evidence) that he/she really needs to do a Hotelling’s T.

Exception: If a person stops you in the hall and says “Have you got a minute?” and asks how to do Hotelling’s T, tell them and hope they’ll go away quickly and not be able to find you later. (I’ve had this happen, and if I ask enough questions I inevitably find that it’s the wrong test, answers the wrong question, and is for the wrong type of data.)

Adapting to the Client and His/Her Field

(5) Assess the client’s level of knowledge of measurement, research design, and statistics, and talk at an appropriate level. Make adjustments as you gain more information about your client.

(6) Sometimes the “best” or “right” statistical procedure isn’t really the best for a particular situation. The client may not be able to do a complicated analysis, or understand and write up the results correctly. Journals may reject papers with newer methods (I know it’s hard to believe, but it happens in many substantive journals). In these cases you have to be prepared to do more “traditional” analyses, or use methods that closely approximate the “right” ones. (Turning lemons into lemonade: Use this as an opportunity to write a tutorial for the best journal in their field. The next study can then use this method.) A similar perspective is represented in the report of the APA Task Force on Statistical Significance; see their report: Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.

Professionalism (and self-protection)

(7) If you MUST do the right (complicated) analysis, be prepared to do it, write a few tutorial paragraphs on it for the journal (and the client), and write up the results section.

(8) Your goal is to solve your client’s problems, not to criticize. You can gently note issues that might prevent you from giving as complete a solution as desired. Corollary: Your purpose is NOT to show how brilliant you are; keep your ego in check.

Time Estimation, Charging for Your Time, etc.

(9) If a person stops you in the hall and asks if you have a minute, make him/her stand on one leg while asking the question and listening to your answer. If they ask for five minutes, it’s really a half-hour they need (or more).

(10) Corollary: Don’t charge by the job unless you really know what you’re doing or are really desperate. Not only do people (including you) underestimate how long it will take, but (a la Parkinson’s Law) the job will expand to include everything that comes into the client’s mind as the job progresses. If you think you know enough, write down all of the tasks, estimate how much time each will take, and double it. Also let the client know that if they make changes they’ll pay extra (Examples: “Whoops, I left out some data; can you redo the analyses?”, or “Let’s try a crosstab by astrological sign, and favorite lotto number, and...”)

(11) Charge annoying people a higher hourly rate. If you don’t want to work for them at all, charge them twice your usual rate to discourage them from hiring you (at least if they do hire you, you’ll be rewarded well.)

Resources

http://www.amstat.org/sections/cnsl/index.html ASA section on consulting
http://www.amstat.org/sections/cnsl/BooksJournals.html Their guide to books and journals on statistics

Boen, J.R. and Zahn, D.A. (1982) The Human Side of Statistical Consulting, Lifetime Learning Publications.

Javier Cabrera and Andrew McDougall. (2002). Statistical Consulting. Springer-Verlag.

Janice Derr. (2000). Statistical Consulting: A Guide to Effective Communication.. Pacific Grove CA: Duxbury Press, 200 pages, ISBN:0-534-36228-1.

Christopher Chatfield (1988). Problem solving: A statistician's guide, Chapman & Hall.

Taplin R.H. (2003). Teaching statistical consulting before statistical methodology. Australian & New Zealand Journal of Statistics, Volume 45, Number 2, June 2003, 141-152. Contains a good reference list on statistical consulting.

## January 24, 2008

### Statistical consulting mini-symposum TODAY (Thurs)!

Mini-Symposium: Statistical Consulting

When: January 24, 2008, from 3pm to 5pm

Where: Applied Statistics Playroom*

Sponsored by the New York City chapter of the American Statistical Association and the Columbia University Statistics Department, ISERP, and Applied Statistics Center.

Agenda

* Before 3pm: Casual conversation. This is a good time to meet new people or catch up with others.

* 3pm to 5pm:

o Brief lecture by Andrew Gelman: Some Recent Progress in Simple Statistical Methods.

o Panel discussion on statistical consulting with Naihua Duan (New York State Psychiatric Institute), Mimi Kim (Albert Einstein College of Medicine), Eva Petkova (New York University), Andrew Gelman (Columbia University), Kaiser Fung (Sirius Satellite Radio), and David Rindskopf (CUNY Graduate Center).

o The panel members will speak briefly, discuss questions, and facilitate a general discussion about statistical consulting.

* After 5pm: End of the formal part of the symposium. People can continue a group discussion, leave, or break into smaller groups.

Topics to be discussed include:

* Providing statistical solutions within the range of understandability;
* Handling the trade-offs between doing the analyses yourself and teaching others to perform all or parts of the analyses themselves;
* Managing expectations and building long-term relationships;
* Deciding how much to cater to the norm within disciplines;
* Balancing the goals of co-authorship in conjunction with money-making.

* The Applied Statistics Playroom is 707 International Affairs Building, Columbia University, at 118 St. and Amsterdam Ave., near the 116 St. #1 train. Snacks will be provided.

P.S. See here, here, and here for slides of some of the presentations.

## January 15, 2008

### Statistics postdoc at Michigan

I hate to advertise the competition, but this looks like it could be interesting:

Postdoctoral Position in Methodology Institute for Social Research University of Michigan Unit: The Quantitative Methodology Program

Date Announced: 01/15/2008

Qualified individuals will have a Ph.D. in statistics or biostatistics with a demonstrated interest in the social, behavioral or health sciences, or a Ph.D. in the social, behavioral or health sciences with very strong methodological skills and interests. Postdoctoral researchers will collaborate with Dr. Susan Murphy as part of a large NIH-funded project focused on the advancement and dissemination of statistical methodology involving individually tailored treatments (known as dynamic treatment regimes or adaptive treatment strategies) related to research on the prevention and treatment of substance abuse. This project also involves collaboration with researchers at the Methodology Center at Pennsylvania State University and clinical scientists at a variety of Universities and research centers.

Successful applicants will have exceptional resources to facilitate their research, including access to administrative and software support staff, any required hardware and software, and travel funding for at least one scientific conference annually. Experience with statistical programming and computer simulations is desirable. The position is for one year, with excellent possibility of re-funding for at least one additional year. The salary and benefits associated with this position are competitive.

Review of applications will begin immediately and continue until the positions are filled. Send a letter of application indicating research interests, career goals, and experience, a curriculum vita, and three letters of professional reference to: Susan Murphy, Quantitative Methodology Program, Institute for Social Research, University of Michigan, Ann Arbor, MI 48106-1248. For more information, contact Rhonda Moats (rmoats@umich.edu). The University of Michigan is committed to affirmative action, equal opportunity and the diversity of its workforce.

If you apply for the job, tell 'em you heard about it here!

### A question about causal inference and a question about variable selection

Lingzhou Michael Xue writes in with two questions:

1) Possible to Generalize the Rubin Causal Model? In my undergraduate research project, I have discovered almost every subjects focused on decoding the network-level causality in almost every field, ranging from Biology, Medicine Design to even Social Science. However, these publications obviously lack solid statistical foundations on the definition of causality and how to do causal analysis. On the other hand, I have been enlightened a lot from Rubin Causal Model, and also powerful tools such as instrumental variables and propensity scores. Yet, these causal inference are limited in the one-variable causality. Is possible to generalize it to deal with interaction causality? From my intuition, it seems pretty difficult to do this. I am still curious about the possibility to generalize Rubin's Model?

2) Some works on Bayesian Variable Selection?
Recently, we have witnessed fruitful and interesting reseraches on variable selection,
which even draw Terence Tao's attention. What is more interesting, most works of this
area rely on penalized learning, i.e. from the frequentist perspective. While I believe
that Bayesian approach might bring us a more reasonable framework just as it always did.
could you kindly show me some works on bayesian variable selection?

1. Rubin's causal model allows for interactions. Interactions between treatment and pre-treatment predictors fit in automatically with no complications at all, except that the goal is no longer to estimate an average treatment effect, you now want to estimate the effect conditional on predictors. If you have interactions between different treatment factors, it just complexifies the potential outcomes. I agree, though, that when the treatment is continuous, the potential outcomes need to be modeled, which brings Rubin's framework closer to classical regression and instrumental variables.

2. I'm not a big fan of variable selection. I prefer continuous model expansion: keeping all the variables in the model and controlling them with an informative prior distribution or hierarchical model.

### A sighting of the unicorn

Richard Barker sent in this photograph and the following note:

Matt just pointed me to your article: You can load a die but you can't bias a coin. You might be interested in the attached, a photo of a bent NZ 50c coin that I had pressed in the Physics lab here a few years ago because I got bored using flat coins in classroom demonstrations where everyone knows what Pr(heads) is. Fortunatley that particular style of coin is no longer legal tender so I am unlikely to be prosecuted for defacing her Majesty's coinage.

In discussing this with Matt this afternoon we conjured up a counter example where the coin is completely pressed into a sphere. Then it has Pr(heads) = 1. If the pressing is not quite complete it will be a little less than one, so we claim the statement in the title of your article is not true. We think you can bias a coin.

When I first bent the coin I did some experiments letting the coin land on the ground. On soft carpet it was not obvioulsly biased but it was on a hard surface. On hard surfaces, most of the time it bounces up and starts spinning on its edge. When this happens it then always lands heads up.

Yeah, sure, he's right. We were thinking of weighting a coin, but if you bend it enough, then it is no longer set to land "heads" for half of its rotation. And bouncing, sure, then anything can happen. We were always assuming you catch it in the air!

Finally, we were addressing the concept of the "biased coin," which, by analogy to the "loaded die," looks just like a regular die but actually has probabilities other than 50/50 when caught in the air. In that sense, the bent coin is not a full counterexample since it clearly looks funny.

## January 14, 2008

### What to learn in your statistics Ph.D. program?

Cosma Shalizi (of the CMU statistics dept) and I had an exchange about the role of measure theory in the statistics Ph.D. program. I have to admit I'm not quite sure what "measure theory" is but I think it's some sort of theoretical version of calculus of real variables. I had commented that we're never sure what to do with our qualifying exam, and Cosma wrote,

I think we have a pretty good measure-theoretic probability course, and I wish more of our students went on to take the non-required sequel on stochastic processes (because that's the one I usually teach). I do think it's important for statisticians to understand that material, but I also think it's actually easier for us to teach someone how a martingale works than it is to teach them to be interested in scientific questions and to not get a freaked out, "but what do I calculate?" response when confronted with an open research problem. Here it's been suggested that we replace our qualifying exams with having the student prepare a written review of some reasonably-live topic from the literature and take an oral exam on it, which would be more work for us but come a lot closer to testing what the students actually need to know.

I replied,

I agree that it's hard to teach how to think like a scientist, or whatever. But I don't think of the alternatives as "measure theory vs. how-to-think-like-a-scientist" or even "measure theory vs. statistics". I think of it as "measure theory vs. economics" or "measure theory vs. CS" or "measure theory vs. poli sci" or whatever. That is, sure, all other things being equal, it's better to know measure theory (or so I assume, not ever having really learned it myself, which didn't stop me from proving 2 published theorems, one of which is actually true). But, all other things being equal, it's better to know economics (by this, I mean economics, not necessarily econometrics), and all other things being equal, it's better to know how to program. Etc. I don't see why measure theory gets to be the one non-statistical topic that gets privileged as being so requrired that you get kicked out of the program if you can't do it.

Cosma then shot back with:

I also don't think of the alternatives as "measure theory vs. how-to-think-like-a-scientist" or even "measure theory vs. statistics". My feeling --- I haven't, sadly, done a proper experiment! --- is that it's easier to, say, take someone whose math background is shaky and teach them how a generating-class argument works in probability than it is to take someone who is very good at doing math homework problems and teach them the skills and attitudes of independent research.

You say, "I think of it as "measure theory vs. economics" or "measure theory vs. CS" or "measure theory vs. poli sci" or whatever." I'm more ambitious; I want our students to learn measure-theoretic probability, and scientific programming, and whatever substantive field they need for doing their research, and, of course, statistical theory and methods and data analysis. Because I honestly think that if someone is going to engage in building stochastic models for parts of the world, they really ought to understand how probability _works_, and that is why measure theory is important, rather than for its own sake. (I admit to some background bias towards the probabilist's view of the world.) At the same time it seems to me a shame (to use no stronger word) if someone, in this day and age, gets a ph.d. in statistics and doesn't know how to program beyond patching together scripts in R.

P.S. I think measure theory should be part of the Ph.D. statistics curriculum but I don't think it should be a required part of the curriculum. Not unless other important topics such as experimental design, sample surveys, statistical computing and graphics, stochastic modeling, etc etc are required also. It's sad to think of someone getting a Ph.D. in statistics and not knowing how to work with mixed discrete/continuous variables (see Nicolas's comment below) but it seems equally sad to see Ph.D.'s who don't know what Anova is, who don't know the basic principles of experimental design (for example, that it's more effective to double the effect size than to double the sample size), who don't know how to analyze a cluster sample, and so forth.

Unfortunately, not all students can do everything, and any program only gets some finite number of applicants. If you restrict your pool to those who want to do (or can put up with) measure theory, you might very well lose some who could be excellent statistical researchers. It would be sort of like not admitting Shaq to your basketball program because he can't shoot free throws.

## January 10, 2008

### Record your sleep with Dream Recorder?

Aleks pointed me to this website:

Dream Recorder is the ideal companion of your nights, allowing you to understand better this third of our life spent in bed. Dream recollection, sleep hygiene, curiosity, you will find your own reasons for using this software of a new kind. Nights after nights, Dream Recorder keeps records of your sleep profiles. It provides statistics and give you the possibility to annotate your dream records with notes or keywords. . . .

Dream Recorder uses the difference between successive reconstructed images for computing the quantity of motion (see image on the right). Quantity of motions are reflected by the colored bar graph. High peaks mean motions. Very low peaks are just in the detection noise base level. Dream periods are lit up by spotlights. Normal sleeps are represented by the dark blue shades. Deep sleeps have no lights nor shading. Night events are displayed under the timeline, here a dream feedback followed by a voice recording.

Seth would love this (I assume).

## January 4, 2008

### Free advice is worth what you pay for it

Someone writes,

I am currently looking at different grad school stats programs. I have a BA in Psychology (U. Southern California), but I am really interested in stats. I loved my stats classes in college but I was a bit of a naive wallflower back then and did not think to change course and pursue stats more, even though it was the favorite part of my degree. After I graduated, I worked as a research assistant where my PI quickly learned that I was happiest talking about and running the stats for her various projects. I worked with her for close to two years, then moved and now I'm a public school science teacher.
I know I have a passion for the subject, my problem is that whenever I look at the requirements for stats grad programs, I see that I am severely lacking in the math requirements. In your opinion, am I wasting my time looking at these programs when all I have is a Psych degree and a passion to learn more? Would it even be possible for an admissions board to look past my lack of math classes in college, taking into account my research experience, and possibly admit me on the condition that I complete the prerequisites? What course of action do you think would be best for my situation?

1. I want to be encouraging, because I think that the field of statistics should have more people who want to do statistics, as opposed to people who studied math and aren't really sure what to do next.

2. I expect you should be able to get into a good program, if they can be convinced that you can learn the math that's necessary. I think a lot of places take GRE scores pretty seriously, so if your GRE's aren't so great, you might need to take some math and some probability theory to make the case that you can do it. The research experience should help. I also think that being able to program computers is as important as being able to do math, but most stat programs probably don't agree with me there.

3. Another approach is to get a PhD in an applied field such as psychology or education or political science and focus on statistics (i.e., "methods"). The trick here is to go to a place where you can work with someone who's doing good interdisciplinary work so that you don't end up just doing out-of-date statistics. That's true in a regular statistics department also. Lots of really PhD theses get done (actually, I've advised some of these myself, but I'm trying to improve).

4. Perhaps someone will comment with better advice?

## January 1, 2008

### NASA data released for analysis

Via a Slashdot entry, I heard that NASA has released data from a survey they did from 2001 to 2004. They surveyed pilots, and apparently a lot of the responses did not reflect well on NASA, so the data was going to be destroyed. They changed their minds, and now the data has been posted for analysis - no one has really done a great job analyzing the data yet, so if anyone is interested... For the data, see the link here.

## December 11, 2007

### Why use count-data models (and don't talk to me about BLUE)

Someone who wishes to remain anonymous writes,

I have a question for you. Count models are to be used when you have outcome variables that are non-negative integers. Yet, OLS is BLUE even with count data, right? I don't think we make any assumptions with least squares about the nature of the data, only about the sampling, exogeneous regressors, rank K, etc. So why technically should we use count if OLS is BLUE even with count data?

1. I don't really care if something is BLUE (best linear unbiased estimator) because (a) why privilege linearity, and (b) unbiasedness ain't so great either. See Bayesian Data Analysis for discussion of this point (look at "unbiased" in the index), also this paper for a more recent view.

2. Least squares is fine with count data, it's even usually ok with binary data. (This is commonly known, and I'm sure it's been written in various places but I don't happen to know where.) For prediction, though, you probably want something that predicts on the scale of the data, which would mean discrete predictions for count data. Also, a logarithmic link makes sense in a lot of applications (that is, E(y) is a linear function of exp(x)), and you can't take log of 0, which is a good reason to use a count data model.

### Question about principal component analysis

Andrew Garvin writes,

The CFNAI is the first principal component of 85 macroeconomic series - it is supposed to function as a measure of economic activity. My contention is that the first principal component of a diverse set of series will systematically overweight a subset of the original series, since the series with the highest weightings are the ones that explain the most variance. As an extreme example, say we do PCA on 100 series, where 99 of them are identical - then the first PC will just be the series that is replicated 99 times.
This is not necessarily a bad thing, but consider the CFNAI - most of the highly weighted series are from a) industrial production numbers, or b) non-farm payroll numbers. On the other hand, the series with relatively small weightings are very diverse. As I see it then, using the first principal component is not so much a measure of 'economic activity', but rather, 'economic activity as primarily measured by industrial production and NFP'. Now, if I thought a priori that industrial production and NFP explained most of what was happening in economic activity, then this would not be such a bad outcome. However, it seems to me that the whole point of using PCA instead of an equal-weighting is that we are naive about the true weightings of the various series composing our indicator - and so PCA conveniently gives us the most appropriate weightings. So, to me, PCA only works as a weighting strategy if we already have some idea of what the weights should be, which defeats the purpose of using PCA in the first place.

My question then is: Do you see this as a problem? a) If so, would you mind suggesting ways to deal with this problem, or perhaps pointing me to some reading material that might discuss this issue? b) If not, I would be curious to know what the flaw is in my argument above.

My reply: Hey--you've hit on something I know nothing about! My usual inclination is not to use principal components but rather to just take simple averages or total scores (see pages 69 and 295 of my book with Hill), on the theory that this will work about as well as anything else. But in that case you certainly have to be careful about keeping too many duplicate measures of the same thing. My impression was that principal component analysis gets around that problem, actually.

My more general advice is to check your reasoning by simulating some fake data under various hypothesized models (including examples where you have several near-duplicate series) and then see what the statistical procedure does.

## December 7, 2007

### Statistical consulting mini-symposium next month

See here.

Manuel Spínola writes,

Many people (or at least some) in ecology question the validity of transform a variable to perform a statistical test. I guess they are worried that the transformed variable is not what they intended to measure. Is that a fair criticism? Is it valid to transform a variable, let say, use a logarithmic transformation?

Is it mean that if I found a meaningful (biological significant) relationship of a transformed explanatory variable with a response variable that relationship exist but was not "visible" before the variable transformation? I am not clear on how to explain a relationship based on a transformed variable to, let say a straw-man. For example, if log(forest cover) has an influence on a species bird abundance, what should I say to a manger regarding to the effect of forest
cover on the species abundance?

My reply: This is discussed in more detail in chapter 4 of our recent book. The short answer is that predictive relationships can be nonlinear and can have interactions. Once you consider that things might be nonlinear, the transformation is arbitrary. But in many settings transformations can allow for simpler models, for example a log transformation changing a multiplicative to a linear relation. All is fine as long as you don't take the assumptions too seriously. To specifically answer your question: yes, use a log transformation, then explain the predictive relation by making a graph of y vs. x.

## December 3, 2007

### Exploratory data analysis course

Aleks noticed this interesting-looking course:

Course Contents Predictive Analytics and Exploratory Data Mining

* the relationship between predictive analytics and exploratory data mining
* the role of graphics in exploratory analysis
* complexity in a PowerPoint world
* the analyst's dilemma

Working with Unstructured Data

* data streams versus structured data
* social network analysis as a solution to unstructured problems
* statistical mechanics of network analyses
* predicting with a network
* complex networks versus reductionism

Exploratory Data Mining and Predictive Models

* exploratory data mining success
* predictive modeling methods
* logistic regression
* decision trees
* neural networks
* the truth about neural networks
* comparing and contrasting predictive modeling methods
* model structure and impact on exploratory results
* graphical review of model results
* multi-dimensional graphics

Exploratory Predictive Modeling

* initial data screening
* elements of an exploratory script
* developing complex predictive models for exploratory efforts
* identifying important variables
* analyzing variables, domains, and clusters
* graphical review of models and data

Exploratory Findings

* extracting new hypotheses (exploratory findings) from the predictive model
* building confidence with the exploratory findings
* recognizing and overcoming impediments to acceptance by the target audience

Remind me again why we teach classes on boring topics like "categorical data anlysis" . . .

## November 26, 2007

### What to teach if you only have three weeks, and suggestions for the ten most interesting and accessible quantitative papers in political science

Frank Di Traglia writes,

I'm going to be teaching a three-week, introductory statistics course for local high school students next summer, and wanted to ask for your advice. I have two questions in particular.

First, I doubt that three weeks will be enough time to teach the usual Statistics 101 course. If you had only three weeks, what would you skip and what would you emphasize?

Second, since next year is an election year, I thought it might be fun to build the course around substantive examples from political science. Although I've enjoyed many of your poly-sci papers, my own background is not in this area (I did my masters in Statistics, and am currently pursuing a PhD in Economics). What would you consider to be the ten most interesting and accessible quantitative papers in this field?

1. It's gotta depend on how many hours per week you have! To consider the larger question, I'm unsatisfied with the usual intro stat course (including what I've taught) because it comes across as a disconnected set of topics. As I wrote here about the "sampling distribution of the sample mean":

The hardest thing to teach in any introductory statistics course is the sampling distribution of the sample mean, a topic that is at the center of the typical intro-stat-class-for-nonmajors. All of probability theory builds up to it, and then this sample mean is used over and over again for inferences for averages, paired and unparied differences, and regression. This is the standard sequence, as in the books by Moore and McCabe, and De Veaux et al.

The trouble is, most students don't understand it. I'm not talking about proving the law of large numbers or central limit theorem--these classes barely use algebra and certainly don't attempt rigorous proofs. No, I'm talking about tha dervations that lead to the sample mean of an average of independent, identical measurments having a distribution with mean equal to the population mean, and sd equal to the sd of an individual measurement, divided by the square root of n.

This is key, but students typically don't understand the derivation, don't see the point of the result, and can't understand it when it gets applied to examples.

What to do about this? I've tried teaching it really carefully, devoting more time to it, etc.--nothing works. So here's my proposed solution: de-emphasize it. I'll still teach the samling distribution of the sample mean, but now just as one of many topics, rather than the central topic of the course. In particular, I will not treat statistical inference for averages, differences, etc., as special cases or applications of the general idea of the sampling distribution of the sample mean. Instead, I'll teach each inferential topic on its own, with its own formula and derivation. Of course, they mostly won't follow the derivations, but then at least if they're stuck on one of them, it won't muck up their understanding of everything else.

Given these thoughts, my first suggestion would be for you to indeed focus on one particular thing, for example public opinion, and focus your course on that. Have the students download raw data from polls and do some analyses (maybe using JMP-in). This is what Bob Shapiro does when he teaches intro stats here.

2. If you'd rather do something closer to standard statistics, I'd recommend focusing on sampling, experimentation, and observational studies. You can do one week of each--in each week, they first do an in-class demo (a survey in week 1, an experiment in week 2, an obs study in week 3), then they together do something larger. I have some examples in my book with Deb, but I can't say I've worked out all the details of such a course. It's easier to talk about it than to do it.

3. The ten most interesting and accessible quantitative papers in political science? That's a good question. Of my own papers, these are the most accessible, I think: Why are American Presidential election campaign polls so variable when votes are so predictable?, Voting, fairness, and political representation, Voting as a rational choice: why and how people vote to improve the well-being of others, A catch-22 in assigning primary delegates, Rich state, poor state, red state, blue state: What's the matter with Connecticut? I also like the paper, Methodology as ideology: mathematical modeling of trench warfare, even though it's not really statistical.

I wouldn't include all of these in a top-ten list, but I'd include at least one! Beyond this, perhaps the blog readers have some suggestions?

## November 21, 2007

### Quantitative Methods for Negotiating Trades in Pro Sports

I recently had some thoughts about negotiating trades in the NBA. Specifically, I heard that the Lakers and the Bulls were having daily discussions about a trade involving Kobe Bryant, for at least a week; that seemed like a long time to me. Was this week-long series of conversations productive and/or necessary? Are there no quantitative methods for structuring trade negotiations that could have been used to save these teams some time and energy? I've outlined a potential solution, which can probably be improved using methods from the literature on (1) statistical models for rankings and (2) bargaining and negotiating.

The only similar setting that I can think of is when opposing lawyers eliminate potential jurors from a jury pool (they call these "peremptory challenges"). Does anyone know of another situation in which opposing agents rank items and eventually must agree on a compromise? Maybe there is something in the bargaining literature.

The statistical question of interest is this: What is the percentile of the rank (for each team) of the jointly optimal trade? (That is, the last trade that remains on both lists after eliminations are made). It would be nice if, in the pro sports example, both teams could improve significantly. This would probably only happen in an "apples for oranges" type of trade. Some preliminary work in the Lakers-Bulls example shows that the jointly optimal trade is in the 47th percentile for the Lakers and the 48th percentile for the Bulls - not too great for either team. A bunch of assumptions were made in this example, though, so it's probably not too informative right now. If a probability model is used to generate the two sets of rankings, then the pair of percentiles of the jointly optimal trade, (p_1,p_2), would be a random variable of interest.

## November 16, 2007

### Political Neuroscience

A piece by Brandon Keim in Wired points out some issues in the fMRI brain-politics study on reactions to presidential candidates discussed in a recent NYT op-ed. For example,

Let's look closer, though, at the response to Edwards. When looking at still pictures of him, "subjects who had rated him low on the thermometer scale showed activity in the insula, an area associated with disgust and other negative feelings." How many people started out with a low regard for Edwards? We aren't told. Maybe it was everybody, in which case the findings might conceivably be extrapolated to the swing voter population of the United States. But maybe it was just five or ten voters, of whom one or two had such strong feelings of disgust that it skewed the average. What about the photographs? Was he sweating and caught in flashbulb glare that would make anyone's picture look disgusting? How did the disgust felt towards Edwards compare to that felt towards other candidates? How well do scientists understand the insula's role in disgust -- better, I hope, than they understand the Romney-activated amygdala, which is indeed associated with anxiety, but also with reward and general feelings of arousal?

(And don't forget "Baby-faced politicians lose" on this blog.)

## November 12, 2007

Josh Menke writes,

I saw that you had commented on adjusted plus/minus statistics for basketball in a few of your blog entries [see also here]. I've been working on a Bayesian version of the model used by Dan Rosenbaum, and wondered if I could ask you a question.

I wanted to be able to update the posterior after each sequence of game play between substitutions, so I decided to use the standard exact inference update for a normal-normal Bayesian linear regression model. If you're familiar with Chris Bishop's recent book, Pattern Recognition and Machine Learning, the updating equations for this are 3.50 and 3.51 on page 153. I felt OK with using a normal prior based on some past research I did in multiplayer game match-making with Shane Reese at BYU. The tricky part comes with using exact inference for updating the posterior. The updating method is very sensitive to the prior covariance matrix. I start with a diagonal covariance matrix, and if the initial player variances I choose are too high, the +/- estimates can go to infinity after several updates. I thought this was related to the data sparsity causing an ill-conditioned update matrix, but I thought I'd ask in case you'd had any experience with this type of problem.

Have you dealt with an issue like this before? If I set the prior variances low enough, I get reasonable results, and the ordering of the final ranking is fairly robust to changes in the prior. It's just the estimation process itself that doesn't "feel" as robust as I'd prefer, so I don't know that I trust the adjusted values (final coefficients) to be meaningful.

I don't think I can use MCMC in this situation either because trying to get 100,000 samples using 38,000+ data points and 400+ parameters feels intractable to me. I could be wrong there as well since I suppose I only need to include the current players in each match-up within the log likelihood. But it would still take quite a bit of time.

It would also be nice to go with the sequential updating version if possible since I could provide adjusted +/- values instantly after each game, if not after each match-up.

1. I'd try the scaled inverse Wishart prior distribution as described in my book with Hill. This allows the correlations to be estimated from data in a way that still allows you to provide a reasonable amount of information about the scale parameters.

2. I'd go with the estimation procedure that gives reasonable estimates, then do some posterior predictive checks, as described in chapter 6 of Bayesian Data Analysis. (Sorry for always referencing myself; it's just the most accessible reference for me!) This should give you some sense of the aspects of the data that are not captured well by the model.

3. Finally, you can simulate some fake data from your model and check that your inferential procedure gives reasonable estimates. Cook, Rubin, and I discussed a formal way of doing this, but you can probably do it informally and still build some confidence in your method.

## November 10, 2007

### Survey weighting is a mess

Dave Judkins writes, regarding my Struggles with Survey Weighting and Regression Modeling paper,

I am hoping you might be able to clarify a point in your approach. How does a variable like number of phone lines in the house get used in equation 5? (Given that N.pop and X.pop are not available.) Does your work in Section 3 apply only to X variables with known population distributions?

My student and I are working on how to deal with these "non-census variables." The Bayesian answer is that you need to know the N's for the crosstabs of these non-census variables and the census variables. Since only the census variables are known, the relative N's for the non-census variables are unknowns are random variables, they need a prior distribution, etc etc. Inference is done on these by making an assumption about selection probability (e.g., that households with multiple phones are twice as likely to be picked, and households with intermittent service are half as likely to be picked, compared to households with one phone line). My conjecture is that if you have a simple flat prior on the unknown multinomial probabilities, this reduces to some sort of inverse-probability-weighting analysis, and that maybe one can do better using a more structured prior (i.e., a hierarchical model). But for now it's all talk and no action from me on this!

Dave replied with some information on how they adjust for non-Census variables at Westat and links to this recent paper of his on sample-based raking, work that started around 1987:

Judkins, D., Nadimpalli, V. and Adeshiyan, S. (2005). Replicate control totals. Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp 3167-3171.

Which reminds me of this 2001 paper by Cavan, Jonathan, and myself on poststratifying without population-level information.

## November 8, 2007

### Those people who go around telling you not to do posterior predictive checks

I started to post this item on posterior predictive checks and then I realize I already did post it several months ago! Memories (including my own) are short, though, so here it is again:

A researcher writes,

I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don't quite understand, and was wondering if you have perhaps run into this criticism. Here's the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I'll call lambda).

I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.

If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a "Bayesian p-value." The model with the larger p-value is a better fit, just like my frequentist friends are used to.

Now to the criticism. An anonymous reviewer suggests this approach is weakened by "using the observed data twice." Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don't see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in "degrees of freedom" in the usual sense associated with MLEs either.

I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.

My thoughts:

1. My first thought is that the safest choice is the nonhomogeneous process since it includes the homogeneous as a special case (in which the variation in the rate has zero variance over time). This can be framed as a modeling problem in which the variance of the rate is an unknown parameter which must be nonnegative. If you have a particular parametric model (e.g., log(rate(t))=a+b*x(t)+epsilon(t), where epsilon(t) has mean 0 and sd sigma), then the homogeneous model is a special case like a=b=sigma=0.

2. From this perspective, I'd rather just estimate a,b,sigma and see to what extent the data are consistent with a=0, b=0, sigma=0.

3. I agree with you that "using the data twice" is not usually such a big deal. It's a n vs. n-k sort of thing.

4. I'm not thrilled with the approach of picking the model with the larger p-value. There are lots of reasons this might not work so well. I'd prefer (a) fitting the larger model and taking a look at the inferences for a, b, and sigma; and maybe (b) fitting the smaller model and computing a chi-squared statistic to see if this model can be rejected from the data.

Still more . . .

And here's more on the ever-irritating topic of people who go around telling you not to use posterior predictive checks because they "use the data twice." Grrrrr... The posterior predictive distribution is p(y.rep|y). It's fully Bayesan. Period.

## November 7, 2007

### To our loyal readers . . .

Sorry for all the red-state, blue-state stuff. We'll be giving you more statistics-as-usual and miscellaneous social science soon . . . In the meantime, you can read these papers:

Using redundant parameterizations to fit hierarchical models (with Zaiying Huang, David van Dyk, and John Boscardin; to appear in JCGS)

Weight loss, self-experimentation, and web trials: a conversation (with Seth Roberts; to appear in Chance)

Manipulating and summarizing posterior simulations using random variable objects (with Jouni Kerman; to appear in Statistics and Computing)

## November 6, 2007

### Statistical challenges in estimating small effects

My immediate reaction is that we won't get people away from these mistakes as long as we talk in terms of "statistical significance" and even power, since these concepts are just too subtle for most people to understand, and they distract from the real issues. Somewhat influenced by others, I spend quite a bit of time eradicating the term "statistical significance" from colleagues' papers. I suspect that as long as the world sees statistical analysis as dividing "findings" into positives and negatives then the nonsense will keep flowing, so an important step in dealing with this is to change the terminology. In your example you seem to be arguing too much on his ground by focussing on the fact that although he data-dredged a significant p-value, your p-value is not significant. (So the ignorant editor or reader may see it as technical squabbling between statisticians rather than being forced to deal with the real issues about precision of estimation or lack of information.)

I agree entirely that the problem is with the framework of effects as true/false, but this is the very framework that "statistical significance" is built around and your article makes that concept very central by continually referring to "what if the effect is not statistically significant?" etc. I think the focus should be on how dangerous it is to overinterpret small studies with vast imprecision, and I'm not sure why this can't be clarified by sticking to the precision (or information) concept. I still haven't looked again at your Type S and Type M but on the face of it wonder if they may just confuse by adding more layers. Statistical significance gets it wrong because it focuses on null hypotheses (usually artificial), but when you say Type S it almost sounds similar in that you are thinking of truth/falsity with respect to the sign, rather than uncertainty about effects...?

My big point in considering Type S errors is to move beyond the idea of hypotheses being true or false (that is, to move beyond the idea of comparisons being exactly zero), but John has a point, that I still have to decide how to think about statistical significance. The problem is that, from the Bayesian perspective, you can simply ignore statistical significance entirely and just make posterior statements like Pr (theta_1 > theta_2 | data) = 0.8 or whatever, but such statements seem silly given that you can easily get impressive-seeming probabilities like 80% by chance.

## November 1, 2007

### A statistician does web analytics

I sometimes play with Google Analytics to see the number of daily visitors on our blog and where they are coming from. The charts of daily visits look a bit like this:

Clearly, there is an upwards trend, but the influence of the day of the week messes everything up. I exported the data into a text file, and typed a line into R:

The trend component shows what I am really interested in: the trough of summer, followed by a relatively consistent rising trend. Every now and then another site will refer to our blog, temporarily increasing the traffic, and Andrew's cool voting plots are responsible for the latest spike.

Setting the stl function's t.window parameter to 14, 21 or more will smooth the trend a bit more. The model is imperfect because new visitors do come in bursts, but leave more slowly. Perhaps we should do a better Bayesian model for time series decomposition, unless someone else has already done this.

Posted by Aleks Jakulin at 3:15 PM | Comments (1) | TrackBack

## October 31, 2007

Nick Firoozye writes,

I [Firoozye] wanted to point your attention to the following podcast by Ian Ayres on Supercrunchers, where he shows himself an enthusiastic (if perhaps a bit naïve) proponent of the statistical method. Entertaining, definitely. One thing though that I thought you might be interested in is Russ Roberts’ (the interviewer's) own skepticism over the econometric method, which I think probably warrants a response. It may be that Roberts’ own view is due to his now-Austrian economics slant (i.e., somewhat anti-formallist approach) or perhaps to the fact that mainstream econometrics is a frequentist pursuit and one might question the honesty of the results as a consequence.

I don't really have much to add here, except that the problem noted by Roberts (it's hard to know whether to believe a statistical study) is even more of a problem with non-statstical empirical studies (i.e., anecdotes). I think Roberts might be overstating the problem because he is focusing on issues where he already had a strong personal opinion even before seeing data analyses. (He mentions the examples of concealed handguns and anti-theft devices on cars.) But there are a lot of areas where we have only weak opinions which can indeed be swayed by data (see here for some examples). These cases are important in their own right and also can serve as benchmarks for the success of statistical analysis, so that we can trust good analyses more when they're applied to tougher problems. This is one way that applied statistics proceeds, by exemplary analyses of problems that might not be hugely important on their own terms but serve as useful templates. Consider, for example, the book by Snedecor and Cochran: it's full of examples on agricultural field trials. Sure, these are important, but these methods have been useful in so many other fields. This is a great example, actually: Snedecor and his colleagues worked on agricultural trials because they cared about the results--these were not "toy examples" or thought experiments--and the resulting methods endured.

## October 29, 2007

### Distinguishing association from causation

I was pointed to Distinguishing Association from Causation:A Background for Journalists (there is also a PDF version). Here is my summary of their executive summary:

• Scientific studies that show an association between a factor and a health effect do not necessarily imply that the factor causes the health effect.

• Randomized trials are studies in which human volunteers are randomly assigned to receive either the agent being studied or an inactive placebo, usually under double-blind conditions.

• The findings of animal experiments may not be directly applicable to the human situation because of genetic, anatomic, and physiologic differences between species and/or because of the use of unrealistically high doses.

• In vitro experiments are useful for defining and isolating biologic mechanisms but are not directly applicable to humans.

• The findings from observational epidemiologic studies are directly applicable to humans, but the associations detected in such studies are not necessarily causal.

• Useful, time-tested criteria for determining whether an association is causal include:
• Temporality. For an association to be causal, the cause must precede the effect.

• Strength. Scientists can be more confident in the causality of strong associations than weak ones.

• Dose-response. Responses that increase in frequency as exposure increases are more convincingly supportive of causality than those that do not show this pattern.

• Consistency. Relationships that are repeatedly observed by different investigators, in different places, circumstances, and times, are more likely to be causal.

• Biological plausbility. Associations that are consistent with the scientific understanding of the biology of the disease or health effect under investigation are more likely to be causal.

• Studies that include appropriate statistical analysis and that have been published in peer-reviewed journals carry greater weight than those that lack statistical analysis and/or have been announced in other ways.

• Claims of causation should never be made lightly.

But all this isn't about causation vs association, it's about better studies or worse studies. Association and causation are not binary categories. Instead, there is a continuum from simple models on observational data (correlation between two variables), through more sophisticated models on observational data that include covariates (regression, structural equation models), through yet sophisticated models on observational data that take sample selection bias into consideration (Rubin's propensity score approach), to often simple models on controlled data (randomized experiments). But the mysterious causal "truth" is still out there. If one talks to philosophers these days, they're not even happy with the notion of causality as being powerful enough as a model of reality.

In the past, I've often unfairly complained about studies after having read misleading journalistic reports, so this report is a timely one. But the report has been paid for by large pharma corporations, people may wonder if there is bias or some sort of an agenda in this report.

My quick impression is that they're promoting the best practices in statistical methodology, that all these companies are subscribing to. But there could be greater use of cheaper observational studies with better modeling (such as employing the propensity score approach, or even just better regression modeling) compared to expensive randomized experiments, and society might be better off as a result. Moreover, there is the issue of statistical versus practical significance. What do you think?

Posted by Aleks Jakulin at 3:56 PM | Comments (11) | TrackBack

### Anova

Cari Kaufman writes,

I am writing a paper on using Gaussian processes for Bayesian functional ANOVA, and I'd like to draw some connections to your 2005 Annals paper. In my own work I've chosen to use a 1-1 reparameterization of the cell means, that is, to constrain the levels within each factor. But I am intrigued by your use of exchangeable levels for all factors, and I'm hoping you can take a few minutes to help me clarify your motivation for this decision. Since not all parameters are estimable under the unconstrained model, don't you encounter problems with mixing when the sums of the levels trade off with the grand mean? It seems in many situations it's advantageous to have an orthogonal design matrix, especially when the observed levels correspond to all possible levels in the population. Do you have any thoughts on this you can share?

I should say I found the paper very useful, especially your graphical representation of the variance components. I also like your distinction between the superpopulation and finite population variances, which helped me clarify what happens when generalizing to functional responses. Basically, we can share information across the domain to estimate the superpopulation variances by having a stationary Gaussian process prior, but the finite population variances can differ over the domain, which gives some nice insight into where
various sources of variability are important. (At the moment I'm working with climate modellers, who can really use maps of where various sources of variability show up in their output.)

My reply: I'm not quite sure what the question is, but I think you're pointing out the redundant parameterization issue, that if we specify all levels of a factor, and then have other crosscutting or nested factors (or even just a constant term), then the linear parameters are not all identifiable. I would deal with this issue by fitting the large, nonidentified model and then summarizing using the relevant finite-population summaries. We discuss this a bit in Sections 19.4-19.5 and Chapters 21-22 of our new book.

A couple notes on this:

1. Mixing of the Gibbs sampler can be slow on the original, redundant parameter space but fast on the transformed space, which is what we really care about. Also, things work better with proper priors. My new thing is weakly informative priors which don't include all your prior information but act to regularize your inferences and keep the algorithms in a reasonable space where they can converge faster. The orthoganality that you want can come in this lower-dimensional summary.

2. The redundant-parameter model is identified, if only weakly, as long as we use proper prior distributions on the variance parameters. In Bayesian Data Analysis and in my 2005 Anova paper, I was using flat prior distributions on these "sigma" parameters. But since then I've moved to proper priors, or, in the Anova context, hierarchical priors. See this paper for more information, including an example in Section 6 of the hierarchical model for the variance parameters.

## October 22, 2007

### Survey weighting and regression modeling

I [Mike] have one specific question about your article in Statistical Science on weighting and multi-level regression models. I have one specific question about the article: do the results for the table 1 regression results use the procedure you describe in section 1? That is, does it include interactions between X and z in the model, or does it use design variables with main effects for the relation (y on z) of interest and simply report the coefficient for y on z? I couldn’t really tell, but perhaps I missed something.

I guess I have another question: on page 157 in the last full paragraph you state that it is not clear why a simple linear regression of y on z in the entire population would be of interest. That implies that it is not of interest. The first line of 1.4 discusses the regression of y on z. If we had all the data in the population, would we not simply compute the simple linear regression parameter estimates and report those as the relationship between y and z (assuming linearity)? If not, what are we trying to estimate with the E(y|z) function? I understand that it would be more interesting to look at y on z and X if we had tons of data, but that did not appear to be the motivation at the start of 1.4.

Related to this, I see that the population proportions of men and women enter into equation (4) through Bayes’ theorem because you don’t have many people of a single height. In the second example (page 158) you might have E(male|white=1) etc. from population data, such as census data in the geographical area. You could use that, couldn’t you, instead of the proportions white among males in the sample and then Bayes’ theorem?

Finally, about implementing this idea, perhaps we need groups of statisticians inside federal agencies to build recommendations for multilevel models for various outcomes and relationships among variables in place of (or in addition to) the survey statisticians developing complicated weights? What do you think?

1. The details are given in the second column of p.158. The model does not include interactions, and we just use the coefficient of z.

2. My point on p.157 that you noted is that, once you consider an additional predictor in the model, you have to consider that the regression of y on z might not be linear. In which case, yes, you can certainly create some summary such as the slope that you'd get by regressing y on z given all the data--but it's not clear why you'd want it. The E(y|z) function is still clearly defined, though, even if nonlinear.
There's a paper by Korn and Graubard in the American Statistician several years ago that discusses this point.

3. For equation (4), even if you had many people at any single height, you'd want to adjust using the population dist of men and women, to correct for differential nonresponse rates. In the Social Indicators Survey example, yes, we poststratified using census numbers.

4. Yes, I think that what is needed is a set of worked examples showing how the hierarchical modeling can work. Once we have the examples, we can have guidelines. But I don't really have the examples yet--note the "struggles" in the title!

## October 19, 2007

### Is an Oxford degree worth the parchment it's printed on?

I recently graduated a U.S. university with a degree in Political Science and Applied Mathematics. At the moment, I'm starting out at Oxford where I'm studying statistics. While I've always been interested in politics and statistics, I didn't start to combine the two until my last year of college, and even then, only on occasion. . . . I saw your recent posting for post-docs at Columbia's Applied Statistics Center and thought about how much I would love that job, or one like it, at some point in the future. The practical question is this: I have been given a great opportunity to study at Oxford, but there is a question as to how much American institutions value Oxford degrees. I'm currently on track to get a master's degree followed by a doctorate in statistics. However, some old advisors are strongly discouraging me from pursuing a DPhil (Oxford's PhD) and instead think that I should get an American PhD in Political Science or Economics. While there are of course other factors in this decision, I was hoping you might have some advice. Would an Oxford DPhil be competitive for a job like the one you posted? Do you think I would need more substantial qualifications to teach in statistics or political science in the States?

1. We should be having these postdocs for the indefinite future, so I encourage you to apply in a few years. The top new PhD's in applied statistics can get good academic jobs right after graduating, but I think you can learn a lot in a postdoc position, especially ours, which is interdisciplinary but with a core in statistical methods.

The other cool thing about a postdoc (compared to a faculty position, or for that matter compared to admissions to college or a graduate program) is that you're hired based on what you can do, not based on how "good" you are in some vaguely defined sense. I like to hire people know how to fit models and to communicate with other researchers, and my postdocs have included a psychologist, an economist, and a computer scientist, along with several statisticians.

2. I have no sense of how Oxford degrees are valued. I would assume it has the same value as a degree at an American university. Oxford statistics has some great people, including Chris Holmes, Tom Snijders, and Brian Ripley. Recommendations from these guys would carry a lot of weight, at least in a statistics department. More important, you can probably do something interesting when you're in grad school and also learn some useful skills.

3. You also ask about getting a Ph.D. in statistics or political science or economics. My general impression is that, to teach in a department of X, it helps to have a Ph.D. in X. But some people can do a lot of statistics in a poli sci or econ dept, or vice versa. My other impression is that econ is a cartel. The individual econ professors I know are, without exception, nice people and excellent colleagues who do interesting and important research. But the field as a whole seems so competitive, I would think it could be an unpleasant setting to be in, academically. Statistics (and, to a lesser extent, political science) seems much less competitive to me. Substantively, much of the interesting and important work in applied economics is statistical, and my impression is you'd be better prepared to do the best work there if you come at it from a statistical background.

4. Update: I mentioned this to a colleague and he said that, if you're interested in getting an academic job in the U.S., it isn't a bad idea to spend a year or two at a top U.S. department so people get to know you. (This doesn't contradict my point 1 above.)

P.S. The student replies,

I was not expecting your negative view of economics, however. My interest in the field has (naturally) been on the applied side, more as a potential combination of political science and statistics than anything else, and I gave it as a potential PhD option merely to add more diversity to the list.

My reply: No, I think economics (and economists) are great. I'm just not sure I'd recommend an academic career in economics, since I think you can do similar work in other fields without the intense competitive atmosphere. But that's just my impression as an outsider. In any case, I'm a big fan of the work that's being done in economics, sociology, psychology, and various other social sciences (along with political science and statistics, of course).

## October 7, 2007

### Statistics in the real world

Here's an interesting and informative rant I received recently in the email:

This document is a consultant’s report to the Traverse City Convention & Visitor’s Bureau, quoted — literally photocopied into ­— a market analysis for an application for an approx. 270,000 square foot shopping center. The full report is here. On page 6 of the .pdf, we are told the following:

“After extensive evaluation and testing of these variables [that possibly determine tourist visitor volume to Grand Traverse County] for their predictive ability, the Consultant determined there are three variables with statistically significant associations. These are population in Grand Traverse County, Gross Domestic Product (GDP), and the External Event dummy variable.

“The Consultant found GDP [national, not regional or local] alone is a significant predictor however [sic] it does not hold up in association with either Grand Traverse Population or the External Event dummy variable.”

The Consultant then goes on to run a regression using GT population and the dummy, but not GDP. The resulting equation has an adjusted R-square of .95, and F=87.0. While GT pop has a t-value=10.9 & p=.000012, the dummy isn’t significant (p=0.3). The Consultant thus takes GT population projections out to 2025 to forecast annual tourist visits for that time frame.

That seems rather sketchy to me. Correct me because I’m likely wrong, but the Consultant basically said that 95% of the variation in annual tourist visits was due to (predicted by) county population, and then used population projections to forecast future tourist visits. And even though GDP was a significant variable, she used population instead, with no explanation why. (Or, none that I can find.) GDP and population were apparently the only two significant variables (though we don’t know how population held up if she removed the insignificant dummy from the specification) of the host of variables she tested; e.g., DoD/military contracts, even though our military presence is limited to a couple Coast Guard helicopters. (And her regression is based on about 10 data points.)

Surely, local population can’t be the driver of tourist visits. It does seem reasonable that population is driven by tourism, since people who visit here might end up wanting to move here, no? That seems to be a questionable variable for trying to forecast tourism in the future, when at least one other significant variable, GDP, is available — even if that was found by data mining as well.

I wish I could say this is typical, but in my experience, local units of government, &c., pay money for analyses even more questionable than what I just presented. For example, the market study in which the above was quoted reports consumer demand in 2005 $194,896,255 less than supply. Setting aside the problems this claim has in view of economic theory, the values labeled “demand” and “supply” are consumer expenditures and retail sales: retailers sold approx.$195 million more that consumers purchased. And there is no explanation of why this is; in 2005, within a 50-mile radius, consumers spent $1,371,392 on “News Dealers and Newsstands,” while retail sales in the same category was$0, and there is no explanation of that 1.4-milion gap! Well, I [my correspondent] guess there’s no real point to this email other than to complain, and shouting at the sky is getting me a lot of strange looks. I’ll close by just asking you to ask your students to get involved in their communities, and at the very least, act as bullshit detectors and raise their voices when something smells. This certainly doesn't surprise me: I've seen worse from paid statistical consultants on court cases, including one from a consultant (nobody I've ever met or know personally in any way) who reportedly was paid hundreds of thousands of dollars for his services. The key problems seem to be: 1. Statistics is hard, and not many people know how to do it. 2. The people who need statistical analysis don't always know where to look. Posted by Andrew at 12:32 AM | Comments (8) | TrackBack ## October 6, 2007 ### If one person asks, others might be interested too . . . Shane Murphy writes, I am a graduate student in political science (interested in economics as well), and I was reading your recent blog posts about significance testing, and the problems common for economists doing statistics. Do you know of and recommend any books to students learning econometrics or statistics for social science? Also, just in case your answer is your own book, "Data Analysis Using Regression and Multilevel/Hierarchical Models," is this book an appropriate way to learn "econometrics" (which is just statistics for economists, right?)? My reply: Yes, I do recommend my book with Jennifer Hill. I also think it's the right book to learn applied statistics for economics. However, within economics, "econometrics" usually means something more theoretical, I think. You could take a look at a book such as Wooldridge's, which presents the theory pretty clearly. Posted by Andrew at 8:28 PM | Comments (6) | TrackBack ## October 5, 2007 ### More on significance testing in economics After I posted this discussion of articles by McCloskey, Ziliak, Hoover, and Siegler, I received several interesting comments, which I'll address below. The main point I want to make is that the underlying problem--inference for small effects--is hard, and this is what drives much of the struggles with statistically significance. See here for more discussion of this point. Statisticians and economists not talking to each other Scott Cunningham wrote, surprised that I'd not heard of these papers before: I wasn't expecting anything like what you wrote. I live in a bubble, and just assumed you were familiar with the papers, because in grad school, whenever I presented results and said something was significant (meaning statistically significant), I would *always* get someone else responding, "but is it _economically_ significant" meaning, at minimum, is the result basically a very precisely measured no effect? The McCloskey/Ziliak stuff was constantly being thrown at you by the less quantitatively inclined people (that set is growing smaller all the time), and I forgot for a moment that those papers probably didn't generate much interest outside economics. I live in a bubble too, just a different bubble than Scott's. He and others might be interested in this article by Dave Krantz on the null hypothesis testing controversy in psychology. Dave begins his article with: This article began as a review of a recent book, What If There Were No Significance Tests? . . . The book was edited and written by psychologists, and its title was well designed to be shocking to most psychologists. The difficulty in reviewing it for [statisticians] is that the issue debated may seem rather trivial to many statisticians. The very existence of two divergent groups of experts, one group who view this issue as vitally important and one who might regard it as trivial, seemed to me an important aspect of modern statistical practice. As noted above, I don't think the issue is trivial, but it is true that I can't imagine an article such as McCloskey and Ziliak's appearing in a statistical journal. Rational addiction Scott also writes, BTW, the rational addiction literature is a reference to Gary Becker and Kevin Murphy's research program that applies price theory to seemingly "non-market phenomenon", such as addiction. Rational choice would seem to break down as a useful methodology when applied to something like addiction. Becker and Murphy have a seminal paper on this from 1988. It's been an influential paper in the area of health economics, as numerous papers have followed it by estimating various price elasticities of demand, as well as to test the more general theory regarding the theory. My reply to this: Yeah, I figured as much. It's probably a great theory. But, ya know what? If Becker and Murphy want to get credit for being bold, transgressive, counterintuitive, etc etc., the flip side is that they have to expect outsiders like me to think their theory is pretty silly. As I noted in my previous entry, there's certainly rationality within the context of addiction (e.g., wanting to get a good price on cigarettes), but "rational addiction" seems to miss the point. Hey, I'm sure I'm missing the key issue here, but, again, it's my privilege as a "civilian" to take what seems a more commonsensical position here and leave the counterintuitive pyrotechnics to the professional economists. The paradigmatic example in economics is program evaluation? Mark Thoma "disagreed mildly" with my claim that the null hypothesis of zero coefficient is essentially always false. Mark wrote: I don't view the "paradigmatic example in economics" to be program evaluation. We do some of that, but much of what econometricians do is test the validity of alternative theories and in those contexts the hypothesis of a zero coefficient can make sense. For example, New Classical models imply that expected changes in the money supply should not impact real variables. Thus, a test of a zero coefficient on expected money in an equation with a real activity as the dependent variable is a test of the validity of the New Classical model's prediction. These tests requires sharp distinctions between models, i.e. to find variables that can impact other variables in one theory but not another, and that's something we try hard to find, but when such sharp distinctions exist I believe classical hypothesis tests have something useful to contribute. Hmmm . . . . I'll certainly defer to Mark on what is or is not the paradigmatic example in economics. I can believe that theory testing is more central. I'll also agree that important theories do have certain coefficients set to zero. I doubt, however, that in actual economic data, such coefficients really would be zero (or, to be more precise, that coefficient estimates would asymptote to zero as sample sizes increase). To wander completely out of my zone of competence and comment on Mark's money supply example: I'm assuming this is somewhat of an equilibrium theory, and short-term fluctuations in expected money supply could affect individual actors in the economy, which could then create short-term outcomes, which would show up in the data in some way (and then maybe, in good "normal science" fashion, be explained in a reasonable way to preserve the basic New Classical model). What I'm saying is: in the statistics, I don't think you'd really be seeing zero, and I don't think the Type 1 / Type 2 error framework is relevant. Getting better? And a digression on generic seminar questions Justin Wolfers writes that "the meaningless statements of statistical rather than economic significance are declining." Yeah, I think things must be getting better. Many years ago, Gary told me that his generic question to ask during seminars was, "What are your standard errors." Apparently in poli sci, that used to stop most speakers in their tracks. We've now become much more sophisticated--in a good way, I think. (By the way, it's good to have a few of these generic questions stored up, in case you fall asleep or weren't paying attention during the talk. My generic questions include: "Could you simulate some data from your fitted model and see if they look like your observed data?" and "How many data points would you have to remove for your effect estimate to go away?" Justin uses a lot of bold type in his blog entries. What's with that? Maybe a good idea? I use bold for section headings, but he uses them all over the place. Sports examples Also, since I'm responding to Justin, let me comment on his use of sports as examples in his classes. I do this too--heck, I even wrote a paper on golf putting, and I've never even played macro-golf--but, as people have noted on occasion, you have to be careful with such examples because they exclude many people who aren't interested in the topic. (And, unlike examples in biology, or economics, or political science, it's harder to make the case that it's good for the students' general education to become more familiar with the statistics of basketball or whatever.) So: keep the sports examples, but be inclusive. Posted by Andrew at 8:37 PM | Comments (2) | TrackBack ### Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler Scott Cunningham writes, Today I was rereading Deirdre McCloskey and Ziliak's JEL paper on statistical significance, and then reading for the first time their detailed response to a critic who challenged their original paper. I was wondering what opinion you had about this debate. Is statistical significance and Fisher tests of significance as maligned and problematic as McCloskey and Ziliak claim? In your professional opinion, what is the proper use of seeking to scientifically prove that a result is valid and important? The relevant papers are: McCloskey and Ziliak, "The Standard Error of Regressions," Journal of Economic Literature 1996. Ziliak and McCloskey, "Size Matters: The Standard Error of Regressions in the American Economic Review ," Journal of Socio-Economics 2004. Hoover and Siegler, "Sound and Fury: McCloskey and Significance Testing in Economics," Journal of Economic Methodology, 2008. McCloskey and Ziliak, "Signifying Nothing: Reply to Hoover and Siegler." My comments: 1. I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects--positive for some people, negative for others--but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, "while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance." Certainly, I'd rather see an estimate with an assessment of statistical significance than an estimate without such an assessment. 2. Hoover and Siegler's discussion of the logic of significance tests (section 2.1) is standard but, I believe, wrong. They talk all about Type 1 and Type 2 errors, which are irrelevant for the reasons described in point 1 above. 3. I agree with most of Hoover and Siegler's comments in their Section 2.4, in particular with the idea that the goal in statistical inference is often not to generalize from a sample to a specific population, but rather to learn about a hypothetical larger population, for example generalizing to other schools, other years, or whatever. Some of these concerns can best be handled using multilevel models, especially when considering different possible generalizations. This is most natural in time-series cross-sectional data (where you can generalize to new units, new time points, or both) but also arises in other settings. For example, in our analyses of electoral systems and redistricting plans, we were careful to set up the model so that our probability distribution generalized to other possible elections in existing congressional districts, not to hypothetical new districts drawn from a common population. 4. Hoover and Siegler's Section 2.5, while again standard, is I think mistaken in ignoring Bayesian approaches, which limits their "specification search" approach to the two extremes of least squares or setting coefficients to zero. They write, "Additional data are an unqualified good thing, which never mislead." I'm not sure if they're being sarcastic here or serious, but if they're being serious, I disagree. Data can indeed mislead on occasion. Later Hoover and Siegler cite a theorem that states "as the sample size grows toward infinity and increasingly smaller test sizes are employed, the test battery will, with a probability approaching unity, select the correct specification from the set. . . . The theorem provides a deep justification for search methodologies . . that emphasize rigorous testing of the statistical properties of the error terms." I'm afraid I disagree again--not about the mathematics, but about the relevance, since, realistically, the correct specification is not in the set, and the specification that is closest to the ultimate population distribution should end up including everything. A sieve-like approach seems more reasonable to me, where more complex models are considered as the sample size increases. But then, as McCloskey and Ziliak point out, you'll have to resort to substantive considerations to decide whether various terms are important enough to include in the model. Statistical significance or other purely data-based approaches won't do the trick. Although I disagree with Hoover and Siegler in their concerns about Type 1 error etc., I do agree with them that it doesn't pay to get too worked up about model selection and its distortion of results--at least in good analyses. I'm reminded of my own dictum that multiple comparisons adjustments can be important for bad analyses but are not so important when an appropriate model is fit. I agree with Hoover and Siegler that it's worth putting in some effort in constructing a good model, and not worrying if said model was not specified before the data were seen. 5. Unfortunately my copy of McCloskey and Ziliak's original article is not searchable, but if they really said, "all the usual econometric problems have been solved''--well, hey, that's putting me out of a job, almost! Seriously, there are lots of statistical (thus, I assume econometric) problems that are still open, most notably in how to construct complex models on large datasets, as well as more specific technical issues such as adjustments for sample surveys and observational studies, diagnostics for missing-data imputations, models for time-series cross-sectional data, etc etc etc. 6. I'm not familiar enough with the economics to comment much on the examples, but the study of smoking seems pretty wacky to me. First there is a discussion of "rational addiction." Huh?? Then Ziliak and McCloskey say "cigarette smoking may be addictive." Umm, maybe. I guess the jury is still out on that one . . . . OK, regarding "rational addiction," I'm sure some economists will bite my head off for mocking the concept, so let me just say that presumably different people are addicted in different ways. Some people are definitely addicted in the real sense that they want to quit but they can't, perhaps others are addicted rationally (whatever that means). I could imagine fitting some sort of mixture model or varying-parameter model. I could imagine some sort of rational addiction model as a null hypothesis or straw man. I can't imagine it as a serious model of smoking behavior. 7. Hoover and Siegler must be correct that economists overwhelmingly understand that statistical and practical significance are not the same thing. But Ziliak and McCloskey are undoubtedly also correct that most economists (and others) confuse these all the time. They have the following quote from a paper by Angrist: "The alternative tests are not significantly different in five out of nine comparisons (p<0.02), but the joint test of coefficient equality for the alternative estimates of theta.t leads to rejection of the null hypothesis of equality." This indeed does not look like good statistics. Similar issues arise in the specific examples. For instance, Ziliak and McCloskey describe where Becker, Grossman, and Murphy summarize their results in terms of t-ratios of 5.06, 5.54, etc, which indeed miss the point a bit. But Hoover and Siegler point out that Becker et al. also present coefficient estimates and interpret them on relevant scales. So they make some mistakes but present some things reasonably. 8. People definitely don't understand that the difference between significant and not significant is not itself statistically significant. 9. Finally, what does this say about the practice of statistics (or econometrics)? Does it matter at all, or should we just be amused by the gradually escalating verbal fireworks of the McCloskey/Ziliak/Hoover/Siegler exchange? In answer to Scott's original questions, I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler's attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference and also from expanding models to include treatment interactions. P.S. See here for more. Posted by Andrew at 7:07 AM | Comments (10) | TrackBack ### An explanation of hypothesis testing Aleks came across this somewhere. Posted by Andrew at 6:43 AM | Comments (1) | TrackBack ## October 4, 2007 ### Postdoc openings for Fall 2008 We'll be considering applications for more postdocs in the Applied Statistics Center. As far as I can tell, this is the best statistics postdoctoral position out there: you get to work with fun, interesting people on exciting projects and make a difference in a variety of fields. You'll be part of an active an open community of students, faculty, and other researchers. It's a great way for a top Ph.D. graduate to get started in research without getting overwhelmed right away by the responsibilities of a faculty position. If this job had existed when I got my own Ph.D. way back when, I would've taken it. Just email me your application letter, c.v., and your papers and have three letters of reference emailed to me. We will hire 0, 1, or more people depending on who applies and how they fit in with our various ongoing and planned projects in statistical methods, computation, and applications in social science, public health, engineering, and other areas. Posted by Andrew at 9:05 AM | Comments (8) | TrackBack ## October 1, 2007 ### Question on hypothesis testing Mike Frank writes, Hi, I'm a graduate student at MIT in Brain and Cognitive Sciences. I'm an avid reader of your blog and user of your textbook and so I thought I would email you this question in the hopes you have thoughts on it. I'm in a strange position in my research in that I do a lot of Bayesian modeling of cognitive processes but then end up doing standard psychology experiments to test predictions of the models where I have to use simpler frequentist statistical methods (which are standard in psychology, hard to publish without them) to analyze those data. The basic question is how to compare binomial data from two different conditions in an experiment when there are multiple datapoints from each individual in each condition (so the trials are not independent). The simplest option seemed to me be to use a chi-square test (e.g., compare 54/56 trials correct in one condition with 43/56 trials correct in the other, aggregating across participants). But I'm told this practice violates the independence assumption of the test. I'm not sure I totally understand why this is a problem here, but that may be a separate question entirely. In contrast, what most psychologists do is calculate a percentage correct for each individual and then do a paired t-test between the two sets of means. But I've read that using standard ANOVAs or t-tests on this type of binomial data violates the assumption of normal distribution of the data and is invalid (and can lead to bad inferences in many situations). More sophisticated people have recommended using a GLM with a logit link function so that it is appropriate to binomial data and then making it a mixed model which can include individuals (subjects) as a random effect. But if I have multiple comparisons between conditions that differ qualitatively (e.g., not along some particular continuum), it seems like I would need to run the GLM on different pairs of conditions and look for a significant effect of condition in each case, and that doesn't seem particularly elegant either (although at least more appropriate). What I'd really like is just a simple hypothesis test like a chi-square or t-test but appropriate to the form of the data. My reply: The t-test comparing the means is correct. You compute the mean for each person and then do a person-level analysis. Thus, you're not actually using the total percentage correct, you're using the mean and sd across people for each condition. The chi-squared test is not so interpretable because it doesn't give you differences in proportions, it only gives you a hard-to-interpret p-value. Logistic regression is also fine. Posted by Andrew at 12:42 PM | Comments (2) | TrackBack ## September 28, 2007 ### Statistical consultants Jeff Miller pointed me to his website. He offers statistical consulting. I'll also use this occasion to refer you again to Rahul. Posted by Andrew at 1:42 AM | Comments (0) | TrackBack ## September 25, 2007 ### Context is important: a question I don't know how to answer, leading to general thoughts about consulting Someone writes in with a question that I can't answer but which reminds me of a general point about interactions between statisticians and others. It seems that there should be an easy answer to this question yet I cannot find a satisfactory one (aka one that satisfies my committee). 1) I have a vector of values V and I find the median of it m. 2) I feed vector V to a simulation and it returns a value for each iteration of the simulation r_t which is stacked into a vector R. 3) I take the mean of R which is r_bar I want to now be able to compare m and r_bar. I want to be able to say if they are statistically different. V cannot be assumed to be normal, and the simulation is stochastic, but not random. Currently I am constructing a confidence interval around r_bar as: r_bar +/- 1.96*sd(R) But this does not seem right considering I cannot assume normality of the original "data: nor can I assume the simulation amounts to "random sampling". What would you recommend? My reply: I'm sorry but I don't understand what you are asking. My generic advice is that it's hard to solve such problems without having more information on the context. This happens all the time to statisticians, that people try to help us out by giving us what is essentially a probability problem, stripping out all content. But almost always a good answer depends on what these random variables actually represent. Once when I was teaching at Chicago I overheard a discussion of some students and faculty about some consulting problem that had come in, something about estimating the probability of a sequence of successive "heads" in some specified number of coin flips. It turned out to be for the goal of computing a p-value, and looking into the example more, it also became clear that this was not a very good analysis of these particular data. A tough balance It's a tough balance being a statistician: we can't go to the experimenters and start bossing them around--we have to respect what they're saying--but we also have to extract from them what is really important and cause them to question their statistical assumptions. I've seen statistical consultants err in both directions: either basically ignoring the client and trying to cram every problem into some narrow methodological framework, or taking all the client's words too seriously and becoming a technician, applying an inappropriate method without question. Posted by Andrew at 9:49 AM | Comments (3) | TrackBack ## September 24, 2007 ### Measuring interpersonal influence David Nickerson gave a wonderful talk at our quantitative political science seminar last week. He described three different experiments he did, and it was really cool. Here's the paper, and here are Alex's comments on it. I've never really done an experiment. I like the idea but somehow I've never gotten organized to do one. I want to, though. I feel like an incomplete statistician as things currently stand. Posted by Andrew at 12:50 AM | Comments (1) | TrackBack ## September 20, 2007 ### Control only for the covariates that matter NY Times published an awful article 25th Anniversary Mark Elusive for Many Couples that deserves a comment. Here is a quote: Among men over 15, the percentage who have never been married was 45 percent for blacks, 39 percent for Hispanics, 33 percent for Asians and 28 percent for whites. Among women over 15, it was 44 percent for blacks, 30 percent for Hispanics, 23 percent for Asians and 22 percent for whites. No wonder! The median age for whites in the US in 2000 was 37.7, for Asians 32.7, for blacks 30.2 and for Hispanics 25.8. 11 years of age difference should make a difference when it comes to the probability of having been married, no? While they didn't control for age here, they did unnecessarily control for sex in this highly uninformative table-of-many-numbers: The gross JPEG artifacts that blur the fonts are theirs, not mine: they should have known to use PNG or GIF for figures with lots of text. Does anyone gain any insight from the difference between women and men's probability other than noise? A similar nonsensical control appeared in Men with younger women have more children where the difference in optimum age difference between men (6) and women (4) is purely a statistical artifact if you go and read the paper. Yuck. I wouldn't have posted this if this hadn't made it to the 6th place of most emailed articles in past 24 hours. In summary, when displaying the data control for things when 1) you need to remove a known effect, 2) controlling for things tells you something you didn't know before. And use graphs not tables! And educate journalists about the basics of statistics! Posted by Aleks Jakulin at 5:31 PM | Comments (3) | TrackBack ## September 19, 2007 ### The past, present, and future of statistics This is going to be a letdown after this grand title . . . . Lingzhou Xue writes, I just read recently the talk titled "The Future of Statistics" from Bradley Efron. Actually, I see some enlightening ideas but also fall a little puzzled. In this talk, Efron first gave a simple review of the rapid development of statistics last century. He is humorous to comment that "The history of statistics in the Twentieth Century is the surprising and wonderful story of a ragtag collection of numerical methods coalescing into a central vehicle for scientific discovery". After this humor is just what puzzles me and what I really hope your instructions and ideas. Efron cited a simple example to illustrate the limitations of classical statistics in the model selection problems and also exploit a figurative comment that "History seems to be repeating itself: we've returned to an era of ragtag heuristics propelled with energy but with no guiding direction." Finally, he presented an helpful instructions that "During such time it pays to concentrate on basics and not tie oneself too closely to any one technology." My reply: Efron is an interesting example of a leading statistical researcher who has developed and used a diverse set of tools, most notably model-based empirical Bayes and nonparametric boostrap and permutation tests. So he, more than most, is justified in seeing statistics as being extremely successful without needing a guiding direction. In the hedgehog/fox distinction, he's a fox. It's hard for me to make generalizations about the field of statistics since there are so many different strands. I guess some sort of analysis based on papers and citation counts would give a clue. I guess it is true that statistics in the 1950s, like politics in the 1950s, had a unity that we didn't see before and don't see today. 1950s-style statistics was limited but it was all people had and so they used it well. It broke down when it got overwhelmed with data. Posted by Andrew at 2:10 AM | Comments (1) | TrackBack ## September 14, 2007 ### Most science studies appear to be tainted by sloppy analysis Boris pointed me to this article by Robert Lee Hotz: We all make mistakes and, if you believe medical scholar John Ioannidis, scientists make more than their fair share. By his calculations, most published research findings are wrong. . . . "There is an increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims," Dr. Ioannidis said. "A new claim about a research finding is more likely to be false than true." The hotter the field of research the more likely its published findings should be viewed skeptically, he determined. Take the discovery that the risk of disease may vary between men and women, depending on their genes. Studies have prominently reported such sex differences for hypertension, schizophrenia and multiple sclerosis, as well as lung cancer and heart attacks. In research published last month in the Journal of the American Medical Association, Dr. Ioannidis and his colleagues analyzed 432 published research claims concerning gender and genes. Upon closer scrutiny, almost none of them held up. Only one was replicated. . . . Ioannides attributes this to "messing around with the data to find anything that seems significant," and that's probably part of it. The other part is that, even if all statistics are done according to plan, the estimates that survive significance testing will tend to be large--this is what we call "Type M error." See here for more discussion. Posted by Andrew at 3:57 PM | Comments (2) | TrackBack ### Opportunity for a survey expert in NYC Meg Lamm writes, I and the market research firm I work for (Nancy Dodd Research) are looking for someone to hire who could meet with me a few times to train me in using SPSS and/or Excel for doing data analysis, and in survey writing techniques. If you or anyone you know might be interested, please contact me at meg@nancydodd.com or 212-366-1526. Posted by Andrew at 7:06 AM | Comments (0) | TrackBack ## September 11, 2007 ### Poststratification on variables that are not fully observed Seth Wayland writes, In Chapter 14.1 of your new book, the example uses only predictors for which you have census data at the state level. In the postratification step, you just plug the values of those covariates into the model, and viola, you have an estimate for that poststratification cell! What about including further individual level predictors in the model to account for probability of selection such as household size and number of phones in the household, or even an individual-level predictor that might improve the model? How do you then calculate the estimate for each poststratification cell? My response: yes, this is something we are struggling with. The long answer is that we would treat the population distribution of all the predictors, Census and non-Census variables (those desirable individual-level predictors which are only observed in the sample and not in the population), as unknown. We'd give it all a big fat prior distribution and do Bayesian inference. This sounds like a lot but I think it's doable using regression models with interactions. We're working on this now, starting with simple models with just one non-Census variable. The closest we've come so far with is this paper with Cavan and Jonathan on poststratification without population inference (see blog entry here). The short answer is that it should be possible to do a quick-and-dirty version of the above plan, estimating the joint distribution of Census and non-Census variables using point estimates for the distribution of non-Census variables given the Census variables, based on weighting using the survey data within each Census post-stratification cell. This is only an approximation because it ignores uncertainty (for example, if a particular cell includes 4 people in single-phone households and 3 people in multiple-phone households, the weighted totals become 4 and 1.5, so the quick-and-dirty approach would use the point estimate of 1/(4+1.5) as the proportion in single-phone households in that cell, ignoring the uncertainty arising from sampling variability). I do think that this (the quick version, then the full version) is ultimately the way to go, since the poststratification strategy allows us to model the data and get small-area estimates, such as state-level opinions from national polls. As is often the case, the challenge in statistics is to include all relevant information (from the Census as well as the survey, and maybe also from other surveys), and to do this while setting up a model that is structured enough to take advantage of all these data but not so structured that it overwhelms this information. Posted by Andrew at 7:36 AM | Comments (0) | TrackBack ### Statistical consulting My former student Rahul Dodhia (actually Dave Krantz's former student, but I was on his committee and we did write a paper together) lives in Seattle and is a full-time statistical consultant. Here's the website for his company. He actually took the statistical consulting course from me at Columbia so I'm probably to blame for this! Anyway, I haven't actually seen him consult recently, but I expect he's doing a good job. He's located in Seattle but also works remotely. Posted by Andrew at 2:42 AM | Comments (0) | TrackBack ## August 31, 2007 ### A rant on the virtues of data mining I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null. The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING !" Partially as a result of this and of failures of the government monitoring technology programs, data mining researchers are trying to rename themselves to sound more politically unbiased. Of course, I'm doing data mining, and I'm proud of it. Because data mining is summarization, it's journalism, it's surveying, it's mapping. That where one gets ideas from and impressions. Of course what data mining found isn't "true". The models underlying data mining are most definitely not "true". But a mean is informative even if the distribution isn't symmetric. The "scientific" approach corresponds to picking The One and Only Holy Hypothesis. Then you collect the data. Then you fit the model and verify whether it works or not. Then you write a paper. The good thing about the "scientific" approach is that you don't have to think, and that you need very little common sense. But real science is curiosity and pursuit of improved understanding of the world, not mindlessly following algorithms that can be taught even to imbeciles. Let me analyze where the problem lies. There is data D. And there are multiple models M. In confirmatory data analysis (CDA) high prior probabilities are assigned to a single model and its negative (null): so it is very easy to establish which of the two is better. In exploratory data analysis (EDA) and data mining the prior over models is relatively flat. Yes, there are models underlying EDA too: if you rotate your scatter plot in three dimensions to get a good view of the phenomenon, your parameters are the rotations and you're doing kernel density estimation with your eyes. When you see a fit, you stop and save the snapshot. The problem is that no model in particular sticks out, so it's hard to establish the best one. Yes, it's hard to establish what "truth" is. "Truth" is the domain of religion. "Model", "data" and "evidence" are the domain of science. Many of the hypothesis generated by people from theory might be understood as deserving higher prior probability: after all they are based on experience. In turn, a flat prior includes many models that are unlikely. For that matter, one should use a bit of common sense interpreting EDA results: because the prior was flat, if something looks fishy, subtract a little bit from it and study it in more detail. On the other hand, if you don't see something you think you should, add a little and study it in more detail. A CDA that tells you everything you've already known doesn't deserve a paper. But it's better to just eyeball the results with an implicit prior in your mind than to try to cook up a complex prior that will do the same. But once you've found a surprise, throw all the CDA you've got at it. Posted by Aleks Jakulin at 2:10 PM | Comments (17) | TrackBack ## August 29, 2007 ### The most beautiful people in the world . . . and a request for a favor (see the very bottom of this entry) Ralph Blair sent this in. It's so horrible that I have to put it in the continuation part of the blog entry. I recommend you all stop reading right here. Stop . . . It's not too late!!!!!!!!!!! OK, here it is. No, no, no... (Here's the technical article explaining the statistical flaws in this stuff.) Mistakes are made all the time, of course, but it doesn't help when they are tied to wacky political agendas. The news article begins: The Beautiful Person club is an exclusive one, and entry brings much - fame, wealth ... and daughters. Think of the most beautiful couples in the world - they all have daughters. Tom Cruise and Katie Holmes? Check. Denise Richards and Charlie Sheen? Check. Brangelina and Bennifer? Check and check. Actually, we looked up a few years of People Magazine's 50 most beautiful people, and they were as likely as anyone else to have boys: One way to calibrate our thinking about Kanazawa’s results is to collect more data. Every year, People magazine publishes a list of the fifty most beautiful people, and, because they are celebrities, it is not difficult to track down the sexes of their children, which we did for the years 1995–2000. As of 2007, the 50 most beautiful people of 1995 had 32 girls and 24 boys, or 57.1% girls, which is 8.6 percentage points higher than the population frequency of 48.5%. This sounds like good news for the hypothesis. But the standard error is 0.5/sqrt(56) = 6.7%, so the discrepancy is not statistically significant. Let’s get more data. The 50 most beautiful people of 1996 had 45 girls and 35 boys: 56.2% girls, or 7.8% more than in the general population. Good news! Combining with 1995 yields 56.6% girls—8.1% more than expected—with a standard error of 4.3%, tantalizingly close to statistical significance. Let’s continue to get some confirming evidence. The 50 most beautiful people of 1997 had 24 girls and 35 boys—no, this goes in the wrong direction, let’s keep going . . . For 1998, we have 21 girls and 25 boys, for 1999 we have 23 girls and 30 boys, and the class of 2000 has had 29 girls and 25 boys. Putting all the years together and removing the duplicates, such as Brad Pitt, People’s most beautiful people from 1995 to 2000 have had 157 girls out of 329 children, or 47.7% girls (with standard error 2.8%), a statistically insignificant 0.8% percentage points lower than the population frequency. So nothing much seems to be going on here. But if statistically insignificant effects with a standard error of 4.3% were considered acceptable, we could publish a paper every two years with the data from the latest “most beautiful people.” I don't blame the reporter (Maxine Shen) for this: it's natural to believe something that's been published in a book and a scientific journal. Perhaps, though, someone could send a note to whoever reviews this sort of book so that the errors won't be propagated indefinitely?? Posted by Andrew at 1:23 AM | Comments (13) | TrackBack ### R-squared: useful or evil? I had the following email exchange with Gary King. Me: I know you hate R-squared and you hate standardization; nonetheless you might like this paper and this one. I've found the standardization idea, in particular, very helpful--I've been using it on many applications recently. Gary: If R-sq were used as a data summary only, I'd have no objection (as an aside, I think 'data summary' which has good uses, often just means 'it's just description so don't bother me, anything goes!'). Instead, it is used as a measure of the quality or success or correctness or validity of the model, which is usually nuts. Me: I agree with you there. By "data summary," I more precisely mean something that inherently depends on the design of the data collection. Thus, keep the model the same but spread out the x's, and R-squared goes up. But the model doesn't change. Similarly, "stat. signif." changes as you change the sample size. Gary: Spreading out the x's is changing the model. Also, you can write down two equivalent models where the data give identical inferences about all the key parameters, but R2 can differ drastically. That's not abt the model or data summaries. Me: What's in the model depends where you draw the line. If you have a dose-response model of the form y=f(x) + error, and you're interested in f, then I don't consider x part of the model; you set x to get a good estimate of f. At the other extreme, you can define anything as part of the model. Even sample size is part of the model if you consider it as a random variable. But I see what you're talking about. You're talking about comparing models. I'm not particularly interested in comparing models. I'm using R^2 to understand a single model (in particular, the way in which a particular dataset is informative about that single model). If I were to compare models, I'd do it directly. Gary: So if you want to compare models, you wouldn't use R^2. But almost all uses of R^2 in the literature are about comparisons of some kind, even when implicit (the R^2 indicates that my model is better than yours! etc.). Anyway, I agree that it shouldn't be used to compare models, altho one (perhaps the only one?) valid use of R^2 is to compare two models or two specifications so long as they have the same dependent variable. The problem with R^2 is comparisons of data or model or anything when Y changes. The problem is identifying the question R^2 is the optimal answer to That's enough for now, I'm sure... Posted by Andrew at 12:26 AM | Comments (1) | TrackBack ## August 27, 2007 ### Using experts' ranges Doug McNamara writes, I am preparing for my first year as a graduate student at the University of Maryland in their Department of Measurement, Evaluation and Statistics. I've been reading your blog for a few months, and thought I would finally ask a question. So, here it is: I have some data on number of terrorist/insurgent troops in a country. For some of the cases, the data could not be directly measured; instead, experts on the country in question were surveyed. For these survey responds, the dataset provides a range of possible values for number of troops, with the range usually representing the high and low estimates (rounded to the nearest thousand). For instance, experts have assigned a range of 10,000-15,000 for number of UNITA troops in Angola in 1989. So, the question is, how do I go about assigning an actual value to those situations where there is a range? Initially, I was thinking about simply using the mean between the high and low values, but I know nothing about the distribution of expert opinions. Alternatively, I could simply assign a random value within the range. A third option would be to run three tests—one where I only use the low values, one where I use the high values and a third where I use the median/random value approach. I should mention I would like to assign a single value for the simple purpose of running a t-test to see if there is a difference in average number of troops when the group is foreign funded or not. My reply: Considering this as a statistical problem, you could treat the actual number as missing data and then use a rounded-data likelihood (as in Exercise 3.5 of Bayesian Data Analysis). In your case, however, I'd probably just use the average (or the geometric mean) of the range. I wouldn't take these ranges very seriously: in general, experts are notorious for giving estimates where the truth falls outside the range of their guesses. So I don't see you getting anything special from looking at the high and low values as if they were actually upper and lower bounds. Posted by Andrew at 12:02 AM | Comments (2) | TrackBack ## August 24, 2007 ### Average predictive comparisons for models with nonlinearity, interactions, and variance components How do you summarize logistic regressions and other nonlinear models? The coefficients are only interpretable on a transformed scale. One quick approach is to divide logistic regression coefficients by 4 to convert on to the probability scale--that works for probabilities near 1/2--and another approach is to compute changes with other predictors held at average values (as we did for Figure 8 in this paper). A more general strategy is to average over the distribution of the data--this will make more sense, especially with discrete predictors. Iain Pardoe and I wrote a paper on this which will appear in Sociological Methodology: In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels. We fit three models: a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; a hierarchical model with varying intercepts only; and a non-hierarchical model that ignores the multilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models, as well as illustrating the relative size of varying county effects. The next step is to program it in general in R. Posted by Andrew at 12:32 AM | Comments (1) | TrackBack ## August 22, 2007 ### Percent Changes? Benjamin Kay writes about a problem that seems simple but actually is not: I've come across a pair of problems in my work into which you may have some insight. I am looking at the percentage change in earnings per share (EPS) of various large American companies over a 3 year period. I am interested in doing comparisons of how other attributes influence the median value of earnings per share. For example, it might be that high paying companies have higher EPS growth than low paying ones. I am aware that this model might not fully take advantage of the data but I'm preparing it for an audience with limited statistical education. The problems occur in ranking percentages. If you calculate percentages as (New - Old)/Old then there are two major problems: 1) Anything near zero explodes 2) Companies which go from negative to positive EPS appear to have negative growth rates. (1) -(- $1) / -$1 = -200%

The first problem is seemingly intractable as long as I am using percent changes, but I cannot use dollar changes because it ignores the issue of scale. A company with 100 shares and $100 in earnings has$1 EPS, and one with 20 shares (and the same earnings) has $5 EPS. If both companies double their earnings to$200 dollars, they've performed identically. However, in absolute changes the former shows $1 change and the latter$5. I'm stuck with what to do here, maybe there is another measure of change that I haven't considered or another way of doing this entirely.

One thing I've considered for the second problem is taking the absolute value of companies whose EPS changes sign. That seems equivalent to claiming that a change from $1 to$3 EPS is equivalent to a -$1 to$1 change in EPS. Is that a standard approach to treating percent changes? Are there any other assumptions lurking underneath when doing this?

Is there a classic reference to doing order statistic work like this on percentile data?

My reply: this is an important problem that comes up all the time. The percent-change approach is even worse than you suggest, because it will blow up if the denominator approaches zero. Similar problems arise with marginal cost-benefit ratios, LD50 in logistic regression (see chapter 3 of Bayesian Data Analysis for an example), instrumental variables, and the Fieller-Creasy problem in theoretical statistics. I've actually been planning for awhile to write a paper on estimation of ratios where the denominator can be positive or negative.

In general, the story is that the ratio completely changes in interpretation when the denominator changes sign (as you illustrated in your example). But yeah, dollar values can't be right either. I have a couple questions for you:

a. How important are the signs to you? For example, if a given company changes from -$1 to$1, is that more important to you than a change from $1 to$3, or from $3 to$5?

b. For any given company, do you want to use the same scaling for all three years? I imagine the answer is Yes (so you don't have to worry about funny things happening such as an increase of 25%, followed by a decrease of 25%, does not bring things to the initial value).

One approach might be to rescale based on some relevant all-positive variable such as total revenue. I'm sure many other good options are available, once you get away from trying to rescale based on a variable that can be positive or negative.

## August 21, 2007

### Ken Rice on conditioning in 2x2 tables

At the bottom of this entry I wrote that the so-called Fisher exact test for categorical data does not make sense. Ken Rice writes:

It turns out the standard conditional likelihood argument (which to me always looked prima facie contrived and artifical) is in fact exactly what you get from a carefully considered random-effects approach.
There are some nice symmetries in the random-effects prior, effectively it forces the same prior beliefs for cases and controls. It also has a nice non-parametric property - effectively one only specifies the first few moments of the prior, it's most attractive in e.g. matched pair studies.

Naturally, where one had good backing for a 'bespoke' prior, the conditional approach isn't going to beat it, but as a default I believe it's acceptable, and does actually make some sense.

Ken's paper is here, and here's an entertaining powerpoint [link fixed] on the topic.

Larger models that reduce to particular smaller models

I'll have to digest all this before I have any comments. Except that it reminds me of something similar with models of censored and truncated data. Truncation can be considered as a generalization of censoring where the number of censored cases is unknown. Thus, to do a full Bayesian inference for truncated data you need a prior distribution on the number of censored cases. It turns out that, for a particular choice of prior distribution, the truncation model reduces to the censoring model. We discuss this in chapter 7 of Bayesian Data Analysis (second edition) and section 2 of this paper from 2004. As I wrote then:

once we consider the model expansion, it reveals the original truncated-data likelihoodas just one possibility in a class of models. Depending on the information available in any particular problem, it could make sense to use different prior distributions for N and thus different truncated-datamodels. It is hard to return to the original “state of innocence” in which N did not need to be modeled.

The same issue arises when considering additional input variables in a regression models.

Also

Here is Ken's conference poster:

## August 13, 2007

### Medians?

Jeff noticed this news article by Gina Kolata:

EVERYONE knows men are promiscuous by nature. It's part of the genetic strategy that evolved to help men spread their genes far and wide. The strategy is different for a woman, who has to go through so much just to have a baby and then nurture it. She is genetically programmed to want just one man who will stick with her and help raise their children.

Surveys bear this out. In study after study and in country after country, men report more, often many more, sexual partners than women.

One survey, recently reported by the federal government, concluded that men had a median of seven female sex partners. Women had a median of four male sex partners. Another study, by British researchers, stated that men had 12.7 heterosexual partners in their lifetimes and women had 6.5.

But there is just one problem, mathematicians say. It is logically impossible for heterosexual men to have more partners on average than heterosexual women. Those survey results cannot be correct.
...

Jeff's response: MEDIANS??!!

Indeed, there's no reason the two distributions should have the same median. I gotta say, it's disappointing that the reporter talked to mathematicians rather than statisticians. (Next time, I'd recommend asking David Dunson for a quote on this sort of thing.) I'm also surprised that they considered that respondents might be lying but not that they might be using different definitions of sex partner. Finally, it's amusing that the Brits report more sex partners than Americans, contrary to stereotypes.

## August 10, 2007

### A question about transformation in regression

Alban Z writes,

I am seeking for your view on some concept. This is about transforming a dependent variable to make it normally distributed before doing a regression. For situations where common strategies like logarithm transformation, taking square root .... do not help in making a variable (close to) normally distributed, some of the literature suggests using the so called *Inverse normal transformation: The transformation involves ranking the observations in the dependent variable, and then matching the percentile of each observation to the corresponding percentile in the standard normal distribution. Using the resulting percentiles, each observation is replaced with the corresponding z-score from the standard normal distribution. When there are ties, percentiles are averaged across all ties*.

What are your thoughts about the above procedure? Do you recommend using it?

My reply: I do not recommend transforming to make a variable have a particular distribution. Additivity and linearity of the model are more important. We discuss the issue further in chapter 4 of our new book. See also here.

## August 8, 2007

### Modeling on the original or log scale

Shravan writes,

Here is a typical problem I keep running into. I'm analyzing eyetracking data of the sort you have already seen in the polarity paper. Specifically, I am analyzing re-reading times at a particular word as a function of some experimental conditions that I will call c1 and c2. I expect an effect of c1 and c2, and an interaction. I get it when I analyze on raw reading times (milliseconds) but get only the interaction when I analyze on the log RTs. The logs' residuals are normally distributed and the raw RTs' are not. I am inclined to trust the log RTs more because of the normal residuals (theory, however, is more in line with raw RT-based results). But reviewers keep insisting I analyze on untransformed (raw) reading times, and your book also advises the reader to ignore residuals.

1. The log scale makes more sense to me. On the other hand, the last time I analyzed eye-tracking data was 17 years ago, and I didn't know anything about the experimental setup even then!

2. If an interaction might be important, I'd include it in the model. Then if its coefficient isn't statistically significant, you can say that.

3. Hey, we do look at residuals in our book! Take a look at Chapter 5.

4. I wouldn't pick the model based on normality of the residuals. As we discuss in the book, the distribution of the residuals is the least important aspect of the model.

## July 24, 2007

### The difference between ...

Bruce McCullough points out this blog entry from Eric Falkenstein:

Recently the Wall Street Journal has had several articles about estrogen's link to heart disease in women, highlighting a recent New England Journal of Medicine article showing that it lowers risk of arterial sclerosis. Then last week, the Journal did a story concentrating on how the Women's Health Initiative (WHI) misread the data by focusing on the increased heart attack risk for women over 70, While neglecting the lowered rate of heart attack for women under 60 (since the WHI's 2002 report arguing that estrogen therapy actually raised heart disease--opposite sign to previous findings--hormone sales plummeted 30%). The WHI shot back in a letter to the WSJ, arguing they stand by their interpretation of the data, which they think is somewhat mixed, and in their words, the differences in heart disease between the older and younger (one up, one down!) is not 'statistically significant'. If the difference isn't statistically significant, I can't see how the old cohort can be thought to have a higher than average risk (eg, if the sample estimate for the old is +14%, for the young, -30%, if the difference is noise, the +14% is certainly noise). As Paul Feyerabend argued, there are no definitive tests in science, as people just ignore evidence that goes against them, emphasizing the consistent results.

I don't really have anything new to say about the Women's Health Initiative but I did want to point this out since it's an interesting reminder about the difficulty of using statistical signifcance as a measure of effect size.

Just a couple of weeks ago I was meeting with some people who were doing a health study where effect A was positive and not statistically significant, effect B was negative and not stat signif., but the difference was stat. signif. They had another comparison in their study where A was positive and stat signif, B was negative and not signif, and the difference was not stat signif. They were struggling to figure out how to explain all these things. Rather than give some sort of "multiple comparisons correction" answer, I suggested the opposite: to graphically display all their comparisons of interest in a big grid, to get a better understanding of what their study said. Then they could go further and fit a model if they want.

Falkenstein also writes,

Estrogen therapy helps women with symptoms of menopause, including hot flashes, bone loss, but also depression, wrinkles, vaginal dryness, and lower sexual desire. Though not mentioned in the WSJ articles, I think it is the latter issues are what really bothers the WHI. Women's groups are fond of coming up with pretexts to desexualize women...

I don't know enough about the WHI to answer this one, but I imagine that they want to be extra careful when assessing estrogen therapy, given the problems with earlier recommendations on this.

## July 10, 2007

### Convergent interviewing and Markov chain simulation

Bill Harris writes,

MCMC is a technique to sample higher-dimensional spaces efficiently. By using Markov chains to select the next sample point, MCMC gathers information about important parts of that space when purely random sampling would likely fail to hit any points of interest.

Convergent interviewing is a way to select the
next person or people to interview and the next questions to use when gathering information from a group of people. It "combines some of the features of structured and unstructured interviews, and uses a systematic process to refine the information collected."

In particular, people are selected by a simple process:

Decide the person "most representative" of the population. She will be the first person interviewed. Then nominate the person "next most representative, but in other respects as unlike the first person as possible"; then the person "next most representative, but unlike the first two" ... And so on. This sounds "fuzzy"; but in practice most people use it quite easily.

Each person is asked largely "content-free" questions on the general topic at hand. Probe questions are added to later questions to test the extent of apparent agreement between people and to explain apparent disagreements.

At first glance, there seems to be a metaphorical similarity between the two processes, as both seek to extract desired information from a high-dimensional space in reasonable time with a guided sampling process that may or may not converge.

I sometimes wonder if there might not even be a deeper connection, although I'm not sufficiently educated in Gibbs sampling and the like yet to be able to test that conjecture.

My response: Regarding MCMC, there has been some stuff written on "antithetical sampling" (I think that's what they call it) where there is a deliberate effort to make new samples different from earlier samples. There's also hybrid sampling, or Hamiltonian dynamics sampling, which Radford Neal has written about (extending methods that have been used in computational physics), which tries to move faster through parameter space.

Regarding convergent interviewing, the key idea seems to be the technique of the interview itself. (I can give my own disclaimer here which is that I've never done a personal interview of this sort, so I'm just speculating based on books I've read and conversations I've had with experts.) The sampling method seems fine. In practice the real worry is getting people who are too much alike, thus an extra effort is made to get people who are different. This makes sense to me. In practice I suspect it probably won't be better than sampling random people from the population (unless n is really small), but in many settings you can't really get a random sample, so it sounds like a good idea to intentionally diversify. Another approach is to use network sampling and use statistical methods to correct for sampling biases (as in Heckathorn et al.'s work here).

## July 6, 2007

### Job opening in Vienna

I got this in the mail the other day. It's at the Institute for Tourism and Leisure Studies. Pretty cool (perhaps)! Certainly something unusual. I'm sure Bayesian data analysis, multilevel modeling, and statistical graphics will be useful in this job . . .

## June 25, 2007

### Animated MDS convergence

A few days ago we had quite a discussion on multidimensional scaling. While everyone agreed that initialization is important with non-convex problems, minimizing some objective function is more appealing than using initial placement for the prior, except in appealing circumstances such as iterative scaling. For the objective function approach, one can regularize the stress function, and it is also possible to use the prior to shrink towards geographic positions.

The untidy initial placement approach is sufficient, however, to provide a visualization as we travel from the initial placement towards the final placement. Namely, the clinal pattern in the final placement is only one of the things we can learn: the migrations of points and the resulting stresses are just as interesting in providing insight about the differences between the simple uniform geographic diffusion model and the real distribution of genes in Europe.

I also visualize the stress (at the top), and the strongest attraction/repulsion vectors.

The Python source code is now available if you agree to post a link to all derivative work in the comments of this entry, you can click here [ZIP]).

Posted by Aleks Jakulin at 6:10 PM | Comments (4) | TrackBack

## June 20, 2007

### Overview of Missing Data Methods

We came across a interesting paper on missing data by Nicholas J. Horton and Ken P. Kleinman. The paper is about comparison of Statistical Methods and related Software to Fit Incomplete Data Regression Models.

Here is the abstract:

Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Statistical methods to address missingness have been actively pursued in recent years, including imputation, likelihood, and weighting approaches. Each approach is more complicated when there are many patterns of missing values, or when both categorical and continuous random variables are involved. Implementations of routines to incorporate observations with incomplete variables in regression models are now widely available. We review these routines in the context of a motivating example from a large health services research dataset. While there are still limitations to the current implementations, and additional efforts are required of the analyst, it is feasible to incorporate partially observed values, and these methods should be used in practice.

This is quite a thorough review. The authors refer to different packages already available. One thing we noticed is that there is nothing on diagnostics (see here for more on diagnostics of imputation). This paper should help us on improving the "mi" package.

Also the appendix of the paper can be found here.

## June 19, 2007

### A universal dimension reduction tool?

Igor Carron reports on a paper by Richard Barniuk and Michael Wakin, "Random projections of smooth manifolds," that is billed as a universal dimension reduction tool. That sounds pretty good.

I'm skeptical about the next part, though, as described by Carron, a method for discovering the dimension of a manifold. This is an ok mathematical problem, but in the problems I work on (mostly social and environmental statistics), the true dimension of these manifolds is infinity, so there's nothing to discover. Rather than a statement like, "'We've discovered that congressional roll-call votes fall on a 3-dimensional space" or "We estimate the dimensionality of roll-call voting to be 3" or even "We estimate the dimensionality to be 3 +/- 2", I prefer a statement like, "We can explain 90% of the variance with three dimensions" or "We can correctly predict 93% of the votes using a three-dimensional model" or whatever.

## June 18, 2007

### Is significance testing all bad?

Dan Goldstein quotes J. Scott Armstrong:

About two years ago, I [Armstrong] was a reasonable person who argued that tests of statistical significance were useful in some limited situations. After completing research for “Significance tests harm progress in forecasting” in the International Journal of Forecasting, 23 (2007), 321-327, I have concluded that tests of statistical significance should never be used.

Here's a link to Armstrong's paper, and here's a link to his rejoinder to discussion.

My thoughts:

It has been rare that I've found significance tests to be useful, but when they have, it has been as a way to get a sense of the ways in which a model does not fit data, to give direction on where the model can be improved; see Chapter 6 of Bayesian Data Analysis.

For a specific example in which I found significance tests useful, see Section 2.6 of our new book. I emailed Amstrong and am interested to see if he agrees that significance testing was appropriate in that case. I suppose I agree that, ultimately, confidence intervals and effect size estimates would be appropriate even in this example, but the significance testing was relatively simple and clear so I was happy with it.

I was also reminded that the difference between "significant" and "not significant" is not itself statistically significant.

## June 11, 2007

### "Missing at random" and "distinct parameters"

Etienne Rivot sends in a question about models for missing data. The issues are subtle and I think could be of general interest (since we all have missing data!) These issues are covered in Chapter 7 of Bayesian Data Analysis, but it always helps to see these theoretical ideas in the context of a specific example.

Rivot writes:

I [Rivot] am currently writing a paper to be submitted in a fisheries review and one of the referee raised a problem in the treatment of the missing data in our model. In a first version of the manuscript, we argue that the missing data generating process was "ignorable" (because we argue that the 2 conditions - 1) missing at random ; 2) distinct parameters - were verified). But the referee argue that the "distinct" parameters conditions was NOT verified. I would greatly appreciate to have your opinion about that. Please find below a short description of the model and of the problem :

The problem
--------------------------------
Objective : estimate the number of fish, say N, in a particular site in a river
Method : successive removal method via electrofishing
Data :
- site i=1,...,n
- C1 : capture at the first pass (fish are captured by means of electro-fishing)
sampling equation : C1(i) ~ Binom(N(i),p(i))
- C2 : capture at the second pass (the same experiment with the same capture probability p(i))
sampling equation : C2(i) ~ Binom(N(i)-c1(i),p(i))
Hierarchical Bayesian model :
- priors on p(i) and N(i) have a hierarchical structure accros sites i=1,...,n

The missing data problem :
Some times, the population size N(i) is so low that the result of the first pass is C1(i) = 0. In that case, the field crew often do not perform the second pass. Then C2(i) = NA.

The argumentation of the referee :
The "distinct parameters" condition IS NOT verified because the probability of a missing data at the site i depends upon the population size N(i). Indeed, the smaller the population size N(i), the greater the probability of obtaining C1 = 0 at the first pass, and the greater the probability of a missing data at the second pass. Then, the parameters of the missing data generating process (i.e. the probability of a missing data C2 = NA at site i) are NOT independent of the parameters of the data generating process (i.e. N(i)).

Our argumentation (but we are maybe wrong) :
The "distinct parameters" condition IS verified because the probability of a missing data at the site i ONLY depends upon the observed value C1(i) : if C1(i) = 0, then the probability of having a missing data is greater. In that point of view, the parameters of the missing data generating process (i.e. the probability of a missing data C2 = NA at site i) only depends upon the observations C1(i) and can then be considered as independent of the non observed latent variable N(i).
--------------------------------

Who is wrong ? Is it only a question of "point of view" ?

My reply: Certainly the fact that C1=0 is providing information about N. But, yes, if C1 is observed, and only C1 determines whether C2 is measured, then you are missing at random. However, you say that the field crew "often do not perform the second pass." If, for example, they are more likely to perform the second pass when they believe there are fish at the site, then the data are _not_missing at random.

But that has nothing to do with "distinct parameters." The "distinct parameters" issue is different. Here, the two sets of parameters that must be "distinct" are:

(a) The usual parameters of your model: N, p, and your hierarchical parameters;

(b) The parameters of your missingness model: in this case, the probability that C2 is measured, if C1=0.

Violation of "distinct parameters" would occur if the parameters in (a) and (b) are dependent in their prior distribution; that is, if the proportion of missing cases is informative about N,p,etc. I would see no reason to believe this is the case.

So I think you're right.

## June 7, 2007

### Percentage of missing observations, leading to a suggestion for a research paper based on grabbing available datasets

Jacob Felson writes,

Here is a question I was hoping you might address (on your blog?) It has to do with the distribution of missing data in a dataset. What is the relationship between the total percentage missing observations, and the number of observations left after listwise deletion? Given the percentage of missing observations in the data matrix, what is the expected number of observations after listwise deletion?

To take an arbitrary example, say you have 10 variables and 100 cases, 1000 total observations. 20% of the observations are missing. What is the expected number of observations after listwise deletion?

I was wondering whether any one does research on this kind of thing. For example, what would a specific pattern of missing data (say, measured as the number of listwise deleted cases relative to the number of expectation of such cases, given total missing observations) say about the administration of a dataset, and what would it say about the probability that the data are missing at random?

My reply: I wasn't familiar with the term "listwise deletion," but it seems to be the same thing as "complete-case analysis" (a more descriptive term, I think). Anyway, I'm not quite sure why the question arises since I'd think you could answer it directly with any particular dataset. For your example, if you have 100 units with 10 observations each and 20% missing, one extreme is that the same 20 units are missing all the observations (in which case you'd lose 20% with listwise deletion); in another extreme, everybody is missing at least one variable, in which case you can't do listwise deletion at all! It depends on the pattern of what's missing.

If you want a more quantitative answer, one approach would be to trawl the internet for a few thousand datasets, and, for each, count the number of units, number of observations, number of completely observed units, and number of missing observations--and see what patterns arise empirically. It could make for an intersesting paper.

## June 5, 2007

### Applied Probability Day

Hey, this looks interesting:

The Center for Applied Probability at Columbia University presents the:

14th Annual Applied Probability Day

Friday June 8th, 2007
9:00AM-6:00PM

Room 303
S.W. Mudd Building
500 West 120th Street
Columbia University, New York City

REGISTRATION IS FREE, all are very welcome

SPEAKERS & SCHEDULE:

8:30 - 9:00 Coffee and Danish

9:00-9:15 Opening Remarks, Chris Heyde, Director of CAP

9:15-10:00 Kevin Glazebrook (Lancaster U., UK)

* General Notions of Indexability and Dynamic Allocation of Individuals to Teams Providing Service

10:00-10:45 Jose Blanchet (Harvard U.)

* Rare-event Analysis and Simulation of Heavy-tailed Systems

10:45-11:15 Coffee Break

11:15-12:00 Jim Fill (Johns Hopkins U.)

* A (Minor) Miracle: Diagonalization of a Bose-Einstein Markov Chain

12:00-2:00 Lunch

2:00-2:45 Marco Avellaneda (Courant Institute, NYU)

* Power-Law Price Impact Models and Stock Pinning on Option Expiration Dates

2:45-3:30 Steve Lalley (U. Chicago)

* Spatial Epidemics: Critical Behavior

3:30-4:00 Coffee Break

4:00-4:45 Les Servi (Lincoln Labs, MIT)

* Methods of Maritime Tracking: An Overview from an Applied Probability/Operations Research Perspective

4:45-6:00 Wine and Cheese Reception

For further information please go to our web site: http://www.cap.columbia.edu/

### Logistic regression and glm defaults

When you run glm in R, the default is linear regression. To do logistic, you need a slightly awkward syntax, e.g.,

fit.1 <- glm (vote ~ income, family=binomial(link="logit"))

This isn't so bad, but it's a bit of a pain when doing routine logistic regressions. We should alter glm() so that it first parses the y-variable and then, by default, chooses logit if it's binary, linear if it's continuous, and overdispersed Poisson if it's nonnegative integers with more than two categories, maybe ordered logit for some other cases. If you don't like defaults, you can always specify the model explicitly (as you currently must do anyway).

Aliasing the standard glm() function is awkward so we'll start by putting these defaults into bayesglm() (which I think we should also call bglm() for ease of typing) in the "arm" package in R. The categorization of the variables fits in well with what we're doing in the "mi" package for missing-data imputation.

## June 4, 2007

### When to worry about heavy-tailed error distributions

Hal Daume writes,

I hope you don't mind an unsolicited question about your book. I'm working with someone in our chemistry department right now on a regression problem, and he's a bit worried about the "normality of errors" assumption. You state (p.46):
The regression assumption that is generally the least important is that the errors are normally distributed... Thus, in contrast to many regression textbooks, we do not recommend diagnostics of the normality of regression residuals.

Can you elaborate on this? In particular, what if the true error distribution is heavy-tailed? Could this not cause (significant?) problems for the regression? Do you have any references that support this claim?

My response: It depends what you want to do with the model. If you're making predictions, the error model is certainly important. If you're estimating regression coefficients, the error distribution is less important since it averages out in the least-squares estimate. The larger point is that nonadditivity and nonlinearity are big deals because they change how we think about the model, whereas the error term is usually not so important. At least that's the way it's gone in the examples I've worked on.

To get slightly more specific, when modeling elections I'll occasionally see large outliers, perhaps explained by scandals, midterm redistrictings, or other events not included in my model. I recognize that my intervals for point predictions from the normal regression model will be a little too narrow, but this hasn't been a big concern for me. I'd have to know more about your chemistry example to see how I'd think about the error distribution there.

## May 29, 2007

### Mathematics and truth

Stan Salthe pointed me to the article "Why Mathematical Models Just Don't Add Up." It's something every quantitative modeler should read. Let me provide some snippets:

The predictions about artificial beaches involve obviously absurd models. The reason coastal engineers and geologists go through the futile exercise is that the federal government will not authorize the corps to build beaches without a calculation of the cost-benefit ratio, and that requires a prediction of the beaches' durability. Although the model has no discernible basis in reality, it continues to be cited around the world because no other model even attempts to answer that important question. [...] In spite of the fact that qualitative models produce better results, our society as a whole remains overconfident about quantitative modeling. [...] We suggest applying the embarrassment test. If it would be embarrassing to state out loud a simplified version of a model's parameters or processes, then the model cannot accurately portray the process. [...] A scientist who stated those assumptions in a public lecture would be hooted off the podium. But buried deep within a model, such absurdities are considered valid.

While the article is quite extreme in its derision of quantitative models, plugs the book the authors wrote, and employs easy rhetoric by providing only positive examples of a few failures and not negative examples of many successes, it is right that quantitative models are overrated in our society, especially in domains that involve complex systems. The myriad of unrealistic and often silly assumptions are hidden beneath layers of obtuse mathematics.

Statistics and probability were attempts to deal with failure of deterministic mathematical models, and Bayesian statistics is a further attempt to manage the uncertainty arising from not knowing what the right model is. Moreover, a vague posterior is a clear signal that you don't have enough data to make predictions.

Someone once derided philosophers by saying that first they stir up the dust, and then they complain that they cannot see: they are taking too many things into consideration, and this prevents them from coming up with a working model that will predict anything. One does have to simplify to make any prediction, and philosophers are good at criticizing the simplifications. Finally, even false models are known to yield good results, as we are reminded by that old joke:

An engineer, a statistician, and a physicist went to the races one Saturday and laid their money down. Commiserating in the bar after the race, the engineer says, “I don’t understand why I lost all my money. I measured all the horses and calculated their strength and mechanical advantage and figured out how fast they could run. . . ”

The statistician interrupted him: “. . . but you didn’t take individual variations into account. I did a statistical analysis of their previous performances and bet on the horses with the highest probability of winning. . . ”

“. . . so if you’re so hot why are you broke?” asked the engineer. But before the argument can grow, the physicist takes out his pipe and they get a glimpse of his well-fattened wallet. Obviously here was a man who knows something about horses. They both demanded to know his secret.

“Well,” he says, between puffs on the pipe, “first I assumed all the horses were identical, spherical and in vacuum. . . ”

Posted by Aleks Jakulin at 9:32 AM | Comments (7) | TrackBack

## May 25, 2007

### Treating discrete variables as if they were continuous

Francesca Vandrola writes,

Reading several papers published recently in Political Science journals (i.e. Journal of Politics, Political Behavior, etc), I find quite consistently that:

I) Authors have discrete variables such as

- Income, say, measured as follow: (1 = $15,000 or under, 2 =$15,001Â–$25,000, 3 =$25,001Â–$35,000, 4 =$35,001Â–$50,000, 5 =$50,001Â–$65,000, 6 =$65,001Â–$80,000, 7 =$80,001Â– $100,000, 8 = over$100,000)

- External efficacy, say, measured as follow: an index that sums responses from four questions: Â‘Â‘People like me donÂ’t have any say about what the government doesÂ’Â’, Â‘Â‘I donÂ’t think public officials care much what people like me thinkÂ’Â’, Â‘Â‘How much do you feel that having elections makes the government pay attention to what the people think?Â’Â’, and Â‘Â‘Over the years, how much attention do you feel the government pays to what the people think when it decides what to do?Â’Â’. The first two questions are coded 0 = agree, 0.5 = neither, and 1 = disagree. The third and fourth questions are coded 1 = a good deal, 0.5 = some, and 0 = not much.

- Church attendance, say, measured as an index of religious attendance, 1 = never/no religious preference, 2 = a few times a year, 3 = once or twice a month, 4 = almost every week, and 5 = every week.

And so on and so forth.

II) Authors include the above variables in their models (as explanatory variables) as if they were continuous. Why? I see the point of not including some of the variables as categorical predictors (say, if the variable has 9 categories), but I am less clear on there being a good rationale for some other cases. Wouldn't it be preferable, especially if there are sufficient observations, to include some of those predictors in a categorical fashion? Maybe they will indeed behave like an index and have a linear effect... but maybe they won't.

Variables commonly behave monotonically, in which case linearity can be a good approximation--even if the scale of the predictor is somewhat arbitrary. But then it makes sense to check the residuals to see if the linearity is strongly violated. The alternative--modeling everything categorically--is fine, but that requires work in building and interpreting the model, effort that might be better spent elsewhere.

## May 19, 2007

### More on the prosecutor's fallacy and the defense fallacy

I sent this Mark Buchanan article to Bruce Levin (coauthor. with Michael Finkelstein, of Statistics for Lawyers) who pointed me to this discussion from their book of what he calls the prosecutor's fallacy and the defense fallacy. Interesting stuff.

## May 18, 2007

### Multilevel logistic regression

Enrique Pérez Campuzano writes,

I'm using a multilevel logistic model to predict the probabilities of internal migration in Mexico. My level 1 are persons and level two cities. I ran some analysis with a small sample of my dataset in R using lmer as you do. The model, I think, fits well, but I was wondering if the assumption of 50/50 for probabilities can be applied to data when the distribution of the response variable is not 50/50 or when the cutoff point is not 50. In my data only the .7% of population moved from a city to another one. In this kind of data, can we use the "divide by 4 rule" for the estimation of probabilities? or can we change the cutoff point to improve the coefficients in lmer command (for example logistic regression SPSS allows th change the cutoff point)?

Second, as I said, this small model is only the beginning. My real data are around 10,000,000 individuals from a special sample from the Mexican Census. Is any problem in estimations in this kind of samples? In think the computational problem is more or less solved because University of California at San Diego allows me to run the analysis in the supercomputing center.

My response:

1. That's right, when the probabilities are far from 1/2, you can't use the divide-by-4 rule. See Section 5.7 of our book for discussions of how to compute the appropriate probability changes.

2. I'm pretty sure that R will choke on datasets of size 10 million; you might try Stata. (We have some example code in Appendix C on fitting multilevel models in Stata and other languages.) Another way to look at it is, if you have 10 million data points, you can probably start by fitting a model separately to different subsets of the data--that's how I'd probably go. You can fit the separate multilevel models for different demographic groups, then postprocess in some way, for example using plots or second-level regressions fit to the estimated coefficients.

## May 15, 2007

### More hoops

Carl Bialik has followed up on the NBA bias study. (See here for my earlier comments.) There was some interesting back-and-forth, and the problem had an interesting feature in that different parties were performing different analyses on different datasets. My only added comment is that I mentioned this stuff to Carl Morris yesterday and he pointed out that it would be interesting to see who were the players who were fouled against. For (almost) every foul there is a "foulee", and you'd think that racism would manifest itself most as fouls called on a white guy playing against a black player or on a black guy facing a white player.

Alan Agresti has written some papers motivating the (y+1)/(n+2) estimate, instead of the raw y/n estimate, for probabilities. (Here we're assuming n independent tries with y successes.)

The obvious problem with y/n is that it gives deterministic estimates (p=0 or 1) when y=0 or y=n. It's also tricky to compute standard errors at these extremes, since sqrt(p(1-p)/n) gives zero, which can't in general be right. The (y+1)/(n+2) formula is much cleaner. Agresti and his collaborators did lots of computations and simulations to show that, for a wide range of true probabilities, (y+1)/(n+2) is a better estimates, and the confidence intervals using this estimate have good coverage properties (generally better than the so-called exact test; see Section 3.3 of this paper for my fulminations against those misnamed "exact tests").

The only worry is . . .

The only place where (y+1)/(n+2) will go wrong is if n is small and the true probability is very close to 0 or 1. For example, if n=10 and p is 1 in a million, then y will almost certainly be zero, and an estimate of 1/12 is much worse than the simple 0/10.

However, I doubt that would happen much: if p might be 1 in a million, you're not going to estimate it with a n=10 experiment. For example, I'm not going to try ten 100-foot golf putts, miss all of them, and then estimate my probability of success as 1/12.

Conclusion

Yes, (y+1)/(n+2) is a better default estimate than y/n.

## May 14, 2007

### Missing data imputation updating one variable at a time

Masanao discovered this interesting paper by Ralf Munnich and Susanne Rassler. From the abstract:

In this paper we [Munnich and Susanne Rassler] discuss imputation issues in large-scale data sets with different scaled variables laying special emphasis on binary variables. Since fitting a multivariate imputation model can be cumbersome, univariate specifications are proposed which are much easier to perform. The regression-switching or chained equations Gibbs sampler is proposed and possible theoretical shortcomings of this approach are addressed as well as data problems.

A simulation study is done based on the data of the German Microcensus, which is often used to analyse unemployment. Multiple imputation, raking, and calibration techniques are compared for estimating the number of unemployed in different settings. We find that the logistic multiple imputation routine for binary variable, in some settings, may lead to poor point as well as variance estimates. To overcome possible shortcomings of the logistic regression imputation, we derive a multiple imputation matching algorithm which turns out to work well.

This is important stuff. They refer to the packages Mice and Iveware which have inspired our new and improved (I hope) "mi" package which is more flexible than these predecessors. Unfortunately with this flexibility comes the possibility of more problems, so it's good to see this sort of research. I like the paper a lot (except for the ugly color figures on pages 12-16!).

## May 9, 2007

### Stories about the qualifying exam

Yves told me he's doing his qualifying exam, which brought back memories from 20 years ago. At one point we had a fire alarm. Before exiting the bldg, I went into the office and took out all the booklets of exams I'd been working on--our exam was a 2-week-long takehome. Just on the off chance there actually was a fire, I didn't want my papers to burn up!

Among other things, I learned logistic regression from a problem of Bernie Rosner, and I got stuck on a very hard, but simple-looking, problem from Fred Mosteller on dog learning, an example that I ended up returning to again and again, most recently in our new multilevel modeling book. There was also a problem by Peter Huber--he was notorious for using the same problem year after year--it featured a 10th-generation photocopy of a very long article on crystallography, I think, along with a suggestion that the data be analyzed using robust methods. Like everyone else, I skipped that one. But I did spend three days on a problem from Art Dempster on belief functions--it was so much work but I really felt I needed to do it.

As a student, we all took this so seriously. The perspective from the faculty direction is much different, since a key part of the exam is to evaluate the students, not just to give them an exciting challenge. Also we've had problems over the years with cheating, so it's been difficult to have take-home exams. Finally, I heard a rumor that our students were told not to worry, there won't be anything Bayesian on Columbia's qualifying exam this year. Say it ain't so!!

## May 4, 2007

For decades I've been reading about 1/f noise and have been curious what it sounds like. I've always been meaning to write a little program to generate some and then play it on a speaker, but I've never put the effort into figuring out exactly how to do it. But now with Google . . . there must be some 1/f noise out there already, right?

A search on "1/f noise" yielded this little snippet of 1/f noise simulated by Paul Bourke from a deterministic algorithm. It sounded pretty cool, like what I'd imagine computer music to sound like.

I did a little more searching on the web; it was easy to find algorithms and code for generating 1/f ("pink") noise but surprisingly difficult to find actual sounds. I finally found this 15-second sinppet, which sounded like ocean waves, not like computer music at all! (You can compare to the sample of white noise, which indeed sounds like irritating static.)

I also found this online 1/f noise generator from a physics class at Berkeley. It works, also shows the amplitude series and spectrum. Also sounds like ocean waves. I'm disappointed--I liked the computer music. What's the deal?

## May 2, 2007

### Doing surveys through the web

Aleks pointed me to this site.

## April 29, 2007

### Too much information?

Aleks sent me the link to this site. Seth might like it--except that it seems to be set up only to monitor data, not to record experiments.

## April 18, 2007

### We Don't Quite Know What We are Talking About When We Talk About Volatility

Following up (sort of) on my comments on The Black Swan . . .

Dan Goldstein and Nassim Taleb's paper write: "Finance professionals, who are regularly exposed to notions of volatility, seem to confuse mean absolute deviation with standard deviation, causing an underestimation of 25% with theoretical Gaussian variables. In some fat tailed markets the underestimation can be up to 90%. The mental substitution of the two measures is consequential for decision making and the perception of market variability."

This interests me, partly because I've recently been thinking about summarizing variation by the mean absolute difference between two randomly sampled units (in mathematical notation, E(|x_i-x_j})), because that seems like the clearest thing to visualize. Fred Mosteller liked the interquartile range but that's a little too complicated for me, also I like to do some actual averaging, not just medians which miss some important information. I agree with Goldstein and Taleb that there's not necessarily any good reason for using sd (except for mathematical convenience in the Gaussian model).

## April 16, 2007

### Some thoughts on the sociology of statistics

One thing that bugs me is that there seems to be so little model checking done in statistics. As I wrote in this referee report,

I'd like to see some graphs of the raw data, along with replicated datasets from the model. The paper admirably connects the underlying problem to the statistical model; however, the Bayesian approach requires a lot of modeling assumptions, and I'd be a lot more convinced if I could (a) see some of the data and (b) see that the fitted model would produce simulations that look somewhat like the actual data. Otherwise we're taking it all on faith.

But, why, if this is such a good idea, do people not do it? I don't buy the cynical answer that people don't want to falsify their own models. My preferred explanation might be called sociological and goes as follows: We're often told to check model fit. But suppose we fit a model, write a paper, and check the model fit with a graph. If the fit is ok, then why bother with the graph: the model is OK, right? If the fit shows problems (which, realistically, it should, if you think hard enough about how to make your model-checking graph), then you better not include the graph in the paper, or the reviewers will reject, saying that you should fix your model. And once you've fit the better model, no need for the graph.

The result is: (a) a bloodless view of statistics in which only the good models appear, leaving readers in the dark about all the steps needed to get there; or, worse, (b) statisticians (and, in general, researchers) not checking the fit of their model in the first place, so that neither the original researchers nor the readers of the journal learn about the problems with the model.

One more thing . . .

You might say that there's no reason to bother with model checking since all models are false anyway. I do believe that all models are false, but for me the purpose of model checking is not to accept or reject a model, but to reveal aspects of the data that are not captured by the fitted model. (See chapter 6 of Bayesian Data Analysis for some examples.)

## April 13, 2007

### Statistical inefficiency = bias, or, Increasing efficiency will reduce bias (on average), or, There is no bias-variance tradeoff

Statisticians often talk about a bias-variance tradeoff, comparing a simple unbiased estimator (for example, a difference in differences) to something more efficient but possibly biased (for example, a regression). There's commonly the attitude that the unbiased estimate is a better or safer choice. My only point here is that, by using a less efficient estimate, we are generally choosing to estimate fewer parameters (for example, estimating an average incumbency effect over a 40-year period rather than estimating a separate effect for each year or each decade). Or estimating an overall effect of a treatment rather than separate estimates for men and women. If we do this--make the seemingly conservative choice to not estimate interactions, we are implicitly estimating these interactions at zero, which is not unbiased at all!

I'm not saying that there are any easy answers to this; for example, see here for one of my struggles with interactions in an applied problem---in this case (estimating the effect of incentives in sample surveys), we were particularly interested in certain interactions even thought they could not be estimated precisely from data.

(Also posted at Overcoming Bias.)

## April 12, 2007

A correspondent writes:

Wanted to add to my comment on the Black Swan review... but didn't want to hang people in public.

You mentioned... (Mosteller and Wallace made a similar point in their Federalist Papers book about how they don't trust p-values less than 0.01 since there can always be unmodeled events. Saying p<0.01 is fine, but please please don't say p<0.00001 or whatever.) which is a terrific point!

I had a related experience just last week when attending a seminar recently. Some guys were modeling some marketing information and showed ranges of coefficents from the set of regressions and argued that everything was significant. At the bottom of the table, it read: "Adjusted R-sq = 0.001".

I had to check my glasses. I thought I was hallucinating. That line didn't seem to unfaze anyone else. The audience were asking modeling questions, why didn't you model it this way or that, etc. I turned around and asked my neighbor: were you bothered by R-sq of 0.1%? His answer was "I have seen 0.001 or lower for panel data".

Now I'm not an expert in panel data analysis. But I am shocked, shocked, that apparently such models are allowable in academia. Pray tell me not!

I don't know what to say. In theory, R^2 can be as low as you want, but I have to admit I've never seen something like 0.001.

## April 3, 2007

### n = 35

Ronggui Huang from the Department of Sociology at Fudan University writes,

Recently, my mentor and I have collected data in about 35 neighborhoods, and we survey 30 residents in each neighborhood. I would like to study the effects of neighborhood-level characteristics, so after data collection, I aggregate the data to neighborhood-level. In other words, I have just 35 sample points. With such a small sample size (35 neighborhoods), what statistical methods can I use to analyse the data? It seems that most of the statistical methods are based on large sample theory.

My quick answer is that, from the standpoint of classical statistical theory, 35 is a large sample! You could also do a multilevel model if you want. But I'd be careful about the causal interpretations (you wrote "effects" above)--you're probably limited on what you can learn causally unless you can frame what you're doing as a "natural experiment" (for a start, see chapters 9 and 10 of our new book).

P.S. I imagine things have changed quite a bit at Fudan in the years since Xiao-Li was there.

## April 2, 2007

### Pseudo-failures to replicate

Prakash Gorroochurn from our biostat dept wrote this paper discussing the fact that, even if a study find statistical significance, its replication might not be statistically significant--even if the underlying effect is real.

This is an important point, which can also be understood using the usual rule of thumb that to have 80% power for 95% significance, your true effect size needs to be 2.8 se's from zero. Thus, if you have a result that's barely statistically significant (2 se's from zero), it's likely that the true effect is less than 2.8, and so you shouldn't be so sure you'll see a statistically significant replication. As Kahneman and Tversky found, however, our intuitions lead us to (wrongly) expect replication of statistical significance.

Prakash's paper is also related to our point about the difference between significance and non-significance.

## March 30, 2007

### More on "The difference between 'significant' and 'not significant' is not itself statistically significant"

Following up on the remark here, Ben Jann writes,

This just sprang to my mind: Do you remember the 2005 paper on oxytocin and trust by Kosfeld et al. in Nature? It has been in the news. I think they did the same mistake. The study contains a "Trust experiment" and a "Risk experiment". Because the oxytocin effect was significant in the Trust experiment, but not in the Risk experiment, Kosfeld et al. see their hypothesis confirmed that oxytocin increases trust, but not the readiness to bear risks in general. However, this is not a valid conclusion since they did not test the difference in effects. Such a test would, most likely, not turn out to be significant (at least if performed on the aggregate level as the other tests in the paper; the test might be significant if using the individual-level experimental data). (Furthermore, note that there is an error in Figure 2a: there should be an additional hollow 0.10 relative frequency bar at transfer 10.)

## March 26, 2007

### The difference between "significant" and "not significant" is not itself statistically significant

Ben Jann writes,

I came across your paper on the difference between significant and non-significant results. My experience is that this misinterpretation is made in about every second talk I hear and also appears in a lot of published work.

Let me point you to an example in a prominent sociological journal. The references are:

Wu, Xiaogang and Yu Xie. 2003. "Does the Market Pay Off? Earnings Returns to Education in Urban China." American Sociological Review 68:425-42.

and the comment

Jann, Ben. 2005. "Earnings Returns to Education in Urban China: A Note on Testing Difference among Groups." American Sociological Review 70:860-864.

## March 19, 2007

### Curve fitting on the web

We once collected the following data from a certain chemical process:

The curve looks smooth and could be governed by some meaningful physical law. However, what would be a good model? There is probably quite a number of physical laws that would fit the observed data very well. Wouldn't it be nice if a piece of software would examine a large number of known physical laws and check them on this data? ZunZun.com is such a piece of software, and it runs directly from the browser. After plugging my data in, ZunZun gave me a ranked list of functions that fit it, and the best ranked was the "Gaussian peak with offset" (y = a * e(-0.5 * (x-b)^2 / c^2) + d):

Number two was "Sigmoid with offset" (y = a / (1.0 + e(-(x-b)/c)) + d).

In all, ZunZun may help you find a good nonlinear model when all you have is data.

Posted by Aleks Jakulin at 4:06 PM | Comments (8) | TrackBack

## March 15, 2007

### Meta-analysis of music reviews?

Parsefork is an aggregator of music reviews - which reminds me of Metacritic. In plain words, it collects reviews of music from different sources and presents it as a easily navigable and searchable web database. The main shift in the past few years is the abundance of specialized structured databases that are available on the internet. Back in the 90's most of the internet "databases" were based on text.

This data would be a great setting for multi-level modeling, as each review refers to an artist, an album, a piece, a magazine and a reviewer. There are quite a few things that multi-level modeling could do for such websites. Here is an example of the reviews for a single song, Happy Songs for Happy People:

Posted by Aleks Jakulin at 10:11 AM | Comments (1) | TrackBack

## March 6, 2007

### Classifying classifiers

Here's an interesting classification (from John Langford) of statistical methods from a machine learning perspective. The only thing that bothers me is the conflation of statistical principles with computational methods. For example, the table lists "Bayesian learning," "graphical/generative models," and "gradient descent" (among others) as separate methods. But gradient descent is an optimization algorithm, while Bayesian decision analysis is an approach that tells you what to optimize (given assumptions). And I don't see how he can distiguish graphical/generative models from "pure Bayesian systems." Bayesian models are almost always generative, no? This must be a language difference in CS versus statistics.

Anyway, it's an interesting set of comparisons. And, as Aleks points out, we're trying our best to reduce the difficulties of the Bayesian approach (in particular, the difficulty of setting up models that are structured enough to learn from the data but weak enough to learn from the data).

## March 5, 2007

### Same topics, different languages

Aleks sent along this announcement:

Call for participation

The AAAI-2007 Workshop on Evaluation Methods for Machine Learning II

Description:
============
The purpose of this workshop is to continue the lively and interesting debate started last year at the AAAI 2006 workshop on Evaluation Methods for Machine learning. The previous workshop was successful on the following points:

- It established that the current means of evaluating learning algorithms has some serious drawbacks.
- It established that there are many important properties of algorithms that should be measured, requiring more than a single evaluation metric.
- It established that algorithms must be tested under many different conditions.
- It established that the UCI data sets do not reflect the variety of domains to which algorithms are applied in practice.

For this year's workshop, we are intending to address, in a more specific fashion, some of the topics that were raised at last year's workshop and some new ones. Last year's participants are invited to submit papers reflecting the evolution of their views within the year, or new ideas. Researchers, or practitioners, interested in this issue who did not get a chance to submit a paper last year are invited to do so this year.

Topics:
=======
We invite position papers and technical papers addressing three main topics, and their subtopics:

Evaluation Metrics:
* The efficacy of existing evaluation metrics
* The need for new metrics
* Useful metrics from the other fields
* Single number summaries versus curves
* New performance visualization methods

Statistical Issues:
* Cross-validation vs. bootstrap
* Bias vs. variance
* Parametric vs. non-parametric
* The power of tests
* Sampling methods
* Multiple comparisons

Data Sets:
* The (over) use of community repositories such as UCI
* Concept Drift
* Synthetic, or semi-synthetic, data sets

Format and Attendance:
======================
The workshop will consist of:

Invited talks: We are planning to have several invited speakers. Some, from outside of our research community, will be able to criticize our accepted practices from an external point of view. Some, from inside our community, will discuss how we could improve on our current practices.

Panel Discussion: Our invited speakers will be asked to engage each other on the various issues surrounding the problem of evaluation in Machine Learning at the end of the workshop. The audience will be strongly encouraged to participate in the discussion.

Presentations: The papers accepted to the workshop will be presented throughout the day between the various invited talks. Papers will be grouped by theme, in order to facilitate discussion at the end of each session.

Workshop attendance is open to the public and is estimated at 20-25 attendees. Priority will be given to those active participants in the workshop (paper authors or speakers).

Submission:
===========
Authors are invited to submit papers on the topics outlined above or other related issues. Both position and technical reports will be considered for this workshop. To promote a lively event, with plenty of discussion, the organizers are very interested in papers taking strong positions on the issues listed above. Workshop papers should not exceed 6 pages using the AAAI Style. Submissions should be made electronically in PDF or Postscript format and should be sent (no later than April 1, 2007) by email to William Elazmeh:

welazmeh@site.uottawa.ca

Resubmissions and position papers (of at least 1 page) are welcome.

Important Dates:
================
* Submission deadline: April 1, 2007
* Notification date: April 25, 2007
* The workshop date: July 22 or 23, 2007.
* AAAI-07 conference is to be held on July 22-26, 2007, in Vancouver, British Columbia, Canada.

Organizers:
===========
Chris Drummond
NRC Institute for Information Technology, Canada
Email: chris.drummond@nrc-cnrc.gc.ca

William Elazmeh (main contact)
School of Information Technology and Engineering, University of Ottawa,
800 King Edward Ave., room 5-105, Ottawa, Ontario, K1N 6N5 Canada
Telephone: +1 (613) 562 5800 Extension 6699
Fax.: +1 (613) 562 5664
Email: welazmeh@site.uottawa.ca

Nathalie Japkowicz
School of Information Technology and Engineering
Email: nat@site.uottawa.ca

Sofus A. Macskassy
Fetch Technologies
United States of America
Email: sofmac@fetch.com

Program Committee:
==================
Nitesh Chawla, University of Notre Dame, USA
Peter Flach, University of Bristol, UK
Alek Kotcz, Microsoft Research, USA
Nicolas Lachiche, Louis Pasteur University, France
Sylvain Letourneau, National Research Council, Canada
Dragos Margineantu, The Boeing Company, Phantom Works, USA
Stan Matwin, University of Ottawa, Canada
Kiri Wagstaff, NASA Jet Propulsion Laboratory, USA
Bianca Zadrozny, Fluminense Federal University, Brazil

Workshop Website:
=================
http://www.site.uottawa.ca/~welazmeh/conferences/AAAI-07/workshop/

## February 14, 2007

### Here's a paragraph from a recent referee report that I could really attach to just about every review I ever do

Finally, I'd like to see some graphs of the raw data, along with replicated datasets from the model. The paper admirably connects the underlying problem to the statistical model; however, the Bayesian approach requires a lot of modeling assumptions, and I'd be a lot more convinced if I could (a) see some of the data and (b) see that the fitted model would produce simulations that look somewhat like the actual data. Otherwise we're taking it all on faith.

### Coin flips and super bowls

A colleague heard this and commented that I considered various alternative possibilities, for example ten straight heads or ten straight tails, but not other possibilities such as alternation between heads and tails. He also wondered if I was too confident in saying I could correctly identify the true and fake sequences and suggested I could report my posterior probability of getting the correct identification.

The interview was taped, cut, and put back together before airing. The main effect of this was to make me sound more coherent and poised than I actually am, but a couple of points did get lost.

First, I actually did mention situations when I've guessed wrong, either because the real coin flips happened to have a lot of alternation, or because the kids creating the fake flips actually knew the secret and created long runs.

Second, even beyond it being a 1/250 chance of 10 straight heads, 10 straight tails, 10 AFC coin flip wins, or 10 NFC coin flip wins, there have also been over 30 sequences of 10 super bowls--and 30 chances of 1/250 shot isn't so extreme at all. I actually did think of the case of alternation of heads/tails, or wins/losses, but it seemed to me that such a pattern might not be noticed as exceptional, so I didn't mention it.

So, yes, I agree that this sort of classical significance-testing reasoning includes a lot of uncomfortable psychologizing and speculation, and indeed it's a motivation for Bayesian inference. But really I think of this more of a probability example than a statistics example, and typically the discrimination between real and fake sequences is clear enough.

I'd be happy to report a posterior probability--the ideal approach would be actually do the demo a few hundred times on various classrooms (you can have students work in small groups, rather than simply divide the class in half, so as to get several fake sequences per class--but, for reasons discussed in our book, I wouldn't have the students do the sequences individually), and when I get a pair of sequences, one real and one fake, give my subjective judgment as to which is the fake one, along with a numerical measure of my strength of certainty (e.g., on a 1-10 scale). Then with a few hundred examples, I could easily fit a model to calibrate my certainties onto a probability scale.

Finally, I didn't get a chance to credit Phil Stark, who told Deb and me about this demo. I did, however, tell them that the example is well known in statistics teaching.

P.S. The demo is described in this book and also in this article.

## February 7, 2007

### Models for compositional data

Joe asks about models for compositional data. We have an example in this paper on toxicology from 1996 (also briefly covered in the "nonlinear models" chapter in BDA). In the paper, it's discussed in the last paragraph of p.1402 and continues on to p.1403.

## January 30, 2007

### Dr. Bob Sports

Here's another one from Chance News (scroll down to "Hot streaks rarely last"), John Gavin refers to an interesting article from the Wall Street Journal by Sam Walker about a guy called Dr. Bob who has a successful football betting record:

YEAR WIN/LOSS/TIE %
1999 49-31-1 61%
2000 47-25-0 65%
2001 35-28-0 56%
2002 49-44-3 53%
2003 46-55-2 46%
2004 55-34-1 62%
2005 51-21-2 71%
2006 45-34-3 57%

Gavin writes,

What separates Mr. Stoll from other professionals, and makes him so frightening to bookmakers, is that he distributes his bets to the public, for a fee. All that pandemonium on Thursdays was no coincidence: that's the day Mr. Stoll sends an email to his subscribers telling them which college football teams to bet on the following weekend. This makes it very difficult for bookmakers to maintain a balanced book.

His website (http://www.drbobsports.com/) discusses the tools he uses to analyze football games: a mathematical model to project how many points each team was likely to score in a coming matchup. He makes unapologetic use of terms like variances, square roots, binomials and standard distributions. Much of his time is spent making tiny adjustments. If a team lost 12 yards on a running play, he checks the game summary to make sure it wasn't a botched punt. He compensates for the strength of every team's opponents. It takes him eight hours just to calculate a rating he invented to measure special teams. Trivial as this seems, Mr. Stoll says the extra work makes his predictions 4% better.

I have nothing to add except that I always think this sort of thing is cool.

## January 24, 2007

### Data and Web 2.0: "many eyes" from IBM

Jeff Heer reports that IBM has released their Many Eyes platform for browser-based data analysis. I have already written about Swivel, and there is another similar system called Data 360. However, the Many Eyes seems to be the most impressive of all, with very clean visualizations and numerous types of graphs, including, for example, social networks and maps.

Posted by Aleks Jakulin at 10:38 AM | Comments (1) | TrackBack

## January 18, 2007

Jeff Lax pointed me to this online article by Jeanna Bryner:

Higher education tied to memory problems later, surprising study finds

Going to college is a no-brainer for those who can afford it, but higher education actually tends to speed up mental decline when it comes to fumbling for words later in life.

Participants in a new study, all more than 70 years old, were tested up to four times between 1993 and 2000 on their ability to recall 10 common words read aloud to them. Those with more education were found to have a steeper decline over the years in their ability to remember the list, according to a new study detailed in the current issue of the journal Research on Aging. . . .

As Jeff pointed out, they only consider the slope and not the intercept. Pehaps the college graduates knew more words at the start of the study?

Here's a link to the study by Dawn Alley, Kristen Southers, and Eileen Crimmins. Looking at the article, we see "an older adult with 16 years of schooling or a college education scored about 0.4 to 0.8 points higher at baseline than a respondent with only 12 years of education." Based on Figures 1 and 2 of the paper, it looks like higher-educated people know more words at all ages, hence the title of the news article seems misleading.

The figures represent summaries of the fitted models. I'd like to see graphs of the raw data (for individual subjects in the study and for averages). It's actually pretty shocking to me that in a longitudinal analysis, such graphs are not shown.

## January 4, 2007

### Measuring model fit

Ahmed Shihab writes,

I have a quick question on clustering validation.

I am interested in the problem of measuring how well a given data set fits a proposed GMM [Gaussian mixture model]. As opposed to the notion of comparing models, this "validation" idea asserts that a GMM already represents a specific mixture of distributions, it already represents an absolute, so we can find out directly if the data fits that representation or not.

In fuzzy clustering, such validity measures abound. But it struck me that in the probabilistic world of GMMs our only measure is the actual sum of probabilities given by the GMM. The closer it is to one, the better. However, if the sum is say 0.69 it can be misleading; when the clusters do not match in populations the bigger cluster, even though it fits badly, adds substantially to the overall probability score and so the overall impression is that there is a good fit.

My reponse: I don't have much experience with these models, but I recommend simulating replicated datasets from the fitted model and comparing them (visually, and using numerical summaries) to the observed data, as discussed in Chapter 6 of Bayesian Data Analysis.

My other comment is that clusters typically represent chioces rather than underlying truth. For an extreme example, consider a simulation of 10,000 points from a unit bivariate normal distribution. This can certainly be considered as a single cluster, but it can also be divided into 50 or 100 or 200 little clusters (e.g., via k-means or any other clustering algorithm). Depending on the purpose, any of these choices can be useful. But if you have a generative model, then you can check it by comparing replications to the data.

## January 2, 2007

### Data Analysis Using Regression and Multilevel/Hierarchical Models

Our book is finally out! (Here's the Amazon link) I don't have much to say about the book here beyond what's on its webpage, which has some nice blurbs as well as links to the contents, index, teaching tips, data for the examples, errata, and software.

But I wanted to say a little about how the book came to be.

When I spoke at Duke in 1997--two years after the completion of the first edition of Bayesian Data Analysis--Mike West asked me when my next book was coming out. At this time, I was teaching statistical modeling and data analysis to the Ph.D. statistics students and was realizing that there were all sorts of things that I had thought were common knowledge--and were not really written in any book--but the students were struggling with. These skills included:

- Simple model building--for example, taking the logarithm when appropriate, building regression models by combining predictors rather than simply throwing them in straight from the raw data file.
- Simulation--for example, I had an assignment to forecast legislative elections from 1986 by district, using the 1984 data as a predictor, then to use this model to predict 1988, then get an estimate and confidence interval for the number of seats won by the Democrats in 1988. This is straighforward using regression and simulation, but none of the students even thought of doing simulation. They all tried to do it with point predictions, thus getting wrong results.
- Hypothesis testing as a convenient applied tool. For example, looking at the number of boys and girls born in each month over two years in a city, and using a chi^2 test to check for evidence of over- or under-dispersion.

And a bunch of other things, including the use of regression in causal inference, how randomized experiments work, practical model checking, discrete models other than the logit, etc etc. (Students from back then will recall the examples from the homeworks: the elections, the chickens, the dogs, the TV show, etc., most of which have made their way into the new book.) No book covered this stuff. I tried Snedecor and Cochran, but this just described methods without much explanation. Cox and Snell's Applied Statistics book looked good, but students got nothing out of it--I think that book is great for people who already know applied statistics but not so useful for people who are trying to learn the topic.

So I thought I'd write a book called "Introduction to Data Analysis," a prequel to Bayesian Data Analysis, with all the important things that I thought students should already know before getting into serious modeling. (I also had plans to discuss the steps of practical data analysis, including how to set up a problem, and a bunch of other things that I can't remember. This never led anywhere, but at some point I'd like to pick that up again.) I took some notes and thought occasionally about how to put the book together.

The next step came in 2002, when I was talking with Hal Stern and he suggested that "Intro to Data Analysis" (or, as he put it, "All about Andy") wasn't enough of a unifying principle. We discussed it and came up with the idea of structuring the book around regression. I liked this idea, especially given Gary King's comment from several years earlier that stat books tend to spend lots of time on simple models that aren't so useful. I loved the idea of starting with regression right away, rather than mucking around with all those silly iid models.

(Just as an aside: I really really hate when textbooks give inference for iid Poisson data. I don't think I've ever seen such a thing: multiple observations of Poisson data with a constant mean. Somebody will probably correct me on this, but I think it just doesn't happen. I have to admit that we do give this model in BDA, but we immediately follow it with the more realistic model of varying exposures.)

Anyway, back to the book: starting with the good stuff is definitely the way to go. I tried to follow the book-writing rule of "tell 'em what they don't know." It's supposed to be a "good parts" version (as William Goldman would say) of regression. It's still pretty long, but that's because regression has lots of good parts. Having Jennifer as a collaborator helped a lot, giving a second perspective on everything in addition to her special expertise in causal inference.

### The 1/4-power transformation

I like taking logs, but sometimes you can't, because the data have zeroes. So sometimes I'll take the square root (as, for example for the graphs on the cover of our book). But often the square root doesn't seem to go far enough. Also it lacks the direct interpretation of the log. Peter Hoff told me that Besag prefers the 1/4-power. This seems like it might make sense in practice, although I don't really have a good way of thinking about it--except maybe to think that if the dynamic range of the data isn't too large, it's pretty much like taking the log. But then why not the 1/8 power? Maybe then you get weird effects near zero? I haven't ever really seen a convincing treatment of this problem, but I suspect there's some clever argument that would yield some insight.

Here's my first try: a set of plots of various powers (1/8, 1/4, 1/2, and 1, i.e., eighth-root, fourth-root, square-root, and linear), each plotted along with the logarithm, from 0 to 50:

OK, here's the point. Suppose you have data that mostly fall in the range [10, 50]. then the 1/4 power (or the 1/8 power) is a close fit to the log, which means that the coefficients in a linear model on the 1/4 power or the 1/8 power can be interpreted as muliplicative effects.

On this scale, the difference between either of these powers and the log occurs at the low end of the scale. As x goes to 0, the log actually goes to negative infinity, and the powers go to zero. The big difference betwen the 1/8 power and the 1/4 power is that the x-points near 0 are mapped much further away from the rest for the 1/4 power than for the 1/8 power.

An argument in favor of the 1/4-power transformation thus goes as follows:

First, the 1/4 power maps closely to the log over a reasonably large range of data (a factor of 5, for example from 10 to 50). Thus, an additive model on the 1/4-power scale approximately corresponds to a multiplicative model, which is typically what we want. (In contrast, the square-root does not map so well, and so a model on that scale is tougher to interpret.)

Second, on the 1/4-power scale, the x-points near zero map reasonably close to the main body of points. So we're not too worried about these values being unduly influential in our analysis. (In contrast, the 1/8-power takes x=0 and puts it so far away from the other data that I could imagine it messing up the model.)

Could this argument be made more formally? I've never actually used the 1/4 power, but I'm wondering if maybe it really is a good idea.

P.S. Just to clear up a few things: the purpose of a transformation, as I see it, is to get a better fit and better interpretability for an additive model. Students often are taught that transformations are about normality, or about equal variance, but these are chump change compared to additivity and interpretability.

## December 29, 2006

### Chess ratings: some thoughts on the Glicko system

I ran across this interesting interview with Mark "Smiley" Glickman. He discusses the Glicko and Glicko-2 rating systems which are based on dynamic Bayesian models. From a statistical perspective, some of the most interesting discussion comes near the middle of the interview where he discusses the chess federation's ongoing project to monitor average ratings, and the challenge of comparing ratings of people in different years. The bit at the very end is also interesting--it reminds me of the claim I once heard that a chess player, if given the option of being a better player or having a higher rating, would choose the higher rating. One of the difficulties of numerical ratings or rankings is that people can take them too seriously, and Glickman discusses this.

## December 28, 2006

### High-dimensional data analysis

I came across this talk by David Donoho (see also here for more detail) from 2000. I was disappointed to see that he scooped me on the phrase "blessing of dimensionality" but I guess this is not such an obscure idea.

More interesting are the different perspectives that one can have on high-dimensional data analysis. Donoho's presentation (which certainly is still relevant six years later) focuses on computational approaches to data analysis and says very little about models. Bayesian methods are not mentioned at all (the closest is slide #44, on Hidden Components, but no model is specified for the components themselves). It's good that there are statisticians working on different problems using such different methods.

Donoho also discusses Tukey's ideas of exploratory data analysis and discusses why Tukey's approach of separation from mathematics no longer makes sense. I agree with Donoho on this, although perhaps from a different statistical perspective: my take on exploratory data analysis is that (a) it can much more powerful when used in conjunction with models, and (b) as we fit increasingly complicated models, it will become more and more helpful to use graphical tools (of the sort associated with "exploratory data analysis") to check these models. As a latter-day Tukey might say, "with great power comes great responsibility." See this paper and this paper for more on this.

I was also trying to understand the claim on page 14 on Donoho's presentation that the fundamental roadblocks of data analysis are "only mathematical." From my own experiences and struggles (for example, here), I'd interpret this from a Bayesian perspective as a statement that the fundamental challenge is coming up with reasonable classes of models for large problems and large datasets--models that are structured enough to capture important features of the data but not so constrained as to restrict the range of reasonable inferences. (For a non-Bayesian perspective, just replace the word "model" with "method" in the previous sentence.)

## December 23, 2006

### Distributions of rankings

A few postings ago, Andrew wondered about the shape of the long tail. OneEyedMan's comment reminded us that the extensive NetFlixPrize dataset contains information about almost half a million users' ratings on almost 20000 movies. It's an excellent playground, although I was told that the data was corrupted.

So, I was happy to notice Ilya Grigorik's analysis of the distributions of the dataset. In particular, the average user seems to be centered at 3.8 (on a scale from 1-5), indicating that people do try to watch movies they like. But the uneven distribution of score variance across users indicates that one could model the type of user, perhaps with a mixture model:

I must also note that NetFlix users have an incentive to score movies even with lukewarm scores, which moderates the above distribution. On most internet sites that allow users to rank content, the extreme scores (1 or 5) are overrepresented: some people make the effort to write a review only when they are very unhappy and want to punish someone, or when they are very happy and want to reward or recommend the work to others.

Another interesting source of rating distributions is the Interactive Fiction Competition results page: it has numerous histograms of scores for individual IF works.

Posted by Aleks Jakulin at 1:19 PM | Comments (0) | TrackBack

## December 13, 2006

### Interpreting p-values

Dan Goldstein made this amusing graph:

in discussing this paper.

## December 7, 2006

### Calibration accuracy for binary-data prediction

Jose von Roth writes,

I am creating a logistic model on 230 cases (4 categorical explanatory variables; about 25% of the cases are 1s, and 75% are 0s in the dependent variable). And I get accuracy of 65%. As a further validation, I boostraped the 230 cases sample 1000 times (with replacement), and ran the obtained model through those 1000 samples, getting accuracies in the range of 57% to 68%. Is that a approvable validation method? Or is bootstrapping "without" replacement and less cases better? Or is this kind of validation in general wrong? (Problem is that I have no test sample).

I'm a little confused here. How can you get an accuracy of 65% when simply predicting 0 all the time gives an accuracy of 75%! This doesn't sound like such a great model...

## December 5, 2006

### Swivel: Web 2.0 and Data

The latest craze on the internet is the migration of applications from desktop to the Web. The latest is "Swivel": the internet archive for data, something I have written about before. While there is not much to be seen at the site, TechCrunch has some intriguing snapshots:

I guess that one can upload the data, access data that others have posted, and perform some simple types of analysis. It might not sound much, but having a database of data will remove the need for people to provide summaries of it. Anyone interested in the problem can perform the summaries for himself. This will make data analysis much more approachable than before. This can also become competition to existing spreadsheet and statistical software, and a platform for deploying recent research: it is often frustrating for a researcher in statistical methodology how difficult it is to actually enable users to benefit from the most recent advances in the research sphere.

Posted by Aleks Jakulin at 9:51 AM | Comments (4) | TrackBack

### Sample size and self-efficiency

Xiao-Li Meng is speaking this Friday 2pm in the biostatistics seminar (14th Floor, Room 240, Presbyterian Hospital Bldg, 622 West 168th Street). Here's the abstract:

One of the most frequently asked questions in statistical practice, and indeed in general quantitative investigations, is "What is the size of the data?" A common wisdom underlying this question is that the larger the size, the more trustworthy are the results. Although this common wisdom serves well in many practical situations, sometimes it can be devastatingly deceptive. This talk will report two of such situations: a historical epidemic study (McKendrick, 1926) and the most recent debate over the validity of multiple imputation inference for handling incomplete data (Meng and Romero, 2003). McKendrick's mysterious and ingenious analysis of an epidemic of cholera in an Indian village provides an excellent example of how an apparently large sample study (e.g., n=223), under a naive but common approach, turned out to be a much smaller one (e.g., n<40) because of hidden data contamination. The debate on multiple imputations reveals the importance of the self-efficiency assumption (Meng, 1994) in the context of incomplete-data analysis. This assumption excludes estimation procedures that can produce more efficient results with less data than with more data. Such procedures may sound paradoxical, but they indeed exist even in common practice. For example, the least-squared regression estimator may not be self-efficient when the variances of the observations are not constant. The morale of this talk is that in order for the common wisdom "the larger the better" be trusted, we not only need to assume that data analyst knows what s/he is doing (i.e., an approximately correct analysis), but more importantly that s/he is performing an efficient, or at least self-efficient, analysis.

This reminds me of the blessing of dimensionality, in particular Scott de Marchi's comments and my reply here. I'm also reminded of the time at Berkeley when I was teaching statistical consulting, and someone came in with an example with 21 cases and 16 predictors. The students in the class all thought this was a big joke, but I pointed out that if they had only 1 predictor, it wouldn't seem so bad. And having more information should be better. But, as Xiao-Li points out (and I'm interested to hear more in his talk), it depends what model you're using.

I'm also reminded of some discussions about model choice. When considering the simpler or the more complicated model, I'm with Radford that the complicated model is better. But sometimes, in reality, the simple model actually fits better. Then the problem, I think, is with the prior distribution (or, equivalently, with estimation methods such as least squares that correspond to unrealistic and unbelievable prior distributions that do insufficiant shrinkage).

## November 30, 2006

### The difference between significant and non-significant is not itself statistically significant

One of my favorite points arose in the seminar today. A regression was fit to data from the 2000 Census and the 1990 Census, a question involving literacy of people born in the 1958-1963 period. The result was statistically significant for 2000 and zero for 1990, which didn't seem to make sense, since presumably these people were not changing much in literacy between the ages of 30 and 40. The resolution of this apparent puzzle was . . . the difference between the estimates was not itself statistically significant! (The estimate was something like 0.003 (s.e. 0.001) for 2000, and 0.000 (s.e. 0.002) for 1990. So both data points were consistent with an estimate of 0.002 (for example). But at first sight, it really looked like there was a problem.

P.S. These was a small effect, as can be seen by the fact that it was barely statistically significant, even though it came from the largest dataset one could imagine: the Chinese census. Well, just a 1% sample of the Chinese census, but still . . .

### The state of the art in missing-data imputation

Missing-data imputation is to statistics as statistics is to research: a topic that seems specialized, technical, and boring--until you find yourself working on a practical problem, at which point it briefly becomes the only thing that matters, the make-it-or-break-it step needed to ensure some level of plausibility in your results.

Anyway, Grazia pointed me to this paper by a bunch of my friends (well, actually two of my friends and a bunch of their colleagues). I think it's the new state of the art, so if you're doing any missing-data imputation yourself, you might want to take a look and borrow some of their ideas. (Of course, I also recommend chapter 25 of our forthcoming book.)

P.S. Yes, their tables should be graphs But, hey, nobody's perfect.

## November 22, 2006

### Named variables

I always tell students to give variables descriptive names, for example, define a variable "black" that equals 1 for African-Americans and 0 for others, rather than a variable called "race" where you can't remember how it's defined. The problem actually came up in a talk I went to a couple of days ago: a regression included a variable called "sex", and nobody (including the speaker) knew whether it was coding men or women.

P.S. Yet another example occurred a couple days later in a different talk (unfortunately I can't remember the details).

P.P.S. I corrected the coding mistake in the first version of the entry.

P.P.S. Check out Keith's story in the comments.

### Categorizing continuous variables

Jose von Roth writes,

I am working on a project where I have to create some logistic models.

I am trying to categorize continuous variables in the most efficient way. Do you know of some rule of thumb I could apply based on the total n?

My sample size is relatively small, 230 cases, each case has 7 variables, 3 of them are continuous (which have to be categorized). Because of the sample size, I found as recommendation, to boostrap the data let's say 1000 times, run logistic models for each of those 1000 samples, and compute the average beta coefficients based on the average P values. And use those means as my final beta coefficients. Is that a valid approach to improve the correctness of the models?

Once I have my logistic models, can I rely on the accuracy rates if I test them on let's say another 1000 boostrapped samples (with replacement)? Or is there any other validation technique which you would suggest?

Finally, the idea of this project is to classify cases into 5 categories (calculating the Probability that "x" case belongs to "y" category). I was considering using multicategory logistic models, but I am not sure if using single dichotomous logistic models would be better and then create 5 of them. Which approach is statistically more accurate? I get overall different accuracy rates, where multicategory is always much higher than 5 dichotomous models.

My response: first, it's not clear whether you're trying to predict these 7 variables or use them as predictors. In either case, I don't see why you want to divide the continuous variables into categories. I'd recommend just keeping them as continuous. Or, if you do divide a predicted variable into 5 cagetories, I'd start with simple linear regression, only using the more complicated ordered categorical models if you really have to (as revealed, for example, by residual plots).

## November 14, 2006

### A course in search engines, also a comment about classes that meet once per week

Dragomir Radev (who's visiting Columbia this year from the University of Michigan) is teaching this interesting-looking course this spring on search engine technology. He said that one of the assignments will be to build your own search engine, and another might be to build that machine that can figure out what you're typing just by listening to the sound of your keystrokes. It looks pretty cool--it's too bad it's offered at 6pm so I can't make it.

On a barely related note: the class meets once per week for two hours. I know that people often find this sort of once-a-week schedule convenient, but it's been my impression that research shows that people learn better from more frequently-scheduled classes. I usually do twice a week. I'd prefer meeting three times per week, but Columbia doesn't seem to do much of that anymore. According to Dave Krantz, when he was a student at Yale way back when, some of the courses would be Mon Wed Fri, and others would be Tues Thurs Sat. That's a bit too rigorous for my taste, but I wouldn't mind teaching some Mon Wed Fri courses.

P.S. Drago clarifies:

I [Drago] didn't say that I would give an assignment about recognizing what is typed. I said that this could be a nice course project.

The actual assignments that I am planning to give will be chosen from this list.

- build a search engine
- build a spam recognition system
- perform a text mining analysis of a large data set such as the Enron mail corpus or the Netflix movie recommendation data set
- perform an analysis of a large networked data set such as a snapshot of the web graph (e.g., write code to compute pagerank).

I would imagine that each student can only do 2 or 3 of them.

Then each student will also have to propose an independent course project where the sky is the limit. Some ideas include cross-lingual information retrieval, text summarization, question answering, identifying online communities, information propagation in blogs, etc.

## November 9, 2006

### A convenient random number generator

Adam Sacarny noticed some dice in my office today and we came up with a good idea for a web-accessible random number generator: Put a die in a clear plastic box attached to something that can shake it. Train a video camera on the box, and pipe it to some digit-recognition software. Then whenever someone clicks on a button on the website, it shakes the box and read off the number. (We can use one of those 20-sided dice with each digit written twice, so we get a random number between 0 and 9.) Pretty convienent, huh?

I'd like to set this up but I assume somebody's done it already. In any case it's not nearly as cool as that program that figures out what you're typing by listening to the sounds of the keystrokes.

## October 30, 2006

### Dartmouth postdoc in applied statistics

Joe Bafumi writes,

Dartmouth College seeks applicants for a post-doctoral fellowship in the area of applied statistics. Dartmouth is developing a working group in applied statistics, and the fellowship constitutes one part of this new initiative. The applied statistics fellow will be in residence at Dartmouth for the 2007-2008 academic year, will teach one 10-week introductory course in basic statistics, and will be expected to further his or her research agenda during time at Dartmouth. Research speciality is open but applicants should have sufficient inter-disciplinary interest so as to be willing to engage different fields of study that rely on quantitative techniques. The fellow will receive a competitive salary with a research account. Dartmouth is an EO/AA employer and the college encourages applications from women and minority candidates. Applications will be reviewed on a rolling basis. Applicants should send a letter of interest, two letters of recommendation, one writing sample, and a CV to Michael Herron, Department of Government, 6108 Silsby Hall, Hanover, NH 03755.

This looks interesting to me. I suggested to Joe that they also invite visitors to come for a few days at a time to become actively involved in the research projects going on at Dartmouth.

## October 11, 2006

Paul Mason writes,

I have been trying to follow the Statistical Modeling, Causal Inference, and Social Science Blog. I have had a continuing interest in statistical testing as an ex-Economics major and follower of debates in the philosophy of science. But I am finding it heavy going. Could you point me to (or post) some material for the intelligent general reader.

I'd start with our own Teaching Statistics: A Bag of Tricks, which I think would be interesting to learners as well. And I have a soft spot for our new book on regression and multilevel modeling. But perhaps others have better suggestions?

## October 5, 2006

### More thoughts on publication bias

Neal writes,

Thanks for bringing up the most interesting piece by Gerber and Malhotra and the Drum comment.

My own take is perhaps a bit less sinister but more worrisome than Drum's interpretation of the results. The issue is how "tweaking" is interpreted. Imagine a preliminary analysis which shows a key variable to have a standard error as large as its coefficient (in a regression). Many people would simply stop analysis at that point. Now consider getting a coefficient one and a half times its standard error (or 1.6 times its standard error). We all know it is not hard at that point to try a few different specifications and find one that gives a magic p-value just under .05 and hence earning the magic star. But of course the magic star seems critical for publication.

Thus I think the problem is with journal editors and reviewers who love that magic star. And hence to authors who think that it matters whether t is 1.64 or 1.65. Journal editors could (and should) correct this.

When Political Analysis went quarterly we got it about a third right. Our instructions are:

"In most cases, the uncertainty of numerical estimates is better conveyed by confidence intervals or standard errors (or complete likelihood functions or posterior distributions), rather than by hypothesis tests and p-values. However, for those authors who wish to report "statistical significance," statistics with probability levels of less than .001, .01, and .05 may be flagged with 3, 2, and 1 asterisks, respectively, with notes that they are significant at the given levels. Exact probability values may always be given. Political Analysis follows the conventional usage that the unmodified term "significant" implies statistical significance at the 5% level. Authors should not depart from this convention without good reason and without clearly indicating to readers the departure from convention."

Would that I had had the guts to drop "In most cases" and stop after the first sentence. And even better would have been to simply demand a confidence interval.

Most (of the few) people I talk with have no difficulty distinguishing "insignificant" from "equals zero," but Jeff Gill in his "The Insignificance of Null Hypothesis Significance Testing" (Political Research Quarterly, 1999) has a lot of examples showing I do not talk with a random sample of political scientists. Has the world improved since 1999?

BTW, since you know my obsession with what Bayes can or cannot do to improve life, this whole issue, is in my mind, the big win for Bayesians. Anything that lets people not get excited or depressed depending on whether a CI (er HPD credible region) is (-.01,1.99) or (.01,2.01) has to be good.

My take on this: I basically agree. In many fields, you need that statistical significance--even if you have to try lots of tests to find it.

## September 22, 2006

### Hypothesis testing and conditional inference

Ben Hansen sent me this paper by Donald Pierce and Dawn Peters, about which he writes:

I [Ben] stumbled on the attached paper recently, which puts forth some interesting ideas relevant to whether very finely articulated ancillary information should be conditioned upon or coarsened. These authors' views are clearly that it should be coarsened, and I have the impression the higher-order-asymptotics/conditional inference people favor that conclusion.

The background on this is as follows:

1. I remain confused about conditioning and testing. I hate the so-called exact test (except for experiments that really have the unusual design of conditioning on both margins; see Section 3.3 of this paper from the International Statistical Review).

2. So I'd like to just abandon conditioning and ancillarity entirely. The principles I'd like to hold (following chapters 6 and 7 of BDA) are to do fully Bayesian inference (conditional on a model) and then use predictive checking (based on the design of data collection) to check the fit.

3. But when talking with Ben on the matter, I realized I still was confused. Consider the example of a survey where we gather a simple random sample of size n, fit a normal distribution, and then test for skewness (using the standard test statistic: the sample third moment, divided by the sample varicance to the 3/2 power). The trick is that, in this example, the sample size is determined by a coin flip: if heads, n=20, if tails, n=2000. Based on my general principles (see immediately above), the reference distribution for the skewness test will be a mixture of the n-20 and the n=2000 distribution. But this seems a little strange, for example, what if we see n=2000--shouldn't we make use of that information in our test?

4. In this particular example, I think I can salvage my principles by considering a two-dimesional test statistic, where the first dimension is that skewness measure and the second dimesion is n. Then the decision to "condition on n" becomes a cleaner (to me) decision to use a particular one-dimensional summary of the two-dimensional test statistic when comparing to the reference distribution.

Anyway, I'm still not thrilled with my thinking here, so perhaps the paper by Pierce and Peters will help. Of course, I don't really care about getting "exact" pvalues or anything like that, but I do want a general method of comparing data to replications from the assumed model.

## September 21, 2006

### Something fishy in political science p-values, or, it's tacky to set up quantitative research in terms of "hypotheses"

A commented pointed out this note by Kevin Drum on this cool paper by Alan Gerber and Neil Malhotra on p-values in published political science papers. They find that there are suprisingly many papers with results that are just barely statistically significant (t=1.96 to 2.06) and surprisingly few that are just barely not significant (t=1.85 to 1.95). Perhaps people are fuding their results or selecting analyses to get significance. Gerber and Malhotra's analysis is excellent--clean and thorough.

Just one note: the finding is interesting, and I love the graphs, but, as Gerber and Malhotra note,

We only examined papers that listed a set of hypotheses prior to presenting the statistical results. . . .

I think it's kind of tacky to state a formal "hypothesis," especially in a social science paper, partly because, in many (most?) of my research, the most interesting finding was not anything we'd hypothesized ahead of time. (See here for some favorite examples.) I think there's a problem with the whole mode of research that focuses on "rejecting hypotheses" using statistical significance, and so I'm sort of happy to find that Gerber and Malhotra notice a problem with studies formulated in this way.

Slightly related

In practice, t-statistics are rarely much more than 2. Why? Because, if they're much more than 2, you'll probably subdivide the data (e.g., look at effects among men and among women) until subsample sizes are too small to learn much. Knowing this can affect experimental design, as I discuss in my paper, "Should we take measurements at an intermediate design point?"

## September 20, 2006

### Position in educational psychology

Allan Cohen writes,

The Ed Psych Department at UGA has an opening for an Assistant or Associate Professor in educational measurement and statistics (see attached). We are looking for someone who is a strong statistician with an interest in pursuing psychometric research. This position is somewhat unique in that it is a tenure track position with only a one-course teaching requirement each of two semesters. The remainder of the time is to be spent doing research and working on testing issues in the Georgia Center for Assessment. Summer support is also provided by the Georgia Center for Assessment.

See here for more.

## September 11, 2006

### Multiple imputation has reached the "spam" level of ubiquity

I once asked Don Rubin if he was miffed that some of his best ideas, including the characterization of missing-data processes ("missing completely at random," "missing at random," etc.) and multiple imputation are commonly mentioned without citing him at all. He said that he actually considers it a compliment that these ideas are so accepted that they need no citation. Along those lines, he'd probably be happy to know that we're now getting unsolicited emails of the following sort:

Dear Colleague:

On Nov. 10-11 in New York City, I [unsolicited emailer] will be presenting my 2-day course on Missing Data. This course provides an in-depth look at modern methods for handling missing data, with particular emphasis on maximum likelihood and multiple imputation. These methods have been demonstrated to be markedly superior to conventional methods like listwise deletion or single imputation, while at the same time resting on less stringent assumptions.

While the course is applications oriented, I also explain the conceptual underpinnings of these new methods in some detail. Maximum likelihood is illustrated with two programs, Amos and LEM. Multiple imputation is demonstrated with two SAS procedures (MI and MIANALYZE) and two Stata commands (ICE and MICOMBINE).

The course will be held at the . . . Hotel . . . Guest rooms are available at the hotel at a special rate.

You can get more detailed information at . . .

If you'd prefer not to get announcements like this in the future, please reply to this e-mail and ask to removed from the list.

## September 6, 2006

### Series of p-values

A finance professor writes,

I am currently working on a project and am looking for a test. Unfortunately, none of my colleagues can answer my question. I have a series of regressions of the form Y= a + b1*X1 + b2*X2. I am attempting to test whether the restriction b1=b2 is valid over all regressions. So far, I have an F-test based on the restriction for each regression, and also the associated p-value for each regression (there are approximately 600 individual regressions). So far, so good.

Is there a way to test whether the restriction is valid "on average"? I had thought of treating the p-values as uniformy distributed and testing them against a null hypothesis that the mean p-value is some level (i.e. 5%).

I figure that there should be a better way. I recall someone saying that a sum of uniformly distributed random variates is distribted Chi-squared (or was that a sum of squared uniforms?). In either case, I can't find a reference.

My response: if the key question is comparing b1 to b2, I'd reparameterize as follows:
y = a + B1*z1 + B2*z2 + error, where z1=(X1+X2)/2, and z2=(X1-X2)/2. (as discussed here)
Now you're comparing B2 to zero, which is more straightforward--no need for F-tests, you can just look at the confidence intervals for B2 in each case. And you can work with estimated regression coefficients (which are clean) rather than p-values (which are ugly).

At this point I'd plot the estimates and se's vs. some group-level explanatory variable characterizing the 600 regressions. (That's the "secret weapon.") More formal steps would include running a regression of the estimated B2's on relevant group-level predictors. (Yes, if you have 600 cases, you certainly must have some group-level predictors.) And the next step, of course, is a multilevel model. But at this point I think you've probably already solved your immediate problem.

## August 30, 2006

### Problems in a study of girl and boy births, leading to a point about the virtues of collaboration

I was asked by a reporter to comment on a paper by Satoshi Kanazawa, "Beautiful parents have more daughters," which is scheduled to appear in the Journal of Theoretical Biology.

As I have already discussed, Kanazawa's earlier papers ("Engineers have more sons, nurses have more daughters," "Violent men have more sons," and so on) had a serious methodological problem in that they controlled for an intermediate outcome (total number of children). But the new paper fixes this problem by looking only at first children (see the footnote on page 7).

Unfortunately, the new paper still has some problems. Physical attractiveness (as judged by the survey interviewers) is measured on a five-point scale, from "very unattractive" to "very attractive." The main result (from the bottom of page 8) is that 44% of the children of surveyed parents in category 5 ("very attractive") are boys, as compared to 52% of children born to parents from the other four attractiveness categories. With a sample size of about 3000, this difference is statistically significant (2.44 standard errors away from zero). I can't confirm this calculation because the paper doesn't give the actual counts, but I'll assume it was done correctly.

Choice of comparisons

Not to be picky on this, though, but it seems somewhat arbitrary to pick out category 5 and compare it to 1-4. Why not compare 4 and 5 ("attractive" or "very attractive") to 1-3? Even more natural (from my perspective) would be to run a regression of proportion boys on attractiveness. Using the data in Figure 1 of the paper:

> attractiveness <- c (1, 2, 3, 4, 5)
> percent.boys <- c (50, 56, 50, 53, 44)
> display (lm (percent.boys ~ attractiveness))
lm(formula = percent.boys ~ attractiveness)
coef.est coef.se
(Intercept) 55.10 4.56
attractiveness -1.50 1.37
n = 5, k = 2
residual sd = 4.35, R-Squared = 0.28

So, having a boy child is negatively correlated with attractiveness, but this is not statistically significant. (Weighting by the approximate number of parents in each category, from Figure 2, does not change this result.) It would not be surprising to see a correlation of this magnitude, even if the sex of the child were purely random.

But what about the comparison of category 5 with categories 1-4? Well, again, this is one of many comparisons that could have been made. I see no reason from the theory of sex ratios (admittedly, an area on which I am no expert) to pick out this particular comparison. Given the many comparisons that could be done, it is not such a surprise that one of them is statistically significant at the 5% level.

Measuring attractiveness?

I have little to say about the difficulties of measuring attractiveness except that, according to the paper, interviewers in the survey seem to have assessed the attractiveness of each participant three times over a period of several years. I would recommend using the average of these three judgments as a combined attractiveness measure. General advice is that if there is an effect, it should show up more clearly if the x-variable is measured more precisely. I don't see a good reason to use just one of the three measures.

Reporting of results

The difference ireported in this study was 44% compared to 52%--you could say that the most attractive parents in the study were 8 percentage points more likely than the others to have girls. Or you could say that they were .08/.52=15% more likely to have girls. But on page 9 of the paper, it says, "very attractive respondents are about 26% less likely to have a son as the first child." This crept up to 36% in this news article, which was cited by Stephen Dubner on the Freakanomics blog.

Where did the 26% come from? Kanazawa appears to have run a logistic regression of sex of child on an indicator for whether the parent was judged to be very attractive. The logistic regression coefficient was -0.31. Since the probabilities are near 0.5, the right way to interpret the coefficient is to divide it by 4: -0.31/4=-0.08, thus an effect of 8 percentage points (which is what we saw above). For some reason, Kanazawa exponentiated the coefficient: exp(-0.31)=0.74, then took 0.74-1=-0.26 to get a result of 26%. That calculation is inappropriate (unless there is something I'm misunderstanding here). But, of course, once it slipped past the author and the journal's reviewers, it would be hard for a reporter to pick up on it.

Coauthors have an incentive to catch mistakes

I'm disappointed that Kanazawa couldn't find a statistician in the Interdisciplinary Institute of Management where he works who could have checked his numbers (and also advised him against the bar graph display in his Figure 1, as well as advised him about multiple hypothesis testing). Just to be clear on this: we all make mistakes, I'm not trying to pick on Kanazawa. I think we can all do better by checking our results with others. Maybe the peer reviewers for the Journal of Theoretical Biology should've caught these mistakes, but in my experience there's no substitute for adding someone on as a coauthor, who then has a real incentive to catch mistakes.

Summary

Kanazawa is looking at some interesting things, and it's certainly possible that the effects he's finding are real (in the sense of generalizing to the larger population). But the results could also be reasonably explained by chance. I think a proper reporting of Kanazawa's findings would be that they are interesting, and compatible with his biological theories, but not statistically confirmed.

My point in discussing this article is not to be a party pooper or to set myself up as some sort of statistical policeman or to discourage innovative work. Having had this example brought to my attention, I was curious enough to follow it up, and then I wanted to share my newfound understanding with others. Also, this is a great example of multiple hypothesis testing for a statistics class.

## August 24, 2006

### Updated paper on 'The difference between "significant'' and "not significant'' is not itself statistically significant'

Hal Stern updated our paper, 'The difference between "significant'' and "not significant'' is not itself statistically significant,' to include this example of sexual preference and birth order. Here's the abstract of our paper:

It is common to summarize statistical comparisons by declarations of statistical significance or non-significance. Here we discuss one problem with such declarations, namely that changes in statistical significance are often not themselves statistically significant. By this, we are not merely making the commonplace observation that any particular threshold is arbitrary---for example, only a small change is required to move an estimate from a 5.1% significance level to 4.9%, thus moving it into statistical significance. Rather, we are pointing out that even large changes in significance levels can correspond to small, non-significant changes in the underlying variables.

The error we describe is conceptually different from other oft-cited problems---that statistical significance is not the same as practical importance, that dichotomization into significant and non-significant results encourages the dismissal of observed differences in favor of the usually less interesting null hypothesis of no difference, and that any particular threshold for declaring significance is arbitrary. We are troubled by all of these concerns and do not intend to minimize their importance. Rather, our goal is to bring attention to what we have found is an important but much less discussed point. We illustrate with a theoretical example and two applied examples.

The full paper is here, and here are some more of my thoughts on statistical significance.

## August 3, 2006

### Analyzing choice data

Mathis Schulte writes,

We have collected 3 waves of survey data from 80 teams of approximately 12 people each. Each team has a formally designated leader. At Time 3, we asked an open ended question: "If you had to choose one member from your team (not including yourself) to be your team's new leader, who would you choose? Please write this person's first and last name in the space below."

We regard this as a measure of emergent leadership and want to predict the number of 'votes' (or nominations) a person receives from his/her team members with predictors at multiple levels (characteristics of voter, votee, and team). But clearly, there's negative interdependence among the team members' votes (i.e., the more votes one team member gets, the fewer votes other team members can get).

What can we do to use 'votes' as our DV?

That's a good question. My first thought is that 12 is enough people in a group that you can pretty much ignore the correlation and just do the analysis. Strictly speaking, it's an unordered multinomial outcome, and there are models for these, but maybe it's simplest to start with something like a logistic regression predicting the probabilty that person i picks person j. If you model #votes as an outcome, I'd be sure to use an overdispersed model (rather than a straight binomial or Poisson). You also might want to use jackknife or bootstrap (on the 80 groups) to get standard errors.

The problem is so open-ended, though, that I expect there are a lot of good solutions that you'd only think of after playing with the data.

## August 1, 2006

### Comparing multinomial regressions

Lenore Fahrig writes,

I have two multinomial logistic models meant to explain the same data set. The two models have different predictor variables but they have the same number of predictor variables (2 each). Can I use the difference in deviance between the two models to compare them?

This sort of question comes up a lot. My quick answer is to include all four predictors in the model, or to combine them in some way (for example, sometimes it makes sense to reparameterize a pair of related predictors by considering their average and their difference). I can see why it can be useful to look at the improvement in fit from adding a predictor or two, but I don't see the use in comparing models with different predictors. (I mean, I see how one can learn interesting things from this sort of comparison, but I don't see the point in a formal statistical test of it, since I would think of your two original models as just the starting points to something larger.)

### Red Baron debunked?

The legend of Manfred von Richthofen, aka the Red Baron, has taken a knock. The victories notched up by him and other great flying aces of the first world war could have been down to luck rather than skill.

Von Richthofen chalked up 80 consecutive victories in aerial combat. His success seems to suggest exceptional skill, as such a tally is unlikely to be down to pure luck.

However, Mikhail Simkin and Vwani Roychowdhury of the University of California at Los Angeles think otherwise. They studied the records of all German fighter pilots of the first world war and found a total of 6745 victories, but only about 1000 "defeats", which included fights in which pilots were killed or wounded.

The imbalance reflects, in part, that pilots often scored easy victories against poorly armed or less manoeuvrable aircraft, making the average German fighter pilot's rate of success as high as 80 per cent. Statistically speaking, at least one pilot could then have won 80 aerial fights in a row by pure chance.

The analysis also suggests that while von Richthofen and other aces were in the upper 30 per cent of pilots by skill, they were probably no more special than that. "It seems that the top aces achieved their victory scores mostly by luck," says Roychowdhury.

I'm still confused. (6745/7745)^80 = .000016, or 1 in 60,000. Still seems pretty good to me. I mean, with these odds I wouldn't put my money on Snoopy, that's for sure.

## July 31, 2006

### A nearly generic referee report

I just reviewed a paper for a statistics journal. My review included the following sentences which maybe I should just put in my .signature file:

The main weakness of the paper is that it does not have any examples. This makes it hard to follow. As an applied statistician, I would like an example for two reasons: (a) I would like to know how to apply the method, and (b) it is much easier for me to evaluate the method if I can see it in an example. I would prefer an example that has relevance for the author of the paper (rather than a reanalysis of a "classic" dataset), but that is just my taste.

Lest you think I'm a grouch, let me add that I liked the paper and recommended acceptance. (Also, I know that I do not always follow my own rules, having analyzed the 8 schools example to death and having even on occasion reanalyzed data from Snedecor and Cochran's classic book.)

## July 27, 2006

### Homosexuality and the number of older brothers and sisters, or, the difference between "significant" and "not significant" is not itself statistically significant

This paper, "Biological versus nonbiological older brothers and men's sexual orientation," by Anthony Bogaert, appeared recently in the Proceedings of the National Academy of Sciences and was picked up by several news organizations, including Scientific American, New Scientist, Science News, and the CBC. As the Science News article put it,

The number of biological older brothers correlated with the likelihood of a man being homosexual, regardless of the amount of time spent with those siblings during childhood, Bogaert says. No other sibling characteristic, such as number of older sisters, displayed a link to male sexual orientation.

I was curious about this--why older brothers and not older sisters? The article referred back to this earlier paper by Blanchard and Bogaert from 1996, which had this graph:

and this table:

Here's the key quote from the paper:

Significant beta coefficients differ statistically from zero and, when positive, indicate a greater probability of homosexuality. Only the number of biological older brothers reared with the participant, and not any other sibling characteristic including the number of nonbiological brothers reared with the participant, was significantly related to sexual orientation.

The entire conclusions seem to be based on a comparison of significance with nonsignificance, even though the differences do not appear to be significant. (One can't quite be sure--it's a regression analysis and the different coef estimates are not independent, but based on the picture I strongly doubt the differences are significant.) In particular, the difference between the coefficients for brothers and sisters does not appear to be significant.

As I have discussed elsewhere, the difference between "significant" and "not significant" is not itself statistically significant. But should I be such a hard-liner here? As Andrew Oswald pointed out, innovative research can have mistakes, but that doesn't mean it should be discarded. And given my Bayesian inclinations, I should be the last person to discard a finding (in this case, the difference between the average number of older brothers and the average number of older sisters) just because it's not statistically significant.

But . . . but . . . yes, the data are consistent with the hypothesis that only the number of older brothers matters. But the data are also consistent with the hypothesis that only the birth order (i.e., the total number of older siblings) matters. (At least, so I suspect from the graph and the table.) Given that the 95% confidence level is standard (and I'm pretty sure the paper wouldn't have been published without it), I think the rule should be applied consistently.

To put it another way, the news articles (and also bloggers; see here, here, and here) just take this finding at face value.

Let me try this one more time: Bogaert's conclusions might very well be correct. He did not make a big mistake (as was done, for example, in the article discussed here). But I think he should be a little less sure of his conclusions, since his data appear to be consistent with the simpler hypothesis that it's birth order, not #brothers, that's correlated with being gay. (The paper did refer to other studies replicating the findings, but when I tracked down the references I didn't actually see any more data on the brothers vs. sisters issue.)

Warning: I don't know what I'm talking about here!

This is a tricky issue because I know next to nothing about biology, so I'm speaking purely as a statistician here. Again, I'm not trying to slam Bogaert's study, I'm just critical of the unquestioning acceptance of the results, which I think derives from an error about comparing statistical significance.

## July 25, 2006

### Fred Mosteller

Frederick Mosteller passed away yesterday. He was a leader in applied statistics and statistical education and was a professor of statistics at Harvard for several decades. Here is a brief biography by Steve Fienberg, and here are my own memories of being Fred's T.A. in the last semester that he taught statistics. I did not know Fred well, but I consider him an inspiration in my work in applied statistics and statistical education.

## July 13, 2006

### "Invariant to coding errors"

I was just fitting a model and realizing that some of the graphs in my paper were all wrong--we seem to have garbled some of the coding of a variable in R. (It can happen, especially in multilevel models when group indexes get out of order.) But the basic conclusion didn't actually change. This flashed me back to when Gary and I were working on our seats-votes stuff (almost 20 years ago!), and we joked that our results were invariant to bugs in the code.

## July 11, 2006

### Counting churchgoers

In googling for "parking lot Stolzenberg," I came across a series of articles in the American Sociological Review on the measurement of church attendance in the United States--an interesting topic in its own right and also a great example for teaching the concept of total survey error in a sampling class. The exchange begins with an article by C. Kirk Haraway, Penny Long Marler, and Mark Chaves in 1993:

Characterizations of religious life in the United States typically reference poll data on church attendance. Consistently high levels of participation reported in these data sug-gest an exceptionally religious population, little affected by secularizing trends. This picture of vitality, however, contradicts other empirical evidence indicating declining strength among many religious institutions. Using a variety of data sources and data collection procedures, we estimate that church attendance rates for Protestants and Catholics are, in fact, approximately one-half the generally accepted levels.

The tables in the paper are really ugly (as are nearly all tables) and the graph does the mistake of listing cities alphabetically (rather than in some meaningful order), but the paper is interesting.

Then in 1998 came follow-up articles by Theodore Caplow, Michael Hout and Andrew Greeley, and Robert Woodberry, along with a reply by Hadaway, Marler, and Chaves. The discussion concludes with an article by Tom Smith and an article by Stanley Presser and LInda Stinson, who report,

Compared to conventional interviewer-administered questions about attendance at religious services, self-administered items and time-use items should minimize social desirability pressures. In fact, they each reduce claims of weekly attendance by about one-third. This difference in measurment approach does not generally affect associations between attendance and demographic characteristics. It does, however, alter the observed trend in religious attendance over time: In contrast to the almost constant attendance rate recorded by conventional interview-admininstered items, approaches minimizing social desirability bias reveal that weekly attendance has declined continuously oer the past three decades. These results provide support for the hypothesis that Americal has become more secularized, and they demonstrate the role of mode of administration in reducing measurement error.

Lots to think about here, both substantively and methodologically. I plan to use this as an example in my survey sampling class this fall.

## July 10, 2006

### Statistical consulting

I'm sometimes asked if I can recommend a statistical consultant. Rahul Dodhia is a former student (and coauthor of this paper on statistical communication) who, after getting his Ph.D. in psychology, has worked at different places including NASA, Ebay, and Amazon. He does statistical consulting; see here. I also have some colleagues in the Columbia faculty who do consulting. Rahul's the one with the website, though.

## July 5, 2006

### Is this blog too critical of innovative research?

I read your post on Kanazawa. I don't know whether his paper is correct, but I wanted to say something slightly different. Here is my concern.

The whole spirit of your blog would have led, in my view, to a rejection of the early papers arguing that smoking causes cancer (because, your eloquent blog might have written around 1953 or whenever it was exactly, smoking is endogenous). That worries me. It would have led to many extra people dying.

I can tell that you are a highly experienced researcher and intellectually brilliant chap but the slightly negative tone of your blog has a danger -- if I may have the temerity to say so. Your younger readers are constantly getting the subtle message: A POTENTIAL METHODOLOGICAL FLAW IN A PAPER MEANS ITS CONCLUSIONS ARE WRONG. Such a sentence is, as I am sure you would say, quite wrong. And one could then talk about type one and two errors, and I am sure you do in class.

Your blog is great. But I often think this.

I appreciate it is a fine distinction.

In economics, rightly or wrongly, referees are obsessed with thinking of some potential flaw in a paper. I teach my students that those obsessive referees would have, years ago, condemned many hundreds of thousands of smokers to death.

I replied as follows:

I agree completely with your point, of course; in fact, I have had colleagues in the past who used to specialize in criticizing applied statistical work, without ever suggesting constructive alternatives.

In my defense, I think that my blog often features studies with potential methodological problems, and I think I am judicious in considering these. For example:

- I had an extensive discussion of the advantages and disadvantages of Seth Roberts's results from self-experimentation. I think I made it clear that I thought it plausible that his findings would replicate to others, but I can't be sure.

- Regarding your study, perhaps I was being too harsh by lumping you with some other studies that controlled for total #kids. But I did provide a constructive solution (to run the "intent-to-treat analysis", not controlling for total #kids), so I don't think I was rejecting your conclusoins, just pointing out a way to get more confidence in them.

- The Kanazawaa paper is of lower quality, I think. I say this partly because such huge coefficients are extremely scientifically implausible to me. I'm not in the business of going around trashing random papers (with millions of scientific papers published each year, what's the point); I only noticed this one because it had been "Slashdotted." The point here is not that "a potential methodological flaw means its conclusions are wrong" but rather that its conclusions are highly scientifically implausible, and it has a huge methodological flaw.

In general, I try to be constructive in my comments (and I certainly hope I wasn't rude in my comments on your paper); I just found the Kanazawaa paper particularly irritating because they seemed so confident in their results For a more typical way in which I comment on a paper, see here.

Finally, I disagree that I would've rejected the argument in 1953 that smoking causes cancer. The main point I'd like to make here is that my comments on Kanazawaa's paper (and, to a lesser extent, on yours) involved the potential methodological flaw of controlling for an intermediate outcome. In a situation with outcome y, treatment T, and pre-treatment variables X, the standard approach would be to regress y on T,X, but what you did was to regress (or, in some way, model) y on z,T,X, and look at the coefficient of T. This can be a real problem (even if it didn't make a big difference in your particular example), and the #kids example is interesting because it seems so natural to subdivide the analysis by #kids, but it can lead to problems. Also, there's a simple check here, which is to take z out of the model.

In contrast, with the smoking analysis, you're talking about the problem that T is endogenous. This is a different problem, no simple solution and it's often the best you can do. To criticize studies because T is endogeous would shoot down almost all observational studies, and I'm not trying to do that.

So although I agree with the spirit of your comment (that one should have a sense of proportion in one's criticisms), I think that I've actually been ok in the specifics. I suspect it's a problem with tone rather than content.

In retrospect, Oswald's comments clearly hit a nerve, since my reply was longer than his original message! In any case, he replied as follows:

I agree that it is tiresome and dangerous when researchers sound like they are unreasonably confident. Perhaps the reason that I'm gentler-spirited than I was when young is that I have seen the relentless pressure to publish sound work that will get you a small pay rise each year. Publishing our Daughters work seemed the right thing to do once Nick and I had seen the same pattern in German data. Until then, we sat on the finding. But I am conscious that, because you can be wrong and people will notice, making any unusual claim is scary.

I feel that, sometime without being conscious of it, a lot of applied researchers prefer to work -- I am going to use loose language here -- on an issue that is dull but is 99.9% probably true rather than something deeply inconoclastic and potentially important that is 90.0% likely to be true. They do this partly, perhaps, because subconciously they very much fear being exposed as having made an error in their conclusion. Whether or not that is rational for an individual researcher, and very likely it is, there is also, I think, a case for believing that society needs risky iconoclasm in a very deep sense and to get it that society can live with some medium-run errors (because the profession will go on correct those reasonably quickly). I shall try to think through whether there is a way to make precise my intuition here that, in a society where scholars' scientific reputations may be damaged by one false conclusion, individual risk-aversion may be suboptimally high -- from a society's standpoint. Of course if I'm right there is a convex function of some kind at the bottom of all this.

## June 23, 2006

### Jay Goodliffe's comments on standardizing regression inputs

Jay Goodliffe writes,

I recently read your paper on scaling coefficients that you posted on the PolMeth site. I hope you don't mind if I send a comment/question on your manuscript.

I usually like to include some sort of "substantive significance" table after the regular tables to report something like first differences. I have also thought recently about how to compare relative effects of variables when some variables are binary and others are not.

My current approach is to code all binary variables with the modal category as 0, set all variables to their median, and then see how the predicted dependent variable changes when each independent variable is moved to the 90th percentile, one at a time. This approach makes it easy to specify the "baseline" observation, so there are no .5 Female voters, which occurs if all variables are set to the mean instead. There are, of course, some problems with this. First, you need all of the binary variables to have at least 10% of the observations in each category. Second, it's not clear this is the best way to handle skewed variables. But it is similar in kind to what you are suggesting.

My comment is that your approach may not always work so well for skewed variables. With such variables, the range mean +/- s.d. will be beyond the range of observed data. Indeed, in your NES example, Black is such a variable. In linear models, this does not matter since you could use the range [mean, mean + 2 s.d.] and get the same size effect. But it might matter in non-linear models, since it matters what the baseline is. And there is something less...elegant in saying that you are moving Black from -0.2 to 0.5, rather than 0 to 1.

My question is: You make some comments in passing that you prefer to present results graphically. Could you give me a reference to something that shows your preferred practice?

Thanks.

--Jay

P.S. I've used tricks from _Teaching Statistics_ book in my undergraduate regression class.

To start with, I like anyone who uses our teaching tricks, and, to answer the last question first, here's the reference to my preferred practice on making graphs instead of tables.

On to the more difficult questions: There are really two different issues that Jay is talking about:

1. What's a reasonable range of variation to use in a regression input, so as to interpret how much of its variation translates into variation in y?

2. How do you summarize regressions in nonlinear models, such as logistic regression?

For question 1, I think my paper on scaling by dividing by two sd's provides a good general answer: in many cases, a range of 2 sd's is a reasonable low-to-high range. It works for binary variables (if p is not too far from .5) and also for many continuous variables (where the mean-sd is a low value, and the mean+sd is a high value). For this interpretation of standardized variables, it's not so important that the range be mean +/- 1sd; all that matters is the total range. (I agree that it's harder to interpret the range for a binary variable where p is close to 0 or 1 (for example, the indicator for African American), but in these cases, I don't know that there's any perfect range to pick--going from 0 to 1 seems like too much, it's overstating the reasonable changes that could be expected--and I'm happy with 2sd's a choice.

For question 2, we have another paper just on the topic of these predictive comparisons. The short answer is that, rather than picking a single center point to make comparisons, we average over all of the data, considering each data point in turn as a baseline for comparisons. (I'll have to post a blog entry on this paper too....)

## June 21, 2006

### Standardizing regression inputs by dividing by two standard deviations

Interpretation of regression coefficients is sensitive to the scale of the inputs. One method often used to place input variables on a common scale is to divide each variable by its standard deviation. Here we propose dividing each variable by two times its standard deviation, so that the generic comparison is with inputs equal to the mean +/- 1 standard deviation. The resulting coefficients are then directly comparable for untransformed binary predictors. We have implemented the procedure as a function in R. We illustrate the method with a simple public-opinion analysis that is typical of regressions in social science.

Here's the paper, and here's the R function.

Standardizing is often thought of as a stupid sort of low-rent statistical technique, beneath the attention of "real" statisticians and econometricians, but I actually like it, and I think this 2 sd thing is pretty cool.

## June 13, 2006

### Encyclopedias, statistical resources, references

Through Robert Israel's sci.math posting I've found an excellent online resource, especially for many statistical topics: Encyclopaedia of Mathematics. It seems better than the other two better-known options Wolfram MathWorld and Wikipedia. Another valuable resource Quantlets has several interesting books and tutorials, especially on the more financially oriented topics; while some materials are restricted, much of it is easily accessible. Finally, I have been impressed by Computer-Assisted Statistics Teaching - while it is of introductory nature, the nifty Java applets make it worth registering.

Posted by Aleks Jakulin at 10:09 PM | Comments (1) | TrackBack

## June 12, 2006

### Statistical Data on the Web

Preparing data for use, converting it, cleaning it, leafing through thick manuals that explain the variables, asking collaborators for clarifications takes a lot of our time. The rule of thumb in data mining is that 80% of the time is spent on preparing the data. Also, it is often painful to read bad summaries of interesting data in papers when one would want to actually examine the data directly and do the analysis for oneself.

While there are many repositories of data on the web, they are not very sophisticated: usually there is a ZIP file with the data in some format that yet has to be figured out. Today I have stumbled upon Virtual Data System that provides an open source implementation of a data repository that enables one to view variables, the distribution of their values in the data, perform certain types of filtering, all through the internet browser interface. An example can be seen at Henry A. Murray Research Archive - click on Files tab and then on Variable Information button. Moreover, the system enables one to cite data similarly as one would cite a paper.

A similar idea developed for publications a few years earlier is GNU EPrints, which is a system of repositories of technical reports and papers that almost anyone can set up. Having used EPrints, I was frustrated by the inability to move data from one repository to another, to have some sort of a search system that would span several repositories, to have integration with search and retrieval tools such as Citeseer.

But regardless of the problems, such things are immensely useful parts of the now-developing scientific infrastructure on the internet. There would be wonders if even 5% of the money that goes into the antiquated library system was channelled into the development of such alternatives.

Posted by Aleks Jakulin at 3:12 PM | Comments (1) | TrackBack

David Berri very nicely gave detailed answers to my four questions about his research in basketball-metrics. Below are my questions and Berri's responses.

AG: 1. Reading Gladwell's article, I assume that Berri et al. are doing regression analysis, i.e., estimating player abilities as a linear combination of individual statistics. I have the same question that Bill James asked in the context of baseball statistics: why restrict to linear functions? A function of the form A*B/C (that's what James used in his runs created formula, or more fully, something like (A1 + A2 +...)*(B1 + B2 +...)/C) could make more sense.

DB: Before we discuss alternative approaches to what we have done we first need to establish what we exactly did. I would emphasize that the book does not present any math. Still, I think the models are described in enough detail that one can follow what we did -- and did not do.

For now, let me offer a brief description. It is important to note at the onset the motivation behind the model. We are economists interested in using the productivity data generated by the NBA to answer questions we think are interesting. To do this research, one first needs to make sense of the data. The NBA tracks a collection of statistics to measure player performance, but the statistics are not easily understood. For example, points, rebounds, and assists all seem important, but what is each stats relative value? To answer this question, it makes sense to use the tool economists most often employ, regression analysis. But how one builds the regression is not entirely straightforward.

In 1999 I published a paper which presented an early effort. The basic model employed in The Wages of Wins improves upon this effort on a number of dimensions. It is a simpler approach, it is more accurate, and I think, more theoretically sound.

The specific model described in the book -- and I would emphasize again the word described since there is no math in our book -- begins with a very simple regression. Specifically, wins are regressed on both offensive and defensive efficiency -- where offensive efficiency is defined as points divided by possessions employed and defensive efficiency is defined as points surrendered divided by possessions acquired.

Now that one regression is just the beginning of the story. Assists, blocked shots, and personal fouls are not part of any element of offensive or defensive efficiency. That does not mean, though, that these three factors don't matter. To get the value of these statistics, though, one needs to craft additional regressions. Of these, the regression designed to determine the impact of assists was easily the most difficult to construct.

I would emphasize that one approach often taken in the study of NBA performance is to attempt to logically derive the value of the statistics. We do not take this approach, but instead rely entirely on regression analysis. In other words, the relative value of each statistic is determined by the regressions and the data. The trick, of course, is defining the regressions correctly (which I think we do).

One last observation...In an end note in the book it is noted that the results one derives from the model based on offensive and defensive efficiency can also be derived from the model presented in the 1999 paper, if one makes a few modifications to the earlier work. That you get the same results with a different formulation suggests that the findings are fairly robust.

AG: 2. Have Berri et al. looked at the plus-minus statistic, which is "the difference in how the team plays with the player on court versus performance with the player off court"? When I started reading Gladwell's article, I thought he was going to talk about the plus-minus statistic, actually.

DB: We talk about plus-minus in the book briefly. The Wins Produced model is not a plus-minus approach, at least not in the way that term has been defined with respect to the NBA. The Wins Produced model utilizes the standard statistics the NBA tracks for its players and we find that these statistics do allow one to measure each player's contributions to team wins. I plan on commenting on plus-minus in more detail later on, but for now, I would say I think both plus-minus and the Wins Produced model are valid approaches and often (although not always) come to similar conclusions.

AG: 3. I'm concerned about Gladwell's causal interpretation of regression coefficients. I don't know what was in the analysis of all-star voting, but if you run a regression including points scored and also rebounds, turnovers, etc., then the coefficient for "points scored" is implicitly comparing two players with different points scored but identical numbers of rebounds, assists, etc.--i.e., "holding all else constant." But that is not the same as answering the what happens "if a rookie increases his scoring by ten per cent." If a rookie increases his scoring by 10%, I'd guess he'd get more playing time (maybe I'm wrong on this, I'm just guessing here), thus more opportunities for rebounds, steals, etc. Just to be clear here: I'm not knocking the descriptive regression. In particular, you can play with it to model what might happen if players are switched in an out of teams (as long as you think carefully about issues such as playing time, I suppose). I'm just sensitive to mistakenly-causal interpretations of regression coefficients--the idea that you can change one variable while holding all else constant.

DB: If you have two rookies, equal in every way except one has 10% more points, then the one with more points will have 23% more voting points. That's how I read the coefficient.

I think the key result, and this is found in studies of different decisions, is that how many points a player scores dominates decision making in the NBA. Studies of what determines a players salary, factors that cause a player to be cut from a team, and the coaches' voting for the All-Rookie team all indicate that points scored is the most important factor. Rebounds, turnovers, steals, and shooting efficiency determine wins, but do not have as much impact on player evaluation.

AG: 4. Gladwell's article is subtitled, "When it comes to athletic prowess, don't believe your eyes," and he writes, "We see Allen Iverson, over and over again, charge toward the basket, twist and turning and writhing through a thicket of arms and legs of much taller and heavier men--and all we learn is to appreciate twisting and turning and writhing. We become dance critics, blind to Iverson's dismal shooting percentage and his excessive turnovers, blind to the reality that the Philadelphia 76ers would be better off without him." But it seems here that the problem is not that people are ignoring the statistics, but that they're using the wrong (or overly simplified) statistics. After all, he points out in the first paragraph of his article that Iverson has led the league in scoring and steals, and his team has done well. Even if he didn't look cool flying to the basket, Iverson might have gotten recognition from these statistics, right? This is a point that Bill James made (with regard to batting average in Fenway Park, ERA in Dodger Stadium, etc.): people can over interpret statistics in isolation.

DB: Let me try and shed additional light on what Gladwell was saying by taking you through a story from Gladwell's work. There is a great story in Blink (Gladwell's latest book) where he talks about medical doctors trying to determine if someone is having a heart attack the moment the person arrives in the emergency room with chest pains. If the doctor said yes and he was wrong, the patient would tie up staff and space unnecessarily. If the doctor said no and he was wrong, the patient could be sent home and be in very serious trouble. According to Gladwell, given the importance of the decision, doctors would look at everything. Unfortunately, much of what they considered was not important to the decision. A cardiologist named Lee Goldman developed a simple algorithm which found that only three factors truly mattered. And furthermore -- and this is the interesting results -- doctors who looked at everything could not come close to the accuracy of the simple algorithm.

After reading his piece in the New Yorker, I think Gladwell looks at our algorithm the same way as he sees the algorithm designed to predict heart attacks. People in the NBA try and look at everything in evaluating players. So they watch every move the player makes on the court trying to figure out who is good and who is bad. Watching, though, is biased towards the dramatic, which often is scoring. Although we show that scoring is of course important, scoring itself is not always evaluated correctly and often the non-scoring actions are just as important. Specifically, with respect to scoring the issue is not really how much you score, but how you score. If you score inefficiently, you might look good on the court, but you are not helping your team win very much. Furthermore, non-scoring factors like turnovers, steals, and rebounds, which may not stand out when you just watch a player, really impact outcomes. In the end a player with high scoring totals, and in Gladwell's words, good dance moves, can easily lead to an incorrect evaluation if all you do is watch the player. Perhaps decision-makers would be better off first looking at the numbers, and then watching the player to be sure that the numbers the player posted in the past are the numbers you will likely see in the future. In other words, start with the numbers which tell you how productive the player has been. Then watch the players to see if you can figure out why that productivity is happening.

## June 11, 2006

### Survey weights and poststratification

A question came in which relates to an important general problem in survey weighting. Connie Wang from Novartis pharmaceutical in Whippany, New Jersey, writes,

I read your article "Struggles with survey weighting and regression modeling" and I have a question on sampling weights.

My question is on a stratified sample design to study revenues of different companies. We originally stratified the population (about 1000 companies) into 11 strata based on 9 attributes (large/small Traffic Volume, large/small Size of the Company, with/without Host/Remote Facilities, etc.) which could impact revenues, and created a sampling plan. In this sampling plan, samples were selected from within each stratum in PPS method (probabilities proportionate to size), and we computed the sampling weights (inverse of probability of selection) for all samples in all strata. In this sampling plan, sampling weights for different samples in the same stratum may not be the same since samples were drawn from within each stratum not in SRS (simple random sample) but in PPS/census.

After all samples were drawn, we want to estimate the revenue of large/medium/small size of companies (the above sampling plan was created for other purpose) respectively. We poststratified all samples based on size of the company only. Obviously, this re-classification was totally different from the original stratifying based on 9 attributes. Let's assume that there are x samples falling into large company group under this re-reclassification. Obviously, these x samples could be from any original stratum or several original strata since this poststratification was regardless of the original stratifying. e.g., some samples in original stratum h could fall into large company group under this re-classification and other samples in original stratum h could fall into medium company group under this re-classification. Now my question is: are the original stratified sampling weights valid for the large company group from poststratifying! ? e.g., can we multiply the revenue of each sample by its original stratified sampling weight for all x samples in the large company group poststratified and add up all products to get the total revenue of the large company group (subtotal)?

I guess the sample weights are not valid for the post-stratified large company group, noting that the check rule on sample weights: add up all sample weights in a stratum (poststratified) and see if the sum is equal to the stratum size (subpopulation). So, we cannot multiply the revenue of each sampling unit by its original sampling weight to get the total revenue.

My thinking is that the original sampling weights need to be adjusted (rescaled) to get new sampling weights for the post-stratified new strata, and then the revenue of each sampling unit can be multiplied by its new sampling weight to get the total revenue. Another reason to adjust for sampling weights is non-responses.

1. You use PPS sampling. Usually PPS is used in cluster sampling, where the first stage is PPS and the second stage samples a fixed number of units per cluster, so that the unit-level sampling is equal probability of selection. (See, e.g., these references for more on these issues.) So if you're just doing PPS, I don't know that you'll need unit-level weights (or, if you do, they shouldn't vary much if the survey was designed well.)

2. The full solution to your problem is to poststratify on the 2-way table of the 11 design strata crossed with the 3 size categories--that's 33 post-strata in all. If you have enough data, you can simply do full poststratification--that is, get a separate estimate for each of the 33 cells, and then add up the rows of the table to get estimates for each of the 3 size categories.

3. Another option, which might work better if the sample size is smaller, is to rake: poststratify on your 11 strata, then on your 3 size categories, and iterate this a couple times so that your weighted sample matches the population in both these dimensions.

You certainly shouldn't need to take a new sample (unless the original sample has too small a sample size in one or more of your post-strata of interest).

P.S. Connie sent me a new version of her question so I altered her wording above as requested.

## June 7, 2006

### World Cup simulation

Mouser sent along this link to an applet that simulates World Cup outcomes.

## June 6, 2006

### Take logit coefficients and divide by approximately 1.6 to get probit coefficients

[See update at end of this entry.]

Jeff Lax pointed me to the book, "Discrete choice methods with simulation" by Kenneth Train as a useful reference for logit and probit models as they are used in economics. The book looks interesting, but I have one question. On page 28 of his book (go here and click through to page 28), Train writes, "the coefficients in the logit model will be √1.6 times larger than those for the probit model . . . For example, in a mode choice model, suppose the estimated cost coefficient is −0.55 from a logit model . . . The logit coefficients can be divided by √1.6, so that the error variance is 1, just as in the probit model. With this adjustment, the comparable coefficients are −0.43 . . ."

This confused me, because I've always understood the conversion factor to be 1.6 (i.e., the variance scales by 1.6^2, so the coefficients themselves scale by 1.6). I checked via a little simulation in R:

> n <- 100

> x <- rnorm (n)

> a <- 1.3

> b <- -0.55

> y <- rbinom (n, 1, invlogit (a + b*x))

> M1 <- glm (y ~ x, family=binomial(link="logit"))

> display (M1)

glm(formula = y ~ x, family = binomial(link = "logit"))

coef.est coef.se

(Intercept) 0.88 0.22

x -0.44 0.24

n = 100, k = 2

residual deviance = 118.6, null deviance = 122.2 (difference = 3.6)

> M2 <- glm (y ~ x, family=binomial(link="probit"))

> display (M2)

glm(formula = y ~ x, family = binomial(link = "probit"))

coef.est coef.se

(Intercept) 0.54 0.13

x -0.26 0.14

n = 100, k = 2

residual deviance = 118.6, null deviance = 122.2 (difference = 3.5)

> -.44/-.26

[1] 1.69

I did it a few more times and got different results, but always between 1.6 and 1.8 (which is consistent with the literature, e.g., Amemiya, 1981).

Train also refers to a factor of pi^2/6, which is the variance of a single utility in the logit model (so that the difference has a variance of pi^2/3; see p.39 of his book here). This pi^2/3 is a variance, so its square root needs to be taken, hence pi/√3=1.8, which is indeed the sd of the unit logistic distribution. However, as Amemiya (1981) and others have noted, the logistic distribution function actually fits better to the normal, over most of the range of the curve, if we scale by 1.6 rather than 1.8. But, in any case, it's 1.6, not √1.6. Anyway, I think that's right.

P.S.

I talked with Dr. Train and we realized that we're talking about two different (although related) models. I'm working with logit/probit for binary outcomes, or ordered logit/probit for multilnomial outomes, in which there's a single latent variable (with logistic(0,1) or normal(0,1) error term). Train is working with a utility model in which each alternative has its own independent error term (extreme-value or normal(0,1)), so that the difference in two utilities is either logistic(0,1) or normal(0,2). Hence the sqrt(2) difference in our sd's. The parameterization/model I use is more common in statistics and, I believe, in econometric analysis of discrete data (e.g., Maddala's book), but I can see that Train's parameterization/model would makes sense in settings with different random utility for each person and each outcome.

Train clarifies:

These are not two different parameterizations of the same model, with one parameterization being more common than the other. They are two different models, each with its own parameterization that is common for that model.

## June 5, 2006

### How can and should we interpret regression models of basketball?

I read Malcolm Gladwell's article in the New Yorker about the book, "The Wages of Wins," by David J. Berri, Martin B. Schmidt, and Stacey L. Brook. Here's Gladwell:

Weighing the relative value of fouls, rebounds, shots taken, turnovers, and the like, they’ve created an algorithm that, they argue, comes closer than any previous statistical measure to capturing the true value of a basketball player. The algorithm yields what they call a Win Score, because it expresses a player’s worth as the number of wins that his contributions bring to his team. . . .

In one clever piece of research, they analyze the relationship between the statistics of rookies and the number of votes they receive in the All-Rookie Team balloting. If a rookie increases his scoring by ten per cent—regardless of how efficiently he scores those points—the number of votes he’ll get will increase by twenty-three per cent. If he increases his rebounds by ten per cent, the number of votes he’ll get will increase by six per cent. . . . Every other factor, like turnovers, steals, assists, blocked shots, and personal fouls—factors that can have a significant influence on the outcome of a game—seemed to bear no statistical relationship to judgments of merit at all. Basketball’s decision-makers, it seems, are simply irrational.

I have a few questions about this, which I'm hoping that Berri et al. can help out with. (A quick search found that this blog that they are maintaining.) I should also take a look at their book, but first some questions:

1. Reading Gladwell's article, I assume that Berri et al. are doing regression analysis, i.e., estimating player abilities as a linear combination of individual statistics. I have the same question that Bill James asked in the context of baseball statistics: why restrict to linear functions? A function of the form A*B/C (that's what James used in his runs created formula, or more fully, something like (A1 + A2 +...)*(B1 + B2 +...)/C) could make more sense.

2. Have Berri et al. looked at the plus-minus statistic, which is "the difference in how the team plays with the player on court versus performance with the player off court"? (See here for some references to this, also here and here.) When I started reading Gladwell's article, I thought he was going to talk about the plus-minus statistic, actually.

3. I'm concerned about Gladwell's causal interpretation of regression coefficients. I don't know what was in the analysis of all-star voting, but if you run a regression including points scored and also rebounds, turnovers, etc., then the coefficient for "points scored" is implicitly comparing two players with different points scored but identical numbers of rebounds, assists, etc.--i.e., "holding all else constant." But that is not the same as answering the what happens "if a rookie increases his scoring by ten per cent." If a rookie increases his scoring by 10%, I'd guess he'd get more playing time (maybe I'm wrong on this, I'm just guessing here), thus more opportunities for rebounds, steals, etc.

Just to be clear here: I'm not knocking the descriptive regression. In particular, you can play with it to model what might happen if players are switched in an out of teams (as long as you think carefully about issues such as playing time, I suppose). I'm just sensitive to mistakenly-causal interpretations of regression coefficients--the idea that you can change one variable while holding all else constant.

4. Gladwell's article is subtitled, "When it comes to athletic prowess, don’t believe your eyes," and he writes, "We see Allen Iverson, over and over again, charge toward the basket, twisting and turning and writhing through a thicket of arms and legs of much taller and heavier men—and all we learn is to appreciate twisting and turning and writhing. We become dance critics, blind to Iverson’s dismal shooting percentage and his excessive turnovers, blind to the reality that the Philadelphia 76ers would be better off without him." But it seems here that the problem is not that people are igoring the statistics, but that they're using the wrong (or overly simplified) statistics. After all, he points out in the first paragraph of his article that Iverson has led the league in scoring and steals, and his team has done well. Even if he didn't look cool flying to the basket, Iverson might have gotten recognition from these statistics, right? This is a point that Bill James made (with regard to batting average in Fenway Park, ERA in Dodger Stadium, etc.): people can overinterpret statistics in isolation.

### Useful statistics material at UCLA

Rafael pointed me toward some great stuff at the UCLA statistics website, including a page on Multilevel modeling that's full of great stuff (No link yet to our forthcoming book, but I'm sure that will change...) It would also benefit from a link to R's lmer() package.

Fixed and random (whatever that means)

One funny thing is that they link to an article on "distingushing between fixed and random effects." Like almost everything I've ever seen on this topic, this article treats the terms "random" and "fixed" as if they have a precise, agreed-upon definition. People don't seem to be aware that these terms are used in different ways by different people. (See here for five different definitions that have been used.)

del.icio.us isn't so delicious

At the top of UCLA's multilevel modeling webpage is a note saying, "The links on this are being superseded by this link: Statistical Computing Bookmarks". I went to this link. Yuck! I like the original webpage better. I suppose the del.icio.us page is easier to maintain, so it's probably worth it, but it's too bad it's so much uglier.

## June 2, 2006

### Building a statistics department

Aleks sent me these slides by Jan de Leeuw describing the formation of the UCLA statistics department. Probably not much interest unless you're a grad student or professor somewhere, but it's fascinating to me, partly because I know the people involved and partly because I admire the UCLA stat dept's focus on applied and computational statitics. In particular, they divide the curriculum into "Theoretical", "Applied", and "Computational". I think that's about right, and, to me, much better than the Berkeley-style division into "Probability", "Theoretical", and "Applied". Part of this is that you make do with what you have (Berkely has lots of probabilists, UCLA has lots of ocmputational people) but I think that it's a better fit to how statistics is actually practice.

It's also interesting that much of their teaching is done by continuing lecturers and senior lecturers, not by visitors, adjuncts, and students. I'm not sure what to think about this. One of the difficulties with hiring lecturers is that the hiring and evaluation itself should be taken seriously, which really means that experienced teachers should be doing the evaluation. So I imagine that getting this started could be a challenge.

I also like the last few slides, on Research:

Centers

We went from a mathematics model (one faculty member in an office with pencils and yellow pads) to a science model (one or more faculty members with graduate students in labs). Centers are autonomous, support their graduate students, and are associated with specialized courses.

-- Center for Applied Statistics (Berk, Schoenberg, De Leeuw)

-- Center for Statistical Computing (Hansen)

-- Center for Image and Vision Sciences (Zhu, Yuille, Wu)

-- Center for the Teaching of Statistics (Gould and the Teaching Faculty)

-- Laboratory of Statistical Genomics (Sabatti)

-- Studio of Bio-data Refining and Dimension Reduction (Li)

Lessons

-- Pay attention to Campus Initiatives (Bioinformatics, Computing, UCLA in LA).

-- Link with large interdisciplinary projects (Embedded Networks, Institute for the Environment).

-- PI’s autonomy and reponsibility. Federal model.

-- Use a very wide definition of statistics.

-- Preprints, Digital Library, E-Journals.

I think this approach could work well even with a Berkeley-type department that is strong on probability. I like the idea of offering opportunities to students rather then telling them all what to do.

## June 1, 2006

### Comparing models

Jonathan Zaozao Zhang writes,

For the dataset in my research, I am currently trying to compare the fit between a linear (y=a+bx) and a nonlinear model y=(a0+a1*x)/(1-a2*x).

The question is: For the goodness of fit, can I compare R-squared values?(I doubt it... Also, the nls command in R does not give R-squared value for the nonlinear regression) If not, why not? and what would be a common goodness of fit measure that can be used for such comparsion?

My response: first off, you can compare the models using the residual standard deviations. R^2 is ok too, since that's just based on the residual sd divided by the data sd. Data sd is same in 2 models (since you're using the same dataset), so comparing R^2 is no different than comparing residual sd.

Even simpler, I think, is to note that model 2 includes model 1 as a special case. If a2=0 in model 2, you get model 1. So you can just fit model 2 and look at the confidence interval for a2 to get a sense of how close you are to model 1.

Continuing on this theme, I'd graph the fitted model 2 as a curve of E(y) vs x, showing a bunch of lines indicating inferential uncertainty in the fitted regression curve. Thien you can see the fitted model and related possibilities, and see how close it is to linear.

## May 26, 2006

### The importance of groups of variables

The first thing we do when looking at the summaries of a regression model is to identify which variables are the most important. In that respect, we should distinguish the importance of a variable on its own and the importance of variable as a part of a group of variables. Information theory in combination with statistics allows us to quantify the amount of information each variable provides on its own, and how much does the information provided by two variables overlap.

I will use the notion of the nomogram from two days ago to explain this on the same example of a customer walking into a bank and applying for credit. The bank has to decide whether it will accept or reject a credit. Let us focus on two variables, credit duration and credit amount. We can perform logistic regression for only one variable at a time, but display the effect function on the same nomogram. This looks as follows:

In fact, this is a visualization of the naive Bayes classifier, using loess smoother as a way of obtaining the conditional probability densities P(y|x). But regardless of that, we can see a relatively smooth almost-linear increase in risk, both with increasing duration and with increasing credit amount. In that respect, both variables seem to be about equally good, although duration is better, partly due to the problems with credit amount being leftwards skewed, so the big effects for large credits are somewhat infrequent.

But this is not the right way of doing regression: we have to model both variables at the same time. As the scatter plot shows, they are not independent:

This plot also seems to show that both of them are of comparative predictive power. But now consider the nomogram of the logistic regression model:

>

The coefficient for the credit amount has shrunk considerably! This holds even if we performed Bayesian logistic regression and took the posterior mean as a summary of the correlated coefficients. Why was the credit amount that shrank and not the duration? I find the resolution of the logistic regression model somewhat arbitrary, in the spirit of "winner takes all".

A different interpretation is to use mutual and interaction information as to clarify what is going on. Consider this summary:

The meaning is as follows:

• Duration alone explains 2.64% of the entropy of the risk.
• Credit amount alone explains 2.12% of the entropy of the risk.
• There is a 1.03% overlap between the information both of them provide.
Conditional mutual information indicates how much one variable tells about the other if we control for the third variable. In this case, duration would explain 2.64-1.03=1.61% of risk entropy had we controlled for credit amount, and credit amount would explain 2.12-1.03=1.09% of risk entropy had we controlled for the duration.

The only problem with this approach is that one needs to construct a reliable joint model of all three variables at the same time as to be able to estimate these information quantities.

Posted by Aleks Jakulin at 2:19 PM | Comments (1) | TrackBack

## May 24, 2006

### Nomograms

Regression coefficients are not very pleasant to look at when listed in a table. Moreover, the value of the coefficient is not what really matters. What matters is the value of the coefficient multiplied with the value of the corresponding variable: this is the actual "effect" that contributes to the value of the outcome, or with logistic regression, towards the log-odds ratio. With this approach, it is no longer necessary to scale variables prior to regression. A nomogram is the visualization method based on this idea.

An example of a logistic regression model of credit risk, visualized with a nomogram is below:

We can see that the coefficients for the nominal (factor) variables are grouped together on the same line. The intercept is implicit through the difference between the dashed line and the 0.5 probability axis below. The error bars for individual parameters are not shown, but they could be if we desired. We can easily see whether a particular variable increases or decreases the perception of risk for the bank.

It is also possible to display similar effect functions for nonlinear models. For example, consider this model for predicting the survival of a horse, depending on its body temperature and its pulse:

As soon as the temperature or pulse deviate from the "standard operating conditions" at 50 heart beats per minute and 38 degrees Celsium, the risk of death increses drastically.

Both graphs were created with Orange. Martin Mozina and colleagues have shown how nomograms can be rendered for the popular "naive" Bayes classifier (which is an aggregate of a bunch of univariate logistic/multinomial regressions). A year later we demonstrated how nomograms can be used to visualize support vector machines, even for nonlinear kernels, by extending nomograms toward generalized additive models and including support for interactions. Frank Harrell has an implementation of nomograms within R as a part of his Design package (function nomogram).

Posted by Aleks Jakulin at 9:29 AM | Comments (3) | TrackBack

## May 15, 2006

### Statistical anomaly, or is something going on?

According to the Washington Post, "Florida had seen just 17 confirmed fatal alligator attacks in the previous 58 years. In less than a week, there appears to have been three."

How strange is this? 17 attacks in 58 years is about 0.3 attacks per year. But there are a lot more people in Florida now than there were 58 years ago or even 25 years ago, so we might expect the rate of attacks to be increasing with time (though I don't know if that's true). Suppose the current "true" rate is 1 attack per year, or 0.02 per week. In that case, given that there is one attack, the odds of having two more within a week would seem to be roughly 0.0004. (We look at the question this way because we are interested in the question "what is the probability of having three attacks in the same week", not "what is the probability of having three attacks in this particular week.")

So according to this crude estimate, 3 fatal alligator attacks in a week is indeed an extremely unlikely occurrence, although not spectacularly so. There could be some important modifiers, though. For instance, maybe there is a seasonality to alligator attacks, either because alligators are more aggressive at some times of year or because more people do risky activities during some times of the year.

On a related subject, I'm never exactly sure how to think of these "freakish coincidence" stories. We'd be having this same sort of discussion if instead there had been three fatal bear attacks in a week in Oregon, or three fatal shark attacks in Hawaii, or three fatal mountain lion attacks in California, or whatever, so maybe a "three fatal something attacks in a week" story shouldn't be all that rare. On the other hand, it does seem worthwhile trying to figure out if anything has changed in Florida that would make gator attacks a lot more common, because 3 in a week sure seems exceptional.

## May 14, 2006

### Pretty graph, could be made even prettier

Here's a pretty graph (from Steven Levitt, who says "found on the web" but I don't know the original source):

This is a good one for your stat classes. My only suggestions:

1. Get rid of the dual-colored points. What's that all about? One color per line, please! As Tufte might say, this is a pretty graph on its own, it doesn't need to get all dolled up. Better to reveal its natural beauty through simple, tasteful attire.

2. Normalize each month's data by the #days in the month. Correcting for the "30 days hath September" effect will give a smoother and more meaningful graph.

3. Something wacky happened with the y-axis: the "6" is too close to the "7". Actually, I think it would be fine to just label the axis at 6, 8, 10,... Not that it was necessarily worth the effort to do it in this particular case, just thinking about this one to illustrate general principles. Ideally, the graphing software would make smart choices here.

4. (This takes a bit more work, but...) consider putting +/- 1 s.e. bounds on the hockey-player data. Hmm, I can do it right now....761/12 = 63, so we're talking about relative errors of approximately 1/sqrt(63)=1/8, so the estimates are on the order of 8% +/- 1%.

P.S. See Junk Charts for more.

## May 10, 2006

### Richard Berk's book on regression analysis

I just finished reading Dick Berk's book, "Regression analysis: a constructive critique" (2004). It was a pleasure to read, and I'm glad to be able to refer to it in our forthcoming book. Berk's book has a conversational format and talks about the various assumptions required for statistical and causal inference from regression models. I was disappointed that the book used fake data--Berk discussed a lot of interesting examples but then didn't follow up with the details. For example, Section 2.1.1 brought up the Donohue and Levitt (2001) example of abortion and crime, and I was looking forward to Berk's more detailed analysis of the problem--but he never returned to the example later in the book. I would have learned more about Berk's perspective on regression and causal inference if he were to apply it in detail to some real-data examples. (Perhaps in the second edition?)

p. xiv: Berk writes, "We could all sit at our desks and perform hypthetical experiments in our heads all day, and science would not advance one iota." This is true of most of us, I'm sure . . . but Einstein advanced science by his hypothetical experiments on relativity theory. So it is possible!

p.19 has a nice quote: "If the goal is to do more than describe the data on hand, information must be introduced that cannot be contained in the data themselves."

Figure 3.3: This is described as a "bad fit," but it's just a case of high residual variance. The model fits fine, but sigma is large. Perhaps a distinction worth making. (More generally, I like that Berk focuses on the mean--the deterministic part of the regression model--rather than the errors. Most statistics texts seem to make the mistake of talking on and on about the distribution of the errors and the variance function, but it's the deterministic part that's generally most important.)

I also like that in Section 3.6, Berk presents transformations as a way to get linearity (not normality or equal-variance, which are typically much less important). Again, an important practical point that is lost in many more mathematical books.

Figure 4.5: I don't really understand this picture: it's a plot of (hypothetical?) data of students' grade point averages vs. SAT scores, and the discussion says that "larger positive errors around the line tend to have larger SAT scores." But I don't understand what is meant by "errors" here or why the regression line is not drawn to go through the data.

p.68: Berk writes, "The null hypothesis is either true or false." Actually, I'd go further: in any problem I've ever worked on, the null hypothesis is false. Mathematically, the null hyp in regression is that some beta equals 0, and in social and environmental science, the true beta (as would be seen by gathering data from a very large population) is never exactly zero. I do think that hyp testing can be useful--for example, it can tell you that you can be very sure that beta>0 or that beta<0, and whether the data are sufficient to estimate beta precisely--but we know ahead of time that beta != 0.

Sections 5.2.2 and 10.5.1: There's an extensive discussion here of "response schedules" but I don't quite understand what's being said here. A full example with data would help.

On p. 92, there's a discussion of estimating the effects of prison sentences, and Berk seems to be saying that this can't be done because you can't simultaneously manipulate the length of sentence and the age at which a prisoner is released. But I don't see why this is a problem: say, for example, that you have a bunch of 20-year-old prisoners, and though a randomized intervention, some are released at age 25 and some at age 40. You can look at a bivariate outcome: crimes committed ages 25-40, and crimes committed ages 40+. The treatment will have a clear effect on the first outcome (of course, there are cost-benefit issues as to whether the treatment is worth it), but you can certainly compare the crimes age 40+ for the two groups.

Chapter 5 has lots of discouraging examples. It would be good to see some success stories. (Parochially, I can point to this and this as particularly clean examples of causal inference from observational data, but lots more is out there.) I also think the discussion of models would be strengthened by some disucssion of interactions (in the causal setting, that would correspond to treatments that are more effective for some groups than others). This is also an active research area (see here).

p.99: In the discussion of matching, it would unify things to point out that matching followed by regression can be more effective than either alone (this was Don Rubin's Ph.D. thesis, published in article form as Rubin, 1973).

In Chapter 8 there's some discussion of stepwise regression etc. It would be also helpful to discuss other methods of combining predictors, for example adding them up to create "scores." Also, when mentioning categorical predictors, that's a good place to put a pointer to multilevel models.

I agree with Berk's skepticism in Chapter 9 about traditional "regression diagnostics." In my experience, outliers and nonnormality are not the key concerns, and what's more important is to get a sense of what the model is doing and how it is being fit to the data.

In Chapter 10, he refers to multilevel modeling as "relatively recent." Actually, it's been around since the early 1950s in animal breeding and since the early 1970s in education. These are two fields where one encounters many small replicated datasets.

I also think Berk is too skeptical about multilevel models. I think he needs to apply equal skepticism to the altermative, which is to include categorical predictors and fit by least squares. This least-squares alternative has bad statistical properties, makes it difficult to fit varying slopes and include group-level predictors, becomes even messier when fitting logistic regressions to sparse data, and is in fact a special case of multilevel modeling where the group-level variance is set to infinity. So, yes, I agree that multilevel modeling does not solve the problem of causality (see here), it can be pretty useful for model fitting.

I have similar comments for Berk's discussion of meta-analysis. Numerical and graphical combination of information can be helpful, and multilevel meta-analysis is a way of doing this and separating out the different sources of variation as they appear in the data.

Finally, there's a quote on page 204 disparaging the method (which I like) of hypothesizing a model, and then when it is rejected by the data, of replacing or improving the model. I think the iteration of modeling/fitting/checking/re-modeling is extremely useful (here, I'm influenced by the writings of Jaynes and Box, as well as my own experiences). The quote says something about how if you "stick your neck out" to assume a model, then your head will be cut off. But I don't think that's quite right. I'll make an assumption, knowing that it's false, and ready to replace or refine it as indicated by the data.

## May 8, 2006

### Vegas, baby, Vegas

I am in Vegas for a couple of days, to give a talk. A few observations:
0. I'm shocked -- shocked! -- to find gambling going on here.
1. Some casinos have blackjack tables that pay 3:2 for a blackjack; others pay 6:5. As far as I could tell, there are no other differences in the rules at these casinos. Since blackjacks are not all that rare --- about 5% of hands --- one would think that players would just walk across the street to a casino that offers the better odds. Maybe some players do, but many do not.
2. Card counting would take considerable practice, even using the "high-low" method that just keeps a running count of the difference in the number of high versus low cards that have been dealt. In a few minutes of trying to keep track while watching a game -- much less playing it -- it was very easy to accidentally add rather than subtract, or miss a card, and thus mess up the count. Don't count on paying for your Vegas vacation this way, no matter how good a statistician you are, at least unless you practice for a few hours first.
3. Actual quote from an article called "You can bet on it: How do you beat the casino?", by Larry Grossman, in the magazine "What's On, the Las Vegas Magazine": "For a slot player the truly prime ingredient to winning is the luck factor. All you have to do is be in the right place at the right time. No other factor is as meaningful as luck when you play the slots." Forsooth!
4. Also from the Grossman "article": "Certain games played in the casino can be beaten, and not just in the short term. The winnable ones are these: blackjack, sports betting, race betting and live poker." Seems reasonable and maybe correct.
5. The toilets in "my" hotel/casino (the oh-so-classy Excalibur) seem to waste water almost gratuitously. I can understand why they want swimming pools, and giant fountains that spray water into the desert air...but does having a really water-wasting _toilet_ convey some sort of feeling of luxury?

## May 1, 2006

### Marginal and marginal

Statistics and economics have similar, but not identical, jargon, that overlap in various confusing ways (consider "OLS," "endogeneity," "ignorability," etc., not to mention the implicit assumptions about the distributions of error terms in models).

To me, the most interesting bit of terminological confusion is that the word "marginal" has opposite meanings in statistics and economics. In statistics, the margin (as in "marginal distribution") is the average or, in mathematical terms, the integral. In economics, the margin (as in "marginal cost") is the change or, in mathematical terms, the derivative. Things get more muddled because statisticians talk about the marginal effect of a variable in a regression (using "margin" as a derivative, in the economics sense), and econometricians work with marginal distributions (in the statistical sense). I've never seen any confusion in any particular example, but it can't be a good thing for one word to have two opposite meanings.

P.S. I assume that the derivation of "margin," in both senses, is from the margin of a table, in which case either interpretation makes sense: you can compute sums or averages and place them on the margin, or you can imagine the margin to represent the value at the next value of x, in which case the change to get there is the "marginal effect."

## April 27, 2006

### Statistical significance as a guide to future data collection

The vigorous discussion here on hypothesis testing made me think a bit more about the motivations for significance tests. Basically, I agree with Phil's comment that there's no reason to do a hypothesis test in this example (comparing the average return from two different marketing strategies)--it would be better to simply get a confidence interval for each strategy and for their difference. But even with confidence intervals, it's natural to look at whether the difference is "statistically significantly" different from zero.

As Phil noted, from a straight decision-analytic perspective, significance testing does not make sense. You pick the strategy that you think will do better--it doesn't matter whether the difference is statistically significant. (And the decision problem is what to do, not what "hypothesis" to "accept.")

But the issue of statistical significance--or, perhaps better put, the uncertainty in the confidence interval--is relevant to the decision of whether to gather more data. The more uncertainty, the more that can be learned by doing another experiment to compare the treatments. But if the treatments are clearly statistically significantly different, you already know which is better and there's no need to gather more data.

In reality, this is a simplification--if you really can clearly distinguish the 2 treatments, then it makes sense to look at subsets of the data (for example, which does better in the northeast and which does better in the midwest), and continue stratifying until there's too much uncertainty to go further. But, anyway, that's how I see the significance issue here--it's relevant to decisions about data collection.

## April 25, 2006

### Direct mail marketing problem

Gordon Dare writes with an interesting example of the kind of statistics question that comes up all the time but isn't in the textbooks:

I took survey sampling with you a few years ago. I would be grateful if you could help me with with a hypothesis test problem. I work in a direct mail company and would like to know if the difference in results in two groups are significant. Here are the details:

group1: 50,000 custs mailed a catalog. Of these 100 purchased and the mean of their $spend was$50 with a standard deviation of $10. group2: 50,000 custs mailed a catalog. Of these 120 purchased and the mean of their$ spend was $55 with a standard deviation of$11.

I did extensive research but am still confused as to what N I should use to test if there is a significant difference between these two means ($50 and$55). Is it the 50,000 or the 100? Should I use the $0 for the non-buyers to calculate the mean and SD. If so, with so much nonbuyers how should I cater for such a skewed distribution or is this not an issue? My response: The quick answer is to use all 50,000 customers in each group (counting the non-buyers as zeroes). Skewness really isn't an issue given that you have over 100 nonzeroes in each group. You could also do more elaborate analyses, considering the purchasing decision and the average purchase separately, but the quick summary would be to just use the total. Posted by Andrew at 12:00 AM | Comments (8) | TrackBack ## April 20, 2006 ### Bayes pays Date: Wed, 19 Apr 2006 13:46:11 -0400 From: patrick.e.flanagan@census.gov Subject: Job Opportunity at Bureau of the Census The job listed below may be of interest to those with Bayesian background. Job: DSMD-2006-0018 & DSMD-2006-0019 Job Position: 1529 Mathematical Statistician Job Description: Mathematical Statistician, GS-1529-14, Leader of the Small Area and Population Estimates Methods Staff in the Demographic Statistical Methods Division Job grade: GS-14 Close Date: 04/27/2006 Salary Range:$91407.00-$118828.00 1 vacancy at Washington DC Metro Area, DC (Suitland, MD) Follow this link to the job listed above: https://jobs.quickhire.com/scripts/doc.exe/rundirect?Org=1&Job=6365 This position is also listed with USAJobs website http://www.usajobs.opm.gov as Control Number 642688 Patrick Flanagan Assistant Division Chief Demographic Statistical Methods Division Bureau of the Census Posted by Andrew at 9:38 AM | Comments (0) | TrackBack ## April 18, 2006 ### A wallful of data Rafael pointed me to this link: I can't quite figure out what's on that wall, but I wonder if it could be ordered a bit so that anomalies would show up more (as in the cover image of Bayesian Data Analysis). Posted by Andrew at 12:47 AM | Comments (0) | TrackBack ## April 10, 2006 ### Log transformations and generalized linear models Gregor writes, I would like to hear your opinion on Paul Johnson comments here, where this link is provided. In this note Paul states that: The GLM really is diferent than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity. Using OLS with manually transformed data leads to horribly wrong parameter estimates. Let y_ii be the dependent variable with mean \mu. OLS estimates: E(g(y_i)) = b_0 + b_1x_i but the GLM estimates g(E(y_i)) = b_0 + b_1x_i This also applies to log transformation. So the following two approaches are not the same: glm(log(y) ~ x, family = Gaussian(link = "identity")) glm(y ~ x, family = Gaussian(link = "log")) the difference is that first approach log transforms observed values, while the second one log transforms the expected value. My reply: Yeah, that's right. Usually I'd just take the log of the data, because, for all-positive outcomes, it typically makes sense to consider effects and errors as multiplicative (that is, additive on the log scale). And on the log scale you won't get negative predictions. But another way to look at it is that the 2 models are very similar, with the key difference being the relation between the predicted value and the variance. In some problems, you won't want to pick either model; instead you can model the variance as a power law, with power estimated from the data. This is done in serial dilution assays; see here, for example. P.S. I answer all of Gregor's questions because they are interesting. Also, he gave us literally zillions of comments on our forthcoming book. Posted by Andrew at 12:20 AM | Comments (0) | TrackBack ## April 5, 2006 ### Bubble graphs Junk Charts features these examples of pretty but not-particularly-informative bubble plots from the New York Times. If you like those, you'll love this beautiful specimen that Jouni sent me awhile ago. As Junk Charts puts it, "perhaps the only way to read their intention is to see them as decorated data tables, in other words, as objets d'art rather than data displays." I'll have to say, though, the bubble charts do display qualitative comparisons better than tables do. But, yeah, "real" graphs (dot plots, line plots, etc) would be better. Posted by Andrew at 5:10 PM | Comments (2) | TrackBack ## April 4, 2006 ### NCAA men's basketball tournament...place your bets I'm not much of a sports fan, but I enjoy reading "King Kaufman's Sports Daily" at Salon.com. (I think Kaufman's column may be only available to "Premium" (paid) subscribers). For the past few years, Kaufman has tracked the performance of self-styled sports "experts" in predicting the outcome of the National Collegiate Athletic Association's (NCAA) men's basketball tournament, which begins with 64 teams that are selected by committee and supposedly represent (more or less) the best teams in the country, and ends with a single champion via a single-elimination format. Many people wager on the outcome of the tournament --- not just who will become champion, but the entire set of game outcomes --- by entering their predictions into a "pool" from which the winner is rewarded. Last year and this year, Kaufman included the "predictions" of his son Buster (now three years old). Last year Buster flipped a coin for every outcome, and did not perform well; this year, he followed a modified strategy that is essentially a way of sampling from a prior distribution derived from the official team rankings created by the NCAA selection committee. The Pool o' Experts features a roster of national typists and chatterers, plus you, the unwashed hordes as represented by the CBS.SportsLine.com users' bracket, and my son, Buster, the coin-flippinest 3-year-old in the Milky Way. [Kaufmann later explains "Buster's coin-flipping strategy was modified again this year. Essentially, he picked all huge favorites, flipped toss-up games, and needed to flip tails twice to pick the upsets in between. Write me for details if this interests you, but think really hard before you do that, and maybe call your therapist."] To answer the inevitable question: Yes, Buster really exists. When football season comes in five months and he's still 3, I'll get letters saying it seems like he's been 3 for about two years, which says something about how we perceive the inexorable crawl of time. But I don't know what. Anyway, correct predictions earn 10 points in the first round, 20 in the second, and 40, 80, 120 and 160 for subsequent rounds. Note that Buster's strategy is to assign a win probability of 1 to a team that is a "huge favorite" based on pre-tournament seeding, a probability of 0.75 to a team that is a strong favorite, and a probability of 0.5 to a team that is playing another team of approximately equal seeding. So, how did the experts do? Because of [a particularly arrogant and ridiculous prediction several years ago] I'm always interested to see how Sports Illustrated does in the Pool o' Experts. Generally, it doesn't do well. And this year was no different -- except, interestingly, in the case of Mandel. He won, and his co-workers lost. Here are the final standings of the 2006 Pool o' Experts, the winner of which is entitled to dinner at my house, home cooking not implied. The winner is also not notified, the better to avoid having to award the prize: 1. Stewart Mandel,Sports Illustrated, 920 2. Gregg Doyel, CBS.SportsLine, 880 3. King Kaufman, Salon, 780 4. Tony Kornheiser, Washington Post, 760 5. Buster, Coinflip Quarterly, 740 (2) 6. WhatIfSports.com simulation, 720 7. Yoni Cohen, FoxSports.com, 690 8. Luke Winn, Sports Illustrated, 680 9. NCAA Selection Committee, 630 10. Seth Davis, Sports Illustrated, CBS, 590 11. CBS.SportsLine.com users, 550 (1) 12. Tony Mejia, CBS.SportsLine.com, 530 (1) 13. Grant Wahl, Sports Illustrated, 490 (3) Notes 1. Denotes past champion 2. Denotes 3-year-old 3. Just wanted to say hi The "NCAA Selection Committee" did not make a formal prediction, but (as indicated in the list above) would implicitly have finished in 9th place if one assumes that they would pick their favorite in each game. Buster the coin-flipper, whose predictions were essentially one "realization" of the tournament using the NCAA rankings and a crude way of performing the simulation, beat the NCAA and indeed beat most of the competitors. At first thought it seems that the best approach to trying to win a pool such as these is to pick the favorite in every game (as in the "NCAA Selection Committee" results): after all, if the seedings are correct then all-favorites-win is the single most likely outcome (with perhaps a one-in-a-million chance of occurring). But is this really the best strategy? Is there another strategy that would (1) beat pick-the-favorites more than half the time, or (2) would have a better chance of winning the pool? A couple of other things: (3) Is the scoring system for evaluating the performance (described near the middle of this entry) reasonable?, and (4) as Andrew has previously pointed out, there's no such thing as a "weighted coin". Posted by Phil at 12:33 PM | Comments (6) | TrackBack ### A reference for multiple regression Someone who goes by the name "mr_unix" writes: In your blog entry, there is reference to a public document called "Reference Manual on Scientific Evidence." I downloaded it and tried to read the chapter on multiple regression by Rubinfeld. I couldn't read it because the footnotes were so copious and interfering. I removed them with awk and reformatted the chapter to resemble an ordinary (!) monograph. Some people will benefit from it, I think. It's attached in MS Word format. Here it is. I don't agree with everything in it, but it's basically reasonable. Posted by Andrew at 11:09 AM | Comments (1) | TrackBack ## March 30, 2006 ### Tables of regression coefficients Andrew Sutter writes, I'm a practicing attorney with a background in physics. I've recently begun reading a lot of papers by economists, on topics like intellectual property, economic development and world trade. I'm pretty comfortable reading many technical papers in engineering and physical sciences for my work, so I'm not a quantitative basket case. Nonetheless, I am stupefied by some of the literary conventions of this new genre. Whoever decided it was great idea to present one’s findings in the form of acres and acres of regression coefficient tables? (Using dense, opaque clusters of capital letters as captions for the rows or columns is another odd notion, but easier to handle.) Why not use pictures? Of course, from what I’ve seen of your blog, I think I may be preaching to the choir here (or to the bishop, so to speak). Aside from my puzzlement at why social scientists like to spill so many digits, I'm stuck with the more practical problem of how to interpret these tables. I have a basic idea of what a regression is, and what correlation is. Is there some shortcut to interpreting these tables for someone who (i) is unlikely to run a regression himself in the foreseeable future, and (ii) doesn’t have time to wrestle with a 600-page econometrics textbook? Is the gist of it that, if I want to find what the authors will claim to be the biggest effects, I should look for the coefficients with the biggest absolute values and the tightest significance levels? Should I worry about the second row of numbers in each box, which is usually in parentheses, but that sometimes represents t-value, other times standard error, or something else? Is it safe to ignore those on a first reading? If there isn’t such an easy answer, then can you recommend any concise (say, <300 pp), real-world-oriented book you can recommend that could give me this background – esp. at a price below the now traditional$94 quantum for sucking money from students? Or can you point me toward some webpage or paper that has been designed as a guide for the similarly perplexed? (Most Internet resources I’ve found so far are geared to those who will be running regressions themselves, and/or relate to a particular piece of software, and/or are very abstract.)

I look forward someday to digging into the epistemological and interpretive questions you discuss on your blog. But in the meantime, I’d first like to know what these economists are saying in their fantasy models of patent licensing and such.

Is mine a vain hope?

My response: for a quick start, the importance of a predictor, within the context of the multiple regression, is basically the absolute value of the coefficient, multiplied by the magnitude of a typical change in that variable. (For example, many predictors are simply 0 or 1, but if a predictor runs from 1-7, then a typical change might be 4 (going from 2 to 6). The standard errors tell you the uncertainty in the coefficients but not their importance. That said, things are more complicated. For example, two predictors can be jointly important, even if neither looks so big alone. And models typically include interactions, so that a single input variable can enter into many predictors.

Why do they use tables instead of graphs? An economist (or sociologist) could answer better than me, but my guess is a mixture of:

- Tables are standard, and econ journals are pretty conservative regarding graphical presentation.
- Graphs take more work than tables to produce.
- Students learn what they are taught, thus the tablular format is self-perpetuating.
- When graphics are studied in statistics, it's usually in the context of plotting raw data, not plotting estimates.
- Numbers are unambiguous and objective. Perhaps there's a fear that you could cheat or mislead using graphics. Hence, the econ papers I've seen with graphs also have tables. They don't seem to trust graphs on their own.

## November 16, 2005

### Creeping alphabetism

Here's an example of how the principles of statistical graphics can be relevant for displays that, at first glance, do not appear to be statistical. Below is a table, from a Language Log entry by Benjamin Zimmer, of instances of phrases of the form "He eats, drinks, sleeps X" (where the three verbs, along with X, can be altered). I'll present Zimmer's table and then give my comment.

Here's the table:

My comment

The verb sequences are presented alphabetically. I'd rather see them in time order. This would give me a better sense of how the patterns ahve changed over the years.

## November 15, 2005

### I agree completely

I agree completely with this Junk Charts entry, which presents two examples (from the Wall Street Journal on 9 Nov 2005) of bar graphs that become much much more readable when presented as line graphs. The trends are clearer, the comparisons are clearer, and the graphs themselves need much less explaining. Here's the first graph (and its improvement):

And here's the second:

- First graph should be inflation-adjusted.
- Both graphs could cover a longer time span.
- Axis labels on second graph could be sparer (especially the x-axis, which could be labeled every 5 years).
- I'd think seriously about having the second graph go from 0 to 100% with shading for the three categories (as in Figure 10 on page 451 of this paper, which is one of my favorites, because we tell our story entirely through statistical graphics).
- The graphs could be black-and-white. I mean, color's fine but b&w is nice because it reproduces in all media. The lines are so clearly separated in each case that no shading or dotting would be needed.

P.S. See here for a link to some really pretty graphs.

## November 10, 2005

### Transforming variables that can be positive, negative, or zero

Sy Spilerman writes,

I am interested in the effect of log(family wealth) on some dependent variable, but I have negative and zero wealth values. I could add a constant to family wealth so that all values are positive. But I think that families with zero and negative values may behave differently from positive wealth families. Suppose I do the following: Decompose family wealth into three variables: positive wealth, zero wealth, and negative wealth, as follows:

- positive wealth coded as ln(wealth) where family wealth is positive, and 0 otherwise,
- zero wealth coded 1 if the family has zero wealth, 0 otherwise.
- negative wealth coded ln(absolute value of wealth) if family wealth is negative, and 0 otherwise,

and then use this coding as right side variables in a regression. It seems to me that this coding would permit me to obtain the separate effects of these three household statuses on my dependent variable (e.g., educational attainment of offspring). Do you see a problem with this coding? A better suggestion?

Yes, you could do it this way. I think then you'd want to include values very close to zero (for example, anything less than $100 or maybe$1000 in absolute value) as zero. But yes, this should work ok. Another option is to just completely discretize it, into 10 categories, say.

Any other suggestions out there? This problem arises occasionally, and I've seen some methods that seem very silly to me (for example, addiing a constant to all the data and then taking logs). Obviously the best choice of method will depend on details of the application, but it is good to have some general advice too.

## November 9, 2005

### More on Um, also on the implementation of start-at-zero

I [Mark Lieberman] was mostly trying to see whether a new database search program was working. I knew that men have been said to use filled pauses like "uh" more than women, and it made sense to me that disfluency would increase with age, so I generated the data for the first plot and took a look. I think you're right that I should have started the plot from 0, but I wasn't sure what I'd see, and thought that the qualitative effects if any would be clearer with a narrower range of values plotted.

Then I wondered about "um", and still had a few minutes, so I ginned up the data for the second plot and took a look at it. I was quite surprised to see the opposite age effect, and somewhat surprised to see the inverted sex effect, so I quickly looked up the standard papers on the subject and banged out a post.

Actually what I did was to add a bit of verbiage around the .html notes (with embedded graphs) that I'd been making for myself.

I've attached the first plot that I made in that session, showing the female/male ratio for a number of words that I thought might show a difference. The X axis is the (log) count of the word (mean of counts for male and female speakers), and the y axis is the (log) ratio of female/male counts. The plotted words are too small, but I wasn't sure how much they would overlap...

If I can find another spare hour or two, I'm going to check out whether southerners really talk slower than northeners.

And here's Mark's new plot:

Here's the full version. (I don't know how to fit it all on the blog page.)

P.S. In his new plots (see here), Mark uses a 2x2 grid and extends the y-axis to 0. To be really picky, I'd suggest making 0 a "hard boundary." In R you can do this using 'yaxs="i"' in the plot() call, but then the top boundary will be "hard" also, so that you have to use ylim to extend the range (e.g., ylim=c(0,1.05*max(y))). What I should really do is write a few R functions to encode my default graphing preferences so that I don't need to do this crap every time I make a graph.

## November 8, 2005

### Uh . . . um . . .

Mark Liberman posted some interesting summaries of telephone speech records from the Linguistic Data Consortium. He writes:

I [Mark Liberman] took a quick look at demographic variation in the frequency of the filled pauses conventionally written as "uh" and "um". For technical reasons that I won't go into here, I used the frequency of the definite article "the" as the basis for comparison. Thus I selected a group of speakers (e.g. men aged 60-69), counted how often they were transcribed as saying "uh", and to normalize that count (since the number of people in each category was different) I divided by the number of times the same speakers were transcribed as saying "the".

He also did "Um":

1. I like the clear axis labels and titles, and even more importantly, that the lines are labeled directly (rather than using different dotted lines and a key). Good labeling is important--I do it even for the little graphs I'm making in my own research when exploring data or model fits.
2. I would've used blue for boys and pink for girls--easier to remember--although perhaps Mark was purposely trying to be non-stereotypical.
3. My biggest change would have been to (a) put the 2 graphs on a common scale, and (b) make them smaller, and put them next to each other. Smaller graphs allow us to see more at once, and see patterns that can be more obscure when we are forced to scroll back and forth between mutiple plots. In R, I do par(mfrow=c(2,2)) as a default.
4. I would have the bottom of each graph go to 0, since that's a natural baseline (the zero-uh and zero-um level that we might all like to try to reach!). There's been some debate about the "start-at-zero rule" but I usually favor it in a situation such as this, where it doesn't require much extension of the axis.

Anyway, Mark's blog entry has much more on this interesting data source.

P.S.

Caroline says "emmm" instead of "ummm." Is this standard among native Spanish speakers?

P.P.S.

See here and here for more.

## November 4, 2005

### "Anything worth doing is worth doing shittily": missing-data edition

In another example of the paradox of importance, a colleague writes:

In other news, I am about to use the "hot deck" method to do some imputation. I considered using one of the more sophisticated and generally better methods instead, but hey, I'm on a deadline, plus there are many other sources of error that will be larger than the ones I'm introducing. It's the same old story/justification for using linear models, normal models, assuming iid errors, etc.

## November 1, 2005

### Judge Alito and the use of statistics in racial discrimination cases, well no, actually a technical point about hypothesis testing in 2-way tables

Jim Greiner has an interesting note on the use of statistics in racial discrimination cases. As both a lawyer and a statistician, Jim has a more complete perspective on these issues than most people have. I won't comment on the substance of Jim's comments (basically, he claims that the statistical analyses in these cases, on both sides, are so crude that judges can pretty much ignore the quantitative evidence when making their decisions) since I know nothing about the case in question. But I do have a technical point, which in fact has nothing really to do with racial discrimination and everything to do with statistical hypothesis testing.

Jim writes,

The facts of the specific case, which concerned the potential use of race in preemptory challenges in a death penalty trial, are less important than Judge Alito's approach to statistics and the burden of proof.

Schematically, the facts of the case follow this pattern: Party A has the burden of proof on an issue concerning race. Party A produces some numbers that look funny, meaning instinctively unlikely in a race-neutral world, but conducts no significance test or other formal statistical analysis. The opposing side, Party B, doesn't respond at all, or if it does respond, it simply points out that a million different factors could explain the funny-looking numbers. Party B does not attempt to show that such innocent factors actually do explain the observed numbers, just that they could, and that Party A has failed to eliminate all such alternative explanations.

. . .

Is there a middle way? Perhaps. In the above situation, what about requiring some sort of significance test from Party A, but not one that eliminates alternative explanations? In the specific facts of Riley, the number-crunching necessary for "some sort of significance test" is the statistical equivalent of riding a tricycle: a two-by-two hypergeometric with row totals of 71 whites and 8 blacks, column totals of 31 strikes and 48 non-strikes, and an observed value of 8 black strikes yields a p-value of 0.

OK, now my little technical comment. I don't think the hypergeometric distribution is appropriate since it conditoins on both margins. The relevant margin to condition on is the number of whites and blacks, since that was determined before the lawyers got to the problem. In a hypothesis-testing framework in which p-values represent the probability of various hypothetical alternatives (this is the framework I like, it can be interpreted classically or Bayesianly). To put it another way, the so-called Fisher exact test isn't really "exact" at all.

This is just a rant I go on occasionally, really has nothing to do with Jim's note except that it reminded me of the issue. For the fuller version of this argument, see Section 3.3 of my paper on Bayesian goodness-of-fit testing in the International Statistical Review. Also, Jasjeet Sekhon wrote a paper recently on the same topic.

For Jim's specific example, I'd be happy just doing a chi-squared test with 1 degree of freedom. His calculation is fine too--the hypergeometric is a reasonable approximation to a Bayesian posterior p-value with noninformative prior distribution.

P.S.

## October 28, 2005

I was lucky enough to be a T.A. for Fred Mosteller in his final year of teaching introductory statistics at Harvard. He had taught for 30 years and told us that in different years he emphasized different material--he never knew what aspect of the course they would learn the most from, so each year he focused on what interested him the most.

Anyway, every week he would take his three T.A.'s to lunch to talk about how the course was going and just to get us talking about things. One day he asked us what we thought about some issue of education policy--I don't remember what it was, but I remember that we each gave our opinions. Fred then told us that, as statisticians, people are interested in our statistical expertise, not in our opinions. So in a professional context we should be giving answers about sampling, measurement, experimentation, data analysis, and so forth--not our off-the-cuff policy opinion, which are not what people were coming to us for.

I was thinking of this after reading David Kane's comment on Sam's link to an article about the book, The Bell Curve. David asked me (or Sam) to tell us what we really think about the Bell Curve. I can't speak for Sam, but I wouldn't venture to give an opinion considering that I haven't read the book. I'd like to think I'm qualified to make judgments about it, if I were to spend the effort to follow all the arguments--but it would take a lot of time, and my impression is that a bunch of scientists have already done so (and have come to various conclusions on the topic). I would imagine that I might be inclined to study the issue further if I were involved in a study evaluating educational policies, for example, but it hasn't really come up in any of my own research. (I did think that James Flynn's article on a related topic was interesting, but I don't even really know what are the key points of The Bell Curve are, so I wouldn't presume to comment.

Over the years, I've been distressed to see statistians and other academic researchers quoted as "experts" in the news media, even for subjects way out of their areas of expertise. It takes work to become an expert on a topic. Teaching classes in probability and statistics isn't always enough. As a reaction to this, I've several times said no to media requests on things that I'm not an expert on. (For example, when asked to go on TV to comment on something on the state lottery, I forwarded them to Clotfelter and Cook, two economists at Duke who wrote an excellent book on the topic.) Standards for blogs are lower than for TV, but still . . .

## October 25, 2005

### Stat/Biostat Departments

I wish there were more connections between statistics departments and biostatistics departments. I've been working with survival data recently, and it's made me realize another gaping hole in my statistical knowledge base. It's also made me realize that I wish I knew more biostatisticians. And I'm one of the lucky ones, really, because Columbia has a biostatistics department and I do know some people there. Often when statistics and biostatistics departments don't have close connections, it's for understandable reasons. When I was in graduate school at Harvard, for example, the statistics and biostatistics departments were (still are, I guess) separated by the Charles River and it took a 45-minute bus ride to travel between the two. I almost never made that trip. Still, there are some great people in the Harvard Biostatistics Department and I'm sure I could have benefited from working with or taking classes from them. Here at Columbia, the biostatistics department is a subway ride away from the statistics department, and if you take the 1 train then there's that awful subway elevator to contend with (how on earth is that not a fire hazard?). Lots of universities don't have both statistics and biostatistics departments; of the ones that do there are some with close connections. I just wish that was the rule rather than the exception.

## October 19, 2005

### My talks at Swarthmore next week

Monday talk (for general audience):

Mathematical vs. statistical models in social science

Mathematical arguments can give insights into social phenomena but, paradoxically, tend to give qualitative rather than quantitative predictions. In contrast, statistical models, which often look messier, can introduce new insights. We give several examples of interesting, but flawed, mathematical models for examples including political representation, trench warfare, the rationality of voting, and the electoral benefits of moderation. We consider ways in which these models can be improved in these examples. We also discuss more generally why mathematical models might be appealing and why they commonly run into problems.

Tuesday talk (for math/stat majors and other interested parties):

Coalitions, voting power, and political instability

We shall consider two topics involving coalitions and voting. Each topic involves open questions both in mathematics (probability theory) and in political science.
(1) Individuals in a committee or election can increase their voting power by forming coalitions. This behavior yields a prisoner's dilemma, in which a subset of voters can increase their power, while reducing average voting power for the electorate as a whole. This is an unusual form of the prisoner's dilemma in that cooperation is the selfish act that hurts the larger group. The result should be an ever-changing pattern of coalitions, thus implying a potential theoretical explanation for political instability.
(2) In an electoral system with fixed coalition structure (such as the U.S. Electoral College, the United Nations, or the European Union), people in diferent states will have different voting power. We discuss some flawed models for voting power that have been used in the past, and consider the challenges of setting up more reasonable mathematical models involving stochastic processes on trees or networks.

If people want to read anything beforehand, here's some stuff for the first talk:

http://www.stat.columbia.edu/~gelman/research/unpublished/trench.doc
http://www.stat.columbia.edu/~gelman/research/unpublished/rational_final5.pdf
http://www.stat.columbia.edu/~gelman/research/published/chance.pdf

and here's some stuff for the second talk:

http://www.stat.columbia.edu/~gelman/research/published/blocs.pdf
http://www.stat.columbia.edu/~gelman/research/published/STS027.pdf
http://www.stat.columbia.edu/~gelman/research/published/gelmankatzbafumi.pdf

## October 13, 2005

### Jobs, jobs, jobs

Statisticians continue to be in demand, especially those who are interested in social science and policy applications. From Susan Paddock at RAND:

As a leading public policy organization focusing on quantitative research, RAND offers an exciting setting for a statistician, with opportunities to collaborate on multidisciplinary teams; to conduct research on statistical methods; to consult; and to teach. Research projects address issues in a variety of disciplines, including health, national security, criminal and civil justice, education, and population and regional studies. RAND projects typically pose novel statistical challenges in design, sampling, measurement, modeling, analysis, and computing. In addition, our group of 17 Ph.D. statisticians and six Masters-level statisticians provides a collegiate and stimulating environment. We welcome recent graduates as well as senior candidates.

Here's the ad. The staisticians at Rand do interesting work on important problems.

## October 12, 2005

### Akaike is cool

Today I came across a paper in my files, "On a limiting process which asymptotically produces f^{-2} spectral density" from 1962 by Hirotugu Akaike (most famous for his information criterion). The paper has a great opening paragraph:

In the recent papers in which the results of the spectral analyses of roughnesses of runways or roadways are reported, the power spectral densities of approximately the form f^{-2} (f: frequency) are often treated. This fact directed the present author to the investigation of the limiting process which will provide the f^{-2} form under fairly general assumptions. In this paper a very simple model is given which explains a way how the f^{-2} form is obtained asymptotically. Our fundamental model is that the stochastic process, which might be considered to represent the roughness of the runway, is obtained by alternative repetitions of roughening and smoothing. We can easily get the limiting form of the spectrum for this model. Further, by taking into account the physical meaning of roughening and smoothing we can formulate the conditions under which this general result assures that the f^{-2} form will eventually take place.

It's a cool paper, less than 5 pages long. Something about this reminds me of Mandelbrot's early papers on taxonomy and Pareto distributions, written about the same time.

## October 10, 2005

### The ethics of consulting for the tobacco industry

Don Rubin published an article in 2002 on "The ethics of consulting for the tobacco industry." Here's the article, and here's the abstract:

This article describes how and why I [Rubin] became involved in consulting for the tobacco industry. I briey discuss the four relatively distinct statistical topics that were the primary focus of my work, all of which have been central to my published academic research for over three decades: missing data; causal inference; adjustment for covariates in observational studies; and meta-analysis. To me [Rubin], it is entirely appropriate to present the application of this academic work in a legal setting.

My thoughts:

I respect what Don is saying here--I don't think he'd do this sort of consulting without thinking it through. At the same time, I think there are a couple of complications not mentioned in his article.

1. Don writes, "When I was first contacted by a tobacco lawyer, I was very reluctant to consult for
them, for the standard ‘politically correct’ reasons..." I think this is a bit glib. "Political correctness" refers to attempts to restrict speech or ideology that is deemed offensive. Tobacco companies, on the other hand, actually make cigarettes, which actually do give people cancer. Now, I'm not saying that it's immoral to work for tobacco companies, or to supply cigarettes to people who want them, or even that it's immoral to advertise cigarettes or whatever--but to dismiss this as "political correctness" minimizes the issues here, I think.

2. Later in the article, Don presents the ethical dilemma as whether to give testimony that is scientifically valid but supports cigaratte companies. In his article, he makes a convincing case that, in his analysis, the facts did not support the claims made by the anti-tobacco lawsuits.

I would tend to accept Don's reasoning that, once he has studied the issue, it is ethical for him to call the science as he sees it, even if that means he is supporting tobacco companies in a lawsuit. (If I had close personal experience with lung-cancer victims--or with tobacco farmers--this would probably affect my views on this, but that's another story.) However, there's another decision point that Don didn't spend much time on, which is his decision to work on the problem at all.

Setting aside any questions about the morality of working on the tobacco case, there is still the "opportunity cost" argument: what would have Don done if he had not worked so hard for years on this problem? Perhaps he could have made further strides in the theory of statistical modeling and causal inference, or perhaps he could have been working on an application with direct benefit (for example, collaborating with psychologists or drug designers on improved therapies or treatments). Given his involvement in the case, it is appropriate that Don did his best job as a scientist, but this still raises the question of whether he should have been involved at all.

Just to be clear: I don't think that Don was immoral in working on this problem. In one of his books, Bill James said, "I'm not a public utlity" or something like that, and, similarly, Don should have the freedom to work on problems as he sees fit. I am not at all criticizing his ethical choices. I'm just commenting on his published article on the ethical choices. My impression from talking with Don is that he did make some progress on causal inference in the context of the tobacco study, and that one reason he worked on the topic is that it gave him the opportunity to think seriously about these problems. As I once noted, the most advanced statistical methods are often used in low-stakes problems and so it is good to see some of the most modern methods of causal inference used in this high-stakes dispute.

In any case, this article would be a great discussion-starter for a course on statistics in public health or social science. Ethical discussions in statistics can get into ruts (for example, questions of the morality of randomized clinical trials), and this article looks at a slightly different ethical dilemma that can face statisticians.

I'm curious what Chad would think of Rubin's article. (Here's my earlier discussion of Chad's work on ethics and statistics.)

## October 6, 2005

### God is in every leaf of every tree

In a recent article in the New York Review of Books, Freeman Dyson quotes Richard Feyman:

No problem is too small or too trivial if we really do something about it.

This reminds me of the saying, "God is in every leaf of every tree," which I think applies to statistics in that, whenever I work on any serious problem in a serious way, I find myself quickly thrust to the boundaries of what existing statistical methods can do. Which is good news for statistical researchers, in that we can just try to work on interesting problems and the new theory/methods will be motivated as needed. I could give a zillion examples of times when I've thought, hey, a simple logistic regression (or whatever) will do the trick, and before I know it, I realize that nothing off-the-shelf will work. Not that I can always come up with a clean solution (see here for something pretty messy). But that's the point--doing even a simple problem right is just about never simple. Even with our work on serial dilution assays, which is I think the cleanest thing I've ever done, it took us about 2 years to get the model set up correctly.

As the saying goes, anything worth doing is worth doing shittily.

## October 5, 2005

### Are smaller schools better? Another example of artifacts in rank data

Alex Tabarrok writes, regarding the example in Section 3 of this paper,

Another nice illustration of the importance of weighting comes from high-stakes schemes that reward schools for improving test scores. North Carolina, for example, gives significant monetary awards to schools that raise their grades the most over the year. The smallest decile of schools has been awarded the highest-honors (top-25 in the state) 27% of the time while schools in the largest decile have received that honor only about 1% of the time. Students (and parents) are naturally led to believe that small schools are better. But just as with the cancer data, the worst schools also come from the smallest decile. The reason, of course, is the same as with the cancer data small changes in incoming student cohorts make the variance of the score changes much larger at the smaller schools. There are some nice graphs and discussion in

Kane, T. and D. O. Staiger. 2002. The Promise and Pitfalls of Using Imprecise School Accountability Measures. Journal of Economic Perspectives 16 (4):91-114.

It's scary to think of policies being implemented based on the fallacy of looking at the highest-ranking cases and ignoring sample size. But most of my students every year get the cancer-rate example wrong--that's one reason it's a good example!--so I guess it's not a surprise that policymakers can make the mistake too. And even though people point out the error, it can be hard to get the message out. (For example, Kane and Staiger hadn't head of my paper with Phil Price on the topic, and until recently, I hadn't heard of Kane and Staiger's paper either.)

## September 12, 2005

### Is dimensionality a blessing or a curse?

Scott de Marchi writes, regarding the "blessing of dimensionality":

One of my students forwarded your blog, and I think you've got it wrong on this topic. More data does not always help and this has been shown in numerous applications -- thus the huge lit on the topic. Analytically, the reason is simple. Just for an example, assume your loss function is MSE; then, the uniquely best estimator is E(Y | x) -- i.e., the conditional mean of Y at each point X. The reason one cannot do this in practice is that as the size of your parameter space increases, you never have enough data to span the space. Even if you change the above to a neighborhood around each x, the volume of this hypercube gets really, really ugly for any value of the neighborhood parameter. The only way out of of this is to make arbitrary restrictions on functional form, etc. or derive a feature space (thus "tossing out" data, in a sense).

As I said, there's a huge number of applications where more is not better.
One example if face recognition --increasing granularity or pixel depth
doesn't help. Instead, one must run counter to your intuition and throw
out most of the data by deriving a feature space. And, face recognition
still doesn't work all that well, despite decades of research.

There's a number of other issues -- in your comments on 3 "good" i.v.'s
and 197 "bad" ones, you have to take the issue of overfitting much more
seriously than you do.

My reply: Ultimately, it comes down to the model. If the model is appropriate, then Bayesian inference should deal appropriately with the extra information. After all, discarding most of the information is itself a particular model, and one should be able to do better with shrinkage.

That said, the off-the-shelf models we use to analyze data can indeed choke when you throw too many variables at them. Least-squares is notorious that way, but even hierarchical Bayes isn't so great when the large number of parameters have structure. I think that better models for interactions are out there for us to find (see here for some of my struggles; also see the work of Peter Hoff, Mark Handcock, and Adrian Raftery in sociology, or Yingnian Wu in image analysis). But they're not all there yet. So, in the short term, yes, more dimensions can entail a struggle.

Regarding the problem with 200 predictors: my point is that I never have 200 unstructured predictors. If I have 200 predictors, there will be some substantive context that will allow me to model them.

## September 9, 2005

### Interval-scaled variables

While Eric Yu and I were mooshing around with the current issues I'm dealing with in my research, I came up with a question I wanted to ask you. To some extent this was stimulated by reading a textbook about IRT, where it is emphasized that IRT allows us to posit equal intervals between test items when we order them by difficulty. A benefit of this is that it produces a mathematically tractable interval scale. It occurred to me that in social statistics we highly value, and get good mileage out of, those interval scale data that we possess, such as age, education, and income. But -- here's my question -- these data are not truly interval in substantive nature. With income, for example, in a given living environment, say NYC, the interval between an income of 15,000 a year and 20,000 a year is much larger than the interval between 150,000 and 155,000 a year. When it comes to age, one year is for social, psychological, and political purposes a bigger interval from 18 to 19 than from 38 to 39. When it comes to education, certain years of schooling are years of great transformation, like first grade, and others are years of less dramatic change in one's socially or politically relevant capabilities, like fifth or sixth grade. It follows that every dollar, year, or grade in school is not equal to every other dollar, year, or grade in school in predicting outcomes in attitudes, social behavior, or political behavior. Has anybody noticed this fact and endeavored to, so to speak, rescale some of these variables, perhaps by using available empirical evidence (which might be different in different social environments) to weight different intervals in the scale so that they lose what is now a deceptive quality of equality?

There are ordered logit and probit models that allow categories (e.g., income categories) to be ordered, but with spacing estimated from the data. These are standard tools in generalized linear modeling. There are also nonparametric versions that will transform a continuous response (e.g., income) to 'stretch out" parts of the scale and shrink others. And of course there are simpler tools like log and square root transformations that will stretch and shrink the scale. Ainally, for variables like age, one can also include non-monotonic transformations (e.g., maybe the young and old are more liberal, and the middle-aged are more conservative) which can be done by discretizing the scale and using indicator variables.

## September 6, 2005

### Next Monday's CS colloquium

This sounds interesting, and it's highly statistical:

Organizing the world's information (the world is bigger than you think!) Craig Neville-Manning Engineering Director and Senior Research Scientist Google Inc.
Columbia University Computer Science 2005 Distinguished Lecture Series Monday, September 12, 2005 11:00 a.m. - 12:15 p.m. Schapiro Center Davis Auditorium, 4th Floor CEPSR

ABSTRACT:
Google indexes over 10 billion documents, including web pages,
images, scanned books, video, satellite and aerial photos, maps,
scholarly articles, and business listings. It makes this information
available on the internet and to wireless devices. We help
individuals organize their personal data: email and local documents.
Using natural language processing and optimization techniques, we
place ads on behalf of advertisers on search results pages and on
content pages across the internet. Doing all of this at scale with
consistent speed and accuracy pushes the boundaries of