November 21, 2008
The Denominator, or, Is it an advantage to have a humble background?
Malcolm Gladwell recounts the story of Sidney Weinberg, a kid who grew up in the slums of Brooklyn around 1900 and rose to become the head of Goldman Sachs and well-connected rich guy extraordinaire. Gladwell conjectures that Weinberg's success came not in spite of but because of his impoverished background:
Why did [his] strategy work . . . it's hard to escape the conclusion that . . . there are times when being an outsider is precisely what makes you a good insider.
Later, he continues:
It’s one thing to argue that being an outsider can be strategically useful. But Andrew Carnegie went farther. He believed that poverty provided a better preparation for success than wealth did; that, at root, compensating for disadvantage was more useful, developmentally, than capitalizing on advantage.
At some level, there's got to be some truth to this: you learn things from the school of hard knocks that you'll never learn in the Ivy League, and so forth. But . . . there are so many more poor people than rich people out there. Isn't this just a story about a denominator? Here's my hypothesis:
Pr (success | privileged background) >> Pr (success | humble background)
# people with privileged background << # of people with humble background
Multiply these together, and you might find that many extremely successful people have humble backgrounds, but it does not mean that being an outsider is actually an advantage.
Here's more from Gladwell's article:
Weinberg was decoupled from the business establishment in the same way, and that seems to have been a big part of what drew executives to him. The chairman of General Foods avowed, “Sidney is the only man I know who could ever say to me in the middle of a board meeting, as he did once, ‘I don’t think you’re very bright,’ and somehow give me the feeling that I’d been paid a compliment.” That Weinberg could make a rebuke seem like a compliment is testament to his charm. That he felt free to deliver the rebuke in the first place is testament to his sociological position. You can’t tell the chairman of General Foods that he’s an idiot if you were his classmate at Yale. But you can if you’re Pincus Weinberg’s son from Brooklyn. Truthtelling is easier from a position of cultural distance.
Is this really true? My guess is that it's not so hard to tell your Yale classmate you think he's not very bright, if you say it in a charming way. College fraternity guys like to jokingly insult each other, no?
Posted by Andrew at 2:32 PM | Comments (12) | TrackBack
Netflix Prize scoring function isn't Bayesian
NY Times has a good article on the state of recommender systems: "If You Liked This, Sure to Love That ". This is a description of one of the problems:
But his progress had slowed to a crawl. [...] Bertoni says it’s partly because of “Napoleon Dynamite,” an indie comedy from 2004 that achieved cult status and went on to become extremely popular on Netflix. It is, Bertoni and others have discovered, maddeningly hard to determine how much people will like it. When Bertoni runs his algorithms on regular hits like “Lethal Weapon” or “Miss Congeniality” and tries to predict how any given Netflix user will rate them, he’s usually within eight-tenths of a star. But with films like “Napoleon Dynamite,” he’s off by an average of 1.2 stars.The reason, Bertoni says, is that “Napoleon Dynamite” is very weird and very polarizing. [...] It’s the type of quirky entertainment that tends to be either loved or despised.
And here is the stunning conclusion by fortunately anonymous computer scientists:
Some computer scientists think the “Napoleon Dynamite” problem exposes a serious weakness of computers. They cannot anticipate the eccentric ways that real people actually decide to take a chance on a movie.
Actually, computers do quite a good job modeling probability distributions for those more eccentric and unpredictable of us. Yes, the humble probability distribution, the centuries-old staple of statisticians is enough to model eccentricity! The problem is that Netflix makes it hard to use sophisticated models the scoring function is the antiquated and not just pre-Bayesian but actually pre-probabilistic root mean squared error or RMSE. For all practical purposes, the square root in RMSE is a monotonic transformation that won't affect the ranking of recommender models, and we can drop it outright.
So, if one looked at the distribution of ratings for Napoleon Dynamite on Amazon, it has high variance:

On the other hand, Lethal Weapon 4 ratings have lower variance:

If we use the average number of stars as the context-ignorant unpersonalized predictor (which I've discussed before), ND will give you mean squared pain of 3.8, and LW4 will give you the mean squared pain of 2.7. Now, your model might choose not to make recommendations with controversial movies - but this won't help you on Netflix Prize - you're forced to make errors even when you know you're making them. (R)MSE is pre-probabilistic: it gives no advantage to a probabilistic model that's aware of its own uncertainty.
Posted by Aleks Jakulin at 1:27 PM | Comments (6) | TrackBack
November 20, 2008
Still another 10 days to apply for an Earth Institute postdoc
The Earth Institute is looking for applicants for its postdoctoral fellows program, and if you're doing statistics you can work with me. It's a highly competitive program, deadline is 1 December so apply now:
Postdoctoral Fellows Program in Sustainable Development at The Earth Institute
The Earth Institute at Columbia University is the world’s leading academic center for the study, teaching, and implementation of sustainable development. It builds on excellence in the core disciplines—earth sciences, biological sciences, engineering sciences, social sciences, and health sciences—and stresses cross-disciplinary approaches to complex problems.
Through research, training, and global partnerships, The Earth Institute mobilizes science and technology to advance sustainable development and address environmental degradation, placing special emphasis on the needs of the world’s poor.
The Earth Institute seeks applications from innovative postdoctoral candidates or recent Ph.D., M.D., and J.D. recipients interested in a broad range of issues in sustainable development.
The Postdoctoral Fellows Program in Sustainable Development provides scholars who have a foundation in one of the Institute’s core disciplines the opportunity to acquire the cross-disciplinary expertise and breadth needed to address critical issues in the field of sustainable development, including reducing poverty, hunger, disease, and environmental degradation. Those who have developed cross-disciplinary approaches during graduate studies will find numerous opportunities to engage in leading research programs that challenge their skills.
Candidates for the Postdoctoral Fellows Program should submit a proposal for research that would contribute to the goal of global sustainable development. This could take the form of participating in and contributing to an existing multidisciplinary Earth Institute project, an extension of an existing project, or a new project that connects existing Institute expertise in novel ways. Candidates should identify their desired small multidisciplinary mentoring team, i.e., two or more senior faculty members or research scientists/scholars at Columbia with whom they would like to work during their fellowship.
For detailed information on The Earth Institute, its research centers, programs, and affiliated Columbia University departments, please visit http://www.earthinstitute.columbia.edu
Fellowships will ordinarily be granted for a period of 24 months.
More information on the Postdoctoral Fellows Program is available at http://www.earthinstitute.columbia.edu/postdoc
Application forms should be completed online at http://fellows.ei.columbia.edu/2009/
Applications submitted by December 1, 2008, will be considered for fellowships starting in the summer or fall of 2009.
For more information, contact:
Rita Ricobelli Corradi
Research Director, OARP
rricobelli@ei.columbia.edu
The Earth Institute at Columbia University
B-16 Hogan Hall, MC 3277
2910 Broadway
New York, NY 10025
Program e-mail: fellows@ei.columbia.edu
Columbia University is an affirmative action/equal opportunity employer.
Minorities and women are encouraged to apply.
Posted by Andrew at 9:42 PM | Comments (0) | TrackBack
November 19, 2008
Genetically-influenced traits running in families
There is also little consensus among researchers about what causes psychopathy. Considerable evidence, including several large-scale studies of twins, points toward a genetic component. Yet psychopaths are more likely to come from neglectful families than from loving, nurturing ones.
I'm confused here. If there's a big genetic component, wouldn't it stand to reason that parents of psychopaths are more likely to be neglectful and less likely to be loving and nurturing? So why the "Yet" in the quote above? Or is there something I'm missing?
P.S. in response to commenters: Yes, I agree that it's possible for psychopathy to be largely genetic without parents of psychopaths being much more likely to be neglectful.
What I didn't understand was Seabrook's implication that this would be surprising, the idea that if (a) a trait is genetically linked, and (b) a trait can be (somewhat) predicted by parental behavior, that the combination of (a) and (b) should be considered puzzling. By default, I'd think (a) and (b) would go together.
Posted by Andrew at 2:06 PM | Comments (6) | TrackBack
November 13, 2008
Modeling growth
Charles Williams writes,
In a number of your examples in the multilevel modeling book you use growth as an outcome. I'm doing this in a study of firm growth in the cellular industry. In this setting, we need to control for firm size since firm's propensity to grow is definitely affected by its size. Someone suggested to me that I may have correlation between the size variable and the error term, since size is effectively in the denominator of the growth variable. They suggested using just the numerator of the growth term (subscribers added) as the outcome, since the denominator will be controlled for in the regression.Have you run into this? Do you agree that there is a potential for bias in using size as a regressor for growth?
My reply: Yes, it makes sense to control for size (at the beginning of the study) in your regressions, probably on the log scale. I'd still use the ratio as an outcome because I think it would help the coefficients be more directly interpretable (which is a virtue in itself and also helps with efficiency if you have a hierarchical or Bayesian model).
Posted by Andrew at 9:54 AM | Comments (4) | TrackBack
November 12, 2008
Fellowship and internship programs at the Educational Testing Service
Information and application instructions are posted on the ETS Web site at http://www.ets.org/research/fellowships.html. The deadline for applying for the summer internship and postdoctoral fellowship programs is February 1, 2009. The deadlines for applying for the Harold Gulliksen program are December 1, 2008 for the preliminary nomination materials and February 1, 2009 for the final application materials.
Posted by Andrew at 9:08 PM | Comments (0) | TrackBack
November 11, 2008
Job opening--come here and work down the hall from me
The Department of Statistics at Columbia University invites applications for an Assistant Professor position, commencing Fall 2009. A PhD in statistics or a related field and commitment to high quality research and teaching in statistics and/or probability are required. Outstanding candidates in all areas are strongly encouraged to apply. You should apply before December 1, 2008.
The department currently consists of 20 faculty members, 35 PhD students, and over 100 MS students. The department has been expanding rapidly and, like the University itself, is an extraordinarily vibrant academic community. For further information about the department and our activities, centers, research areas, and curricular programs, please go to our web page at: http://www.stat.columbia.edu.
All applications must be uploaded through our online site at http://academicjobs.columbia.edu/applicants/Central?quickFind=50827
Inquiries may be made to dk@stat.columbia.edu .
Review of applications will begin December 1, 2008. Applications received after this date may be considered until the position is filled or the search is closed. Columbia University is an Equal Opportunity/Affirmative Action employer.
Posted by Andrew at 9:30 PM | Comments (4) | TrackBack
October 30, 2008
More on scaling regression inputs
Tom Knapp writes:
I have four questions and one correction about your article about scaling regression inputs in Statistics in Medicine:
1. In your party identification example you show that division by two standard deviations reversed the relative magnitudes of some regression coefficients. Near the end of your paper, with respect to Itani et al., you say "dividing by one (rather than two) standard deviation will lead the reader to understate the importance of these continuous inputs". Is that always the case?2. How did you get your paper published in SIM, given that the only reference to medicine is in two of those last three examples?
3. In the text accompanying Figure 2 you say "the coefficient for the interaction of income and ideology is now higher than the coefficient for race [black]". If I'm reading the data in that figure correctly I think you meant to say that the coefficient for parents.party is now higher.
4. On page 2866 you say that log transformations are not appropriate for Likert scales. Do you have a reference for that claim? I think Likert scales are inappropriate for linear regression analysis in general and require the use of ordinal regression analysis.
5. On page 2868 you have a brief paragraph regarding the ability of experienced practitioners to interpret the regression coefficients in the top half of Figure 2. I guess I qualify (I taught statistics for 41 years), and I usually interpret regression coefficients by eyeballing the associated t's or p's. Why didn't you provide same? I calculated all of the t's for the unscaled coefficients; for black and for parents.party I got -5.76 and 16.33, respectively, so parents.party is the stronger predictor. [Incidentally, you probably should have reported another
place or two for the data in Figure 2, since the coefficient and the standard error for age squared are both 0.00]
My reply: First off, it's a thrill to get a comment from someone who taught statistics for 41 years! I've been doing it for barely half as long. To get to specifics:
1. Dividing by 1 sd is roughly comparable to a binary predictor being coded as +/- 1. Dividing by 2 sd is roughly comparable to a binary predictor being coded as 0/1. The 0/1 coding is much more common (at least, in the examples that I've seen), which is why I chose the 2 sd scaling.
2. I think it got rejected by 2 other places; I can't quite remember where. But each time I made major improvements.
3. Yes, that's right. D'oh!
4. I'm not so bothered by treating a 1-5 or 1-10 scale linearly, on the assumption that the difference between 1 and 2 is approximately the same as the difference between 3 and 4, or whatever. I'm working on a research project to use Bayesian methods to bridge between the extremes of pure linearity and pure ordered-categorical models.
5. That's a good point. Ordering by statistical significance is not the same as ordering by importance, but it would've been a good idea to discuss this in the article.
Posted by Andrew at 4:38 PM | Comments (0) | TrackBack
October 20, 2008
Two countries separated by a common language
Some differences:
- Tao uses more words. This makes sense: he's busy explaining this stuff to himself as well as to his readers. To a statistician, these ideas are so basic that it's hard for us to really elaborate. (Also, I had a word limit.)
- Tao emphasizes that a confidence interval is not a probability interval. In my experience, confidence intervals are always treated as probability intervals anyway, so I don't spend time with the distinction.
- I emphasize that a poll is a snapshot, not a forecast.
- Tao says that the number of polled voters is fixed in advance. I don't think this is exactly true, what with nonresponse.
- Tao fills his blog entry with Wikipedia links. Wikipedia is ok but I'm not so thrilled with it; I'm happy with people looking things up in it if they want but I won't encourage it.
But we're basically saying the same thing. I like how I put it, but I'm sure a lot of people prefer Tao's style. Luckily there's room on the web for both!
Posted by Andrew at 10:05 PM | Comments (3) | TrackBack
"Data analysis" or "data synthesis?"
See discussion here.
Posted by Andrew at 8:01 PM | Comments (1) | TrackBack
October 14, 2008
Don't Ask, Don't Tell: The New Rules of the SAT and College Admissions
Howard Wainer writes:
On September 22, 2008, the New York Times carried the first of three articles about a report, commissioned by the National Association for College Admission Counseling, that was critical of the current college admission exams, the SAT and the ACT. The commission was chaired by William R. Fitzsimmons, the dean of admissions and financial aid at Harvard.The report was reasonably wide-ranging and drew many conclusions while offering alternatives. Although well-meaning, many of the suggestions only make sense if you say them fast.
Among their conclusions were that schools should consider making their admissions "SAT optional," that is allowing their applicants to submit their SAT/ACT scores if they wish, but they should not be mandatory. The commission cites the success that pioneering schools with this policy have had in the past as proof of concept.
Howard continues:
Has the admissions process been hampered in schools that have instituted an SAT optional policy?The first reasonably competitive school to institute such a policy was Bowdoin College, in 1969. Bowdoin is a small, highly competitive liberal arts college in Brunswick, Maine. A shade under 400 students a year elect to matriculate at Bowdoin, and roughly a quarter of them choose not to submit their SAT scores. . . .
As it turns out the SAT scores for the students who did not submit them would have accurately predicted their lower performance at Bowdoin. In fact the correlation between grades and SAT scores was 12% higher for those who didn't submit them than for those who did.
So not having this information does not improve the academic performance of Bowdoin's entering class — on the contrary it diminishes it. Why would a school opt for such a policy? Why is less information preferred to more? . . .
We see that if all of the students in Bowdoin's entering class had their SAT scores included the average SAT at Bowdoin would sink from 1323 to 1288, and instead of being second among these six schools they would have been tied for next to last. Since mean SAT scores are a key component in school rankings, a school can game those rankings by allowing their lowest scoring students to not be included in the average. I believe that Bowdoin's adoption of this policy pre-dates US News and World Report's rankings, so that was unlikely to have been their motivation, but I cannot say the same for schools that have chosen such a policy more recently.
Interesting. Howard has some data showing that, unsurprisingly, the students who don't supply their SAT are mostly (but, interestingly, not always, those scoring lower):

(I don't find the y-axis on this graph very helpful, but that's another story.)
So what's the deal? Who are those kids with 1450 SAT's who aren't submitting their scores?
This reminds me . . .
Sound psychometric (i.e., statistical) principles tell us that, if we have an applicant who's taken a test multiple times, to use his or her average score. But for our PhD admissions, we generally take the higher score. I understand our psychological reasons for doing this--we want to think the best of a person--but, statistically, it seems like a bad idea.
Posted by Andrew at 10:54 PM | Comments (8) | TrackBack
October 2, 2008
Walter de la Mare was a statistician
Cool.
Posted by Andrew at 10:17 PM | Comments (0) | TrackBack
October 1, 2008
Applied Statistics Center Monthly Update for October 2008
Just to let you know things are busy around here . . .
Contents: Featured Research from ASC Fellows Seminars Coming Soon News********************************************
Featured Research from ASC Fellows:
- Pablo Pinto (Political Science) writes that he's working on a project titled The Politics of Investment, with Santiago M. Pinto:
"In this project we try to establish whether foreign direct investment (FDI) reacts to changing political conditions in host countries. More specifically, we explore the existence of partisan cycles in FDI investment performance, which should be reflected in different patterns of investment at the industrial level. While there has been extensive work on the effects of policy decisions (trade and tax policy in particular) on aggregate FDI flows (Feldstein, Hines, & Hubbard 1995; Hines 2001), we find that that the link between partisanship and investment performance has not been duly explored in the literature. The first paper of this series, The Politics of Investment: Partisanship and the Sectoral Allocation of FDI, was published in the June 2008 issue of Economics & Politics. We are currently working on several extensions: Partisanship, Imperfect Capital Mobility and the Sectoral Allocation of FDI; Partisan Governments, Wages and Employment."- Want your work to be the Featured Research in next month's newsletter? Let us know! Send an email to ejs2130@columbia.edu
***
Seminars Coming Soon:
Quantitative Political Science seminar
- Date: October 2, 2008
- Speaker: Robert Erikson, Kelly Rader and Pablo Pinto, Columbia Political Science
- Topic: TBA- Date: October 16, 2008
- Speaker: Amy Lerman, Princeton Politics
- Topic: TBA- For more information, see http://applied.stat.columbia.edu/quantpoliscisem.php
Quantitative Methods in the Social Sciences seminar
- Date: October 1, 2008
- Speaker: Jennifer Booher-Jennings, Columbia
- Topic: Beyond High Stakes Tests: Teacher Effects on Other Educational Outcomes- Date: October 15, 2008
- Speaker: Margot Jackson, Princeton
- Topic: TBA- For more information, see http://www.iserp.columbia.edu/calendar/
Statistics Seminar
- Date: October 6, 2008
- Speaker: Dr. Adam A. Szpiro, Department of Biostatistics, University of Washington
- Topic: TBA- Date: October 13, 2008
- Speaker: Dr. Ingemar Nåsell, Royal Institute of Technology, Stockholm
- Topic: On Persistence of Endemic Infections- Date: October 20, 2008
- Speaker: Dr. Hernando Ombao, Brown University
- Topic: TBA- Date: October 27, 2008
- Speaker: Dr. David Brillinger, Statistics Department, University of California, Berkeley
- Topic: TBA- For more information, see http://www.stat.columbia.edu/pop-up-pages/seminars_semester_schedule.html
Applied Mathematics Colloquium
- Date: October 7, 2008
- Speaker: Jason Fleischer, Princeton University
- Topic: Towards Optical Hydrodynamics- Date: October 14, 2008
- Speaker: Sorin Tanase-Nicola, University of Michigan
- Topic: TBA- Date: October 16, 2008
- Speaker: Misha Chertkov, LANL
- Topic: Belief Propagation and Beyond- Date: October 21, 2008
- Speaker: Paul Francois, Rockefeller University
- Topic: TBA- Date: October 28, 2008
- Speaker: Surya Ganguli, Keck Center, UCSF
- Topic: TBA
- For more information, see http://www.apam.columbia.edu/pages/news/Seminars/applied_mathematics_colloquium.htmlApplied Microeconomics seminar
- Date: October 1, 2008
- Speaker: Joshua Goodman
- Topic: TBA- Date: October 8, 2008
- Speaker: Erzo Luttmer
- Topic: What Good is Wealth Without Health? The Effect of Health on the Marginal Utility of Consumption- Date: October 15, 2008
- Speaker: Rajeev Cherukupalli
- Topic: TBA- Date: October 22, 2008
- Speaker: Tumer Kaplan
- Topic: TBA- Date: October 29, 2008
- Speaker: Amitabh Chandra
- Topic: TBA
- For more information, see http://www4.gsb.columbia.edu/finance/seminars/appliedmicroEconometrics workshop
- Date: October 9, 2008
- Speaker: Arthur Lewbel, Boston College
- Topic: TBA- Date: October 16, 2008
- Speaker: Peter Reinhardt Hansen, Stanford
- Topic: TBA- Date: October 23, 2008
- Speaker: Eric Ghysels, North Carolina
- Topic: TBA- For more information, see http://www.columbia.edu/~dk2313/Workshop.htm
Econometrics colloquium
- Date: October 1, 2008
- Speaker: Dennis Kristensen, Columbia
- Topic: Testing Conditional Factor Models- Date: October 8, 2008
- Speaker: Richard Davis, Columbia
- Topic: Structural Break Estimation in Time Series: Theory and Practice- Date: October 15, 2008
- Speaker: Pierre Andre Chiappori, Columbia
- Topic: TBA- Date: October 22, 2008
- Speaker: Yinghua He, Columbia
- Topic: Estimating School Choice Problem under Boston Mechanism as a Bayesian Game- For more information, see http://www.columbia.edu/~dk2313/Lunch-Seminar.htm
Posted by Andrew at 9:03 PM | Comments (0) | TrackBack
September 28, 2008
Exciting 1% shift!
Brendan Nyhan offers this amusing example of a newspaper hyping poll noise. From the LA Times:
Registered voters who watched the debate preferred Obama, 49% to 44%, according to the poll taken over three days after the showdown in Oxford, Miss.That is a small gain from a week ago, when a survey of the same voters showed the Democratic candidate with a 48% to 45% edge.
A small gain, indeed.
Posted by Andrew at 9:55 PM | Comments (2) | TrackBack
September 25, 2008
Models for cumulative probabilities
Dan Lakeland writes:
I am working with some biologists on a model for time-to-response for animals under certain conditions. The model(s) ultimately are defined in terms of a differential equation that relates a (hidden) concentration of a metabolic product to the (cumulative) probability that an animal will respond within a given time by changing its behavior.
Now mostly, in my experience, statistical models are models for averages, or particular quantiles of the dataset (medians etc). Most models attempt to predict something (like time to response) from something else (like say measured amounts of a drug). In this case, rather than predicting individual response times, we're trying to predict shape of a distribution from measured exposure to a certain environment.In this case, we are tempted to use some measure of the goodness of fit to try to guess what is going on internally within the animal. For ease of computation, I'm fitting this model with maximum likelihood methods initially (a Bayesian approach may come later if time allows).
What is your opinion on model selection methods in this type of scenario? Your book index has "model selection and why we avoid it" which sounds unhelpful, but the section on model selection was actually more helpful than the index implied. Is there anything you can add in this context?
My reply: I'm not quite sure what your question is, but maybe, if I can translate it into the social-science examples with which I'm more familiar, I can imagine you're doing something like predicting what percentage of people will respond a certain way to an advertisement, or how low a price would have to be before half the people would buy something. Framed that way, these sorts of models are pretty common. In section 6.8 of ARM, we discuss the relation between certain models for individuals and for groups.
Posted by Andrew at 9:02 PM | Comments (2) | TrackBack
September 18, 2008
Student-t conference
A conference celebrating the 100th birthday W.S. "Student" Gosset's "The Probable Error of a Mean" and three other classic papers:
The Harvard University Department of Statistics presents:"Quintessential Contributions: Celebrating Major Birthdays of Statistical Ideas and Their Inventors"
When: Saturday, September 27, 2008
Where: Radcliffe Gym, 18 Mason Street, Cambridge, MA*Celebrating the 65th birthday of Donald B. Rubin and the 30th birthday of his "Multiple Imputations in Sample Surveys: A Phenomenological Bayesian Approach to Nonresponse"
**Invited Speaker: Fritz Scheuren*Celebrating the 70th birthday of Carl N. Morris and the 25th birthday of his "Parametric Empirical Bayes Inference: Theory and Applications"
**Invited Speaker: Andrew Gelman*** Catered Chinese Lunch ***
*Celebrating the 85th birthday of Herman Chernoff and the 35th birthday of his "The Use of Faces to Represent Points in K-Dimensional Space Graphically"
**Invited Speaker: Steve Wang*Celebrating the 100th birthday of W.S. "Student" Gosset's "The Probable Error of a Mean"
**Invited Speaker: Stephen Stigler*Student Presentation: "From Student to students"
*** "Student t Party" at Cambridge Queen's Head Pub***
Registration fee is waived for ALL current and former Harvard affiliates (students, staff and faculty), speakers and specially invited guests. Non-Harvard Affiliates' $50.00 registration fee includes beverages, food, and full symposium. See below for payment information.
To register, provide the following information by September 15, 2008, to
: Name: ______________________________________
Harvard affiliation, if any: _______________
E-mail address: ____________________________
Lunch desired: Yes _______ or No _______
Please make checks payable to: Harvard University Statistics Department and mail to me at the address below. Payment will also be accepted at the door, but we must have a YES or NO reply no later than September 15, 2008, regardless of registration status, to obtain an accurate headcount for catering purposes (attendees will receive marked name tags indicating lunch was reserved).
Posted by Andrew at 8:22 PM | Comments (2) | TrackBack
September 15, 2008
The biggest problem in model selection?
A student writes:
One of my most favorite subjects is model-selection. I have read some papers in this field and know that it is so widely used in almost every sub-field in statistics. I have studied some basic and traditional criterion such as AIC, BIC and CP. The idea is to set a consistent optimal criterion, usually it's not easy when the dimensionality is high, but my question is, what is the biggest problem and why it is so hard?
Also I heard that this field has some relations to non-parameter statistics and linear model theories, but as an undergraduate student, I do not know any specific connections between them. I am working in a laboratory in biostatistics; are there any related problems in this field?
My reply: In my opinion, the biggest difficulty is that AIC etc. are all approximations, not actual out-of-sample errors. The attempt to calculate out-of-sample errors leads to cross-validation which has its own problems. Some sort of general theory/methods for cross-validation would be good. I'm sure people are working on this but I don't think we're there yet. Regarding your final question: sure, just about every statistical method has biological applications. In this case, you're comparing different models you might want to fit.
Posted by Andrew at 5:16 PM | Comments (14) | TrackBack
September 11, 2008
More on interactions
Bruce McCullough writes:
Don't know if you're aware of this, but if you need more evidence for the primacy of interaction effects, data mining is a great place to look. My degree is in economics. I was taught to use interaction effects as a test for nonlinearity, and that was about it.My data mining experience of the past few years has taught me that interaction effects can be neglected at my own peril. A wonderful paper that illustrates this is "Variable selection in data mining: Building a predictive model for bankruptcy," by Dean P. Foster and Robert A. Stine in the Journal of the American Statistical Association (2000). The usual linear regression doesn't work. The model with lots of interactions works very well.
Posted by Andrew at 10:27 PM | Comments (1) | TrackBack
September 10, 2008
My talks this week in D.C.: today (Wed.) at George Washington University, Thurs. at the Cato Institute
If you're in D.C., you should stop by. . . . I'm speaking in the statistics department at George Washington University on the topic of interactions. Here's the powerpoint and here's the abstract:
As statisticians and practitioners, we all know about interactions but we tend to think of them as an afterthought. We argue here that interactions are fundamental to statistical models. We first consider treatment interactions in before-after studies, then more general interactions in regressions and multilevel models. Using several examples from our own applied research, we demonstrate the effectiveness of routinely including interactions in regression models. We also discuss some of the challenges and open problems involved in setting up models for interactions.
The talk will be today, Wed 10 Sept, at 3pm at 1957 E Street, Room 212. If you don't know where that is, you can call the department (202-994-6356) and they should be able to give you directions.
Tomorrow (Thurs) I'll be speaking with Boris at noon at the Cato Institute on Red State, Blue State. It's not too late to sign up for that.
Posted by Andrew at 12:39 AM | Comments (3) | TrackBack
September 3, 2008
Non-Aristotelian logic and municipal government
That header got your attention, huh?? John Hull writes:
Reading an article on "non-Aristotelean" logic, where P(A) is my confidence of A being true, I found (on page 10) the equation P(B=>C)=P(B[AND]C)/P(B). Since I work in municipal government, an obvious interpretation of this is the following:
My confidence that if a person thinks the world is "flat" then they are dangerously stupid is the same as my confidence that a person believes the world is "flat" and is dangerously stupid, divided by my confidence that a person thinks the world is "flat."Setting aside the fact that when people's welfare is in the balance, I tend to become rather passionate and use rather strong language, I simply cannot wrap my head around this idea. For example, my confidence that if it's a lion, then it eats gazelles equals my confidence that it's is a lion eating a gazelle, divided by my confidence that it's is a lion. The left side of that equation is a (near) certainty — lions eat gazelles — but the right-hand side of the equation...how do I even begin to establish my confidence it's a lion, let alone the rest of it?
Can you make this more understandable? Any help will be appreciated.
My response:
1. I find the if-then connection to Aristotelian logic confusing. I'd prefer to start with probabilities as first principles, and then interpret conditional probabilities Bayeisanly or, equivalently, in a frequentist way as the long-run proportion of cases in a "reference set." (The choice of reference set is equivalent to the choice of what to condition on in a Bayesian calculation.) We discuss this in chapter 1 of Bayesian Data Analysis.
2. Right now, I'm realizing how nonintuitive many principles of probability are to some people. See this discussion here where one of the commenters want to assign a zero probability to an event (that of a tied congressional election) because it has never happened yet. That sounds commonsensical--but not if p=1/80,000 and n=20,000.
Posted by Andrew at 12:42 AM | Comments (4) | TrackBack
Melding statistics with engineering?
Dan Lakeland writes:
I recently enrolled as a PhD student in a civil engineering program. My interest could be described as the application of data and risk analysis to engineering modelling, design methods, and decision making.The field is pretty ripe, and infrastructure risk analysis is a common topic these days, but the simulations and statistical approaches taken so far have been a bit unsatisfactory. For example people studying the impact of bridge failures during earthquakes on the local economy might assume a constant cost per person-hour of delay throughout the rebuild period, or people might build statistical models of probability of building collapse, but I would call them pretty much prior distributions, not really based on much data, or based on a finite element computer model of the physics of a single model building.
I think the application of data to engineering is bizarrely a rather new field. Or at least in a renaissance. Back in the 50s or earlier they used to do lots of tests, and generate graphical nomographs of the results (Like the Moody chart for fluid flow friction factors), but these days the emphasis is on detailed finite element analyses, which tell you a exactly how some model will perform, but doesn't deal at all with the difference between your model assumptions and reality.I'm attaching an article that I'm reading for an earthquake soil mechanics class, which shows pretty much the state of the art of applications of (bayesian) statistics to engineering. A CPT test is a test where they push a cone on the end of a long rod into the ground and measure the pressure being applied to the cone as a function of depth. another paper I've read uses artificial neural networks to predict the shear capacity of reinforced concrete beams. Engineers typically don't like ANN type approaches because they're data oriented and don't have explanatory power in terms of physics. On the other hand, the ANN model, because it's based on data, is a much better fit to real performance than the existing physics based models.
I wonder if you might comment in your blog on melding statistics with engineering. especially how we can use data together with deterministic models, and build better engineering decision rules, both for everyday engineering, as well as for dealing with social investment decisions such as building code requirements for extreme events like earthquakes, hurricanes, and soforth.
What decision theory books or articles do you know of that might be useful and relevant to this field?
My reply
I've long thought of statistics as a branch of engineering rather than a science. To me, statistics is all about building tools to solve problems. On the other hand, departments of Operations Research and Industrial Engineering tend to focus on probability theory rather than applied statistics, so I think we need our own departments.
Getting to your specific question: yes, I know what you're talking about. Back in high school and college I spent a few summers working in a lab programming finite element methods. Ultimately this was all statistical, but I didn't see that at the time. I imagine there's been a huge amount of work in this area in the past 25 years, with iterative methods for refining grid boxes and so forth. It would be a fun area to work in. But I suspect it would be an effort to translate it into statistical language.
It seems to me that engineers and physicists work very hard at solving particular problems, which are often big and difficult. Statisticians develop general tools for easy problems (e.g., logistic regression), which is a different sort of challenge. I think there's great potential for putting these perspectives together but I'm not quite clear where to start. I've seen some articles in statistics journals addressing your concerns but I haven't been so impressed by what I've seen there. Probably a better strategy is to start with the engineering literature and add uncertainty to that.
Posted by Andrew at 12:17 AM | Comments (8) | TrackBack
August 25, 2008
Dependent and independent variables
Regarding the question of what to call x and y in a regression (see comments here), David writes, "The semantics are ugly, and don't really add much, because we are concerned with the relation of one to the other, not what they themselves are."
I agree that the semantics don't really add much, but they can subtract, I think! First off, the words "dependent" and "independent" sound similar and can lead to confusion in conversation. Second, as commenter Infz noted, people confuse "independent variables" with statistical independence, leading to the incorrect view that multiple regression requires the predictors to be independent.
I agree, though, that the term "parameter" can be confusing; sometimes it's something that you can vary and sometimes it's something you can estimate. And I've already discussed how "marginal" has opposite meanings in statistics and in economics.
Posted by Andrew at 10:49 AM | Comments (3) | TrackBack
August 23, 2008
"The method of multiple correlation"
Someone writes:
I was reading Harold Gulliksen's /Theory of Mental Tests/ (1950), and on p. 327-329 it describes a process for solving a set of equations of the formy = b1x1 + b2x2 + ... + bnxn
so as to minimize the least square error. Sounds like regression. But this procedure claims to account for the correlation between all the x variables. He calls it "the method of multiple correlation".
Why don't we use this procedure all the time, instead of standard regression, which assumes independence of the independent variables?
My reply: I haven't ever heard of this before. But it sounds to me just like multiple regression (which does not assume independence of the x-variables). This confusion of terminology is one reason why I don't like to use the term "independent variables." I prefer to call them "predictors."
Posted by Andrew at 4:09 PM | Comments (13) | TrackBack
August 20, 2008
Interactions
I have mixed feelings about this picture

and accompanying note of Jeremy Freese, who writes:
Key findings in quantitative social science are often interaction effects in which the estimated “effect” of a continuous variable on an outcome for one group is found to differ from the estimated effect for another group. . . . Interaction effects are notorious for being much easier to publish than to replicate, partly because it is easy for researchers to forget (?) how they tested many dozens of possible interactions before finding one that is statistically significant and can be presented as though it was hypothesized by the researchers all along. . . . There are so many ways of dividing a sample into subgroups, and there are so many variables in a typical dataset that have low correlation with an outcome, that it is inevitable that there will be all kinds of little pockets for high correlation for some subgroup just by chance.
I take his point, and indeed I've written myself about the perils of fishing for statistical significance in a pond full of weak effects (uh, ok, let's shut down that metaphor right there). And I even cite Freese in my article.
On the other hand, I'm also on record as saying that interactions are important (see also here).
I guess my answer is that interactions are important, but we should look for them where they make sense. Jeremy's graph reproduced above doesn't really give enough context. Also, remember that the correlation between before and after measurements will be higher among controls than among treated units.
Posted by Andrew at 12:25 AM | Comments (4) | TrackBack
August 7, 2008
Whassup with Bart?
I've seen Jennifer Hill and Ed George give great talks on Bayesian additive regression trees. It looked awesome. So why haven't these papers appeared anywhere? All I can find are preprints.
Posted by Andrew at 11:08 AM | Comments (7) | TrackBack
August 3, 2008
The mythical Gaussian distribution and population differences
There was a dynamic discussion on gender differences in performance a few days ago. Many interesting points were raised, but most of them regarded differences in models (variance, mean), rather than differences in distributions.
One of the comments referred to the Project TALENT database from 1960. It's one of the most exhaustive datasets of its type.
I have been unhappy for quite some time because papers do not show the actual data. For that reason I wrote a small plotting program that allows visual comparisons of histograms. The plentiful TALENT data makes it possible to avoid binning or kernel smoothing. Here are some plots:


The pink histogram is for girls, the blue one for boys, and where the pink and blue overlap, there is grey.
It is interesting to observe the skew, which might indicate incentives, learning curves or unbalanced tests. One of the most striking examples of skew is the difference in reading comprehension between Catholic/Protestant and Jewish populace, but I also list mechanical reasoning:


Project TALENT's data is from 1960, so things might have changed since then. Nowell & Hedges discuss some trends from 1960-1994.
In the end, let me reiterate that this posting does not make any statements about the causality of these differences - I am merely providing the data as such. The only assumptions were that the missing values can be dropped (boys were overrepresented in this respect) and that both underlying populations are comparable (no systematic effects with respect to extraneous biases such as age).
I did NOT observe boys being overrepresented on the low end of the spectrum for mathematics scores - but this could easily happen if one isn't careful throwing out the missing values coded with "-1" (5.4% among boys, 4.4% among girls).
Posted by Aleks Jakulin at 6:47 PM | Comments (7) | TrackBack
Classifying Olympic athletes as male or female, leading to a comment about the recognition of uncertainty in life
I read an interesting op-ed by Jennifer Finney Boylan about classification of Olympic athletes as male or female. Apparently, they're now checking the sex of athletes based on physical appearance and blood samples. This should be an improvement over the simple chromosome test which can label a woman as a man because she has a Y chromosome, even if she is developmentally and physically female. But then Boylan writes:
Most efforts to rigidly quantify the sexes are bound to fail. For every supposedly unmovable gender marker, there is an exception. There are women with androgen insensitivity, who have Y chromosomes. There are women who have had hysterectomies, women who cannot become pregnant, women who hate makeup, women whose object of affection is other women.
I'm starting to lose the thread here. Nobody is talking about excluding from Olympic competition women who have had hysterctomies or cannot become pregnant, right? And lesbians are allowed to compete too, no? And makeup might be required for Miss America competition but not for athletes. Boylan continues:
So what makes someone female then? . . . The only dependable test for gender is the truth of a person’s life . . . The best judge of a person’s gender is what lies within her, or his, heart.
Would this really work? This just seems like a recipe for cheating, for Olympic teams in authoritarian countries to take some of their outstanding-but-not-quite-Olympic-champion caliber male athletes and tell them to live like women. It doesn't seem so fair to female athletes from the U.S., for example, to have to compete with any guy in the world who happens to be willing to say, for the purposes of the competition, that, in his heart, he feels like a woman.
Why do I mention this in a statistics blog?
I think people are often uncomfortable with ambiguity. Boylan correctly notes that sex tests can have problems and that there is no perfect rule, but then she jumps to the recommendation that there be no rules at all.
Posted by Andrew at 1:24 PM | Comments (4) | TrackBack
August 2, 2008
Let computers do the surveys!
WSJ reports that people are more likely to provide socially-acceptable answers to survey questions about themselves when interviewed by a person (or even an avatar!) than when responding to an automated survey system or a recording. Such questions relate to politics, hygiene, exercise, health, and so on.
The research is helping refine polling at a university phone center nearby. Activity at the center, which sits in a former school building, picks up around dinnertime when the staff makes calls for university-run surveys from a warren of cubicles. The questioners are asked to speak in even tones, reading from scripts. No one is allowed to say, "How are you?" in case the person on the other end had a bad day. The interviewers don't laugh; they don't want people to treat this as a social call. They are allowed only neutral responses such as "I see" or "Hmm."

There are some interesting demonstrations at Harvard's Implicit project.
Posted by Aleks Jakulin at 6:01 AM | Comments (6) | TrackBack
August 1, 2008
Rube Goldberg statistics?
Kenneth Burman writes:
Some modern, computer intensive, data analysis methods may look to the non-statistician (or even our selves) like the equivalent of the notorious Rube Goldberg device for accomplishing a intrinsically simple task. Whereas some variants of the bootstrap or cross validation might fit this situation, mostly the risk this humiliation is to be found in MCMC-based Bayesian methods. I [Bruman] am not at all against such methods. I am only wondering if, to “outsiders” (who may already a negative impression of statistics and statisticians), these methods may appear like a Rube Goldberg device. You have parameters, likelihoods, hierarchies of fixed effects, random effects, hyper-parameters, then Markov chain Monte Carlo with tuning and burn followed by long “chains” of random variables, with possible thinning for lag-correlations, concerns about convergence to an ergodic state. And after all that, newly “armed” now with a “sample” of 100,000 (or more) numbers from a mysterious posterior probability distribution you proceed to analyze these new “data” (where did the real data go? – now you have more numbers than you started with for actual data) by more methods, simple (a mean) or complex (smoothing using kernel density methods, and then pull off the mode). All OK to a suitably trained statistician, but might we be in for ridicule and misunderstanding from the public? If such a charge were leveled at us (“you guys are doing Rube Goldberg statistics”) how would we respond, given the “complaint” comes from people with little or no statistics training? Of course, such folks may not be capable of generating such a critique, but could still realize they have no idea what the statistician is doing to the data to get answers. It does us no good if the public thinks our methods are Rube Goldberg in nature.
Interesting question. I'll respond in a few days. But in the meantime, would any of you like to give your thoughts?
Posted by Andrew at 1:10 AM | Comments (13) | TrackBack
July 28, 2008
"An ounce of replication..."
I was looking through this old blog entry and found an exchange I like enough to repost. Raymond Hubbard and R. Murry Lindsay wrote,
An ounce of replication is worth a ton of inferential statistics.
I questioned this, writing:
More data are fine, but sometimes it's worth putting in a little effort to analyze what you have. Or, to put it more constructively, the best inferential tools are those that allow you to analyze more data that have already been collected.
Seth questioned my questioning, writing:
I'd like to hear more about why you don't think an ounce of replication is worth a ton of inferential statistics. That has been my experience. The value of inferential statistics is that they predict what will happen. Plainly another way to figure out what will happen is to do it again.
To which I replied:
I'm not sure how to put replication and inferential statistics on the same scale . . . but a ton is 32,000 times an ounce. To put in dollar terms, for example, I think that in many contexts, $32,000 of data analysis will tell me more than $1 worth of additional data. Often the additional data are already out there but haven't been analyzed.
I think it's fun to take this sort of quotation literally and see where it leads. It's a rhetorical strategy that I think works well for me, as a statistician.
Posted by Andrew at 12:10 AM | Comments (5) | TrackBack
July 26, 2008
NYT vs WSJ on gender issues
Aleks sends in a striking example of a news story presented in two completely different ways:
I [Aleks] was looking at the NYT and WSJ today, and one particular discrepancy struck me. The NYT story, "Math Scores Show No Gap for Girls," by Tamar Lewin, says:Three years after the president of Harvard, Lawrence H. Summers, got into trouble for questioning women’s “intrinsic aptitude” for science and engineering — and 16 years after the talking Barbie doll proclaimed that “math class is tough” — a study paid for by the National Science Foundation has found that girls perform as well as boys on standardized math tests. . . . “Now that enrollment in advanced math courses is equalized, we don’t see gender differences in test performance,” said Marcia C. Linn of the University of California, Berkeley, a co-author of the study. “But people are surprised by these findings, which suggests to me that the stereotypes are still there.” . . . Although boys in high school performed better than girls in math 20 years ago, the researchers found, that is no longer the case. . . . The researchers looked at the average of the test scores of all students, the performance of the most gifted children and the ability to solve complex math problems. They found, in every category, that girls did as well as boys. . . .The NYT story had absolutely no mention of the girl/boy variance whatsoever. Compare to the
WSJ version (girl/boy variance in the headline), "Boys' Math Scores Hit Highs and Lows," by Keith Winstein:Girls and boys have roughly the same average scores on state math tests, but boys more often excelled or failed, researchers reported. The fresh research adds to the debate about gender difference in aptitude for mathematics, including efforts to explain the relative scarcity of women among professors of science, math and engineering.In the 1970s and 1980s, studies regularly found that high- school boys tended to outperform girls. But a number of recent studies have found little difference. . . . [The recent study] didn't find a significant overall difference between girls' and boys' scores. But the study also found that boys' scores were more variable than those of girls. More boys scored extremely well -- or extremely poorly -- than girls, who were more likely to earn scores closer to the average for all students. . . . The study found that boys are consistently more variable than girls, in every grade and in every state studied. That difference has "been a concern over the years," said Marcia C. Linn, a Berkeley education professor and one of the study's authors. "People didn't pay attention to it at first when there was a big difference" in average scores, she said. But now that girls and boys score similarly on average, researchers are taking notice, she said.
Here's some context from a few years back (I looked it up, because I wasn't sure exactly what Summers said, and the NYT article referred to him. From the NYT a few years ago:
Dr. Summers cited research showing that more high school boys than girls tend to score at very high and very low levels on standardized math tests, and that it was important to consider the possibility that such differences may stem from biological differences between the sexes. Dr. Freeman said, "Men are taller than women, that comes from the biology, and Larry's view was that perhaps the dispersion in test scores could also come from the biology.
What's amazing is that the two newspapers quote the same researcher but with two nearly opposite points. I assume she made both points to both newspapers, but the NYT reporter ran with the "stereotypes are still there" line and the WSJ reporter ran with "researchers are taking notice." It must be frustrating to Linn to have only part of her story reported in each place. (Yeah, yeah, I know that newspapers have space constraints. It still must be frustrating.)
Posted by Andrew at 4:20 PM | Comments (19) | TrackBack
July 16, 2008
The American (League) Dynasty
Every year, the best players (or at least many of the best players) from Major League Baseball's American League play their counterparts in the National League in the All-Star Game. They played last night; the American league won in the 15th inning. Here's who won, from 1965 (when I was born) to the present, with 1965 at the left and 2008 at the right.
NNNNNNNNNNNNNNNNNNANNANAAAAAANNNAAAAATAAAAAA
The "T" indicates a tie (in 2002): unlike regular games, there is no requirement that the All-Star Game continue until somebody wins, and pitchers are reluctant to pitch too many innings and potentially hurt themselves.
I was born into an era in which the National League won every game. Now, the American League wins (or, at least, doesn't lose) every game. This is happening in a sport where even bad teams beat good teams occasionally, so it's really mystifying. It would be possible to explain a small edge for one league or the other, that persists for a few years --- the league with the best pitcher will have an advantage, for example, and that pitcher can play year after year --- but these effects can't come close to explaining the long runs in favor of one team or another. Predicting next year's winner to be the same as this year's winner would have correctly predicted 80% of the games in my lifetime...and that's if we pretend the National League won the tie game in 2002. (If we pretend the American League won it, it's 84%).
What would be a reasonable statistical model for baseball All-Star games, and why isn't it something close to coin flips?
Posted by Phil at 3:19 PM | Comments (14) | TrackBack
July 14, 2008
Thoughts on new statistical procedures for age-period-cohort analyses
Posted by Andrew at 9:53 AM | Comments (0) | TrackBack
July 11, 2008
Guernsey McPearson's Statistical Menagerie
Here are some hilarious (if you're a statistician) sketches from Stephen Senn:
Robustnik "These are the three laws of robustics. First law: get a computer. second law: get a bigger computer. Third law: what you really need is a much bigger computer." Favourite reading: I Robust, by Isaac Azimuth.Frequency Freak
" Did you randomise? OK: so far so good. Now what would you have said if the third value from the left had been the second from the right. Hold on a minute. Are you sure you haven't looked at this question before?" Favourite reading: Casino Royale.Bog Bayesian
" All you need is Bayes. It's the answer to everything. If only Adolf and Neville could have exchanged utility functions at Munich we could have saved the world a whole lot of bother round about the middle of the last century." Favourite reading: The Hindsight Saga.Subset Surfer
"OK, so the egg's rotten but parts of it are excellent." Favourite reading: Europe on $5 a day.Gibbs Sampler
" First catch your likelihood. Take one Super Cray, a linear congruential generator, any prior you like and if the whole thing isn't done to a turn within three days my name's not Gary Rhodes." Favourite reading: Mrs BeatonComplete Consultant
" First we test the randomisation. Then we look for homogeneity between centres. Then we run the Shapiro-Wilks over it and if you like we'll throw in a Kolmogorov-Smirnov at no extra cost. Then we test for homogeneity of variance and look for outliers and even if that's OK we'll do a Mann-Whitney anyway just to be on the safe side. All this will be fully documented in a report with our company logo on every page." Favourite reading: The Whole Earth Catalogue.Mr Mathematics
"I just don't see the problem. All you have to do is define the null hypothesis precisely, define the alternative hypothesis precisely, choose your type I error rate and use the most powerful test." Favourite reading: Brave New World.Bootstrapper
"Look, this is the way to build the football team of the future. You choose a player. You put him back in the pool. You choose again. Do that long enough and if you don't eventually get a team which has Becks in it three times my name's not Sven Goran Erikson." Favourite reading: Bradley's Shakksperrr.Unconditional Inferencer
"It's true that all the engines are on fire and the captain has just died from a heart attack but there's no need to worry because averaged over all flights air travel is very safe." Favourite reading: Grimm's Fairy Tales
And many more:
Data Explorer "Wow! It's all too beautiful. I mean, Man, the colours, the shapes and those rotations and dig those projections. It's like Lucy in the Sky with Diamonds meets the Walrus and the Eggman." Favourite reading: The Glass Bead Game.Third Degree Bayesian " Look there is no way I am letting you out of this room until you give me a prior. Have you heard of the jackknife? Yes? Well this is a thumbscrew." Favourite reading: Justine.
Mr Megabyte
"Just you wait till virtual reality hits the statistical computing scene. The only thing holding us back is that we have been mentally crippled by having been brought up to use pencil and paper. In the third millennium we will all have statistical processing chips implanted behind our ears. Books are a thing of the past." Favourite video: Farenheit 451.Absolute Abacus
"Of course, no real statistical techniques worth talking about have been discovered since 1962. I grant you that in the occasional difficult case you might wish to use an electronic computer but not everyone wants to travel down to Manchester each time they need to calculate something." Favourite reading: The Anglo Saxon Chronicle.Tabulator
"What you really need to do is understand the field of application thoroughly, become familiar with every data point, check each one against original records and present the whole thing with some simple graphs and tables. All this probability rubbish is just a conspiracy got up by a bunch of mathematicians who don't even understand the first thing about data." Favourite reading: The Little House on the Prairie.Mrs P
"Now now. Nursey won't go away until you've filled this bottle. And if you don't produce something soon you'll never grow up to get published. Now, would a nice cup of t help?" Favourite reading: Winnie the Pooh.
Whew! Just copying this made me feel good.
Posted by Andrew at 12:45 PM | Comments (2) | TrackBack
June 27, 2008
Beer, quality control, and Student's t distribution
John Cook's theory of why the t distribution was discovered at a brewery:
Beer makers pride themselves on consistency while wine makers pride themselves on variety. That’s why you’ll never hear beer fans talk about a “good year” the way wine connoisseurs do. Because they value consistency, beer makers invest more in extensive statistical quality control than wine makers do.
(On the other hand, Seth thinks that "ditto foods" are so late-twentieth-century, and that lack of uniformity in taste is ultimately healthier.)
Posted by Andrew at 5:00 PM | Comments (5) | TrackBack
June 26, 2008
The End of Theory: The Data Deluge Makes the Scientific Method Obsolete
Drew Conway pointed me to this article by Chris Anderson talking about the changes in statistics and, by implication, in science, resulting from the ability of Google and others to sift through zillions of bits of information. Anderson writes, "The new availability of huge amounts of data, along with the statistical tools to crunch these numbers, offers a whole new way of understanding the world. Correlation supersedes causation, and science can advance even without coherent models, unified theories, or really any mechanistic explanation at all."
Conway is skeptical, pointing out that in some areas--for example, the study of terrorism--these databases don't exist. I have a few more thoughts:
1. Anderson has a point--there is definitely a tradeoff between modeling and data. Statistical modeling is what you do to fill in the spaces between data, and as data become denser, modeling becomes less important.
2. That said, if you look at the end result of an analysis, it is often a simple comparison of the "treatment A is more effective than treatment B" variety. In that case, no matter how large your sample size, you'll still have to worry about issues of balance between treatment groups, generalizability, and all the other reasons why people say things like, "correlation is not causation" and "the future is different from the past."
3. Faster computing gives the potential for more modeling along with more data processing. Consider the story of "no pooling" and "complete pooling," leading to "partial pooling" and multilevel modeling. Ideally our algorithms should become better at balancing different sources of information. I suspect this will always be needed.
Posted by Andrew at 2:51 AM | Comments (13) | TrackBack
June 25, 2008
The popularity of statistics?
Jennifer pointed me to this site, which states that "white people hate math" but "are fascinated by 'the power of statistics' since the math has already been done for them." I'd like to believe this is true (the part about white people liking statistics, not the part about the math having already be done to them) but I'm skeptical. Everywhere I've ever taught, there have been a lot more math majors than stat majors, and I'm pretty sure this is true among the subset of students who are white. But it might be true that the business majors, the poli sci majors, the English majors, etc.--not to mention the people who don't go to college at all--prefer statistics to mathematics. Actually, I think most of these people should prefer statistics to mathematics. But I fear that a more likely reaction would be something like, "math is cool, statistics is boring."
P.S. I looked further down, and this "Stuff White People Like" site is just weird. "With few exceptions, white people are actually fond of almost any dictator not named Hitler"?? Huh? I mean, I can see that the site is a parody, but this is just weird.
Posted by Andrew at 2:20 PM | Comments (9) | TrackBack
June 23, 2008
Diagnostics for multivariate imputations: getting inside the black box
Random imputation is a flexible and useful way to handle missing data (see chapter 25 for a quick overview), but it's typically taken as a black box. This partly is a result of confusion over statistical theory. Structural assumptions such as "missingness at random" cannot be checked from data--this is a fundamental difficulty--but this does not mean that imputations cannot be checked. In our recent paper, Kobi Abayomi, Mark Levy, and I do the following:
We consider three sorts of diagnostics for random imputations: displays of the completed data, which are intended to reveal unusual patterns that might suggest problems with the imputations, comparisons of the distributions of observed and imputed data values and checks of the fit of observed data to the model that is used to create the imputations. We formulate these methods in terms of sequential regression multivariate imputation, which is an iterative procedure in which the missing values of each variable are randomly imputed conditionally on all the other variables in the completed data matrix.We also consider a recalibration procedure for sequential regression imputations.We apply these methods to the 2002 environmental sustainability index, which is a linear aggregation of 64 environmental variables on 142 countries.
The article has some pretty pictures (and some ugly pictures too; hey, we're not perfect). I don't know how directly useful these methods are; I think of them as providing "proof of concept" model checking for imputations is possible at all, and I'm hoping this will spur lots of work by many researchers in the area. Ultimately I'd like people (or computer programs) to check their imputations just as they currently check their regression models.
Posted by Andrew at 12:04 AM | Comments (2) | TrackBack
June 16, 2008
Friday the 13th study
Apparently, Friday the 13th is not unlucky, according to Dutch researchers: link to article.
I would like to see a parallel psychological study, to see if people are more careful on Friday the 13th, go out less, drive less (or just shorter distances) - and if people considering criminal activity hold off until the next day. I also wonder if there is an upswing in the types of "bad luck" they chose to survey on Saturday the 14th...
Posted by Juli at 3:17 PM | Comments (5) | TrackBack
June 12, 2008
Some thoughts on the saying, "All models are wrong, but some are useful"
J. Michael Steele explains why he doesn't like the above saying (which, as he says, is attributed to statistician George Box). Steele writes, "Whenever you hear this phrase, there is a good chance that you are about to be sold a bill of goods."
He considers a street map of Philadelphia as an example of a model:
If I say that a map is wrong, it means that a building is misnamed, or the direction of a one-way street is mislabeled. I never expected my map to recreate all of physical reality, and I only feel ripped off if my map does not correctly answer the questions that it claims to answer. My maps of Philadelphia are useful. Moreover, except for a few that are out-of-date, they are not wrong.
Actually, my guess is that his maps are wrong, in that there probably are a couple of streets that are mislabeled in some way. Street maps are updated occasionally (even every year), but streets get changed, and not every change is captured in an update. I expect there are a few places where Steele's map has mistakes. (But I doubt it's like those old tourist street maps of Soviet cities which, I've been told, had lots of intentional errors to make it harder for people to actually find their way around too well.) In any case, I take his general point, which is that a street map could be exactly correct, to the resolution of the map.
Statistical models of the sort that I typically use are different in being generative: that is, they are stochastic prescriptions for creating data. As such, they can typically never be proven wrong (except in special cases, for example a binary regression model can't produce a data value of 0.6). The saying, "all models are wrong," is helpful because it is not completely obvious, since it can't always be proved in special cases.
Recall the saying that a chi-squared test is a "measure of sample size." With a small sample size, you won't be able to reject even a silly model, and with a huge sample size, you'll be able to reject any statistical model you might possibly want to use (at least in the social and environmental sciences, where I do most of my work). This is a simple point, and I can see how Steele can be irritated by people making a big point about it . . . .
But, the trouble is, many people don't realize that all models are wrong. They want to make statements such as, The probability is 0.74 that the logistic regression model with predictors A,B,and D is correct. This is not the sort of statement I ever want to say.
The point of posterior predictive checking (see chapter 6 of Bayesian Data Analysis, or chapter 8 in our regression book for a less explicitly Bayesian treatment) is to use numerical and graphical summaries to understand what aspects of the data are captured by the model and what aspects are not. The goal is not to check whether the model is "wrong"--after all, all models are wrong--but to see how well it fits. I agree with Steele that external validation is good too.
Posted by Andrew at 12:24 AM | Comments (9) | TrackBack
June 9, 2008
Doing statistics the Dunson way: nonparametric statistics for the 21st century
David Dunson forwarded me this article for
a book that is coming out on Nonparametric Bayes in Practice. I think David's work is great but I keep encountering it in separate research articles and never in a single place which explains when to use each sort of model. I'll have to read the article in detail, but it seems like a good start. I suggested to David that he write a book but he pointed out that nobody reads books. But do people read articles in handbooks? I don't know. I guess what's really needed is a convenient software implementation for all of it. In the meantime, this article seems like the place to go.
Posted by Andrew at 2:17 AM | Comments (3) | TrackBack
May 30, 2008
Demystifying double robustness: "in at least some settings, two wrong models are not better than one"
From Joseph Kang and Joseph Schafer:
When outcomes are missing for reasons beyond an investigator’s control, there are two different ways to adjust a parameter estimate for covariates that may be related both to the outcome and to missingness. One approach is to model the relationships between the covariates and the outcome and use those relationships to predict the missing values. Another is to model the probabilities of missingness given the covariates and incorporate them into a weighted or stratified estimate. Doubly robust (DR) procedures apply both types of model simultaneously and produce a consistent estimate of the parameter if either of the two models has been correctly specified. In this article, we show that DR estimates can be constructed in many ways. We compare the performance of various DR and non-DR estimates of a population mean in a simulated example where both models are incorrect but neither is grossly misspecified. Methods that use inverse-probabilities as weights, whether they are DR or not, are sensitive to misspecification of the propensity model when some estimated propensities are small. Many DR methods perform better than simple inverse-probability weighting. None of the DR methods we tried, however, improved upon the performance of simple regression-based prediction of the missing values. This study does not represent every missing-data problem that will arise in practice. But it does demonstrate that, in at least some settings, two wrong models are not better than one.
Posted by Andrew at 12:40 PM | Comments (1) | TrackBack
Post-World War II cooling a mirage
Mark Levy pointed me to this. I don't know anything about this area of research, but if true, it's just an amazing, amazing example of the importance of measurement error:
The 20th century warming trend is not a linear affair. The iconic climate curve, a combination of observed land and ocean temperatures, has quite a few ups and downs, most of which climate scientists can easily associate with natural phenomena such as large volcanic eruptions or El Nino events.But one such peak has confused them a hell of a lot. The sharp drop in 1945 by around 0.3 °C - no less than 40% of the century-long upward trend in global mean temperature - seemed inexplicable. There was no major eruption at the time, nor is anything known of a massive El Nino that could have caused the abrupt drop in sea surface temperatures. The nuclear explosions over Hiroshima and Nagasaki are estimated to have had little effect on global mean temperature. Besides, the drop is only apparent in ocean data, but not in land measurements.
Now scientists have found – not without relief - that they have been fooled by a mirage.
The mysterious post-war ocean cooling is a glitch, a US-British team reports in a paper in this week’s Nature. What most climate researchers were convinced was real is in fact “the result of uncorrected instrumental biases in the sea surface temperature record,” they write. Here is an editor’s summary.How come? Almost all sea temperature measurements during the Second World War were from US ships. The US crews measured the temperature of the water before it was used to cool the ships engine. When the war was over, British ships resumed their own measurements, but unlike the Americans they measured the temperature of water collected with ordinary buckets. Wind blowing past the buckets as they were hauled on board slightly cooled the water samples. The 1945 temperature drop is nothing else than the result of the sudden but uncorrected change from warm US measurements to cooler UK measurements, the team found.
Whaaa...?
The article (by Quirin Schiermeier) continues:
That’s a rather trivial explanation for a long-standing conundrum, so why has it taken so long to find out? Because identifying the glitch was less simple than it might appear, says David Thompson of the State University of Colorado in Boulder. The now digitized logbooks of neither US nor British ships contain any information on how the sea surface temperature measurements were taken, he says. Only when consulting maritime historians it occurred to him where to search for the source of the faintly suspected bias. Our news story here has more.Scientists can now correct for the overlooked discontinuity, which will alter the character of mid-twentieth century temperature variability. In a News and Views article here (subscription required) Chris Forest and Richard Reynolds lay out why this will not affect conclusions about an overall 20th century warming trend.
And there's more:
But it may not be the last uncorrected instrument bias in the record. The increasing number of measurements from automated buoys, which in the 1970s begun to replace ship-based measurements, has potentially led to an underestimation of recent sea surface temperature warming.
Posted by Andrew at 12:22 AM | Comments (9) | TrackBack
May 23, 2008
Quarterbacks and psychometrics
Eric Loken writes,
Criteria Corp is a company doing employee testing (basically psychometrics meets on-demand assessment). We're also going to blog on various issues relating to psychometrics and analyses of testing data. We're starting slowly on the blog front, but a few days ago we did one on employment tests for the NFL.. A few scholars have argued that the NFL's use of the Wonderlic (a cognitive aptitude measure) is silly as it shows no connection to performance. But we showed that for quarterbacks, once you condition on some minimal amount of play, the correlation between aptitude and performance was as high as r = .5...which is quite strong. It's the common case of regression gone bad when people don't recognize that the predictor has a complex relationship to the outcome. There are many reasons why a quarterback doesn't play much; so at the low end of the outcome, the prediction is poor and the variance widely dispersed. But there are fewer reasons for success, and if the predictor is one of them, then it will show a better association at the high end.
Here's their blog, and here's Eric's football graph:

P.S. The graph would look better with the following simple fixes:
1. Have the y-axis start at 0. "-2000.00" passing yards is just silly.
2. Label the y-axis 0, 5000, 10000. "10000.00" is just silly. Who counts hundredths of yards?
3. Label the x-axis at 10, 20, 30, 40. Again, who cares about "10.00"?
I've complained about R defaults, but the defaults on whatever program created the above plot are much worse! (I do like the color scheme, though. Setting the graph off in gray is a nice touch.)
Posted by Andrew at 9:12 AM | Comments (6) | TrackBack
A question of infinite dimensions
Constantine Frangakis, Ming An, and Spyridon Kotsovilis write:
Problem: suppose we conduct a study of known design (e.g. completely random sample) to measure *just a scalar* (say income, gene expression example from Rafael Irizarry), and suppose we get full response. Question: what data do we actually observe? Answer: we observe an infinite dimensional variable, which can carry extra information about how we analyze the scalar (say to estimate the population mean).Logic:
1. Suppose we believe that if we had applied the same measurement device on all the population, then we would have some non-response. That would then mean that in the actual sample we got, the mere fact that we observed *all the data* is actual information and means that we got a non-representative sample of the population (just from the responders).2. If we believe that (1) can be true, then we should worry. Reversely, if we do not worry, it implies we believe (1) is false. But there is no measurement device that is a priori guaranteed to work for all units, so we must worry.
3. The key issue now is that we usually think that, by incorporating the indicator of observation in a new column in the data, we believe we have fully described what we observed. But I suggest we have not. This is because we can iterate the logic of (1) now on the "new data": the fact ={that we observed that we had full responses} is also a nontrivial observation, as long as it is measured with a device that can sometimes be fallible. But when we iterate this logic we conclude that we actually observe an infinite sequence of variables.
This is very much similar to Godel's argument of incompleteness, applied to statistics if we treat a measurement device is a Turing machine. Its practical implication is that it is extremely important to understand the *variation in how* exactly each and every measurement was made because that variation is extra information **even if (and not only if) we observe all measurements !! **
I didn't really follow, so I asked Constantine to clarify. He wrote:
Here is an example of the first level.1. Setting: suppose we are studying the income Y of a city's population, and Y in truth Y follows a log-normal distribution and we know that. We are to conduct measurements on units, with a measurement device (e.g., an interviewer) that can *possibly* give no response (if it gives response, we assume it is true).
2. Data: we now conduct a simple random sample, and with the measurement *device* we use, we get 100% response in the sample. Also, say with the data we get an MLE{median pr(Y)}=$54K and the MLE(SD(log(income)) i.e. among all response sample)=0.43, or MLE(SD(income))=$27K;
Question: Should we worry about non-response even if we got full response ?
Answer: The answer is YES, because we would get a DIFFERENT RESULT than $54K under some consideration of non-response, EVEN IF WE GOT FULL RESPONSE.3. Example: What can that consideration be, and what answer could we get ?
Suppose that if the *same measurement device had been applied to all the population*, we would have gotten 20% non response (R=0). Moreover, suppose that this nonresponse depends on the outcome in the sense that the ratio of the median income among responders versus non-responders is .7, which occurs because all incomes < median respond, but a random 60% of the incomes > median respond. Suppose also we know this - this gives a model for pr(R|Y).
What is the MLE now, with the same log normal model but also the pr(R|Y) model ? It is $60K. It is significant to note that by the above MLEs, I mean no randomness, in the following sense: under the outcome model pr(Y) in part 1 and the pr(R|Y) model in part 3 (the pr(R|Y) model), we have:
A) the median of the true distribution of Y is $60K, but
B) the median of the distribution pr(Y|R=1) is $54K. So, the MLE of $54K if we get full response is not a happenstance but the value we expect to get if we ignore part 3.Since A and B differ, it matters whether we consider the observation of {the fact that we observed all the data} as important information.
The new observation I am making here is that this is not complete - we have to be considering (otherwise we are making assumptions) the observation that we observed that we observed ...., and this can be iterated to infinity.The key results are that
Result 1) from a plan to measure just a scalar, we are actually observing infinite variables; and
Result 2) there is a bound (like Heisenberg's uncertainty bound) of how much of this information we can actually use.
I still don't really understand what Constantine is saying here, but he's a smart guy, so I'm passing this along in case it interests any of you out there.
Posted by Andrew at 8:37 AM | Comments (3) | TrackBack
May 16, 2008
Irreproducible analysis
John Cook has an interesting story here. I agree with his concerns. It's hard enough to reproduce my own analysis, let alone somebody else's. This comes up sometimes when revising a paper or when including an old analysis in a book, that I just can't put all the data back together again, or I have an old Fortran program linked to S-Plus that won't run in R, or whatever, and so I have to patch something together with whatever graphs I have available.
Also, when consulting, I've sometimes had to reconstruct the other side's analysis, and it can be tough sometimes to figure out exactly what they did.
Posted by Andrew at 9:33 PM | Comments (4) | TrackBack
May 6, 2008
Can you trust a dataset where more than half the values are missing?
Rick Romell of the Milwaukee Journal Sentinel pointed me to the National Highway Traffic Safety Administration’s data on fatal crashes. Rick writes,
In 2006, for example, NHTSA classified 17,602 fatal crashes as being alcohol-related and 25,040 as not alcohol-related. In most of the crashes classified as alcohol-related, no actual blood-alcohol-concentration test of the driver was conducted. Instead, the crashes were determined to be alcohol-related based on multiple imputation. If I read NHTSA’s reports correctly, multiple imputation is used to determine BAC in about 60% of drivers in fatal crashes.
He goes on to ask, "Can actual numbers be accurately estimated when data are missing in 60% of the cases?" and provides this link to the imputation technique the agency now uses and this link to an NHTSA technical report on the transition to the currently-used technique.
My quick thought is that the imputation model isn't specifically tailored to this problem and I'm sure it's making some systematic mistakes, but I figure that the NHTSA people know what they're doing, and if the imputed values made no sense, they would've done something about it. That said, it would be interesting to see some confidence-building exercises to give a sense that the imputations make sense. (Or maybe they did this already; I didn't look at the report in detail.)
Posted by Andrew at 12:02 AM | Comments (6) | TrackBack
May 5, 2008
Martian inferences
Benjamin Kay points to this:
But I [Nick Bostrom] hope that our Mars probes discover nothing. It would be good news if we find Mars to be sterile. Dead rocks and lifeless sands would lift my spirit. Conversely, if we discovered traces of some simple, extinct life-form--some bacteria, some algae--it would be bad news. If we found fossils of something more advanced, perhaps something that looked like the remnants of a trilobite or even the skeleton of a small mammal, it would be very bad news. The more complex the life-form we found, the more depressing the news would be. I would find it interesting, certainly--but a bad omen for the future of the human race.What has all this got to do with finding life on Mars? Consider the implications of discovering that life had evolved independently on Mars (or some other planet in our solar system). That discovery would suggest that the emergence of life is not very improbable. If it happened independently twice here in our own backyard, it must surely have happened millions of times across the galaxy. This would mean that the Great Filter is less likely to be confronted during the early life of planets and therefore, for us, more likely still to come.
If we discovered some very simple life-forms on Mars, in its soil or under the ice at the polar caps, it would show that the Great Filter must come somewhere after that period in evolution. This would be disturbing, but we might still hope that the Great Filter was located in our past. If we discovered a more advanced life-form, such as some kind of multicellular organism, that would eliminate a much larger set of evolutionary transitions from consideration as the Great Filter. The effect would be to shift the probability more strongly against the hypothesis that the Great Filter is behind us. And if we discovered the fossils of some very complex life-form, such as a vertebrate-like creature, we would have to conclude that this hypothesis is very improbable indeed. It would be by far the worst news ever printed.
Benjamin Kay writes,
It has a logical claim in it that seems similar to the doomsday argument you presented in Doomsday and Bayes but in a different context. I [Kay] found it an interesting example of inference from very limited data. Perhaps it is right for your blog.
It somehow reminds of the story about the person who brings a bomb on the plane with him to protect against terrorism because, hey, think how low the odds are of having two bombs on the plane. Also, the bit about vertebrates reminded me of the Great Chain of Being.
Posted by Andrew at 12:32 AM | Comments (0) | TrackBack
April 30, 2008
Another salvo in the ongoing battle over standardizing regression coefficients
Sander Greenland doesn't like the automatic rescaling of regression coefficients (for example, my pet idea of scaling continuous inputs to have a standard deviation of 0.5, to put them on a scale comparable to binary predictors) because he prefers interpretable units (years, meters, kilograms, whatever). Also he points out that data-based rescaling (such as I recommend) creates problems in comparing models fit to different datasets.
OK, fine. I see his points. But let's go out into the real world, where people load data into the computer and fit models straight out of the box. (That's "out of the box," not "outside the box.")
Here's something I saw recently, coefficients (and standard errors) from a fitted regression model:
coefficient for "per-capita GDP": -.079 (.170)
coefficient for "secondary school enrollment": -.001 (.006)
Now you tell me that these have easy interpretations. Sure. I'd rather have seen these standardized. Then I'd be better able to interpret the results. Nobody's stopping you from doing a more careful rescaling, a la Greenland, but that's not the default we're starting from.
Posted by Andrew at 12:18 AM | Comments (11) | TrackBack
April 25, 2008
70,000 Assyrians
One of my favorite instances of numeracy in literature is William Saroyan's story, "70,000 Assyrians," which I read in the collection, Bedside Tales. The story is typical charming early-Saroyan: it starts out with him down-and-out, waiting on line for a cheap haircut, then he converses with the barber, asking if he, like Saroyan, is Armenian. No, he replies, he's Assyrian. Saroyan says how sad it is that the Assyrians, like the Armenians, no longer have their own country, but that they can hope for better. The barber says, sadly, that the Assyrians cannot even hope, because they have been so depleted, there are only 70,000 of them left in the world.
This is the numeracy: 70,000 is a large number, a huge number of people. It's crowds and crowds and crowds--enough for an entire society, and then some. But not enough for a country, or not enough in a hostile part of the world where other people are busy trying to wipe you out. The idea that 70,000 is a lot, but not enough--that's numeracy. People can be numerate with dollars--for example, $70,000 is a lot of money but it can't buy you a nice apartment in Manhattan--but it's my impression and others' that people have more difficulty with other sorts of large numbers. That's why this Saroyan story made an impression on me.
Posted by Andrew at 12:00 AM | Comments (2) | TrackBack
April 14, 2008
p-values blah blah blah
Karl Ove Hufthammer points me to this paper by Raymond Hubbard and R. Murray Lindsay, "Why P Values Are Not a Useful Measure of Evidence in Statistical Significance Testing."
I agree that p-values are a problem, but not quite for the same reasons as Hubbard and Lindsay do. I was thinking about this a couple days ago when talking with Jeronimo about FMRI experiments and other sorts of elaborate ways of making scientific connections. I have a skepticism about such studies that I think many scientists share: the idea that a questionable idea can suddenly become scientific by being thrown in the same room with gene sequencing, MRIs, power-law fits, or other high-tech gimmicks. I'm not completely skeptical--after all, I did my Ph.D. thesis on medical imaging--but I do have this generalized discomfort with these approaches.
Consider, for example, the notorious implicit assocation test, famous for being able to "assess your conscious and unconscious preferences" and tell if you're a racist. Or consider the notorious "baby-faced politicians lose" study.
From a statistical point of view, I think the problem is with the idea that science is all about rejecting the null hypothesis. This is what researchers in psychology learn, and I think it can hinder scientific understanding. In the "implicit association test" scenario, the null hypothesis is that people perceive blacks and whites identically; differences from the null hypothesis can be interpreted as racial bias. The problem, though, is that the null hypothesis can be wrong in so many different ways.
To return to the main subject, an alarm went off in my head when I read the following sentence in the abstract to Hubbard and Lindsay's paper: "p values exaggerate the evidence against [the null hypothesis]." We're only on page 1 (actually, page 69 of the journal, but you get the idea) and already I'm upset. In just about any problem I've studied, the null hypothesis is false; we already know that! They describe various authoritative-seeming Bayesian articles from the past several decades, but all of them seem to be hung up on this "null hypothesis" idea. For example, they include the notorious Jeffreys (1939) quote: "What the use of P implies … is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred. This seems a remarkable procedure." OK, sure, but I don't believe that the hypothesis "may be true." The question is whether the data are informative enough to reject the model.
Any friend of the secret weapon is a friend of mine
OK, now the positive part. I agree with just about all the substance of Hubbard and Lindsay's recommendations and follow them in practice: interval estimates, not hypothesis tests; and comparing intervals of replications (the "secret weapon"). More generally, I applaud and agree with their effort to place repeated studies in a larger context; ultimately, I think this leads to multilevel modeling (also called meta-analysis in the medical literature).
P.S. This is minor, but I'm vaguely offended by referring to Ronald Fisher as "Sir" Ronald Fisher in an American journal. We don't have titles here! I guess it's better than calling him Lord Fisher-upon-Tyne or whatever.
P.P.S. I don't know if I agree that "An ounce of replication is worth a ton of inferential statistics." More data are fine, but sometimes it's worth putting in a little effort to analyze what you have. Or, to put it more constructively, the best inferential tools are those that allow you to analyze more data that have already been collected.
Posted by Andrew at 11:00 AM | Comments (8) | TrackBack
April 13, 2008
R.I.P. Minghui Yu
Rachel wrote this note about our Ph.D. student who unexpectedly and tragically died recently.

Posted by Andrew at 12:58 PM | Comments (3) | TrackBack
April 7, 2008
Comment on "What are you going to do with your Ph.D. in Statistics?" conference
The conference consisted of two panels discussing various aspects of the working life of statisticians. The statisticians on the first panel were all currently working in academia, while the statisticians on the second panel were all working in industry.
Academic Panel
If we want to pursue a career in academia, research should be something we enjoy. Panelist mentioned that teaching, while often a burden, should not be something that makes our lives miserable. As Eric Bradlow said, the remuneration for being an academic is not enough compensation for hating teaching and being miserable.
The panelists agreed it was important to work in a department where the people valued and respected the research that you did. The panelist’s research ideas came from a number of different sources, including collaborators, seminars and conferences (and they encouraged us to attend the latter two).
The panelist’s discussions reminded me that perhaps the most important aspect in choosing potential academic departments is finding a good fit. An important part of working life (I think) is being valued and finding collaborators, not only in the department you work in, but also in other departments around campus.
Industry Panel
Communication is a big part of working in industry. Although teaching students is not usually required, consulting with collaborators and colleagues is. There is not as much flexibility in industry as with academia (research must be in the companies interests), however, the compensation is usually much better.
All industry panelists agreed that statisticians must be excited by data. Many of the big companies (such as google, AT&T, etc) have an abundance of data. In order to thrive in these environments data should challenge and excite you.
The reception after the conference was a good chance to meet and talk with the panelists and ask questions about jobs in both academia and industry. It was a good time for me (as a postdoc) to evaluate what direction I hope to take my statistics career. Congratulations should go to the Columbia post-graduate statistics students for organizing such a successful conference.
Posted by Matthew at 11:43 AM | Comments (0) | TrackBack
April 4, 2008
A dismal theorem?
James Annan writes,
I wonder if you would consider commenting on Marty Weitzman's "Dismal Theorem", which purports to show that all estimates of what he calls a "scaling parameter" (climate sensitivity is one example) must be long-tailed, in the sense of having a pdf that decays as an inverse polynomial and not faster. The conclusion he draws is that using a standard risk-averse loss function gives an infinite expected loss, and always will for any amount of observational evidence.
I looked up Weitzman and found this paper, "On Modeling and Interpreting the Economics of Catastrophic Climate Change," which discusses his "dismal theorem." I couldn't bring myself to put in the effort to understand exactly what he was saying, but I caught something about posterior distributions having fat tails. That's true--this is a point made in many Bayesian statistics texts, including ours (chapter 3) and many that came before us (for example, Box and Tiao). With any finite sample, it's hard to rule out the hypothesis of a huge underlying variance. (Fundamentally, the reason is that, if the underlying distribution truly does have fat tails, it's possible for them to be hidden in any reasonable sample. It's that Black Swan thing all over again.) I think that Weitzman is making some deeper technical point, and I'm sure I'm disappointing Annan by not having more to say on this . . .
More
Searching on the web, I found this article by William Nordhaus criticizing Weitzman's reasoning. Unfortunately, Nordhaus's article just left me more confused: he kept talking about a utility function of the form U(c) = (1-c^(1-a))/(1-a), which doesn't seem to be relevant to the climate change example. Or to any other example, for that matter. Attempting to model risk aversion with a utility function--that's so 1950s, dude! It's all about loss aversion and uncertainty aversion nowadays. This isn't Nordhaus's fault--he seems to be working off of Weitzman's model--but it's hard for me to know how to evaluate any of this stuff if it's based on this sort of model.
Also, I don't buy Nordhaus's argument on page 4 that you can deduce our implicit value of non-extinction by looking at how much the U.S. government spends on avoiding asteroid impacts. This reminds me of the sorts of comparisons people do, things like total spending on cosmetics or sports betting compared to cancer research. I already know that we spend money on short-term priorities--I wouldn't use that to make boroad claims about the "negative utility of extinction."
Back to Weitzman's paper
I find abbreviations such as DT (for the "dismal theorem") and GHG (for greenhouse gases) to be distracting. I don't know if this is fair of me. I don't mind U.S. or FBI or EPA or other common abbreviations, but I find it really annoying to read a phrase such as, "Phrased di¤erently, is DT an economics version of an impossibility theorem which signifies that there are fat-tailed situations where economic analysis is up against a strong constraint on the ability of any quantitative analysis to inform us without committing to a VSL-like parameter and an empirical CBA framework that is based upon some explicit numerical estimates of the miniscule [sic] probabilities of all levels of catastrophic impacts up to absolute disaster?" The concepts are tricky enough as it is without me having to try to flip back and find out what is meant by DT, VSL, and CBA. But, if Weitzman were to spell out all the words, would the other economists think he's some sort of rube? I just don't know the rules here.
On page 37, near the end of the paper, Weitzman writes, "A so-called Integrated Assessment Model (hereafter IAM) . . .") I was reminded of Raymond Chandler's advice for writers: "When in doubt, have a man come through the door with a gun in his hand." Or, in this case, an abbreviation. Never let your readers relax, that's my motto.
I'm not sure how to think about the decision analysis questions. For example, Weitzman writes, "Should we have foregone the industrial revolution because of the GHGs it generated?" But I don't think that foregoing the industrial revolution was ever a live option.
P.S. I have to admit, "miniscule" sounds right. It begins with "mini," after all.
Posted by Andrew at 2:39 AM | Comments (15) | TrackBack
April 3, 2008
More data beats better algorithms
Boris sent along this. I can't comment on the examples used there, but I agree with the general point that it's good to use more data. To get back to algorithms, what I'd say is that one important feature of a good algorithm is that it allows you to use more data. Traditional statistical methods based in independent, identically distributed observations can have difficulty incorporating diverse data, whereas more modern methods have more ways in which data can be input.
Posted by Andrew at 12:58 AM | Comments (3) | TrackBack
March 30, 2008
Some thoughts on connections between biostatistics and statistics, prompted by an announcement for a meeting that I won't be able to attend
This looks interesting. Yi Li writes of a panel discussion at the Harvard biostatistics department. My own thoughts are below; first here's Li's description. There's some good stuff:
1) Should Biostatistics continue to be a separate discipline from Statistics? Should Departments of Biostatistics and Statistics merge. In other words, are we seeing a convergence of biostatistics and statistics? Biostatisticians develop statistical methodology, statisticians are getting involved in biological/clinical data. Even at Harvard we are considering moving closer to Cambridge, and some say that the move might lead to the eventual merge of the biostat and stat departments. What are your thoughts on the division between the disciplines of stat and biostat in general, whether it is widening or closing, and how it may affect our careers and career choices, especially for starting faculty, postdocs, students?2) How should stats/biostats as a field respond to the increasing development of statistical and related methods by non-statisticians, in particular computer scientists?
It strikes me [Li] to some extent that statisticians get involved in applied problems in a rather arbitrary fashion based on haphazard personal connections and whether the statistician's personal methodological research fits with the applied problem. Are statisticians sufficiently involved in the most important scientific problems in the world today (at least of those that could make use of statistical methods) and if not, is there some mechanism that could be developed by which we as a profession can make our expertise available to the scientists tackling those problems?
3) How do we close the gap between the sophistication of methods developed in our field and the simplicity of methods actually used by many (if not most) practitioners? Some scientific communities use sophisticated statistical tools and are up to date with the newest developments. Examples are clinical trials, brain imaging, genomics. Other communities routinely use the simplest statistical tools, such as single two-sample tests. Examples are experimental biology and chemistry, cancer imaging, and many other fields outside statistics. How do we explain this gap and what can we do to close it?
4) What makes a statistical methodology successful? Some modern statistical methods have gotten to be very well known in the scientific world, even though they are not usually part of any basic statistics course for non-statisticians. The best examples might be the bootstrap, regression trees, wavelet thresholding. Even Kaplan-Meier and Cox model are not in elementary stat books! But most statistical methods, even when they are good enough to be published in a good statistical journal, might get referenced a few times within the statistics literature and then forgotten, never making it outside the statistics community. What makes a statistical methodology gain widespread popularity?
5) Where should computational biology and bioinformatics sit in relation to biostatistics, both at Harvard and elsewhere. Should these subjects be taught as part of cross-department programs of which biostat is a part or should they be housed within an expanded biostat department?
6) Terry Speed recently published an IMS column entitled "statistics without probability". He stated that "... the most prominent features of the data were systematic. Roughly speaking, we saw in the data very pronounced effects associated with almost every feature of the microarray assay, clear spatial, temporal, reagent and other effects, all dwarfing the random variation. These were what demanded our immediate attention, not the choice of two-sample test statistic or the multiple testing problem. What did we do? We simply did things that seemed sensible, to remove, or at least greatly reduce, the impact of these systematic effects, and we devised methods for telling whether or not our actions helped, none of it based on probabilities. In the technology-driven world that I now inhabit, I have seen this pattern on many more occasions since then. Briefly stated, the exploratory phase here is far more important than the confirmatory... How do we develop the judgement, devise the procedures, and pass on this experience? I don't see answers in my books on probability-based statistics."
My thoughts:
1) I think there are advantages to having two departments but they should certainly coordinate with each other. Here at Columbia, people are hired in one department or the other and nobody in the other department even hears about it, and we also have a biostatistics group in the psychiatry department. The trouble is, everybody's so busy. One idea is to have each department have a person whose job ("committee assignment") is to keep track of what's happening in the sister department and then report back to the others. There are just so many opportunities for collaboration and shared work with students and faculty, it's a shame to not take advantage.
2) I'm not supposed to go around saying that computer scientists are smarter than statisticians, but I think it's ok for me to say that computer science is great, and I welcome that field's involvement in statistical problems. I don't know that we have to "respond" in any way except by cross-listing courses and updating the curriculum every now and then.
Li makes an excellent point about statisticians getting involved in problems "in a rather arbitrary fashion based on haphazard personal connections. One way to do better, I think, is to post all the collaborative projects in an easy-to-hash format so that people can get involved in projects that best suit them. We're starting that here with our Applied Statistics Center but we have a ways to go, even at Columbia. At the very least, I recommend that other universities follow our path and start listing things.
3) There's a need for more research into simple methods. Simple doesn't have to mean stupid. Beyond that, I'm in favor of "closing the gap" one application at a time. But maybe that's not the most efficient way, given that millions of scientific papers are published each year.
4) I think applied Bayesian methods are "very well known in the scientific world, even though they are not usually part of any basic statistics course for non-statisticians." I'm surprised Li didn't mention Bayesian methods in the list: this suggests that the first step is for statisticians and biostatisticians to become aware of the important methods in our own fields!
To answer the question more generally, I think for a method to gain widespread popularity it needs to give people answers that they want, and ideally be easy to use and theoretically justified. One reason Xiao-Li, Hal, and I wrote our paper on posterior predictive checking was to place this very useful method in a theoretical framework with theta, y, and y.rep.
5) I have no opinion on this one.
6) It's funny that Terry Speed said this because, when I used to teach at Berkeley, I heard lots of people in the statistics department say that sort of thing. But at the same time they would teach extremely theoretical courses and discourage the Ph.D. students from learning about applied methods (outside of a few specific statistics-heavy fields such as biology). I don't think they were aware of statistical methods that bridge between science and theory. The Bayesian approach is one way (at least, we try to do this in our book) but lots of non-Bayesian methods focus on systematic effects also. Consider all the work in economics on program evaluation and causal inference. In our recent book on regression, Jennifer and I emphasize the importance of the deterministic part of the model. I can't say that we yet have a method to "develop the judgement, devise the procedures, and pass on this experience"--but we've definitely advanced beyond the 1950s-style "choice of two-sample test statistic or the
multiple testing problem." So I don't think things are as bad as Terry thinks, at least not in social science!
When, where, who
The panel discussion will be on Thurs 3 Apr from 2-3.30 in Kresge 213 (at the Harvard School of Public Health in Boston), and it will be led by Brad Efron, Colin Begg and David Harrington. It sounds like fun (it reminds me a bit of our symposium on statistical consulting), but I don't know how they expect to cover all of that in only an hour and a half!
Posted by Andrew at 12:09 AM | Comments (2) | TrackBack
March 26, 2008
Crime data bonanza!!!
Mike Maltz writes,
A New Data Set Available through Ohio State University’s Criminal Justice Research CenterSo you think you know how to analyze time series! Well, how would you like to test your mettle on over 400,000 time series, each with up to 540 data points? The time series in question are monthly data from 1960-2004, for over 17,000 police departments, for seven crime types (murder, rape, robbery, aggravated assault, burglary, larceny, vehicle theft), as well as their sum (the so-called Crime Index), and an additional 19 subcategories – e.g., robbery with a gun, knife, personal weapons (hands, feet, etc.), or other; attempted rape; auto, truck or bus, or other vehicle theft. Or you can just view the data in different cities over time and see whether it rises and falls with various tides (unemployment, immigration, poverty, age or ethnicity distribution, etc., whatever your pet theory is). I [Maltz] have put all of the files and a plotting utility (so you can see each agency’s crime history) in a zipped file. Download it from http://sociology.osu.edu/mdm/UCR1960-2004.zip.
The data consist of monthly counts of these crimes reported by police departments throughout the country to the FBI as part of its Uniform Crime Reporting (UCR) Program. Since reporting to the UCR Program is entirely voluntary, some agencies are less than diligent in doing so, but for the most part they comply. However, major gaps still remain; for a discussion of these gaps, see “Bridging Gaps in Police Crime Data,” published by the US Bureau of Justice Statistics. Under a series of grants from the US National Institute of Justice, Harry Weiss, a graduate student here at OSU, and I cleaned the data as best we could.
Some of the gaps are just inadvertent (or, as statisticians would say, MCAR, missing completely at random). These can usually be filled in using relatively simple algorithms. The more significant problems, however, are those that are not gaps but “underestimates,” as when the City of Atlanta was bidding (successfully) for the Olympics and lowered its crime statistics in a more, shall we say, “hands-on” way (see http://www.cnn.com/2004/US/South/02/20/atlanta.police.audit.ap/index.html); New York, Philadelphia and Boca Raton also have had their own reporting scandals (http://query.nytimes.com/gst/fullpage.html?res=9F06E2D91F38F930A3575BC0A96E958260); and according to the creator of HBO’s “The Wire,” Baltimore is even better at it (http://www.huffingtonpost.com/david-simon/the-wires-final-s_b_91926.html):
"In Baltimore, where over the last twenty years Times Mirror and the Tribune Company have combined to reduce the newsroom by forty percent, all of the above stories pretty much happened. A mayor was elected governor while his police commanders made aggravated assaults and robberies disappear.
"... It would not have been easy for a veteran police reporter to pull all the police reports in the Southwestern District and find out just how robberies fell so dramatically, to track each individual report through staff review and find out how many were unfounded and for what reason, or to develop a stationhouse source who could tell you about how many reports went unwritten on the major's orders, or even further -- to talk to people in that district who tried to report armed robberies and instead found themselves threatened with warrant checks or accused of drug involvement or otherwise intimidated into dropping the matter."
Not all cities manipulate crime statistics. Even so, you might want to get rid of all of your preconceptions of how to deal with these data. It’s for that reason that a plotting utility is the centerpiece of the data set. You have to look at the data, not just throw it into the computerized maw and let Stata or SAS or SPSS give you some p values. By visually inspecting the data, you might see what the effect of a new policy, or police chief, or law has on crime. You might compare different cities with different characteristics. Whatever you do, it’s a relatively new data set that hasn’t yet been used much at all, so you’re getting in on the ground floor.
Posted by Andrew at 4:40 PM | Comments (5) | TrackBack
Data
Aleks writes:
From here, see this. It could be used as a foundation to latch additional analysis functionality on top of it.Here are some examples of interactive statistics on the web. But few things compare to the venerable b-course.
Posted by Andrew at 9:39 AM | Comments (0) | TrackBack
March 25, 2008
Incredible Illinois, or fun with percentages that can be larger than 100
Tyler Cowen links to a calculation by Tom Elia that "of Sen. Obama's 711,000 popular-vote lead, 650,000 -- or more than 90% of the total margin -- comes from Sen. Obama's home state of Illinois, with 429,000 of that lead coming from his home base of Cook County." This is interesting, but it's more a comment on how close the (meaningless) total popular vote count is, than a reflection of something funny going on in Cook County.
Put it another way. Suppose Obama's total margin was only 111,000 votes instead of 711,000. Then his 650,000 vote margin in Illinois would represent a whoppin 580% of the total margin, and Cook County would represent 390% of the total margin! But wait, how can a part be 390% of the whole??
What I'm sayin is, the "90%" and "60%" figures are misleading because, when written as "a percent of the total margin," it's natural to quickly envision them as percentages that are bounded by 100%. There is a total margin of victory that the individual state margins sum to, but some margins are positive and some are negative. If the total happens to be near zero, then the individual pieces can appear to be large fractions of the total, even possibly over 100%.
I'm not saying that Tom Elia made any mistakes, just that, in general, ratios can be tricky when the denominator is the sum of positive and negative parts. In this particular case, the margins were large but not quite over 100%, which somehow gives the comparison more punch than it deserves, I think.
P.S. Elia's comment that "Sen. Obama's 429,000-vote margin in Cook County alone is larger than the winning margin of either candidate in any state" is more directly interpretable because it's a difference, not a ratio. Obama won Illinois by a 32-percentage-point landslide. (By comparison, Clinton won New York with a 17-point margin and California [typo fixed] with a 9-point margin.)
Posted by Andrew at 2:59 AM | Comments (5) | TrackBack
Peeking behind the curtain, or, What's (not) the matter with Portugal?
This is pretty embarrassing, but I think it's better to tell all, if for no other reason than to make others aware of the challenges of working with data . . .
OK, so we're reanalyzing some data from the Comparative Study of Electoral Systems, basically replicating some findings of Huber and Stanig but including additional countries and with some slightly different coding of political parties.
We have two key graphs.
First, for each country, we compute the difference between rich and poor in voting for the conservative party or parties. This graph (not shown here) reveals that the rich-poor gap in the United States is larger than most of the other (mostly European) countries in the sample.
For our second graph, we fit a model predicting conservative vote given income and religious attendance. For each country, the three lines show estimated conservative vote (compared to the national average) as a function of individual income, among people who attend religious services frequently (solid line), occasionally (light line), and never (dashed line).

The countries are ordered by increasing per-capita GDP. On the bottom line is the United States, with its familiar pattern of religious attendance mattering more for the rich than the poor. As you can see, religious people vote for conservative parties in many countries--Americans are far from unique in that way.
Wha...?
But whassup with Portugal? The only country where the religious vote in a less conservative way than the secular--the lines go in the wrong order! We asked some experts what was going on, and we were told that the center-left Socialist Party and the center-right Social Democratic Party seem to be resistant to the direction or degree of religiosity, and that the party competition in Portugal is basically non-ideological.
But, then, why the big difference between religious and secular in our data? Well, we were also told that the data for Portugal are probably crappy. So we figured we'd just remove Portugal from our graph and add a note why we excluded it, based on concerns about data and some comments about the party structure there. Put then we looked at the data again . . .
It turned out the problem was in the name of one party (the Popular Party)--it had an extra comma in its name and when we read in the data, we mistakenly counted it as a different party. Whoops! (Or, as Mezzanine-era Nicholson Baker would say, Whoop!)
Here's the corrected figure:

Yeah, yeah, I know, we better check all the party names carefully now.
P.S. I guess we could make the case that we were being Bayesian, in checking the results that contradicted our prior distribution. In this case, the prior wasn't really that religion always is associated with conservative voting, but rather that the countries followed some smooth distribution. Actually, when I first noticed the problem with Portugal, I assumed the data were ok and that there was some Portugal-specific story, perhaps a left-wing church-based party. (Yes, I'm sure that comment reveals my ignorance of Portugal, but that's the point here.) I was looking for the magic x-variable that explained the unexplained variation. In this case, the x-factor was a coding error...
P.P.S. More here.
Posted by Andrew at 12:15 AM | Comments (5) | TrackBack
March 21, 2008
Poll and survey faqs
From the American Association for Public Opinion Research.
Posted by Andrew at 3:50 PM | Comments (4) | TrackBack
March 13, 2008
The "all else equal fallacy"
I like John Tierney's New York Times column (for example, here), but sometimes he goes over the top in counterintuitiveness.
Here, for example, Tierney writes about someone who says, "in some circumstances it’s better to drive than to walk. . . . If you walk 1.5 miles, Mr. Goodall calculates, and replace those calories by drinking about a cup of milk, the greenhouse emissions connected with that milk (like methane from the dairy farm and carbon dioxide from the delivery truck) are just about equal to the emissions from a typical car making the same trip. . . . Michael Bluejay, who’s done some number-crunching at BicycleUniverse.info, says that walking is actually worse than driving if you replace the calories with food in the standard American diet and if the car gets more than 24 miles per gallon. . . ."
This is interseting to me because these guys are making a classic statistical error, I think, which is to assume that all else is held constant. This is the error that also leads people to misinterpret regression coefficients causally. (See chapters 9 and 10 of our book for discussion of this point.) In this case, the error is to assume that the walker and the driver will be making the same trip. In general, the driver will take longer trips--that's one of the reasons for having a car, that you can easily take longer trips. Anyway, my point is not to get into a long discussion of transportation pricing, just to point out that this seemingly natural calculation is inappropriate because of its mistaken assumption that you can realistically change one predictor, leaving all the others constant.
As we like to say, it's a great classroom example.
P.S. More here (also see discussion in the comments below).
Posted by Andrew at 4:52 PM | Comments (13) | TrackBack
March 12, 2008
Continuation on a theme...

Posted by Juli at 5:20 PM | Comments (0) | TrackBack
Specifying a distribution from the mean and quantiles, or, just in case you thought this blog was nothing but square footage and Starbucks
David Kane writes,
What is the best way to simulate from a distribution for which you know only the 5th, 50th and 95th percentile along with the mean? In particular, I want to estimate the value for a different percentile (usually around the 40th) and associated confidence interval. I assume that the distribution is "smooth" and unimodal. For background, see here.
If you don't want to read all that, the short version is that I want to see if socioeconomic diversity has increased at Williams College over the last decade. (You may be interested in the same thing about Columbia.) It isn't easy to measure "inequality," of course, so for starters I just want to estimate what has happened at the 20th percentile. Williams has about 2000 students. So, I want to estimate the family income of the 400th poorest family.Williams only has data on students who request financial aid. But that covers almost all the families in the bottom 1/3 of the distribution. Williams, like most colleges, does not want to give out much data. However, recent debate in Congress has resulted in Williams and other rich schools publishing some relevant data. Unfortunately, it isn't exactly what I want, hence my question.
To be concrete, Williams tells us, for each year since 1998, how many students are on aid and what the mean and the 5th, 50th and 95th percentiles of family income are for those students. But the number of students on aid has increased so the location of the 40th percentile for the entire student body (not just those on aid) is in a different location in the aided students distribution each year.
My reply:
If you were given only two quantiles, I'd recommend that you just pick a reasonable 2-parameter distributional family, solve for the two parameters, and go from there (and do a sensitivity analysis considering other families). With 3 quantiles to fit, I'd say to take a 3-parameter skewed family (although I'm not quite sure what I'd actually use). But 3 quantiles and a mean . . . fitting to a 4-parameter family seems silly, and fitting to a 2-parameter or 3-parameter family using least squares doesn't sound quite right either.
The right thing to do, I think, is to have some model over distribution space, probably centered on some reasonable three-parameter family but with error. I'm not quite sure the best way to do this; maybe work with the cdf and transform the uniform. I wouldn't be surprised if there's a reasonable solution out there; it seems like a fun problem to work on.
Or, if I wanted an answer and was in a hurry, I'd try various curves that go thru the 5th, 50th, 95th and then play around until they match the mean correctly.
Posted by Andrew at 12:28 AM | Comments (6) | TrackBack
March 4, 2008
Starbucks/Walmart update
Alex F. commented here about problems with our Starbucks and Walmart data. Elizabeth Kaplan, who collected the data for me, replied:
Yeah Walmart was a bit of a pain to find the locations for as you can not search just by state on their website, like for Starbucks. In order to find the locations I relied on the yellow page results. Even though I looked through to eliminate double postings for walmarts with the same address, after I looked into it again tonight, it appears the yellow pages dramatically over represented the number of walmarts per state. I have attached the correct data. All of these numbers come from this website (http://www.walmartfacts.com/StateByState/) which I was unable to locate before.As far as the data for starbucks that should be correct as I got it straight from their website. The one thing is that they don't list all affiliate stores (that is stores not own and operated by the company). There is no reliable source of data on affiliate stores by state, and obviously the yellow pages are not a good source. So the data I sent to you just includes Starbucks owned and operated stores.
Also for population I used the 2006 Census Bureau estimates.
This sort of thing happens all the time to me, so I certainly don't think Elizabeth should feel too bad about this. I'm just glad that Alex noticed and pointed out the problem. Anyway, here are the corrected maps:


and scatterplot:

And also, following Seth's suggestion, the scatterplot on the log scale:

And, following Kaiser's suggestion, a reparameterization showing people per store (rather than stores per million people):

Posted by Andrew at 12:41 AM | Comments (27) | TrackBack
February 22, 2008
Real statistics and folk statistics: modeling mental models
I was lucky to see most of the talk that Josh Tenenbaum gave in the psychology department a couple weeks ago. He was talking about some experiments that he, Charles Kemp, and others have been doing to model people's reasoning about connectedness of concepts. For example, they give people a bunch of questions about animals (is a robin more like a sparrow than a lion is like a tiger, etc.), and then they use this to construct an implicit tree structure of how people view animals. (The actual experiments were interesting and much more sophisticated than simply asking about analogies; I'm just trying to give the basic idea.) Here's a link to some of this work.
My quick thought was that Tenenbaum, Kemp, et al. were using real statistics to model people's "folk statistics" (by which I mean the mental structures that people use to model the world). I have a general sense that folk statistical models are more typically treelike or even lexicographical, whereas reality (for social phenomena) is more typically approximately linear and additive. (I'm thinking here of Robyn Dawes's classic paper on the robust beauty of additive models, and similar work on clinical vs. statistical prediction.) Anyway, the method is interesting. I wondered whether, in the talk, Tenenbaum might have been slightly blurring the distinction between normative and descriptive, in that people might actually think in terms of discrete models, but actual social phenomena might be better modeled by continuous models. So, in that sense, even if people are doing approximate Bayesian inference in their brains, it's not quite the Bayesian inference I would do, because people are working with a particular set of discrete, even lexicographic, models, which are not what I suspect are good descriptions of most of the phenomena I study (although they might work for problems such as classifying ostriches, robins, platypuses, etc.).
Near the end of his talk, Tenenbaum did give an example where the true underlying structure was Euclidean rather than tree-like (it was a series of questions about the similarity of U.S. cities), and, indeed, there he could better model people's responses using an underlying two-dimensional model (roughly but not exactly corresponding to the latitude-longitude positions of the cities) than a tree model, which didn't fit so well.
I sent Tenenbaum my above comment about real and folk statistics, and he replied:
I'd expect that for either the real world or the mind's representations of the world, some domains would be better modeled in a more discrete way and others in a more continuous way. In some cases those will match up - I talked about these correspondences towards the end of the talk, not sure if you were still there - while in other cases they might not. It would be interesting to think about both kinds of errors: domains which our best scientific understanding suggests are fundamentally continuous while the naive mind treats them as more discrete, and domains which our best scientific understanding suggests are discrete while the naive mind treats them as more continuous. I expect both situations exist.Also, the "naive mind" is quite an idealization here. The kind of mental representation that someone adopts, and in particular whether it's more continuous or discrete, is likely to vary with expertise, culture, and other experiential factors.
My reply:
I think the discrete/continuous distinction is a big one in statistics and not always recognized. Sometimes when people argue about Bayes/frequentist or parametric/nonparametric or whatever, I think the real issue is discrete/continuous. And I wouldn't be surprised if this is true in psychology (for example, in my sister s work on how children think about essentialism).
Tenenbaum replied to this with:
While the focus for most of my talk emphasized tree-structured representations, towards the end I talked about a broader perspective, looking at how people might use different forms of representations to make inferences about different domains. Even the trees have a continuous flavor to them, like phylogenetic trees in biology: edge length in the graph matters for how we define the prior over distributions of properties on objects.
I'll buy that.
On a less serious note . . .
This reminds me of all sorts of things from children's books, such as pictures of animals that include "chicken" and "bird" as separate and parallel categories, or stories in which talking cats and dogs go fishing and catch and eat real fish! (The most bizarre of all these, to me, are the Richard Scarry stories in which the sentient characters include a cat, a dog, and a worm, and they go fishing. My naive view of the "great chain of being" would put fish above worms, but I guess Scarry had a different view.)
Posted by Andrew at 12:50 AM | Comments (4) | TrackBack
February 20, 2008
Using simulation to do statistical theory
We were looking at some correlations--within each state, the correlations between income and different measures of political ideology--and we wanted to get some sense of sampling variability. I vaguely remembered that the sample correlation has a variance of approximately 1/n--or was that 0.5/n, I couldn't remember. So I did a quick simulation:
> corrs <- rep (NA, 1000)
> for (i in 1:1000) corrs[i] <- cor (rnorm(100),rnorm(100))
> mean(corrs)
[1] -0.0021
> sd(corrs)
[1] 0.1
Yeah, 1/n, that's right. That worked well. It was quicker and more reliable than looking it up in a book.
Posted by Andrew at 12:34 AM | Comments (4) | TrackBack
February 15, 2008
Linear regression is not dead, and please don't call it OLS
Lee Sigelman writes,
In the latest issue of The Political Methodologist, James S. Krueger and Michael S. Lewis-Beck examine the current standing of the time-honored but oft-dismissed-as-passe ordinary least squares regression model in political science research. . . . Krueger and Lewis-Beck report that . . . The OLS regression model accounted for 31% of the statistical methods employed in these articles. . . . “Less sophisticated” statistical methods — those that would ordinarily be covered before OLS in a methods course — accounted for 21% of the entries. . . . Just one in six or so of the articles that reported an OLS-based analysis went on to report a “more sophisticated” one as well. . . . OLS is not dead. On the contrary, it remains the principal multivariate technique in use by researchers publishing in our best journals. Scholars should not despair that possession of quantitative skills at an OLS level (or less) bars them from publication in these top outlets.
I have a few thoughts on this:
1. I don't like the term OLS ("ordinary least squares"). I prefer the term "linear regression" or "linear model." Least squares is an optimization problem; what's important (in the vast majority of cases I've seen) is the model. For example, if you still do least squares but you change the functional form of the model so it's no longer linear, that's a big deal. But if you keep the linearity and change to a different optimization problem (for example, least absolute deviation), that generally doesn't matter much. It might change the estimate, and that's fine, but it's not changing the key part of the model.
2. I like simple methods. Gary and I once wrote a paper that had no formulas, no models, only graphs. It had 10 graphs, many made of multiple subgraphs. (Well, we did have one graph that was based on some fitted logistic regressions--an early implementation of the secret weapon--but the other 9 didn't use models at all.) And, contrary to Cosma's comment on John's entry, our findings were right, not just published. The purpose of the graphical approach was not simply to convey results to the masses, and certainly not because it was all that we knew how to do. It just seemed like the best way to do this particular research. Since then, we've returned to some of these ideas using models, but I think we learned a huge amount from these graphs (along with others that didn't make it into the paper).
3. Sometimes simple methods can be justified by statistical theory. I'm thinking here of our approach of splitting a predictor at the upper quarter or third and the lower quarter or third. (Although, see the debate here.)
4. Other times, complex models can be more robust than simple models and easier to use in practice. (Here I'm thinking of bayesglm.)
5. Sometimes it helps to run complicated models first, then when you understand your data well, you can carefully back out a simple analysis that tells the story well. Conversely, after fitting a complicated model, you can sometimes make killer graphs.
Posted by Andrew at 11:25 AM | Comments (7) | TrackBack
February 11, 2008
More discreteness, please
Justin Wolfers presents this graph that he (along with Eric Bradlow, Shane Jensen, and Adi Wyner) made comparing the career trajectory of Roger Clemens to other comparable pitchers:

The point is that Clemens did unexpectedly well in the later part of his career (better earned run average, allowed fewer walks+hits) compared to other pitchers with long careers. This in turn suggests that maybe performance-enhancing drugs made a difference. Justin writes:
To be clear, we don’t know whether Roger Clemens took steroids or not. But to argue that somehow the statistical record proves that he didn’t is simply dishonest, incompetent, or both. If anything, the very same data presented in the report — if analyzed properly — tends to suggest an unusual reversal of fortune for Clemens at around age 36 or 37, which is when the Mitchell Report suggests that, well, something funny was going on.
I can't comment on the steroids thing at all, but I will say that I'd like more information than are in the graphs. For one thing, Clemens is clearly not a typical pitcher and never has been. At the very least, you'd like to see the comparison of his trajectory with all the other individual trajectories, not simply the average. For another, the graphs above seem to be relying way too much on the quadratic fit. At least for the average of all the other pitchers, why not show the actual averages. Far be it from me to criticize this analysis (especially since I am friends with all four of the people who did it!)--this is just a recreational activity, and I'm sure these guys have better things to do than correct ERA's for A.L./N.L. effects, etc.--but I think you do want to have some comparisons of the entire distribution, as well as a sense of how much the "unusal reversal around ages 36 or 37" is an artifact of the fitted model.
P.S. to Justin, Eric, Shane, and Adi: Now youall have permission to be picky about my analyses in return. . . .
P.P.S. Nathan made this plot showing data from the 16 most recent Hall of Fame pitchers.
Posted by Andrew at 12:40 AM | Comments (11) | TrackBack
February 8, 2008
Dead heat
Gary sent along this news article from the Syracuse Post-Standard:
Dead heat: Obama and Clinton split the Syracuse vote 50-50by Mike McAndrew
In the city of Syracuse, the strangest thing happened in Tuesday's Democratic presidential primary.
Sen. Hillary Clinton and Sen. Barack Obama received the exact same number of votes, according to unofficial Board of Election results.
Clinton: 6,001.
Obama: 6,001.
"Wow, that is odd," said Jay Biba, Clinton's Central New York campaign coordinator. "I never heard of that in my life."
The odds of Clinton and Obama tying were less than one in 1 million, said Syracuse University mathematics Professor Hyune-Ju Kim.
"It's almost impossible," said Kim, who analyzed the statewide and citywide votes.Lisa Daly, Obama's Syracuse campaign coordinator, said she thought a mistake had been made when she was first told the tally by the Board of Elections.
What are the chances of it happening?
"Good thing it wasn't a mayor's race," quipped Grant Reeher, a political science professor at Syracuse University's Maxwell School of Citizenship and Public Affairs.
A total of 12,346 votes were cast for Democrats in the city. Four other Democrats also received votes: John Edwards, 114; Dennis Kucinich, 113; Bill Richardson, 90; and Joe Biden, 27.
The tie is likely to be broken when elections officials recanvass the voting machines and add in the absentee and affidavit votes.
But for now, it's all even.
Update
The story The Post-Standard broke about Sen. Hillary Clinton and Sen. Barack Obama battling to a tie vote in the city of Syracuse was being posted Thursday on internet sites across the country.
Clinton and Obama each received 6,001 votes in Syracuse in the unofficial Board of Elections results. A total of 12,346 votes were cast in the city.
After doing a statistical analysis for The Post-Standard, Syracuse University mathematics professor Hyune-Ju Kim noted that the odds of Clinton and Obama getting the exact same amount of votes in Syracuse was less than one in 1 million.
To come to that conclusion, Kim factored in the state-wide and city-wide results in the Democratic primary.
Elaborating on Thursday, she noted: "The "almost impossible" odd is obtained when we assume the Syracuse voter distribution follows the New York state distribution. Since it is almost impossible to observe what we have observed, statistically we can conclude that Syracuse voter distribution is significantly different from the New York state distribution."
There would be less than one in 1 million chance of a tie occurring between Clinton and Obama in voting by a randomly selected group of 12,346 New York Democratic voters, she said.
Not to pick on some harried mathematics professor who'd probably rather be out proving theorems, but . . . of course Syracuse voters are not a randomly selected group of New Yorkers. You don't need a statistical test to see that. Regarding the probability of an exact tie: I don't think that's so low: a quick calculation might say that either Clinton or Obama could've received between, say, 5000 and 7000 votes, giving something like a 1/2000 chance of an exact tie. That's gotta be the right order of magnitude.
Anyway, I know this is silly--as pointed out in the article, it doesn't matter if there's a tie in Syracuse anyway. This might make a good classroom example, though. (See also here and here for more on the probability of a tied election.)
Posted by Andrew at 6:28 PM | Comments (9) | TrackBack
February 6, 2008
It's all over but the normalizin'
Ted Dunning writes:
You advocated recently [article to appear in Statistics in Medicine] the normalization of variables to have average deviation of 1/2 in order to match that of a {0,1} binary variable.This recommendation will disturb lots of people for obvious reasons which may make your recommendation sell better.
But have you considered normalizing the binary variable to {-1, 1} instead of {0,1} before adjusting the mean to zero? This has the same effect but leaves larger communities happier, particularly because much of the applied modeling community has always normalized their binary variables to this range.
My reply: I actually went back and forth on this for awhile. In most of the regression analyses in political science, economics, sociology, epidemiology, etc., that I've seen, it's standard to code binary variables as 0/1. But, yeah, the other way to go would've been to standardize by dividing by 1 sd and then give the recommendation to code binary variables as +/- 1. Maybe that would've been a better idea. I was trying to decide which way would disturb people less, but maybe I guessed wrong!
Posted by Andrew at 12:36 AM | Comments (4) | TrackBack
February 3, 2008
Convergent interviewing and respondent-driven sampling
Bill Harris writes,
I stumbled across this project today and thought it might be related to a comment I posted last summer here.
I'm curious if Bob Dick's convergent interviewing perhaps predates RDS; I'm pretty sure I first learned convergent interviewing from Bob around 1992. I have a book by him, Rigor Without Numbers, that talks about convergent interviewing, as well. While my third edition is copyright 1999, it says that the first version was delivered at the XVIIIth Annual Meeting of Australian Social Psychologists, Greenmount, Queensland, 12-14 May 1989.For more online on convergent interviewing, see here.
Matt responds:
I was not familiar with convergent interviewing and it does seem that it precedes RDS; as far as I know the first paper about RDS was published by Doug Heckathorn in 1997. But, it also seems that the methods are very different. RDS is designed to make population proportion estimates (e.g. What percentage of drug injectors in New York City have HIV?) while it seems that convergent interviewing is designed for qualitative research. Also, in convergent interviewing it seems that the researcher chooses who to interview next (and so can do this in a purposive way), but in RDS the choice of who gets recruited is made by the participants themselves, not the researcher, and in fact RDS estimation only works if people recruit randomly from their friends (i.e. no purposive choice). There may be insights that practitioners of both methods can learn from each other, but those connections aren't clear to me right now. On the other hand, sometimes these connections pop up in mysterious ways, so this idea might be helpful in the future.
Bill adds:
What I saw as qualitatively similar between MCMC and convergent interviewing is the notion that you draw a sample in ways that seeks to maximize the information you gather from your sample, avoiding getting stuck in parts of the population that have very little to contribute to items of interest, as you might with a purely random sample.I seem to recall it being said somewhere that one can select the next people to interview in CI by asking the current pair of respondents (i.e., on an iterative basis) who is the person most unlike them who is also in some sense mainstream. As I haven't gotten time to do much with your paper yet, I can't speak to RDS except via your claim that it relates to MCMC.
As for the intent, I think what you say is correct, although Bob Dick may wish to offer his views: CI is focused on qualitative research, and so you're more likely to surface a broad spectrum of answers but have no estimate of relative frequency of those answers in the population.
Posted by Andrew at 11:26 AM | Comments (1) | TrackBack
January 31, 2008
Debate over categorizing continuous variables
In a comment to an entry linking to my paper on splitting a predictor at the upper quarter or third and the lower quarter or third, MV links to this article by Frank Harrell on problems caused by categorizing continuous variables:
1. Loss of power and loss of precision of estimated means, odds, hazards, etc.2. Categorization assumes that the relationship between the predictor and the response is flat within intervals; this assumption is far less reasonable than a linearity assumption in most cases
. . .
12. A better approach that maximizes power and that only assumes a smooth relationship is to use a restricted cubic spline . . .
My reply:
I agree that it is typically more statistically efficient to use continuous predictors. But, if you are discretizing, our paper shows why it can be much more efficient to use three groups (thus, comparing "high" vs. "low", excluding "middle"), rather than simply dichotomizing into high/low.
As discussed in the paper, we specify the cutpoints based on the proportion of data in each category of the predictor, x. We're not estimating the cutpoints based on the outcome, y. (This handles points 7, 8, 9, and 10 of the Harrell article.)
We're not assuming that the regression function is flat within intervals or discontinuous between intervals. We're just making direct summaries and comparisons. That's actually the point of our paper, that there are settings where these direct comparisons can be more easily interpretable.
Just to be clear: I'm not recommending that discrete parameters be used for articles in the New England Journal of Medicine or whatever, in an area where regression is a well understood technique. I completely agree with Harrell that it's generally better to keep variables as continuous rather than try to get cute with discretization. On the other hand, when you have your results, it can be helpful to explain them with direct comparisons. The point of our paper is that, if you're going to do such direct comparisons, it's typically efficient to do upper and lower third or quarter, rather than upper and lower half.
Posted by Andrew at 4:27 PM | Comments (2) | TrackBack
January 29, 2008
Robust t-distribution priors for logistic regression coefficients
Bill DuMouchel wrote:
I recently came across your paper, "A default prior distribution for logistic and other regression models," where you suggest the student-t as a prior for the coefficients. My application involves drug safety data and very many predictors (hundreds or thousands of drugs might be associated with an adverse event in a database). Rather than a very weakly informative prior, I would prefer to select the t-distribution scale parameter (call it tau) to shrink the coefficients toward 0 (or toward some other value in a fancier model) as much as can be justified by the data. So I want to fit a simple hierarchical model where tau is estimated. Is there an easy modification of your algorithm to adjust tau at every iteration and to ensure convergence to the MLE of tau (or maximum posterior estimate if we add a prior for tau)? And do you know of any arguments for why regularization by cross-validation would really be any better than fitting tau by a hierarchical model, especially if the goal is parameter interpretation rather than pure prediction?
I replied:
We also have a hierarchical version that does what you want, except that the distribution for the coeffs is normal rather than t. (I couldn't figure out how to get the EM working for a hierarchical t model. The point is that the EM for the t model uses the formulation of a t as a mixture of normals, i.e., it's essentially already a hierarchical normal.)We're still debugging the hierarchical version, hope to have something publicly available (as an R package) soon.
Regarding your qu about cross-validation, yes, I think a hierarchical model would be better The point of the cross-validation in our paper was to evaluate priors for unvarying parameters which would not be modeled hierarchically.
Bill then wrote:
I did have my heart set on a hierarchical model for t rather than normal, because I wanted to avoid over shrinking very large coefficients while still "tuning" the prior scale parameter to the data empirically. (Although my worry about over shrinking might be less urgent if I use prior information to create "batches" that can have their own centers of shrinkage, as in your in-progress hierarchical bayesglm program.)Lee Edlefsen and I [Bill D.] are working on a drug adverse events dataset with about 3 million rows and three thousand predictors, using logistic regression and some extensions of LR, and with thousands of different response events to fit. Plus the potential non repeatability of MCMC results would be a real turnoff for the FDA regulators and pharma industry researchers.
An EM question
I have a question for Chuanhai or Xiao-Li or someone like that: is it possible to do EM with two levels of latent variables in the model? In the usual formulation, there are data y, latent parameters z, and hyperparameters theta, and EM gives you the maximum likelihood (or posterior mode) estimate of theta, conditional on y and averaging over z. This can commonly be done fairly easily because z commonly has (or can be approximated with) a simple distribution given y and theta. This scenario describes regression with fixed Student-t priors, or regression with normal priors with unknown mean and variance.
But what about regression with t priors with unknown center and scale? There are now two levels of latent variables. Can an EM, or approximate EM, be constructed here? As Bill and I discussed in our emails, Gibbs is great, and it's much easier to set up and program than EM, but it's harder to debug. There's something nice about a deterministic algorithm, especially if it's built with bells and whistles that go off when something goes wrong.
Posted by Andrew at 12:32 AM | Comments (6) | TrackBack
January 28, 2008
Some Recent Progress in Simple Statistical Methods
Simple methods are great, and "simple" doesn't always mean "stupid" . . .
Here's the mini-talk I gave a couple days ago at our statistical consulting symposium. It's cool stuff: statistical methods that are informed by theory but can be applied simply and automatically to get more insights into models and more stable estimates. All the methods described in the talk derived from my own recent applied research.
For more on the methods, see the full-length articles:
Scaling regression inputs by dividing by two standard deviations
A default prior distribution for logistic and other regression models
Splitting a predictor at the upper quarter or third and the lower quarter or third
A message for the graduate students out there
Research is fun. Just about any problem has subtleties when you study it in depth (God is in every leaf of every tree), and it's so satisfying to abstract a generalizable method out of a solution to a particular problem.
P.S. On the other hand, many of Tukey's famed quick-and-dirty statistical methods don't seem so great to me anymore. They were quick in the age of pencil-and-paper computation, and sometimes dirty in the sense of having unclear or contradictory theoretical foundations. (In particular, his stem-and-leaf plots and his methods for finding gaps and clusters in multiple comparisons seem particularly silly from the perspective of the modern era, however clever and useful they may have been at the time he proposed them.)
P.P.S. Don't get me wrong, Tukey was great, I'm not trying to shoot him down. I wrote the above P.S. just to remind myself of the limitations of simple methods, that even the great Tukey tripped up at times.
Posted by Andrew at 3:02 AM | Comments (10) | TrackBack
January 25, 2008
Kaiser Fung on business statistics in practice
Here's Kaiser Fung's presentation at our consulting mini-symposium. It was interesting to hear about the challenges of in-house consulting at Sirius Satellite Radio.
Posted by Andrew at 5:40 PM | Comments (0) | TrackBack
Rindskopf’s Rules for Statistical Consulting
Our statistical consulting mini-symposium yesterday was great. I wish we'd been able to video it. There was lively discussion of the connections between statistical consulting and research, and the different aspects of consulting in academic, corporate, and legal environments.
I'll be posting everyone's slides. Here's David Rindskopf's contribution:
Rindskopf’s Rules for Statistical ConsultingSome of these rules are universal, while others apply only in particular situations: Informal academic consulting, formal academic consulting, or professional consulting. Hopefully the context will be apparent for each rule.
Communication with the Client:
(1) In the beginning, mostly (i) listen and (ii) ask questions that guide the discussion.
(2) Your biggest task is to get the client to discuss the research aims clearly; next is design, then measurement, and finally statistical analysis.
(3) Don’t give recommendations until you know what the problem is. Premature evaluation of a consulting situation is a nasty disease with unpleasant consequences.
(4) Don’t believe the client about what the problem is. Example: If the client starts by asking “How do I do a Hotelling’s T?” (or any other procedure), never believe (without strong evidence) that he/she really needs to do a Hotelling’s T.
Exception: If a person stops you in the hall and says “Have you got a minute?” and asks how to do Hotelling’s T, tell them and hope they’ll go away quickly and not be able to find you later. (I’ve had this happen, and if I ask enough questions I inevitably find that it’s the wrong test, answers the wrong question, and is for the wrong type of data.)
Adapting to the Client and His/Her Field
(5) Assess the client’s level of knowledge of measurement, research design, and statistics, and talk at an appropriate level. Make adjustments as you gain more information about your client.
(6) Sometimes the “best” or “right” statistical procedure isn’t really the best for a particular situation. The client may not be able to do a complicated analysis, or understand and write up the results correctly. Journals may reject papers with newer methods (I know it’s hard to believe, but it happens in many substantive journals). In these cases you have to be prepared to do more “traditional” analyses, or use methods that closely approximate the “right” ones. (Turning lemons into lemonade: Use this as an opportunity to write a tutorial for the best journal in their field. The next study can then use this method.) A similar perspective is represented in the report of the APA Task Force on Statistical Significance; see their report: Wilkinson, L., & APA Task Force on Statistical Inference. (1999). Statistical methods in psychology journals: Guidelines and explanations. American Psychologist, 54, 594-604.
Professionalism (and self-protection)
(7) If you MUST do the right (complicated) analysis, be prepared to do it, write a few tutorial paragraphs on it for the journal (and the client), and write up the results section.
(8) Your goal is to solve your client’s problems, not to criticize. You can gently note issues that might prevent you from giving as complete a solution as desired. Corollary: Your purpose is NOT to show how brilliant you are; keep your ego in check.
Time Estimation, Charging for Your Time, etc.
(9) If a person stops you in the hall and asks if you have a minute, make him/her stand on one leg while asking the question and listening to your answer. If they ask for five minutes, it’s really a half-hour they need (or more).
(10) Corollary: Don’t charge by the job unless you really know what you’re doing or are really desperate. Not only do people (including you) underestimate how long it will take, but (a la Parkinson’s Law) the job will expand to include everything that comes into the client’s mind as the job progresses. If you think you know enough, write down all of the tasks, estimate how much time each will take, and double it. Also let the client know that if they make changes they’ll pay extra (Examples: “Whoops, I left out some data; can you redo the analyses?”, or “Let’s try a crosstab by astrological sign, and favorite lotto number, and...”)
(11) Charge annoying people a higher hourly rate. If you don’t want to work for them at all, charge them twice your usual rate to discourage them from hiring you (at least if they do hire you, you’ll be rewarded well.)
Resourceshttp://www.amstat.org/sections/cnsl/index.html ASA section on consulting
http://www.amstat.org/sections/cnsl/BooksJournals.html Their guide to books and journals on statisticsBoen, J.R. and Zahn, D.A. (1982) The Human Side of Statistical Consulting, Lifetime Learning Publications.
Javier Cabrera and Andrew McDougall. (2002). Statistical Consulting. Springer-Verlag.
Janice Derr. (2000). Statistical Consulting: A Guide to Effective Communication.. Pacific Grove CA: Duxbury Press, 200 pages, ISBN:0-534-36228-1.
Christopher Chatfield (1988). Problem solving: A statistician's guide, Chapman & Hall.
Taplin R.H. (2003). Teaching statistical consulting before statistical methodology. Australian & New Zealand Journal of Statistics, Volume 45, Number 2, June 2003, 141-152. Contains a good reference list on statistical consulting.
Posted by Andrew at 5:29 PM | Comments (4) | TrackBack
January 24, 2008
Statistical consulting mini-symposum TODAY (Thurs)!
Mini-Symposium: Statistical ConsultingWhen: January 24, 2008, from 3pm to 5pm
Where: Applied Statistics Playroom*
Sponsored by the New York City chapter of the American Statistical Association and the Columbia University Statistics Department, ISERP, and Applied Statistics Center.
Agenda
* Before 3pm: Casual conversation. This is a good time to meet new people or catch up with others.
* 3pm to 5pm:
o Brief lecture by Andrew Gelman: Some Recent Progress in Simple Statistical Methods.
o Panel discussion on statistical consulting with Naihua Duan (New York State Psychiatric Institute), Mimi Kim (Albert Einstein College of Medicine), Eva Petkova (New York University), Andrew Gelman (Columbia University), Kaiser Fung (Sirius Satellite Radio), and David Rindskopf (CUNY Graduate Center).
o The panel members will speak briefly, discuss questions, and facilitate a general discussion about statistical consulting.
* After 5pm: End of the formal part of the symposium. People can continue a group discussion, leave, or break into smaller groups.
Topics to be discussed include:
* Providing statistical solutions within the range of understandability;
* Handling the trade-offs between doing the analyses yourself and teaching others to perform all or parts of the analyses themselves;
* Managing expectations and building long-term relationships;
* Deciding how much to cater to the norm within disciplines;
* Balancing the goals of co-authorship in conjunction with money-making.* The Applied Statistics Playroom is 707 International Affairs Building, Columbia University, at 118 St. and Amsterdam Ave., near the 116 St. #1 train. Snacks will be provided.
P.S. See here, here, and here for slides of some of the presentations.
Posted by Andrew at 6:30 AM | Comments (3) | TrackBack
January 15, 2008
Statistics postdoc at Michigan
I hate to advertise the competition, but this looks like it could be interesting:
Postdoctoral Position in Methodology Institute for Social Research University of Michigan Unit: The Quantitative Methodology ProgramDate Announced: 01/15/2008
Qualified individuals will have a Ph.D. in statistics or biostatistics with a demonstrated interest in the social, behavioral or health sciences, or a Ph.D. in the social, behavioral or health sciences with very strong methodological skills and interests. Postdoctoral researchers will collaborate with Dr. Susan Murphy as part of a large NIH-funded project focused on the advancement and dissemination of statistical methodology involving individually tailored treatments (known as dynamic treatment regimes or adaptive treatment strategies) related to research on the prevention and treatment of substance abuse. This project also involves collaboration with researchers at the Methodology Center at Pennsylvania State University and clinical scientists at a variety of Universities and research centers.
Successful applicants will have exceptional resources to facilitate their research, including access to administrative and software support staff, any required hardware and software, and travel funding for at least one scientific conference annually. Experience with statistical programming and computer simulations is desirable. The position is for one year, with excellent possibility of re-funding for at least one additional year. The salary and benefits associated with this position are competitive.
Review of applications will begin immediately and continue until the positions are filled. Send a letter of application indicating research interests, career goals, and experience, a curriculum vita, and three letters of professional reference to: Susan Murphy, Quantitative Methodology Program, Institute for Social Research, University of Michigan, Ann Arbor, MI 48106-1248. For more information, contact Rhonda Moats (rmoats@umich.edu). The University of Michigan is committed to affirmative action, equal opportunity and the diversity of its workforce.
If you apply for the job, tell 'em you heard about it here!
Posted by Andrew at 4:34 PM | Comments (0) | TrackBack
A question about causal inference and a question about variable selection
Lingzhou Michael Xue writes in with two questions:
1) Possible to Generalize the Rubin Causal Model? In my undergraduate research project, I have discovered almost every subjects focused on decoding the network-level causality in almost every field, ranging from Biology, Medicine Design to even Social Science. However, these publications obviously lack solid statistical foundations on the definition of causality and how to do causal analysis. On the other hand, I have been enlightened a lot from Rubin Causal Model, and also powerful tools such as instrumental variables and propensity scores. Yet, these causal inference are limited in the one-variable causality. Is possible to generalize it to deal with interaction causality? From my intuition, it seems pretty difficult to do this. I am still curious about the possibility to generalize Rubin's Model?2) Some works on Bayesian Variable Selection?
Recently, we have witnessed fruitful and interesting reseraches on variable selection,
which even draw Terence Tao's attention. What is more interesting, most works of this
area rely on penalized learning, i.e. from the frequentist perspective. While I believe
that Bayesian approach might bring us a more reasonable framework just as it always did.
could you kindly show me some works on bayesian variable selection?
My reply:
1. Rubin's causal model allows for interactions. Interactions between treatment and pre-treatment predictors fit in automatically with no complications at all, except that the goal is no longer to estimate an average treatment effect, you now want to estimate the effect conditional on predictors. If you have interactions between different treatment factors, it just complexifies the potential outcomes. I agree, though, that when the treatment is continuous, the potential outcomes need to be modeled, which brings Rubin's framework closer to classical regression and instrumental variables.
2. I'm not a big fan of variable selection. I prefer continuous model expansion: keeping all the variables in the model and controlling them with an informative prior distribution or hierarchical model.
Posted by Andrew at 1:54 AM | Comments (1) | TrackBack
A sighting of the unicorn
Richard Barker sent in this photograph and the following note:

Matt just pointed me to your article: You can load a die but you can't bias a coin. You might be interested in the attached, a photo of a bent NZ 50c coin that I had pressed in the Physics lab here a few years ago because I got bored using flat coins in classroom demonstrations where everyone knows what Pr(heads) is. Fortunatley that particular style of coin is no longer legal tender so I am unlikely to be prosecuted for defacing her Majesty's coinage.In discussing this with Matt this afternoon we conjured up a counter example where the coin is completely pressed into a sphere. Then it has Pr(heads) = 1. If the pressing is not quite complete it will be a little less than one, so we claim the statement in the title of your article is not true. We think you can bias a coin.
At about 300 flips it looks as though Pr(Heads) is about 0.55.
When I first bent the coin I did some experiments letting the coin land on the ground. On soft carpet it was not obvioulsly biased but it was on a hard surface. On hard surfaces, most of the time it bounces up and starts spinning on its edge. When this happens it then always lands heads up.
Yeah, sure, he's right. We were thinking of weighting a coin, but if you bend it enough, then it is no longer set to land "heads" for half of its rotation. And bouncing, sure, then anything can happen. We were always assuming you catch it in the air!
Finally, we were addressing the concept of the "biased coin," which, by analogy to the "loaded die," looks just like a regular die but actually has probabilities other than 50/50 when caught in the air. In that sense, the bent coin is not a full counterexample since it clearly looks funny.
Posted by Andrew at 12:38 AM | Comments (0) | TrackBack
January 14, 2008
What to learn in your statistics Ph.D. program?
Cosma Shalizi (of the CMU statistics dept) and I had an exchange about the role of measure theory in the statistics Ph.D. program. I have to admit I'm not quite sure what "measure theory" is but I think it's some sort of theoretical version of calculus of real variables. I had commented that we're never sure what to do with our qualifying exam, and Cosma wrote,
I think we have a pretty good measure-theoretic probability course, and I wish more of our students went on to take the non-required sequel on stochastic processes (because that's the one I usually teach). I do think it's important for statisticians to understand that material, but I also think it's actually easier for us to teach someone how a martingale works than it is to teach them to be interested in scientific questions and to not get a freaked out, "but what do I calculate?" response when confronted with an open research problem. Here it's been suggested that we replace our qualifying exams with having the student prepare a written review of some reasonably-live topic from the literature and take an oral exam on it, which would be more work for us but come a lot closer to testing what the students actually need to know.
I replied,
I agree that it's hard to teach how to think like a scientist, or whatever. But I don't think of the alternatives as "measure theory vs. how-to-think-like-a-scientist" or even "measure theory vs. statistics". I think of it as "measure theory vs. economics" or "measure theory vs. CS" or "measure theory vs. poli sci" or whatever. That is, sure, all other things being equal, it's better to know measure theory (or so I assume, not ever having really learned it myself, which didn't stop me from proving 2 published theorems, one of which is actually true). But, all other things being equal, it's better to know economics (by this, I mean economics, not necessarily econometrics), and all other things being equal, it's better to know how to program. Etc. I don't see why measure theory gets to be the one non-statistical topic that gets privileged as being so requrired that you get kicked out of the program if you can't do it.
Cosma then shot back with:
I also don't think of the alternatives as "measure theory vs. how-to-think-like-a-scientist" or even "measure theory vs. statistics". My feeling --- I haven't, sadly, done a proper experiment! --- is that it's easier to, say, take someone whose math background is shaky and teach them how a generating-class argument works in probability than it is to take someone who is very good at doing math homework problems and teach them the skills and attitudes of independent research.You say, "I think of it as "measure theory vs. economics" or "measure theory vs. CS" or "measure theory vs. poli sci" or whatever." I'm more ambitious; I want our students to learn measure-theoretic probability, and scientific programming, and whatever substantive field they need for doing their research, and, of course, statistical theory and methods and data analysis. Because I honestly think that if someone is going to engage in building stochastic models for parts of the world, they really ought to understand how probability _works_, and that is why measure theory is important, rather than for its own sake. (I admit to some background bias towards the probabilist's view of the world.) At the same time it seems to me a shame (to use no stronger word) if someone, in this day and age, gets a ph.d. in statistics and doesn't know how to program beyond patching together scripts in R.
P.S. I think measure theory should be part of the Ph.D. statistics curriculum but I don't think it should be a required part of the curriculum. Not unless other important topics such as experimental design, sample surveys, statistical computing and graphics, stochastic modeling, etc etc are required also. It's sad to think of someone getting a Ph.D. in statistics and not knowing how to work with mixed discrete/continuous variables (see Nicolas's comment below) but it seems equally sad to see Ph.D.'s who don't know what Anova is, who don't know the basic principles of experimental design (for example, that it's more effective to double the effect size than to double the sample size), who don't know how to analyze a cluster sample, and so forth.
Unfortunately, not all students can do everything, and any program only gets some finite number of applicants. If you restrict your pool to those who want to do (or can put up with) measure theory, you might very well lose some who could be excellent statistical researchers. It would be sort of like not admitting Shaq to your basketball program because he can't shoot free throws.
Posted by Andrew at 12:33 AM | Comments (16) | TrackBack
January 10, 2008
Record your sleep with Dream Recorder?
Aleks pointed me to this website:
Dream Recorder is the ideal companion of your nights, allowing you to understand better this third of our life spent in bed. Dream recollection, sleep hygiene, curiosity, you will find your own reasons for using this software of a new kind. Nights after nights, Dream Recorder keeps records of your sleep profiles. It provides statistics and give you the possibility to annotate your dream records with notes or keywords. . . .

Dream Recorder uses the difference between successive reconstructed images for computing the quantity of motion (see image on the right). Quantity of motions are reflected by the colored bar graph. High peaks mean motions. Very low peaks are just in the detection noise base level. Dream periods are lit up by spotlights. Normal sleeps are represented by the dark blue shades. Deep sleeps have no lights nor shading. Night events are displayed under the timeline, here a dream feedback followed by a voice recording.
Seth would love this (I assume).
Posted by Andrew at 12:07 AM | Comments (1) | TrackBack
January 4, 2008
Free advice is worth what you pay for it
Someone writes,
I am currently looking at different grad school stats programs. I have a BA in Psychology (U. Southern California), but I am really interested in stats. I loved my stats classes in college but I was a bit of a naive wallflower back then and did not think to change course and pursue stats more, even though it was the favorite part of my degree. After I graduated, I worked as a research assistant where my PI quickly learned that I was happiest talking about and running the stats for her various projects. I worked with her for close to two years, then moved and now I'm a public school science teacher.
I know I have a passion for the subject, my problem is that whenever I look at the requirements for stats grad programs, I see that I am severely lacking in the math requirements. In your opinion, am I wasting my time looking at these programs when all I have is a Psych degree and a passion to learn more? Would it even be possible for an admissions board to look past my lack of math classes in college, taking into account my research experience, and possibly admit me on the condition that I complete the prerequisites? What course of action do you think would be best for my situation?
My reply:
1. I want to be encouraging, because I think that the field of statistics should have more people who want to do statistics, as opposed to people who studied math and aren't really sure what to do next.
2. I expect you should be able to get into a good program, if they can be convinced that you can learn the math that's necessary. I think a lot of places take GRE scores pretty seriously, so if your GRE's aren't so great, you might need to take some math and some probability theory to make the case that you can do it. The research experience should help. I also think that being able to program computers is as important as being able to do math, but most stat programs probably don't agree with me there.
3. Another approach is to get a PhD in an applied field such as psychology or education or political science and focus on statistics (i.e., "methods"). The trick here is to go to a place where you can work with someone who's doing good interdisciplinary work so that you don't end up just doing out-of-date statistics. That's true in a regular statistics department also. Lots of really PhD theses get done (actually, I've advised some of these myself, but I'm trying to improve).
4. Perhaps someone will comment with better advice?
Posted by Andrew at 2:32 AM | Comments (12) | TrackBack
January 1, 2008
NASA data released for analysis
Via a Slashdot entry, I heard that NASA has released data from a survey they did from 2001 to 2004. They surveyed pilots, and apparently a lot of the responses did not reflect well on NASA, so the data was going to be destroyed. They changed their minds, and now the data has been posted for analysis - no one has really done a great job analyzing the data yet, so if anyone is interested... For the data, see the link here.
Posted by Juli at 2:51 PM | Comments (5) | TrackBack
December 11, 2007
Why use count-data models (and don't talk to me about BLUE)
Someone who wishes to remain anonymous writes,
I have a question for you. Count models are to be used when you have outcome variables that are non-negative integers. Yet, OLS is BLUE even with count data, right? I don't think we make any assumptions with least squares about the nature of the data, only about the sampling, exogeneous regressors, rank K, etc. So why technically should we use count if OLS is BLUE even with count data?
My reply:
1. I don't really care if something is BLUE (best linear unbiased estimator) because (a) why privilege linearity, and (b) unbiasedness ain't so great either. See Bayesian Data Analysis for discussion of this point (look at "unbiased" in the index), also this paper for a more recent view.
2. Least squares is fine with count data, it's even usually ok with binary data. (This is commonly known, and I'm sure it's been written in various places but I don't happen to know where.) For prediction, though, you probably want something that predicts on the scale of the data, which would mean discrete predictions for count data. Also, a logarithmic link makes sense in a lot of applications (that is, E(y) is a linear function of exp(x)), and you can't take log of 0, which is a good reason to use a count data model.
Posted by Andrew at 9:46 AM | Comments (6) | TrackBack
Question about principal component analysis
Andrew Garvin writes,
The CFNAI is the first principal component of 85 macroeconomic series - it is supposed to function as a measure of economic activity. My contention is that the first principal component of a diverse set of series will systematically overweight a subset of the original series, since the series with the highest weightings are the ones that explain the most variance. As an extreme example, say we do PCA on 100 series, where 99 of them are identical - then the first PC will just be the series that is replicated 99 times.
This is not necessarily a bad thing, but consider the CFNAI - most of the highly weighted series are from a) industrial production numbers, or b) non-farm payroll numbers. On the other hand, the series with relatively small weightings are very diverse. As I see it then, using the first principal component is not so much a measure of 'economic activity', but rather, 'economic activity as primarily measured by industrial production and NFP'. Now, if I thought a priori that industrial production and NFP explained most of what was happening in economic activity, then this would not be such a bad outcome. However, it seems to me that the whole point of using PCA instead of an equal-weighting is that we are naive about the true weightings of the various series composing our indicator - and so PCA conveniently gives us the most appropriate weightings. So, to me, PCA only works as a weighting strategy if we already have some idea of what the weights should be, which defeats the purpose of using PCA in the first place.My question then is: Do you see this as a problem? a) If so, would you mind suggesting ways to deal with this problem, or perhaps pointing me to some reading material that might discuss this issue? b) If not, I would be curious to know what the flaw is in my argument above.
My reply: Hey--you've hit on something I know nothing about! My usual inclination is not to use principal components but rather to just take simple averages or total scores (see pages 69 and 295 of my book with Hill), on the theory that this will work about as well as anything else. But in that case you certainly have to be careful about keeping too many duplicate measures of the same thing. My impression was that principal component analysis gets around that problem, actually.
My more general advice is to check your reasoning by simulating some fake data under various hypothesized models (including examples where you have several near-duplicate series) and then see what the statistical procedure does.
Posted by Andrew at 12:49 AM | Comments (5) | TrackBack
December 7, 2007
Statistical consulting mini-symposium next month
Posted by Andrew at 4:33 PM | Comments (5) | TrackBack
Questions about transformations
Manuel Spínola writes,
Many people (or at least some) in ecology question the validity of transform a variable to perform a statistical test. I guess they are worried that the transformed variable is not what they intended to measure. Is that a fair criticism? Is it valid to transform a variable, let say, use a logarithmic transformation?Is it mean that if I found a meaningful (biological significant) relationship of a transformed explanatory variable with a response variable that relationship exist but was not "visible" before the variable transformation? I am not clear on how to explain a relationship based on a transformed variable to, let say a straw-man. For example, if log(forest cover) has an influence on a species bird abundance, what should I say to a manger regarding to the effect of forest
cover on the species abundance?
My reply: This is discussed in more detail in chapter 4 of our recent book. The short answer is that predictive relationships can be nonlinear and can have interactions. Once you consider that things might be nonlinear, the transformation is arbitrary. But in many settings transformations can allow for simpler models, for example a log transformation changing a multiplicative to a linear relation. All is fine as long as you don't take the assumptions too seriously. To specifically answer your question: yes, use a log transformation, then explain the predictive relation by making a graph of y vs. x.
Posted by Andrew at 11:54 AM | Comments (1) | TrackBack
December 3, 2007
Exploratory data analysis course
Aleks noticed this interesting-looking course:
Course Contents Predictive Analytics and Exploratory Data Mining* the relationship between predictive analytics and exploratory data mining
* the role of graphics in exploratory analysis
* complexity in a PowerPoint world
* the analyst's dilemmaWorking with Unstructured Data
* data streams versus structured data
* social network analysis as a solution to unstructured problems
* statistical mechanics of network analyses
* predicting with a network
* complex networks versus reductionismExploratory Data Mining and Predictive Models
* exploratory data mining success
* predictive modeling methods
* logistic regression
* decision trees
* neural networks
* the truth about neural networks
* comparing and contrasting predictive modeling methods
* model structure and impact on exploratory results
* graphical review of model results
* multi-dimensional graphicsExploratory Predictive Modeling
* initial data screening
* elements of an exploratory script
* developing complex predictive models for exploratory efforts
* identifying important variables
* analyzing variables, domains, and clusters
* graphical review of models and dataExploratory Findings
* extracting new hypotheses (exploratory findings) from the predictive model
* building confidence with the exploratory findings
* recognizing and overcoming impediments to acceptance by the target audience
Remind me again why we teach classes on boring topics like "categorical data anlysis" . . .
Posted by Andrew at 11:48 AM | Comments (0) | TrackBack
November 26, 2007
What to teach if you only have three weeks, and suggestions for the ten most interesting and accessible quantitative papers in political science
Frank Di Traglia writes,
I'm going to be teaching a three-week, introductory statistics course for local high school students next summer, and wanted to ask for your advice. I have two questions in particular.First, I doubt that three weeks will be enough time to teach the usual Statistics 101 course. If you had only three weeks, what would you skip and what would you emphasize?
Second, since next year is an election year, I thought it might be fun to build the course around substantive examples from political science. Although I've enjoyed many of your poly-sci papers, my own background is not in this area (I did my masters in Statistics, and am currently pursuing a PhD in Economics). What would you consider to be the ten most interesting and accessible quantitative papers in this field?
My reply:
1. It's gotta depend on how many hours per week you have! To consider the larger question, I'm unsatisfied with the usual intro stat course (including what I've taught) because it comes across as a disconnected set of topics. As I wrote here about the "sampling distribution of the sample mean":
The hardest thing to teach in any introductory statistics course is the sampling distribution of the sample mean, a topic that is at the center of the typical intro-stat-class-for-nonmajors. All of probability theory builds up to it, and then this sample mean is used over and over again for inferences for averages, paired and unparied differences, and regression. This is the standard sequence, as in the books by Moore and McCabe, and De Veaux et al.The trouble is, most students don't understand it. I'm not talking about proving the law of large numbers or central limit theorem--these classes barely use algebra and certainly don't attempt rigorous proofs. No, I'm talking about tha dervations that lead to the sample mean of an average of independent, identical measurments having a distribution with mean equal to the population mean, and sd equal to the sd of an individual measurement, divided by the square root of n.
This is key, but students typically don't understand the derivation, don't see the point of the result, and can't understand it when it gets applied to examples.
What to do about this? I've tried teaching it really carefully, devoting more time to it, etc.--nothing works. So here's my proposed solution: de-emphasize it. I'll still teach the samling distribution of the sample mean, but now just as one of many topics, rather than the central topic of the course. In particular, I will not treat statistical inference for averages, differences, etc., as special cases or applications of the general idea of the sampling distribution of the sample mean. Instead, I'll teach each inferential topic on its own, with its own formula and derivation. Of course, they mostly won't follow the derivations, but then at least if they're stuck on one of them, it won't muck up their understanding of everything else.
Given these thoughts, my first suggestion would be for you to indeed focus on one particular thing, for example public opinion, and focus your course on that. Have the students download raw data from polls and do some analyses (maybe using JMP-in). This is what Bob Shapiro does when he teaches intro stats here.
2. If you'd rather do something closer to standard statistics, I'd recommend focusing on sampling, experimentation, and observational studies. You can do one week of each--in each week, they first do an in-class demo (a survey in week 1, an experiment in week 2, an obs study in week 3), then they together do something larger. I have some examples in my book with Deb, but I can't say I've worked out all the details of such a course. It's easier to talk about it than to do it.
3. The ten most interesting and accessible quantitative papers in political science? That's a good question. Of my own papers, these are the most accessible, I think: Why are American Presidential election campaign polls so variable when votes are so predictable?, Voting, fairness, and political representation, Voting as a rational choice: why and how people vote to improve the well-being of others, A catch-22 in assigning primary delegates, Rich state, poor state, red state, blue state: What's the matter with Connecticut? I also like the paper, Methodology as ideology: mathematical modeling of trench warfare, even though it's not really statistical.
I wouldn't include all of these in a top-ten list, but I'd include at least one! Beyond this, perhaps the blog readers have some suggestions?
Posted by Andrew at 12:49 AM | Comments (2) | TrackBack
November 21, 2007
Quantitative Methods for Negotiating Trades in Pro Sports
I recently had some thoughts about negotiating trades in the NBA. Specifically, I heard that the Lakers and the Bulls were having daily discussions about a trade involving Kobe Bryant, for at least a week; that seemed like a long time to me. Was this week-long series of conversations productive and/or necessary? Are there no quantitative methods for structuring trade negotiations that could have been used to save these teams some time and energy? I've outlined a potential solution, which can probably be improved using methods from the literature on (1) statistical models for rankings and (2) bargaining and negotiating.
My idea is this: First, construct a list of all possible trades to be made between two teams, which would involve closely examining the entire roster of each team, accounting for salary cap restrictions that preclude certain trades, and including or excluding specific key players. Then, instruct each team to rank these possible trades from the most desirable trade to the least desirable trade. These two sets of rankings will naturally be each other's approximate inverse, because a very good trade for one team will most likely be a very bad trade for the other team (Kobe Bryant for Chris Duhon, anyone?). Lastly, the negotiations consist of each team taking turns eliminating the lowest-ranked trade from their respecitve lists, until the two lists have only one trade in common. If this trade is rejected by either team, then - and here comes the part that I think could be powerful - no trade can be made between these two teams until at least one of them changes their rankings. It is a framework that could be used to, at the very least, save two teams some time when negotiating a trade.
The only similar setting that I can think of is when opposing lawyers eliminate potential jurors from a jury pool (they call these "peremptory challenges"). Does anyone know of another situation in which opposing agents rank items and eventually must agree on a compromise? Maybe there is something in the bargaining literature.
The statistical question of interest is this: What is the percentile of the rank (for each team) of the jointly optimal trade? (That is, the last trade that remains on both lists after eliminations are made). It would be nice if, in the pro sports example, both teams could improve significantly. This would probably only happen in an "apples for oranges" type of trade. Some preliminary work in the Lakers-Bulls example shows that the jointly optimal trade is in the 47th percentile for the Lakers and the 48th percentile for the Bulls - not too great for either team. A bunch of assumptions were made in this example, though, so it's probably not too informative right now. If a probability model is used to generate the two sets of rankings, then the pair of percentiles of the jointly optimal trade, (p_1,p_2), would be a random variable of interest.
Posted by Kenny at 8:00 AM | Comments (7) | TrackBack
November 16, 2007
Political Neuroscience
A piece by Brandon Keim in Wired points out some issues in the fMRI brain-politics study on reactions to presidential candidates discussed in a recent NYT op-ed. For example,
Let's look closer, though, at the response to Edwards. When looking at still pictures of him, "subjects who had rated him low on the thermometer scale showed activity in the insula, an area associated with disgust and other negative feelings." How many people started out with a low regard for Edwards? We aren't told. Maybe it was everybody, in which case the findings might conceivably be extrapolated to the swing voter population of the United States. But maybe it was just five or ten voters, of whom one or two had such strong feelings of disgust that it skewed the average. What about the photographs? Was he sweating and caught in flashbulb glare that would make anyone's picture look disgusting? How did the disgust felt towards Edwards compare to that felt towards other candidates? How well do scientists understand the insula's role in disgust -- better, I hope, than they understand the Romney-activated amygdala, which is indeed associated with anxiety, but also with reward and general feelings of arousal?
(And don't forget "Baby-faced politicians lose" on this blog.)
Posted by jeff at 3:46 PM | Comments (1) | TrackBack
November 12, 2007
Bayesian Adjusted Plus/Minus Statistics in Basketball
Josh Menke writes,
I saw that you had commented on adjusted plus/minus statistics for basketball in a few of your blog entries [see also here]. I've been working on a Bayesian version of the model used by Dan Rosenbaum, and wondered if I could ask you a question.I wanted to be able to update the posterior after each sequence of game play between substitutions, so I decided to use the standard exact inference update for a normal-normal Bayesian linear regression model. If you're familiar with Chris Bishop's recent book, Pattern Recognition and Machine Learning, the updating equations for this are 3.50 and 3.51 on page 153. I felt OK with using a normal prior based on some past research I did in multiplayer game match-making with Shane Reese at BYU. The tricky part comes with using exact inference for updating the posterior. The updating method is very sensitive to the prior covariance matrix. I start with a diagonal covariance matrix, and if the initial player variances I choose are too high, the +/- estimates can go to infinity after several updates. I thought this was related to the data sparsity causing an ill-conditioned update matrix, but I thought I'd ask in case you'd had any experience with this type of problem.
Have you dealt with an issue like this before? If I set the prior variances low enough, I get reasonable results, and the ordering of the final ranking is fairly robust to changes in the prior. It's just the estimation process itself that doesn't "feel" as robust as I'd prefer, so I don't know that I trust the adjusted values (final coefficients) to be meaningful.
I don't think I can use MCMC in this situation either because trying to get 100,000 samples using 38,000+ data points and 400+ parameters feels intractable to me. I could be wrong there as well since I suppose I only need to include the current players in each match-up within the log likelihood. But it would still take quite a bit of time.
It would also be nice to go with the sequential updating version if possible since I could provide adjusted +/- values instantly after each game, if not after each match-up.
My reply:
1. I'd try the scaled inverse Wishart prior distribution as described in my book with Hill. This allows the correlations to be estimated from data in a way that still allows you to provide a reasonable amount of information about the scale parameters.
2. I'd go with the estimation procedure that gives reasonable estimates, then do some posterior predictive checks, as described in chapter 6 of Bayesian Data Analysis. (Sorry for always referencing myself; it's just the most accessible reference for me!) This should give you some sense of the aspects of the data that are not captured well by the model.
3. Finally, you can simulate some fake data from your model and check that your inferential procedure gives reasonable estimates. Cook, Rubin, and I discussed a formal way of doing this, but you can probably do it informally and still build some confidence in your method.
Posted by Andrew at 12:23 AM | Comments (2) | TrackBack
November 10, 2007
Survey weighting is a mess
Dave Judkins writes, regarding my Struggles with Survey Weighting and Regression Modeling paper,
I am hoping you might be able to clarify a point in your approach. How does a variable like number of phone lines in the house get used in equation 5? (Given that N.pop and X.pop are not available.) Does your work in Section 3 apply only to X variables with known population distributions?
My reply:
My student and I are working on how to deal with these "non-census variables." The Bayesian answer is that you need to know the N's for the crosstabs of these non-census variables and the census variables. Since only the census variables are known, the relative N's for the non-census variables are unknowns are random variables, they need a prior distribution, etc etc. Inference is done on these by making an assumption about selection probability (e.g., that households with multiple phones are twice as likely to be picked, and households with intermittent service are half as likely to be picked, compared to households with one phone line). My conjecture is that if you have a simple flat prior on the unknown multinomial probabilities, this reduces to some sort of inverse-probability-weighting analysis, and that maybe one can do better using a more structured prior (i.e., a hierarchical model). But for now it's all talk and no action from me on this!
Dave replied with some information on how they adjust for non-Census variables at Westat and links to this recent paper of his on sample-based raking, work that started around 1987:
Judkins, D., Nadimpalli, V. and Adeshiyan, S. (2005). Replicate control totals. Proceedings of the Section on Survey Research Methods of the American Statistical Association, pp 3167-3171.
Which reminds me of this 2001 paper by Cavan, Jonathan, and myself on poststratifying without population-level information.
Posted by Andrew at 9:16 PM | Comments (0) | TrackBack
November 8, 2007
Those people who go around telling you not to do posterior predictive checks
I started to post this item on posterior predictive checks and then I realize I already did post it several months ago! Memories (including my own) are short, though, so here it is again:
A researcher writes,
I have made use of the material in Ch. 6 of your Bayesian Data Analysis book to help select among candidate models for inference in risk analysis. In doing so, I have received some criticism from an anonymous reviewer that I don't quite understand, and was wondering if you have perhaps run into this criticism. Here's the setting. I have observable events occurring in time, and I need to choose between a homogeneous Poisson process, and a nonhomogeneous Poisson process, in which the rate is a function of time ( e.g., lognlinear model for the rate, which I'll call lambda).I could use DIC to select between a model with constant lambda and one where the log of lambda is a linear function of time. However, I decided to try to come up with an approach that would appeal to my frequentist friends, who are more familiar with a chi-square test against the null hypothesis of constant lambda. So, following your approach in Ch. 6, I had WinBUGS compute two posterior distributions. The first, which I call the observed chi-square, subtracts the posterior mean (mu[i] = lambda[i]*t[i]) from each observed value, square this, and divides by the mean. I then add all of these values up, getting a distribution for the total. I then do the same thing, but with draws from the posterior predictive distribution of X. I call this the replicated chi-square statistic.
If my putative model has good predictive validity, it seems that the observed and replicated distributions should have substantial overlap. I called this overlap (calculated with the step funtion in WinBUGS) a "Bayesian p-value." The model with the larger p-value is a better fit, just like my frequentist friends are used to.
Now to the criticism. An anonymous reviewer suggests this approach is weakened by "using the observed data twice." Well, yes, I do use the observed data to estimate the posterior distribution of mu, and then I use it again to calculate a statistic. However, I don't see how this is a problem, in the sense that empirical Bayes is problematic to some because it uses the data first to estimate a prior distribution, then again to update that prior. I am also not interested in "degrees of freedom" in the usual sense associated with MLEs either.
I am tempted to just write this off as a confused reviewer, but I am not an expert in this area, so I thought I would see if I am missing something. I appreciate any light you can shed on this problem.
My thoughts:
1. My first thought is that the safest choice is the nonhomogeneous process since it includes the homogeneous as a special case (in which the variation in the rate has zero variance over time). This can be framed as a modeling problem in which the variance of the rate is an unknown parameter which must be nonnegative. If you have a particular parametric model (e.g., log(rate(t))=a+b*x(t)+epsilon(t), where epsilon(t) has mean 0 and sd sigma), then the homogeneous model is a special case like a=b=sigma=0.
2. From this perspective, I'd rather just estimate a,b,sigma and see to what extent the data are consistent with a=0, b=0, sigma=0.
3. I agree with you that "using the data twice" is not usually such a big deal. It's a n vs. n-k sort of thing.
4. I'm not thrilled with the approach of picking the model with the larger p-value. There are lots of reasons this might not work so well. I'd prefer (a) fitting the larger model and taking a look at the inferences for a, b, and sigma; and maybe (b) fitting the smaller model and computing a chi-squared statistic to see if this model can be rejected from the data.
Still more . . .
And here's more on the ever-irritating topic of people who go around telling you not to use posterior predictive checks because they "use the data twice." Grrrrr... The posterior predictive distribution is p(y.rep|y). It's fully Bayesan. Period.
Posted by Andrew at 6:51 AM | Comments (4) | TrackBack
November 7, 2007
To our loyal readers . . .
Sorry for all the red-state, blue-state stuff. We'll be giving you more statistics-as-usual and miscellaneous social science soon . . . In the meantime, you can read these papers:
Using redundant parameterizations to fit hierarchical models (with Zaiying Huang, David van Dyk, and John Boscardin; to appear in JCGS)
Weight loss, self-experimentation, and web trials: a conversation (with Seth Roberts; to appear in Chance)
Manipulating and summarizing posterior simulations using random variable objects (with Jouni Kerman; to appear in Statistics and Computing)
Posted by Andrew at 12:30 AM | Comments (1) | TrackBack
November 6, 2007
Statistical challenges in estimating small effects
John Carlin had some comments on my paper with Weakliem:
My immediate reaction is that we won't get people away from these mistakes as long as we talk in terms of "statistical significance" and even power, since these concepts are just too subtle for most people to understand, and they distract from the real issues. Somewhat influenced by others, I spend quite a bit of time eradicating the term "statistical significance" from colleagues' papers. I suspect that as long as the world sees statistical analysis as dividing "findings" into positives and negatives then the nonsense will keep flowing, so an important step in dealing with this is to change the terminology. In your example you seem to be arguing too much on his ground by focussing on the fact that although he data-dredged a significant p-value, your p-value is not significant. (So the ignorant editor or reader may see it as technical squabbling between statisticians rather than being forced to deal with the real issues about precision of estimation or lack of information.)I agree entirely that the problem is with the framework of effects as true/false, but this is the very framework that "statistical significance" is built around and your article makes that concept very central by continually referring to "what if the effect is not statistically significant?" etc. I think the focus should be on how dangerous it is to overinterpret small studies with vast imprecision, and I'm not sure why this can't be clarified by sticking to the precision (or information) concept. I still haven't looked again at your Type S and Type M but on the face of it wonder if they may just confuse by adding more layers. Statistical significance gets it wrong because it focuses on null hypotheses (usually artificial), but when you say Type S it almost sounds similar in that you are thinking of truth/falsity with respect to the sign, rather than uncertainty about effects...?
My big point in considering Type S errors is to move beyond the idea of hypotheses being true or false (that is, to move beyond the idea of comparisons being exactly zero), but John has a point, that I still have to decide how to think about statistical significance. The problem is that, from the Bayesian perspective, you can simply ignore statistical significance entirely and just make posterior statements like Pr (theta_1 > theta_2 | data) = 0.8 or whatever, but such statements seem silly given that you can easily get impressive-seeming probabilities like 80% by chance.
Posted by Andrew at 11:48 AM | Comments (2) | TrackBack
November 1, 2007
A statistician does web analytics
I sometimes play with Google Analytics to see the number of daily visitors on our blog and where they are coming from. The charts of daily visits look a bit like this:

Clearly, there is an upwards trend, but the influence of the day of the week messes everything up. I exported the data into a text file, and typed a line into R:
plot(stl(ts(read.table('visitors'),frequency=7),s.window="periodic"))
The trend component shows what I am really interested in: the trough of summer, followed by a relatively consistent rising trend. Every now and then another site will refer to our blog, temporarily increasing the traffic, and Andrew's cool voting plots are responsible for the latest spike.
Setting the stl function's t.window parameter to 14, 21 or more will smooth the trend a bit more. The model is imperfect because new visitors do come in bursts, but leave more slowly. Perhaps we should do a better Bayesian model for time series decomposition, unless someone else has already done this.
Posted by Aleks Jakulin at 3:15 PM | Comments (1) | TrackBack
October 31, 2007
Skepticism about empirical studies?
Nick Firoozye writes,
I [Firoozye] wanted to point your attention to the following podcast by Ian Ayres on Supercrunchers, where he shows himself an enthusiastic (if perhaps a bit naïve) proponent of the statistical method. Entertaining, definitely. One thing though that I thought you might be interested in is Russ Roberts’ (the interviewer's) own skepticism over the econometric method, which I think probably warrants a response. It may be that Roberts’ own view is due to his now-Austrian economics slant (i.e., somewhat anti-formallist approach) or perhaps to the fact that mainstream econometrics is a frequentist pursuit and one might question the honesty of the results as a consequence.
I don't really have much to add here, except that the problem noted by Roberts (it's hard to know whether to believe a statistical study) is even more of a problem with non-statstical empirical studies (i.e., anecdotes). I think Roberts might be overstating the problem because he is focusing on issues where he already had a strong personal opinion even before seeing data analyses. (He mentions the examples of concealed handguns and anti-theft devices on cars.) But there are a lot of areas where we have only weak opinions which can indeed be swayed by data (see here for some examples). These cases are important in their own right and also can serve as benchmarks for the success of statistical analysis, so that we can trust good analyses more when they're applied to tougher problems. This is one way that applied statistics proceeds, by exemplary analyses of problems that might not be hugely important on their own terms but serve as useful templates. Consider, for example, the book by Snedecor and Cochran: it's full of examples on agricultural field trials. Sure, these are important, but these methods have been useful in so many other fields. This is a great example, actually: Snedecor and his colleagues worked on agricultural trials because they cared about the results--these were not "toy examples" or thought experiments--and the resulting methods endured.
Posted by Andrew at 7:56 AM | Comments (1) | TrackBack
October 29, 2007
Distinguishing association from causation
I was pointed to Distinguishing Association from Causation:A Background for Journalists (there is also a PDF version). Here is my summary of their executive summary:
- Scientific studies that show an association between a factor and a health effect do not necessarily imply that the factor causes the health effect.
- Randomized trials are studies in which human volunteers are randomly assigned to receive either the agent being studied or an inactive placebo, usually under double-blind conditions.
- The findings of animal experiments may not be directly applicable to the human situation because of genetic, anatomic, and physiologic differences between species and/or because of the use of unrealistically high doses.
- In vitro experiments are useful for defining and isolating biologic mechanisms but are not directly applicable to humans.
- The findings from observational epidemiologic studies are directly applicable to humans, but the associations detected in such studies are not necessarily causal.
- Useful, time-tested criteria for determining whether an association is causal include:
- Temporality. For an association to be causal, the cause must precede the effect.
- Strength. Scientists can be more confident in the causality of strong associations than weak ones.
- Dose-response. Responses that increase in frequency as exposure increases are more convincingly supportive of causality than those that do not show this pattern.
- Consistency. Relationships that are repeatedly observed by different investigators, in different places, circumstances, and times, are more likely to be causal.
- Biological plausbility. Associations that are consistent with the scientific understanding of the biology of the disease or health effect under investigation are more likely to be causal.
- Studies that include appropriate statistical analysis and that have been published in peer-reviewed journals carry greater weight than those that lack statistical analysis and/or have been announced in other ways.
- Claims of causation should never be made lightly.
But all this isn't about causation vs association, it's about better studies or worse studies. Association and causation are not binary categories. Instead, there is a continuum from simple models on observational data (correlation between two variables), through more sophisticated models on observational data that include covariates (regression, structural equation models), through yet sophisticated models on observational data that take sample selection bias into consideration (Rubin's propensity score approach), to often simple models on controlled data (randomized experiments). But the mysterious causal "truth" is still out there. If one talks to philosophers these days, they're not even happy with the notion of causality as being powerful enough as a model of reality.
In the past, I've often unfairly complained about studies after having read misleading journalistic reports, so this report is a timely one. But the report has been paid for by large pharma corporations, people may wonder if there is bias or some sort of an agenda in this report.
My quick impression is that they're promoting the best practices in statistical methodology, that all these companies are subscribing to. But there could be greater use of cheaper observational studies with better modeling (such as employing the propensity score approach, or even just better regression modeling) compared to expensive randomized experiments, and society might be better off as a result. Moreover, there is the issue of statistical versus practical significance. What do you think?
Posted by Aleks Jakulin at 3:56 PM | Comments (11) | TrackBack
Anova
Cari Kaufman writes,
I am writing a paper on using Gaussian processes for Bayesian functional ANOVA, and I'd like to draw some connections to your 2005 Annals paper. In my own work I've chosen to use a 1-1 reparameterization of the cell means, that is, to constrain the levels within each factor. But I am intrigued by your use of exchangeable levels for all factors, and I'm hoping you can take a few minutes to help me clarify your motivation for this decision. Since not all parameters are estimable under the unconstrained model, don't you encounter problems with mixing when the sums of the levels trade off with the grand mean? It seems in many situations it's advantageous to have an orthogonal design matrix, especially when the observed levels correspond to all possible levels in the population. Do you have any thoughts on this you can share?I should say I found the paper very useful, especially your graphical representation of the variance components. I also like your distinction between the superpopulation and finite population variances, which helped me clarify what happens when generalizing to functional responses. Basically, we can share information across the domain to estimate the superpopulation variances by having a stationary Gaussian process prior, but the finite population variances can differ over the domain, which gives some nice insight into where
various sources of variability are important. (At the moment I'm working with climate modellers, who can really use maps of where various sources of variability show up in their output.)
My reply: I'm not quite sure what the question is, but I think you're pointing out the redundant parameterization issue, that if we specify all levels of a factor, and then have other crosscutting or nested factors (or even just a constant term), then the linear parameters are not all identifiable. I would deal with this issue by fitting the large, nonidentified model and then summarizing using the relevant finite-population summaries. We discuss this a bit in Sections 19.4-19.5 and Chapters 21-22 of our new book.
A couple notes on this:
1. Mixing of the Gibbs sampler can be slow on the original, redundant parameter space but fast on the transformed space, which is what we really care about. Also, things work better with proper priors. My new thing is weakly informative priors which don't include all your prior information but act to regularize your inferences and keep the algorithms in a reasonable space where they can converge faster. The orthoganality that you want can come in this lower-dimensional summary.
2. The redundant-parameter model is identified, if only weakly, as long as we use proper prior distributions on the variance parameters. In Bayesian Data Analysis and in my 2005 Anova paper, I was using flat prior distributions on these "sigma" parameters. But since then I've moved to proper priors, or, in the Anova context, hierarchical priors. See this paper for more information, including an example in Section 6 of the hierarchical model for the variance parameters.
Posted by Andrew at 12:18 AM | Comments (1) | TrackBack
October 22, 2007
Survey weighting and regression modeling
Mike Larsen asks,
I [Mike] have one specific question about your article in Statistical Science on weighting and multi-level regression models. I have one specific question about the article: do the results for the table 1 regression results use the procedure you describe in section 1? That is, does it include interactions between X and z in the model, or does it use design variables with main effects for the relation (y on z) of interest and simply report the coefficient for y on z? I couldn’t really tell, but perhaps I missed something.I guess I have another question: on page 157 in the last full paragraph you state that it is not clear why a simple linear regression of y on z in the entire population would be of interest. That implies that it is not of interest. The first line of 1.4 discusses the regression of y on z. If we had all the data in the population, would we not simply compute the simple linear regression parameter estimates and report those as the relationship between y and z (assuming linearity)? If not, what are we trying to estimate with the E(y|z) function? I understand that it would be more interesting to look at y on z and X if we had tons of data, but that did not appear to be the motivation at the start of 1.4.
Related to this, I see that the population proportions of men and women enter into equation (4) through Bayes’ theorem because you don’t have many people of a single height. In the second example (page 158) you might have E(male|white=1) etc. from population data, such as census data in the geographical area. You could use that, couldn’t you, instead of the proportions white among males in the sample and then Bayes’ theorem?
Finally, about implementing this idea, perhaps we need groups of statisticians inside federal agencies to build recommendations for multilevel models for various outcomes and relationships among variables in place of (or in addition to) the survey statisticians developing complicated weights? What do you think?
My reply:
1. The details are given in the second column of p.158. The model does not include interactions, and we just use the coefficient of z.
2. My point on p.157 that you noted is that, once you consider an additional predictor in the model, you have to consider that the regression of y on z might not be linear. In which case, yes, you can certainly create some summary such as the slope that you'd get by regressing y on z given all the data--but it's not clear why you'd want it. The E(y|z) function is still clearly defined, though, even if nonlinear.
There's a paper by Korn and Graubard in the American Statistician several years ago that discusses this point.
3. For equation (4), even if you had many people at any single height, you'd want to adjust using the population dist of men and women, to correct for differential nonresponse rates. In the Social Indicators Survey example, yes, we poststratified using census numbers.
4. Yes, I think that what is needed is a set of worked examples showing how the hierarchical modeling can work. Once we have the examples, we can have guidelines. But I don't really have the examples yet--note the "struggles" in the title!
Posted by Andrew at 10:42 PM | Comments (3) | TrackBack
October 19, 2007
Is an Oxford degree worth the parchment it's printed on?
I received the following email:
I recently graduated a U.S. university with a degree in Political Science and Applied Mathematics. At the moment, I'm starting out at Oxford where I'm studying statistics. While I've always been interested in politics and statistics, I didn't start to combine the two until my last year of college, and even then, only on occasion. . . . I saw your recent posting for post-docs at Columbia's Applied Statistics Center and thought about how much I would love that job, or one like it, at some point in the future. The practical question is this: I have been given a great opportunity to study at Oxford, but there is a question as to how much American institutions value Oxford degrees. I'm currently on track to get a master's degree followed by a doctorate in statistics. However, some old advisors are strongly discouraging me from pursuing a DPhil (Oxford's PhD) and instead think that I should get an American PhD in Political Science or Economics. While there are of course other factors in this decision, I was hoping you might have some advice. Would an Oxford DPhil be competitive for a job like the one you posted? Do you think I would need more substantial qualifications to teach in statistics or political science in the States?
My reply:
1. We should be having these postdocs for the indefinite future, so I encourage you to apply in a few years. The top new PhD's in applied statistics can get good academic jobs right after graduating, but I think you can learn a lot in a postdoc position, especially ours, which is interdisciplinary but with a core in statistical methods.
The other cool thing about a postdoc (compared to a faculty position, or for that matter compared to admissions to college or a graduate program) is that you're hired based on what you can do, not based on how "good" you are in some vaguely defined sense. I like to hire people know how to fit models and to communicate with other researchers, and my postdocs have included a psychologist, an economist, and a computer scientist, along with several statisticians.
2. I have no sense of how Oxford degrees are valued. I would assume it has the same value as a degree at an American university. Oxford statistics has some great people, including Chris Holmes, Tom Snijders, and Brian Ripley. Recommendations from these guys would carry a lot of weight, at least in a statistics department. More important, you can probably do something interesting when you're in grad school and also learn some useful skills.
3. You also ask about getting a Ph.D. in statistics or political science or economics. My general impression is that, to teach in a department of X, it helps to have a Ph.D. in X. But some people can do a lot of statistics in a poli sci or econ dept, or vice versa. My other impression is that econ is a cartel. The individual econ professors I know are, without exception, nice people and excellent colleagues who do interesting and important research. But the field as a whole seems so competitive, I would think it could be an unpleasant setting to be in, academically. Statistics (and, to a lesser extent, political science) seems much less competitive to me. Substantively, much of the interesting and important work in applied economics is statistical, and my impression is you'd be better prepared to do the best work there if you come at it from a statistical background.
4. Update: I mentioned this to a colleague and he said that, if you're interested in getting an academic job in the U.S., it isn't a bad idea to spend a year or two at a top U.S. department so people get to know you. (This doesn't contradict my point 1 above.)
P.S. The student replies,
I was not expecting your negative view of economics, however. My interest in the field has (naturally) been on the applied side, more as a potential combination of political science and statistics than anything else, and I gave it as a potential PhD option merely to add more diversity to the list.
My reply: No, I think economics (and economists) are great. I'm just not sure I'd recommend an academic career in economics, since I think you can do similar work in other fields without the intense competitive atmosphere. But that's just my impression as an outsider. In any case, I'm a big fan of the work that's being done in economics, sociology, psychology, and various other social sciences (along with political science and statistics, of course).
Posted by Andrew at 9:18 AM | Comments (4) | TrackBack
October 7, 2007
Statistics in the real world
Here's an interesting and informative rant I received recently in the email:
This document is a consultant’s report to the Traverse City Convention & Visitor’s Bureau, quoted — literally photocopied into — a market analysis for an application for an approx. 270,000 square foot shopping center. The full report is here. On page 6 of the .pdf, we are told the following:“After extensive evaluation and testing of these variables [that possibly determine tourist visitor volume to Grand Traverse County] for their predictive ability, the Consultant determined there are three variables with statistically significant associations. These are population in Grand Traverse County, Gross Domestic Product (GDP), and the External Event dummy variable.
“The Consultant found GDP [national, not regional or local] alone is a significant predictor however [sic] it does not hold up in association with either Grand Traverse Population or the External Event dummy variable.”The Consultant then goes on to run a regression using GT population and the dummy, but not GDP. The resulting equation has an adjusted R-square of .95, and F=87.0. While GT pop has a t-value=10.9 & p=.000012, the dummy isn’t significant (p=0.3). The Consultant thus takes GT population projections out to 2025 to forecast annual tourist visits for that time frame.
That seems rather sketchy to me. Correct me because I’m likely wrong, but the Consultant basically said that 95% of the variation in annual tourist visits was due to (predicted by) county population, and then used population projections to forecast future tourist visits. And even though GDP was a significant variable, she used population instead, with no explanation why. (Or, none that I can find.) GDP and population were apparently the only two significant variables (though we don’t know how population held up if she removed the insignificant dummy from the specification) of the host of variables she tested; e.g., DoD/military contracts, even though our military presence is limited to a couple Coast Guard helicopters. (And her regression is based on about 10 data points.)
Surely, local population can’t be the driver of tourist visits. It does seem reasonable that population is driven by tourism, since people who visit here might end up wanting to move here, no? That seems to be a questionable variable for trying to forecast tourism in the future, when at least one other significant variable, GDP, is available — even if that was found by data mining as well.
I wish I could say this is typical, but in my experience, local units of government, &c., pay money for analyses even more questionable than what I just presented. For example, the market study in which the above was quoted reports consumer demand in 2005 $194,896,255 less than supply. Setting aside the problems this claim has in view of economic theory, the values labeled “demand” and “supply” are consumer expenditures and retail sales: retailers sold approx. $195 million more that consumers purchased. And there is no explanation of why this is; in 2005, within a 50-mile radius, consumers spent $1,371,392 on “News Dealers and Newsstands,” while retail sales in the same category was $0, and there is no explanation of that $1.4-milion gap!
Well, I [my correspondent] guess there’s no real point to this email other than to complain, and shouting at the sky is getting me a lot of strange looks. I’ll close by just asking you to ask your students to get involved in their communities, and at the very least, act as bullshit detectors and raise their voices when something smells.
This certainly doesn't surprise me: I've seen worse from paid statistical consultants on court cases, including one from a consultant (nobody I've ever met or know personally in any way) who reportedly was paid hundreds of thousands of dollars for his services.
The key problems seem to be:
1. Statistics is hard, and not many people know how to do it.
2. The people who need statistical analysis don't always know where to look.
Posted by Andrew at 12:32 AM | Comments (8) | TrackBack
October 6, 2007
If one person asks, others might be interested too . . .
Shane Murphy writes,
I am a graduate student in political science (interested in economics as well), and I was reading your recent blog posts about significance testing, and the problems common for economists doing statistics. Do you know of and recommend any books to students learning econometrics or statistics for social science? Also, just in case your answer is your own book, "Data Analysis Using Regression and Multilevel/Hierarchical Models," is this book an appropriate way to learn "econometrics" (which is just statistics for economists, right?)?
My reply: Yes, I do recommend my book with Jennifer Hill. I also think it's the right book to learn applied statistics for economics. However, within economics, "econometrics" usually means something more theoretical, I think. You could take a look at a book such as Wooldridge's, which presents the theory pretty clearly.
Posted by Andrew at 8:28 PM | Comments (6) | TrackBack
October 5, 2007
More on significance testing in economics
After I posted this discussion of articles by McCloskey, Ziliak, Hoover, and Siegler, I received several interesting comments, which I'll address below. The main point I want to make is that the underlying problem--inference for small effects--is hard, and this is what drives much of the struggles with statistically significance. See here for more discussion of this point.
Statisticians and economists not talking to each other
Scott Cunningham wrote, surprised that I'd not heard of these papers before:
I wasn't expecting anything like what you wrote. I live in a bubble, and just assumed you were familiar with the papers, because in grad school, whenever I presented results and said something was significant (meaning statistically significant), I would *always* get someone else responding, "but is it _economically_ significant" meaning, at minimum, is the result basically a very precisely measured no effect? The McCloskey/Ziliak stuff was constantly being thrown at you by the less quantitatively inclined people (that set is growing smaller all the time), and I forgot for a moment that those papers probably didn't generate much interest outside economics.
I live in a bubble too, just a different bubble than Scott's. He and others might be interested in this article by Dave Krantz on the null hypothesis testing controversy in psychology. Dave begins his article with:
This article began as a review of a recent book, What If There Were No Significance Tests? . . . The book was edited and written by psychologists, and its title was well designed to be shocking to most psychologists. The difficulty in reviewing it for [statisticians] is that the issue debated may seem rather trivial to many statisticians. The very existence of two divergent groups of experts, one group who view this issue as vitally important and one who might regard it as trivial, seemed to me an important aspect of modern statistical practice.
As noted above, I don't think the issue is trivial, but it is true that I can't imagine an article such as McCloskey and Ziliak's appearing in a statistical journal.
Rational addiction
Scott also writes,
BTW, the rational addiction literature is a reference to Gary Becker and Kevin Murphy's research program that applies price theory to seemingly "non-market phenomenon", such as addiction. Rational choice would seem to break down as a useful methodology when applied to something like addiction. Becker and Murphy have a seminal paper on this from 1988. It's been an influential paper in the area of health economics, as numerous papers have followed it by estimating various price elasticities of demand, as well as to test the more general theory regarding the theory.
My reply to this: Yeah, I figured as much. It's probably a great theory. But, ya know what? If Becker and Murphy want to get credit for being bold, transgressive, counterintuitive, etc etc., the flip side is that they have to expect outsiders like me to think their theory is pretty silly. As I noted in my previous entry, there's certainly rationality within the context of addiction (e.g., wanting to get a good price on cigarettes), but "rational addiction" seems to miss the point. Hey, I'm sure I'm missing the key issue here, but, again, it's my privilege as a "civilian" to take what seems a more commonsensical position here and leave the counterintuitive pyrotechnics to the professional economists.
The paradigmatic example in economics is program evaluation?
Mark Thoma "disagreed mildly" with my claim that the null hypothesis of zero coefficient is essentially always false. Mark wrote:
I don't view the "paradigmatic example in economics" to be program evaluation. We do some of that, but much of what econometricians do is test the validity of alternative theories and in those contexts the hypothesis of a zero coefficient can make sense. For example, New Classical models imply that expected changes in the money supply should not impact real variables. Thus, a test of a zero coefficient on expected money in an equation with a real activity as the dependent variable is a test of the validity of the New Classical model's prediction. These tests requires sharp distinctions between models, i.e. to find variables that can impact other variables in one theory but not another, and that's something we try hard to find, but when such sharp distinctions exist I believe classical hypothesis tests have something useful to contribute.
Hmmm . . . . I'll certainly defer to Mark on what is or is not the paradigmatic example in economics. I can believe that theory testing is more central. I'll also agree that important theories do have certain coefficients set to zero. I doubt, however, that in actual economic data, such coefficients really would be zero (or, to be more precise, that coefficient estimates would asymptote to zero as sample sizes increase). To wander completely out of my zone of competence and comment on Mark's money supply example: I'm assuming this is somewhat of an equilibrium theory, and short-term fluctuations in expected money supply could affect individual actors in the economy, which could then create short-term outcomes, which would show up in the data in some way (and then maybe, in good "normal science" fashion, be explained in a reasonable way to preserve the basic New Classical model). What I'm saying is: in the statistics, I don't think you'd really be seeing zero, and I don't think the Type 1 / Type 2 error framework is relevant.
Getting better? And a digression on generic seminar questions
Justin Wolfers writes that "the meaningless statements of statistical rather than economic significance are declining." Yeah, I think things must be getting better. Many years ago, Gary told me that his generic question to ask during seminars was, "What are your standard errors." Apparently in poli sci, that used to stop most speakers in their tracks. We've now become much more sophisticated--in a good way, I think. (By the way, it's good to have a few of these generic questions stored up, in case you fall asleep or weren't paying attention during the talk. My generic questions include: "Could you simulate some data from your fitted model and see if they look like your observed data?" and "How many data points would you have to remove for your effect estimate to go away?"
Justin uses a lot of bold type in his blog entries. What's with that? Maybe a good idea? I use bold for section headings, but he uses them all over the place.
Sports examples
Also, since I'm responding to Justin, let me comment on his use of sports as examples in his classes. I do this too--heck, I even wrote a paper on golf putting, and I've never even played macro-golf--but, as people have noted on occasion, you have to be careful with such examples because they exclude many people who aren't interested in the topic. (And, unlike examples in biology, or economics, or political science, it's harder to make the case that it's good for the students' general education to become more familiar with the statistics of basketball or whatever.) So: keep the sports examples, but be inclusive.
Posted by Andrew at 8:37 PM | Comments (2) | TrackBack
Significance testing in economics: McCloskey, Ziliak, Hoover, and Siegler
Scott Cunningham writes,
Today I was rereading Deirdre McCloskey and Ziliak's JEL paper on statistical significance, and then reading for the first time their detailed response to a critic who challenged their original paper. I was wondering what opinion you had about this debate. Is statistical significance and Fisher tests of significance as maligned and problematic as McCloskey and Ziliak claim? In your professional opinion, what is the proper use of seeking to scientifically prove that a result is valid and important?
The relevant papers are:
McCloskey and Ziliak, "The Standard Error of Regressions," Journal of Economic Literature 1996.
Ziliak and McCloskey, "Size Matters: The Standard Error of Regressions in the American
Economic Review," Journal of Socio-Economics 2004.
Hoover and Siegler, "Sound and Fury: McCloskey and Significance Testing in Economics," Journal of Economic Methodology, 2008.
McCloskey and Ziliak, "Signifying Nothing: Reply to Hoover and Siegler."
My comments:
1. I think that McCloskey and Ziliak, and also Hoover and Siegler, would agree with me that the null hypothesis of zero coefficient is essentially always false. (The paradigmatic example in economics is program evaluation, and I think that just about every program being seriously considered will have effects--positive for some people, negative for others--but not averaging to exactly zero in the population.) From this perspective, the point of hypothesis testing (or, for that matter, of confidence intervals) is not to assess the null hypothesis but to give a sense of the uncertainty in the inference. As Hoover and Siegler put it, "while the economic significance of the coefficient does not depend on the statistical significance, our certainty about the accuracy of the measurement surely does. . . . Significance tests, properly used, are a tool for the assessment of signal strength and not measures of economic significance." Certainly, I'd rather see an estimate with an assessment of statistical significance than an estimate without such an assessment.
2. Hoover and Siegler's discussion of the logic of significance tests (section 2.1) is standard but, I believe, wrong. They talk all about Type 1 and Type 2 errors, which are irrelevant for the reasons described in point 1 above.
3. I agree with most of Hoover and Siegler's comments in their Section 2.4, in particular with the idea that the goal in statistical inference is often not to generalize from a sample to a specific population, but rather to learn about a hypothetical larger population, for example generalizing to other schools, other years, or whatever. Some of these concerns can best be handled using multilevel models, especially when considering different possible generalizations. This is most natural in time-series cross-sectional data (where you can generalize to new units, new time points, or both) but also arises in other settings. For example, in our analyses of electoral systems and redistricting plans, we were careful to set up the model so that our probability distribution generalized to other possible elections in existing congressional districts, not to hypothetical new districts drawn from a common population.
4. Hoover and Siegler's Section 2.5, while again standard, is I think mistaken in ignoring Bayesian approaches, which limits their "specification search" approach to the two extremes of least squares or setting coefficients to zero. They write, "Additional data are an unqualified good thing, which never mislead." I'm not sure if they're being sarcastic here or serious, but if they're being serious, I disagree. Data can indeed mislead on occasion.
Later Hoover and Siegler cite a theorem that states "as the sample size grows toward infinity and increasingly smaller test sizes are employed, the test battery will, with a probability approaching unity, select the correct specification from the set. . . . The theorem provides a deep justification for search methodologies . . that emphasize rigorous testing of the statistical properties of the error terms." I'm afraid I disagree again--not about the mathematics, but about the relevance, since, realistically, the correct specification is not in the set, and the specification that is closest to the ultimate population distribution should end up including everything. A sieve-like approach seems more reasonable to me, where more complex models are considered as the sample size increases. But then, as McCloskey and Ziliak point out, you'll have to resort to substantive considerations to decide whether various terms are important enough to include in the model. Statistical significance or other purely data-based approaches won't do the trick.
Although I disagree with Hoover and Siegler in their concerns about Type 1 error etc., I do agree with them that it doesn't pay to get too worked up about model selection and its distortion of results--at least in good analyses. I'm reminded of my own dictum that multiple comparisons adjustments can be important for bad analyses but are not so important when an appropriate model is fit. I agree with Hoover and Siegler that it's worth putting in some effort in constructing a good model, and not worrying if said model was not specified before the data were seen.
5. Unfortunately my copy of McCloskey and Ziliak's original article is not searchable, but if they really said, "all the usual econometric problems have been solved''--well, hey, that's putting me out of a job, almost! Seriously, there are lots of statistical (thus, I assume econometric) problems that are still open, most notably in how to construct complex models on large datasets, as well as more specific technical issues such as adjustments for sample surveys and observational studies, diagnostics for missing-data imputations, models for time-series cross-sectional data, etc etc etc.
6. I'm not familiar enough with the economics to comment much on the examples, but the study of smoking seems pretty wacky to me. First there is a discussion of "rational addiction." Huh?? Then Ziliak and McCloskey say "cigarette smoking may be addictive." Umm, maybe. I guess the jury is still out on that one . . . .
OK, regarding "rational addiction," I'm sure some economists will bite my head off for mocking the concept, so let me just say that presumably different people are addicted in different ways. Some people are definitely addicted in the real sense that they want to quit but they can't, perhaps others are addicted rationally (whatever that means). I could imagine fitting some sort of mixture model or varying-parameter model. I could imagine some sort of rational addiction model as a null hypothesis or straw man. I can't imagine it as a serious model of smoking behavior.
7. Hoover and Siegler must be correct that economists overwhelmingly understand that statistical and practical significance are not the same thing. But Ziliak and McCloskey are undoubtedly also correct that most economists (and others) confuse these all the time. They have the following quote from a paper by Angrist: "The alternative tests are not significantly different in five out of nine comparisons (p<0.02), but the joint test of coefficient equality for the alternative estimates of theta.t leads to rejection of the null hypothesis of equality." This indeed does not look like good statistics.
Similar issues arise in the specific examples. For instance, Ziliak and McCloskey describe where Becker, Grossman, and Murphy summarize their results in terms of t-ratios of 5.06, 5.54, etc, which indeed miss the point a bit. But Hoover and Siegler point out that Becker et al. also present coefficient estimates and interpret them on relevant scales. So they make some mistakes but present some things reasonably.
8. People definitely don't understand that the difference between significant and not significant is not itself statistically significant.
9. Finally, what does this say about the practice of statistics (or econometrics)? Does it matter at all, or should we just be amused by the gradually escalating verbal fireworks of the McCloskey/Ziliak/Hoover/Siegler exchange? In answer to Scott's original questions, I do think that statistical significance is often misinterpreted but I agree with Hoover and Siegler's attitude that statistical significance tells you about your uncertainty of your inferences. The biggest problem I see in all this discussion is the restriction to simple methods such as least squares. When uncertainty is an issue, I think you can gain a lot from Bayesian inference and also from expanding models to include treatment interactions.
P.S. See here for more.
Posted by Andrew at 7:07 AM | Comments (10) | TrackBack
An explanation of hypothesis testing
Aleks came across this somewhere.
Posted by Andrew at 6:43 AM | Comments (1) | TrackBack
October 4, 2007
Postdoc openings for Fall 2008
We'll be considering applications for more postdocs in the Applied Statistics Center. As far as I can tell, this is the best statistics postdoctoral position out there: you get to work with fun, interesting people on exciting projects and make a difference in a variety of fields. You'll be part of an active an open community of students, faculty, and other researchers. It's a great way for a top Ph.D. graduate to get started in research without getting overwhelmed right away by the responsibilities of a faculty position. If this job had existed when I got my own Ph.D. way back when, I would've taken it.
Just email me your application letter, c.v., and your papers and have three letters of reference emailed to me. We will hire 0, 1, or more people depending on who applies and how they fit in with our various ongoing and planned projects in statistical methods, computation, and applications in social science, public health, engineering, and other areas.
Posted by Andrew at 9:05 AM | Comments (8) | TrackBack
October 1, 2007
Question on hypothesis testing
Mike Frank writes,
Hi, I'm a graduate student at MIT in Brain and Cognitive Sciences. I'm an avid reader of your blog and user of your textbook and so I thought I would email you this question in the hopes you have thoughts on it. I'm in a strange position in my research in that I do a lot of Bayesian modeling of cognitive processes but then end up doing standard psychology experiments to test predictions of the models where I have to use simpler frequentist statistical methods (which are standard in psychology, hard to publish without them) to analyze those data.
The basic question is how to compare binomial data from two different conditions in an experiment when there are multiple datapoints from each individual in each condition (so the trials are not independent). The simplest option seemed to me be to use a chi-square test (e.g., compare 54/56 trials correct in one condition with 43/56 trials correct in the other, aggregating across participants). But I'm told this practice violates the independence assumption of the test. I'm not sure I totally understand why this is a problem here, but that may be a separate question entirely.In contrast, what most psychologists do is calculate a percentage correct for each individual and then do a paired t-test between the two sets of means. But I've read that using standard ANOVAs or t-tests on this type of binomial data violates the assumption of normal distribution of the data and is invalid (and can lead to bad inferences in many situations).
More sophisticated people have recommended using a GLM with a logit link function so that it is appropriate to binomial data and then making it a mixed model which can include individuals (subjects) as a random effect. But if I have multiple comparisons between conditions that differ qualitatively (e.g., not along some particular continuum), it seems like I would need to run the GLM on different pairs of conditions and look for a significant effect of condition in each case, and that doesn't seem particularly elegant either (although at least more appropriate). What I'd really like is just a simple hypothesis test like a chi-square or t-test but appropriate to the form of the data.
My reply:
The t-test comparing the means is correct. You compute the mean for each person and then do a person-level analysis. Thus, you're not actually using the total percentage correct, you're using the mean and sd across people for each condition.
The chi-squared test is not so interpretable because it doesn't give you differences in proportions, it only gives you a hard-to-interpret p-value.
Logistic regression is also fine.
Posted by Andrew at 12:42 PM | Comments (2) | TrackBack
September 28, 2007
Statistical consultants
Jeff Miller pointed me to his website. He offers statistical consulting. I'll also use this occasion to refer you again to Rahul.
Posted by Andrew at 1:42 AM | Comments (0) | TrackBack
September 25, 2007
Context is important: a question I don't know how to answer, leading to general thoughts about consulting
Someone writes in with a question that I can't answer but which reminds me of a general point about interactions between statisticians and others.
It seems that there should be an easy answer to this question yet I cannot find a satisfactory one (aka one that satisfies my committee).1) I have a vector of values V and I find the median of it m.
2) I feed vector V to a simulation and it returns a value for each iteration of the simulation r_t which is stacked into a vector R.
3) I take the mean of R which is r_barI want to now be able to compare m and r_bar. I want to be able to say if they are statistically different.
V cannot be assumed to be normal, and the simulation is stochastic, but not random.
Currently I am constructing a confidence interval around r_bar as:
r_bar +/- 1.96*sd(R)
But this does not seem right considering I cannot assume normality of the original "data: nor can I assume the simulation amounts to "random sampling".
What would you recommend?
My reply: I'm sorry but I don't understand what you are asking. My generic advice is that it's hard to solve such problems without having more information on the context. This happens all the time to statisticians, that people try to help us out by giving us what is essentially a probability problem, stripping out all content. But almost always a good answer depends on what these random variables actually represent.
Once when I was teaching at Chicago I overheard a discussion of some students and faculty about some consulting problem that had come in, something about estimating the probability of a sequence of successive "heads" in some specified number of coin flips. It turned out to be for the goal of computing a p-value, and looking into the example more, it also became clear that this was not a very good analysis of these particular data.
A tough balance
It's a tough balance being a statistician: we can't go to the experimenters and start bossing them around--we have to respect what they're saying--but we also have to extract from them what is really important and cause them to question their statistical assumptions. I've seen statistical consultants err in both directions: either basically ignoring the client and trying to cram every problem into some narrow methodological framework, or taking all the client's words too seriously and becoming a technician, applying an inappropriate method without question.
Posted by Andrew at 9:49 AM | Comments (3) | TrackBack
September 24, 2007
Measuring interpersonal influence
David Nickerson gave a wonderful talk at our quantitative political science seminar last week. He described three different experiments he did, and it was really cool. Here's the paper, and here are Alex's comments on it.
I've never really done an experiment. I like the idea but somehow I've never gotten organized to do one. I want to, though. I feel like an incomplete statistician as things currently stand.
Posted by Andrew at 12:50 AM | Comments (1) | TrackBack
September 20, 2007
Control only for the covariates that matter
NY Times published an awful article 25th Anniversary Mark Elusive for Many Couples that deserves a comment. Here is a quote:
Among men over 15, the percentage who have never been married was 45 percent for blacks, 39 percent for Hispanics, 33 percent for Asians and 28 percent for whites.Among women over 15, it was 44 percent for blacks, 30 percent for Hispanics, 23 percent for Asians and 22 percent for whites.
No wonder! The median age for whites in the US in 2000 was 37.7, for Asians 32.7, for blacks 30.2 and for Hispanics 25.8. 11 years of age difference should make a difference when it comes to the probability of having been married, no?
While they didn't control for age here, they did unnecessarily control for sex in this highly uninformative table-of-many-numbers:

The gross JPEG artifacts that blur the fonts are theirs, not mine: they should have known to use PNG or GIF for figures with lots of text. Does anyone gain any insight from the difference between women and men's probability other than noise? A similar nonsensical control appeared in Men with younger women have more children where the difference in optimum age difference between men (6) and women (4) is purely a statistical artifact if you go and read the paper.
Yuck. I wouldn't have posted this if this hadn't made it to the 6th place of most emailed articles in past 24 hours.
In summary, when displaying the data control for things when 1) you need to remove a known effect, 2) controlling for things tells you something you didn't know before. And use graphs not tables! And educate journalists about the basics of statistics!
Posted by Aleks Jakulin at 5:31 PM | Comments (3) | TrackBack
September 19, 2007
The past, present, and future of statistics
This is going to be a letdown after this grand title . . . . Lingzhou Xue writes,
I just read recently the talk titled "The Future of Statistics" from Bradley Efron. Actually, I see some enlightening ideas but also fall a little puzzled. In this talk, Efron first gave a simple review of the rapid development of statistics last century. He is humorous to comment that "The history of statistics in the Twentieth Century is the surprising and wonderful story of a ragtag collection of numerical methods coalescing into a central vehicle for scientific discovery".After this humor is just what puzzles me and what I really hope your instructions and ideas. Efron cited a simple example to illustrate the limitations of classical statistics in the model selection problems and also exploit a figurative comment that "History seems to be repeating itself: we've returned to an era of ragtag heuristics propelled with energy but with no guiding direction."
Finally, he presented an helpful instructions that "During such time it pays to concentrate on basics and not tie oneself too closely to any one technology."
My reply: Efron is an interesting example of a leading statistical researcher who has developed and used a diverse set of tools, most notably model-based empirical Bayes and nonparametric boostrap and permutation tests. So he, more than most, is justified in seeing statistics as being extremely successful without needing a guiding direction. In the hedgehog/fox distinction, he's a fox.
It's hard for me to make generalizations about the field of statistics since there are so many different strands. I guess some sort of analysis based on papers and citation counts would give a clue. I guess it is true that statistics in the 1950s, like politics in the 1950s, had a unity that we didn't see before and don't see today. 1950s-style statistics was limited but it was all people had and so they used it well. It broke down when it got overwhelmed with data.
Posted by Andrew at 2:10 AM | Comments (1) | TrackBack
September 14, 2007
Most science studies appear to be tainted by sloppy analysis
Boris pointed me to this article by Robert Lee Hotz:
We all make mistakes and, if you believe medical scholar John Ioannidis, scientists make more than their fair share. By his calculations, most published research findings are wrong. . . . "There is an increasing concern that in modern research, false findings may be the majority or even the vast majority of published research claims," Dr. Ioannidis said. "A new claim about a research finding is more likely to be false than true."The hotter the field of research the more likely its published findings should be viewed skeptically, he determined.
Take the discovery that the risk of disease may vary between men and women, depending on their genes. Studies have prominently reported such sex differences for hypertension, schizophrenia and multiple sclerosis, as well as lung cancer and heart attacks. In research published last month in the Journal of the American Medical Association, Dr. Ioannidis and his colleagues analyzed 432 published research claims concerning gender and genes.
Upon closer scrutiny, almost none of them held up. Only one was replicated. . . .
Ioannides attributes this to "messing around with the data to find anything that seems significant," and that's probably part of it. The other part is that, even if all statistics are done according to plan, the estimates that survive significance testing will tend to be large--this is what we call "Type M error." See here for more discussion.
Posted by Andrew at 3:57 PM | Comments (2) | TrackBack
Opportunity for a survey expert in NYC
Meg Lamm writes,
I and the market research firm I work for (Nancy Dodd Research) are looking for someone to hire who could meet with me a few times to train me in using SPSS and/or Excel for doing data analysis, and in survey writing techniques. If you or anyone you know might be interested, please contact me at meg@nancydodd.com or 212-366-1526.
Posted by Andrew at 7:06 AM | Comments (0) | TrackBack
September 11, 2007
Poststratification on variables that are not fully observed
Seth Wayland writes,
In Chapter 14.1 of your new book, the example uses only predictors for which you have census data at the state level. In the postratification step, you just plug the values of those covariates into the model, and viola, you have an estimate for that poststratification cell! What about including further individual level predictors in the model to account for probability of selection such as household size and number of phones in the household, or even an individual-level predictor that might improve the model? How do you then calculate the estimate for each poststratification cell?
My response: yes, this is something we are struggling with.
The long answer is that we would treat the population distribution of all the predictors, Census and non-Census variables (those desirable individual-level predictors which are only observed in the sample and not in the population), as unknown. We'd give it all a big fat prior distribution and do Bayesian inference. This sounds like a lot but I think it's doable using regression models with interactions. We're working on this now, starting with simple models with just one non-Census variable. The closest we've come so far with is this paper with Cavan and Jonathan on poststratification without population inference (see blog entry here).
The short answer is that it should be possible to do a quick-and-dirty version of the above plan, estimating the joint distribution of Census and non-Census variables using point estimates for the distribution of non-Census variables given the Census variables, based on weighting using the survey data within each Census post-stratification cell. This is only an approximation because it ignores uncertainty (for example, if a particular cell includes 4 people in single-phone households and 3 people in multiple-phone households, the weighted totals become 4 and 1.5, so the quick-and-dirty approach would use the point estimate of 1/(4+1.5) as the proportion in single-phone households in that cell, ignoring the uncertainty arising from sampling variability).
I do think that this (the quick version, then the full version) is ultimately the way to go, since the poststratification strategy allows us to model the data and get small-area estimates, such as state-level opinions from national polls.
As is often the case, the challenge in statistics is to include all relevant information (from the Census as well as the survey, and maybe also from other surveys), and to do this while setting up a model that is structured enough to take advantage of all these data but not so structured that it overwhelms this information.
Posted by Andrew at 7:36 AM | Comments (0) | TrackBack
Statistical consulting
My former student Rahul Dodhia (actually Dave Krantz's former student, but I was on his committee and we did write a paper together) lives in Seattle and is a full-time statistical consultant. Here's the website for his company. He actually took the statistical consulting course from me at Columbia so I'm probably to blame for this! Anyway, I haven't actually seen him consult recently, but I expect he's doing a good job. He's located in Seattle but also works remotely.
Posted by Andrew at 2:42 AM | Comments (0) | TrackBack
August 31, 2007
A rant on the virtues of data mining
I view data analysis as summarization: use the machine to work with large quantities of data that would otherwise be hard to deal with by hand. I am also curious about what would the data suggest, and open to suggestions. Automated model selection can be used to list a few hypotheses that stick out of the crowd: I was not using model selection to select anything, but merely to be able to quantify how much a hypothesis sticks out from the morass of the null.
The response from several social scientists has been rather unappreciative along the following lines: "Where is your hypothesis? What you're doing isn't science! You're doing DATA MINING
Of course, I'm doing data mining, and I'm proud of it. Because data mining is summarization, it's journalism, it's surveying, it's mapping. That where one gets ideas from and impressions. Of course what data mining found isn't "true". The models underlying data mining are most definitely not "true". But a mean is informative even if the distribution isn't symmetric.
The "scientific" approach corresponds to picking The One and Only Holy Hypothesis. Then you collect the data. Then you fit the model and verify whether it works or not. Then you write a paper. The good thing about the "scientific" approach is that you don't have to think, and that you need very little common sense. But real science is curiosity and pursuit of improved understanding of the world, not mindlessly following algorithms that can be taught even to imbeciles.
Let me analyze where the problem lies. There is data D. And there are multiple models M. In confirmatory data analysis (CDA) high prior probabilities are assigned to a single model and its negative (null): so it is very easy to establish which of the two is better. In exploratory data analysis (EDA) and data mining the prior over models is relatively flat. Yes, there are models underlying EDA too: if you rotate your scatter plot in three dimensions to get a good view of the phenomenon, your parameters are the rotations and you're doing kernel density estimation with your eyes. When you see a fit, you stop and save the snapshot. The problem is that no model in particular sticks out, so it's hard to establish the best one. Yes, it's hard to establish what "truth" is. "Truth" is the domain of religion. "Model", "data" and "evidence" are the domain of science.
Many of the hypothesis generated by people from theory might be understood as deserving higher prior probability: after all they are based on experience. In turn, a flat prior includes many models that are unlikely. For that matter, one should use a bit of common sense interpreting EDA results: because the prior was flat, if something looks fishy, subtract a little bit from it and study it in more detail. On the other hand, if you don't see something you think you should, add a little and study it in more detail. A CDA that tells you everything you've already known doesn't deserve a paper. But it's better to just eyeball the results with an implicit prior in your mind than to try to cook up a complex prior that will do the same. But once you've found a surprise, throw all the CDA you've got at it.
Posted by Aleks Jakulin at 2:10 PM | Comments (17) | TrackBack
August 29, 2007
The most beautiful people in the world . . . and a request for a favor (see the very bottom of this entry)
Ralph Blair sent this in. It's so horrible that I have to put it in the continuation part of the blog entry. I recommend you all stop reading right here.
Stop . . . It's not too late!!!!!!!!!!!
OK, here it is. No, no, no... (Here's the technical article explaining the statistical flaws in this stuff.) Mistakes are made all the time, of course, but it doesn't help when they are tied to wacky political agendas.
The news article begins:
The Beautiful Person club is an exclusive one, and entry brings much - fame, wealth ... and daughters. Think of the most beautiful couples in the world - they all have daughters. Tom Cruise and Katie Holmes? Check. Denise Richards and Charlie Sheen? Check. Brangelina and Bennifer? Check and check.
Actually, we looked up a few years of People Magazine's 50 most beautiful people, and they were as likely as anyone else to have boys:
One way to calibrate our thinking about Kanazawa’s results is to collect more data. Every year, People magazine publishes a list of the fifty most beautiful people, and, because they are celebrities, it is not difficult to track down the sexes of their children, which we did for the years 1995–2000.As of 2007, the 50 most beautiful people of 1995 had 32 girls and 24 boys, or 57.1% girls, which is 8.6 percentage points higher than the population frequency of 48.5%. This sounds like good news for the hypothesis. But the standard error is 0.5/sqrt(56) = 6.7%, so the discrepancy is not statistically significant. Let’s get more data.
The 50 most beautiful people of 1996 had 45 girls and 35 boys: 56.2% girls, or 7.8% more than in the general population. Good news! Combining with 1995 yields 56.6% girls—8.1% more than expected—with a standard error of 4.3%, tantalizingly close to statistical significance. Let’s continue to get some confirming evidence.
The 50 most beautiful people of 1997 had 24 girls and 35 boys—no, this goes in the wrong direction, let’s keep going . . . For 1998, we have 21 girls and 25 boys, for 1999 we have 23 girls and 30 boys, and the class of 2000 has had 29 girls and 25 boys.
Putting all the years together and removing the duplicates, such as Brad Pitt, People’s most beautiful people from 1995 to 2000 have had 157 girls out of 329 children, or 47.7% girls (with standard error 2.8%), a statistically insignificant 0.8% percentage points lower than the population frequency. So nothing much seems to be going on here. But if statistically insignificant effects with a standard error of 4.3% were considered acceptable, we could publish a paper every two years with the data from the latest “most beautiful people.”
I don't blame the reporter (Maxine Shen) for this: it's natural to believe something that's been published in a book and a scientific journal. Perhaps, though, someone could send a note to whoever reviews this sort of book so that the errors won't be propagated indefinitely??
Posted by Andrew at 1:23 AM | Comments (13) | TrackBack
R-squared: useful or evil?
I had the following email exchange with Gary King.
Me: I know you hate R-squared and you hate standardization; nonetheless you might like this paper and this one. I've found the standardization idea, in particular, very helpful--I've been using it on many applications recently.
Gary: If R-sq were used as a data summary only, I'd have no objection (as an aside, I think 'data summary' which has good uses, often just means 'it's just description so don't bother me, anything goes!'). Instead, it is used as a measure of the quality or success or correctness or validity of the model, which is usually nuts.
Me: I agree with you there. By "data summary," I more precisely mean something that inherently depends on the design of the data collection. Thus, keep the model the same but spread out the x's, and R-squared goes up. But the model doesn't change. Similarly, "stat. signif." changes as you change the sample size.
Gary: Spreading out the x's is changing the model. Also, you can write down two equivalent models where the data give identical inferences about all the key parameters, but R2 can differ drastically. That's not abt the model or data summaries.
Me: What's in the model depends where you draw the line. If you have a dose-response model of the form y=f(x) + error, and you're interested in f, then I don't consider x part of the model; you set x to get a good estimate of f. At the other extreme, you can define anything as part of the model. Even sample size is part of the model if you consider it as a random variable. But I see what you're talking about. You're talking about comparing models. I'm not particularly interested in comparing models. I'm using R^2 to understand a single model (in particular, the way in which a particular dataset is informative about that single model). If I were to compare models, I'd do it directly.
Gary: So if you want to compare models, you wouldn't use R^2. But almost all uses of R^2 in the literature are about comparisons of some kind, even when implicit (the R^2 indicates that my model is better than yours! etc.). Anyway, I agree that it shouldn't be used to compare models, altho one (perhaps the only one?) valid use of R^2 is to compare two models or two specifications so long as they have the same dependent variable. The problem with R^2 is comparisons of data or model or anything when Y changes. The problem is identifying the question R^2 is the optimal answer to
That's enough for now, I'm sure...
Posted by Andrew at 12:26 AM | Comments (1) | TrackBack
August 27, 2007
Using experts' ranges
Doug McNamara writes,
I am preparing for my first year as a graduate student at the University of Maryland in their Department of Measurement, Evaluation and Statistics. I've been reading your blog for a few months, and thought I would finally ask a question. So, here it is:I have some data on number of terrorist/insurgent troops in a country. For some of the cases, the data could not be directly measured; instead, experts on the country in question were surveyed. For these survey responds, the dataset provides a range of possible values for number of troops, with the range usually representing the high and low estimates (rounded to the nearest thousand). For instance, experts have assigned a range of 10,000-15,000 for number of UNITA troops in Angola in 1989.
So, the question is, how do I go about assigning an actual value to those situations where there is a range? Initially, I was thinking about simply using the mean between the high and low values, but I know nothing about the distribution of expert opinions. Alternatively, I could simply assign a random value within the range. A third option would be to run three tests—one where I only use the low values, one where I use the high values and a third where I use the median/random value approach.
I should mention I would like to assign a single value for the simple purpose of running a t-test to see if there is a difference in average number of troops when the group is foreign funded or not.
My reply: Considering this as a statistical problem, you could treat the actual number as missing data and then use a rounded-data likelihood (as in Exercise 3.5 of Bayesian Data Analysis). In your case, however, I'd probably just use the average (or the geometric mean) of the range. I wouldn't take these ranges very seriously: in general, experts are notorious for giving estimates where the truth falls outside the range of their guesses. So I don't see you getting anything special from looking at the high and low values as if they were actually upper and lower bounds.
Posted by Andrew at 12:02 AM | Comments (2) | TrackBack
August 24, 2007
Average predictive comparisons for models with nonlinearity, interactions, and variance components
How do you summarize logistic regressions and other nonlinear models? The coefficients are only interpretable on a transformed scale. One quick approach is to divide logistic regression coefficients by 4 to convert on to the probability scale--that works for probabilities near 1/2--and another approach is to compute changes with other predictors held at average values (as we did for Figure 8 in this paper). A more general strategy is to average over the distribution of the data--this will make more sense, especially with discrete predictors. Iain Pardoe and I wrote a paper on this which will appear in Sociological Methodology:
In a predictive model, what is the expected difference in the outcome associated with a unit difference in one of the inputs? In a linear regression model without interactions, this average predictive comparison is simply a regression coefficient (with associated uncertainty). In a model with nonlinearity or interactions, however, the average predictive comparison in general depends on the values of the predictors. We consider various definitions based on averages over a population distribution of the predictors, and we compute standard errors based on uncertainty in model parameters. We illustrate with a study of criminal justice data for urban counties in the United States. The outcome of interest measures whether a convicted felon received a prison sentence rather than a jail or non-custodial sentence, with predictors available at both individual and county levels. We fit three models: a hierarchical logistic regression with varying coefficients for the within-county intercepts as well as for each individual predictor; a hierarchical model with varying intercepts only; and a non-hierarchical model that ignores the multilevel nature of the data. The regression coefficients have different interpretations for the different models; in contrast, the models can be compared directly using predictive comparisons. Furthermore, predictive comparisons clarify the interplay between the individual and county predictors for the hierarchical models, as well as illustrating the relative size of varying county effects.
The next step is to program it in general in R.
Posted by Andrew at 12:32 AM | Comments (1) | TrackBack
August 22, 2007
Percent Changes?
Benjamin Kay writes about a problem that seems simple but actually is not:
I've come across a pair of problems in my work into which you may have some insight. I am looking at the percentage change in earnings per share (EPS) of various large American companies over a 3 year period. I am interested in doing comparisons of how other attributes influence the median value of earnings per share. For example, it might be that high paying companies have higher EPS growth than low paying ones. I am aware that this model might not fully take advantage of the data but I'm preparing it for an audience with limited statistical education.The problems occur in ranking percentages. If you calculate percentages as (New - Old)/Old then there are two major problems:
1) Anything near zero explodes
2) Companies which go from negative to positive EPS appear to have negative growth rates. (1$) -(- $1) / -$1 = -200%The first problem is seemingly intractable as long as I am using percent changes, but I cannot use dollar changes because it ignores the issue of scale. A company with 100 shares and $100 in earnings has $1 EPS, and one with 20 shares (and the same earnings) has $5 EPS. If both companies double their earnings to $200 dollars, they've performed identically. However, in absolute changes the former shows $1 change and the latter $5. I'm stuck with what to do here, maybe there is another measure of change that I haven't considered or another way of doing this entirely.
One thing I've considered for the second problem is taking the absolute value of companies whose EPS changes sign. That seems equivalent to claiming that a change from $1 to $3 EPS is equivalent to a -$1 to $1 change in EPS. Is that a standard approach to treating percent changes? Are there any other assumptions lurking underneath when doing this?
Is there a classic reference to doing order statistic work like this on percentile data?
My reply: this is an important problem that comes up all the time. The percent-change approach is even worse than you suggest, because it will blow up if the denominator approaches zero. Similar problems arise with marginal cost-benefit ratios, LD50 in logistic regression (see chapter 3 of Bayesian Data Analysis for an example), instrumental variables, and the Fieller-Creasy problem in theoretical statistics. I've actually been planning for awhile to write a paper on estimation of ratios where the denominator can be positive or negative.
In general, the story is that the ratio completely changes in interpretation when the denominator changes sign (as you illustrated in your example). But yeah, dollar values can't be right either. I have a couple questions for you:
a. How important are the signs to you? For example, if a given company changes from -$1 to $1, is that more important to you than a change from $1 to $3, or from $3 to $5?
b. For any given company, do you want to use the same scaling for all three years? I imagine the answer is Yes (so you don't have to worry about funny things happening such as an increase of 25%, followed by a decrease of 25%, does not bring things to the initial value).
One approach might be to rescale based on some relevant all-positive variable such as total revenue. I'm sure many other good options are available, once you get away from trying to rescale based on a variable that can be positive or negative.
Posted by Andrew at 12:57 AM | Comments (5) | TrackBack
August 21, 2007
Ken Rice on conditioning in 2x2 tables
At the bottom of this entry I wrote that the so-called Fisher exact test for categorical data does not make sense. Ken Rice writes:
It turns out the standard conditional likelihood argument (which to me always looked prima facie contrived and artifical) is in fact exactly what you get from a carefully considered random-effects approach.
There are some nice symmetries in the random-effects prior, effectively it forces the same prior beliefs for cases and controls. It also has a nice non-parametric property - effectively one only specifies the first few moments of the prior, it's most attractive in e.g. matched pair studies.Naturally, where one had good backing for a 'bespoke' prior, the conditional approach isn't going to beat it, but as a default I believe it's acceptable, and does actually make some sense.
Ken's paper is here, and here's an entertaining powerpoint [link fixed] on the topic.
Larger models that reduce to particular smaller models
I'll have to digest all this before I have any comments. Except that it reminds me of something similar with models of censored and truncated data. Truncation can be considered as a generalization of censoring where the number of censored cases is unknown. Thus, to do a full Bayesian inference for truncated data you need a prior distribution on the number of censored cases. It turns out that, for a particular choice of prior distribution, the truncation model reduces to the censoring model. We discuss this in chapter 7 of Bayesian Data Analysis (second edition) and section 2 of this paper from 2004. As I wrote then:
once we consider the model expansion, it reveals the original truncated-data likelihoodas just one possibility in a class of models. Depending on the information available in any particular problem, it could make sense to use different prior distributions for N and thus different truncated-datamodels. It is hard to return to the original “state of innocence” in which N did not need to be modeled.
The same issue arises when considering additional input variables in a regression models.
Also
Here is Ken's conference poster:
Posted by Andrew at 7:43 AM | Comments (4) | TrackBack
August 13, 2007
Medians?
Jeff noticed this news article by Gina Kolata:
EVERYONE knows men are promiscuous by nature. It's part of the genetic strategy that evolved to help men spread their genes far and wide. The strategy is different for a woman, who has to go through so much just to have a baby and then nurture it. She is genetically programmed to want just one man who will stick with her and help raise their children.Surveys bear this out. In study after study and in country after country, men report more, often many more, sexual partners than women.
One survey, recently reported by the federal government, concluded that men had a median of seven female sex partners. Women had a median of four male sex partners. Another study, by British researchers, stated that men had 12.7 heterosexual partners in their lifetimes and women had 6.5.
But there is just one problem, mathematicians say. It is logically impossible for heterosexual men to have more partners on average than heterosexual women. Those survey results cannot be correct.
...
Jeff's response: MEDIANS??!!
Indeed, there's no reason the two distributions should have the same median. I gotta say, it's disappointing that the reporter talked to mathematicians rather than statisticians. (Next time, I'd recommend asking David Dunson for a quote on this sort of thing.) I'm also surprised that they considered that respondents might be lying but not that they might be using different definitions of sex partner. Finally, it's amusing that the Brits report more sex partners than Americans, contrary to stereotypes.
Posted by Andrew at 10:25 AM | Comments (24) | TrackBack
August 10, 2007
A question about transformation in regression
Alban Z writes,
I am seeking for your view on some concept. This is about transforming a dependent variable to make it normally distributed before doing a regression. For situations where common strategies like logarithm transformation, taking square root .... do not help in making a variable (close to) normally distributed, some of the literature suggests using the so called *Inverse normal transformation: The transformation involves ranking the observations in the dependent variable, and then matching the percentile of each observation to the corresponding percentile in the standard normal distribution. Using the resulting percentiles, each observation is replaced with the corresponding z-score from the standard normal distribution. When there are ties, percentiles are averaged across all ties*.What are your thoughts about the above procedure? Do you recommend using it?
My reply: I do not recommend transforming to make a variable have a particular distribution. Additivity and linearity of the model are more important. We discuss the issue further in chapter 4 of our new book. See also here.
Posted by Andrew at 10:17 AM | Comments (1) | TrackBack
August 8, 2007
Modeling on the original or log scale
Shravan writes,
Here is a typical problem I keep running into. I'm analyzing eyetracking data of the sort you have already seen in the polarity paper. Specifically, I am analyzing re-reading times at a particular word as a function of some experimental conditions that I will call c1 and c2. I expect an effect of c1 and c2, and an interaction. I get it when I analyze on raw reading times (milliseconds) but get only the interaction when I analyze on the log RTs. The logs' residuals are normally distributed and the raw RTs' are not. I am inclined to trust the log RTs more because of the normal residuals (theory, however, is more in line with raw RT-based results). But reviewers keep insisting I analyze on untransformed (raw) reading times, and your book also advises the reader to ignore residuals.
My reply:
1. The log scale makes more sense to me. On the other hand, the last time I analyzed eye-tracking data was 17 years ago, and I didn't know anything about the experimental setup even then!
2. If an interaction might be important, I'd include it in the model. Then if its coefficient isn't statistically significant, you can say that.
3. Hey, we do look at residuals in our book! Take a look at Chapter 5.
4. I wouldn't pick the model based on normality of the residuals. As we discuss in the book, the distribution of the residuals is the least important aspect of the model.
Posted by Andrew at 7:39 AM | Comments (3) | TrackBack
July 24, 2007
The difference between ...
Bruce McCullough points out this blog entry from Eric Falkenstein:
Recently the Wall Street Journal has had several articles about estrogen's link to heart disease in women, highlighting a recent New England Journal of Medicine article showing that it lowers risk of arterial sclerosis. Then last week, the Journal did a story concentrating on how the Women's Health Initiative (WHI) misread the data by focusing on the increased heart attack risk for women over 70, While neglecting the lowered rate of heart attack for women under 60 (since the WHI's 2002 report arguing that estrogen therapy actually raised heart disease--opposite sign to previous findings--hormone sales plummeted 30%). The WHI shot back in a letter to the WSJ, arguing they stand by their interpretation of the data, which they think is somewhat mixed, and in their words, the differences in heart disease between the older and younger (one up, one down!) is not 'statistically significant'. If the difference isn't statistically significant, I can't see how the old cohort can be thought to have a higher than average risk (eg, if the sample estimate for the old is +14%, for the young, -30%, if the difference is noise, the +14% is certainly noise). As Paul Feyerabend argued, there are no definitive tests in science, as people just ignore evidence that goes against them, emphasizing the consistent results.
I don't really have anything new to say about the Women's Health Initiative but I did want to point this out since it's an interesting reminder about the difficulty of using statistical signifcance as a measure of effect size.
Just a couple of weeks ago I was meeting with some people who were doing a health study where effect A was positive and not statistically significant, effect B was negative and not stat signif., but the difference was stat. signif. They had another comparison in their study where A was positive and stat signif, B was negative and not signif, and the difference was not stat signif. They were struggling to figure out how to explain all these things. Rather than give some sort of "multiple comparisons correction" answer, I suggested the opposite: to graphically display all their comparisons of interest in a big grid, to get a better understanding of what their study said. Then they could go further and fit a model if they want.
Falkenstein also writes,
Estrogen therapy helps women with symptoms of menopause, including hot flashes, bone loss, but also depression, wrinkles, vaginal dryness, and lower sexual desire. Though not mentioned in the WSJ articles, I think it is the latter issues are what really bothers the WHI. Women's groups are fond of coming up with pretexts to desexualize women...
I don't know enough about the WHI to answer this one, but I imagine that they want to be extra careful when assessing estrogen therapy, given the problems with earlier recommendations on this.
Posted by Andrew at 10:54 AM | Comments (2) | TrackBack
July 10, 2007
Convergent interviewing and Markov chain simulation
Bill Harris writes,
MCMC is a technique to sample higher-dimensional spaces efficiently. By using Markov chains to select the next sample point, MCMC gathers information about important parts of that space when purely random sampling would likely fail to hit any points of interest.Convergent interviewing is a way to select the
next person or people to interview and the next questions to use when gathering information from a group of people. It "combines some of the features of structured and unstructured interviews, and uses a systematic process to refine the information collected."In particular, people are selected by a simple process:
Decide the person "most representative" of the population. She will be the first person interviewed. Then nominate the person "next most representative, but in other respects as unlike the first person as possible"; then the person "next most representative, but unlike the first two" ... And so on. This sounds "fuzzy"; but in practice most people use it quite easily.Each person is asked largely "content-free" questions on the general topic at hand. Probe questions are added to later questions to test the extent of apparent agreement between people and to explain apparent disagreements.
At first glance, there seems to be a metaphorical similarity between the two processes, as both seek to extract desired information from a high-dimensional space in reasonable time with a guided sampling process that may or may not converge.
I sometimes wonder if there might not even be a deeper connection, although I'm not sufficiently educated in Gibbs sampling and the like yet to be able to test that conjecture.
My response: Regarding MCMC, there has been some stuff written on "antithetical sampling" (I think that's what they call it) where there is a deliberate effort to make new samples different from earlier samples. There's also hybrid sampling, or Hamiltonian dynamics sampling, which Radford Neal has written about (extending methods that have been used in computational physics), which tries to move faster through parameter space.
Regarding convergent interviewing, the key idea seems to be the technique of the interview itself. (I can give my own disclaimer here which is that I've never done a personal interview of this sort, so I'm just speculating based on books I've read and conversations I've had with experts.) The sampling method seems fine. In practice the real worry is getting people who are too much alike, thus an extra effort is made to get people who are different. This makes sense to me. In practice I suspect it probably won't be better than sampling random people from the population (unless n is really small), but in many settings you can't really get a random sample, so it sounds like a good idea to intentionally diversify. Another approach is to use network sampling and use statistical methods to correct for sampling biases (as in Heckathorn et al.'s work here).
Posted by Andrew at 9:54 PM | Comments (5) | TrackBack
July 6, 2007
Job opening in Vienna
I got this in the mail the other day. It's at the Institute for Tourism and Leisure Studies. Pretty cool (perhaps)! Certainly something unusual. I'm sure Bayesian data analysis, multilevel modeling, and statistical graphics will be useful in this job . . .
Posted by Andrew at 12:39 AM | Comments (0) | TrackBack
June 25, 2007
Animated MDS convergence
A few days ago we had quite a discussion on multidimensional scaling. While everyone agreed that initialization is important with non-convex problems, minimizing some objective function is more appealing than using initial placement for the prior, except in appealing circumstances such as iterative scaling. For the objective function approach, one can regularize the stress function, and it is also possible to use the prior to shrink towards geographic positions.
The untidy initial placement approach is sufficient, however, to provide a visualization as we travel from the initial placement towards the final placement. Namely, the clinal pattern in the final placement is only one of the things we can learn: the migrations of points and the resulting stresses are just as interesting in providing insight about the differences between the simple uniform geographic diffusion model and the real distribution of genes in Europe.

I also visualize the stress (at the top), and the strongest attraction/repulsion vectors.
The Python source code is now available if you agree to post a link to all derivative work in the comments of this entry, you can click here [ZIP]).
Posted by Aleks Jakulin at 6:10 PM | Comments (4) | TrackBack
June 20, 2007
Overview of Missing Data Methods
We came across a interesting paper on missing data by Nicholas J. Horton and Ken P. Kleinman. The paper is about comparison of Statistical Methods and related Software to Fit Incomplete Data Regression Models.

Here is the abstract:
Missing data are a recurring problem that can cause bias or lead to inefficient analyses. Statistical methods to address missingness have been actively pursued in recent years, including imputation, likelihood, and weighting approaches. Each approach is more complicated when there are many patterns of missing values, or when both categorical and continuous random variables are involved. Implementations of routines to incorporate observations with incomplete variables in regression models are now widely available. We review these routines in the context of a motivating example from a large health services research dataset. While there are still limitations to the current implementations, and additional efforts are required of the analyst, it is feasible to incorporate partially observed values, and these methods should be used in practice.
This is quite a thorough review. The authors refer to different packages already available. One thing we noticed is that there is nothing on diagnostics (see here for more on diagnostics of imputation). This paper should help us on improving the "mi" package.
Also the appendix of the paper can be found here.
Posted by Masanao at 6:42 PM | Comments (5) | TrackBack
June 19, 2007
A universal dimension reduction tool?
Igor Carron reports on a paper by Richard Barniuk and Michael Wakin, "Random projections of smooth manifolds," that is billed as a universal dimension reduction tool. That sounds pretty good.
I'm skeptical about the next part, though, as described by Carron, a method for discovering the dimension of a manifold. This is an ok mathematical problem, but in the problems I work on (mostly social and environmental statistics), the true dimension of these manifolds is infinity, so there's nothing to discover. Rather than a statement like, "'We've discovered that congressional roll-call votes fall on a 3-dimensional space" or "We estimate the dimensionality of roll-call voting to be 3" or even "We estimate the dimensionality to be 3 +/- 2", I prefer a statement like, "We can explain 90% of the variance with three dimensions" or "We can correctly predict 93% of the votes using a three-dimensional model" or whatever.
Posted by Andrew at 6:42 AM | Comments (5) | TrackBack
June 18, 2007
Is significance testing all bad?
Dan Goldstein quotes J. Scott Armstrong:
About two years ago, I [Armstrong] was a reasonable person who argued that tests of statistical significance were useful in some limited situations. After completing research for “Significance tests harm progress in forecasting” in the International Journal of Forecasting, 23 (2007), 321-327, I have concluded that tests of statistical significance should never be used.
Here's a link to Armstrong's paper, and here's a link to his rejoinder to discussion.
My thoughts:
It has been rare that I've found significance tests to be useful, but when they have, it has been as a way to get a sense of the ways in which a model does not fit data, to give direction on where the model can be improved; see Chapter 6 of Bayesian Data Analysis.
For a specific example in which I found significance tests useful, see Section 2.6 of our new book. I emailed Amstrong and am interested to see if he agrees that significance testing was appropriate in that case. I suppose I agree that, ultimately, confidence intervals and effect size estimates would be appropriate even in this example, but the significance testing was relatively simple and clear so I was happy with it.
I was also reminded that the difference between "significant" and "not significant" is not itself statistically significant.
Posted by Andrew at 3:32 PM | Comments (1) | TrackBack
June 11, 2007
"Missing at random" and "distinct parameters"
Etienne Rivot sends in a question about models for missing data. The issues are subtle and I think could be of general interest (since we all have missing data!) These issues are covered in Chapter 7 of Bayesian Data Analysis, but it always helps to see these theoretical ideas in the context of a specific example.
Rivot writes:
I [Rivot] am currently writing a paper to be submitted in a fisheries review and one of the referee raised a problem in the treatment of the missing data in our model. In a first version of the manuscript, we argue that the missing data generating process was "ignorable" (because we argue that the 2 conditions - 1) missing at random ; 2) distinct parameters - were verified). But the referee argue that the "distinct" parameters conditions was NOT verified. I would greatly appreciate to have your opinion about that. Please find below a short description of the model and of the problem :The problem
--------------------------------
Objective : estimate the number of fish, say N, in a particular site in a river
Method : successive removal method via electrofishing
Data :
- site i=1,...,n
- C1 : capture at the first pass (fish are captured by means of electro-fishing)
sampling equation : C1(i) ~ Binom(N(i),p(i))
- C2 : capture at the second pass (the same experiment with the same capture probability p(i))
sampling equation : C2(i) ~ Binom(N(i)-c1(i),p(i))
Hierarchical Bayesian model :
- priors on p(i) and N(i) have a hierarchical structure accros sites i=1,...,nThe missing data problem :
Some times, the population size N(i) is so low that the result of the first pass is C1(i) = 0. In that case, the field crew often do not perform the second pass. Then C2(i) = NA.The argumentation of the referee :
The "distinct parameters" condition IS NOT verified because the probability of a missing data at the site i depends upon the population size N(i). Indeed, the smaller the population size N(i), the greater the probability of obtaining C1 = 0 at the first pass, and the greater the probability of a missing data at the second pass. Then, the parameters of the missing data generating process (i.e. the probability of a missing data C2 = NA at site i) are NOT independent of the parameters of t
