Results matching “R”

Someone who wishes to remain anonymous writes:

I was consulting with an academic group, and I noticed and pointed out what I believed was a clear and obvious mistake in something they were planing to publish on. Now we can always be wrong, but here the mistake was mathematical and had been recently published by a well know author - not me. So I am very sure about it and after the consulting came to an end I sent a very blunt email along with a warning that they would not want their error to be pointed in a letter to the editor. They were arguing that the error was subtle and maybe not a really an error or at least one that was well or widely understood. (If they accept it is an error they would have to develop a new method.) Unfortunately/fortunately I noticed a recent paper from them was listed in the references of a statistical methods paper I was just asked to review. I checked the paper and the error is there with not even a warning about any uncertainty or mixed opinions about there possibly being an error. There was no signed non-disclosure clause. What should I do now?

I can't find someone else to write the letter to the editor I will but I am wondering how often others run into this and what ideas they have.

P.S. To preserve anonymity, some of the details have been faslified in unimportant ways.

My reply:

Many years ago I was involved in a project--not as a coauthor but just as a very peripheral member of a research group--where the data collection didn't go so well, the experimenters got a lot fewer participants than they had hoped for, and when all was said and done, there didn't seem to be much difference between treated and control groups. We were all surprised--the treatment made a lot of sense and we all thought it would work. After it didn't, we had lots of theories as to why that could've been, but we didn't really know. One other thing: there were before and after measures (of course), and both the treatment and the control groups showed strong (and statistically significant) improvement.

I drifted away from this project, but later I heard that the leader of the study had published a paper just with the treatment group data. Now, if all you had were the treatment data--if it were a simple before/after study--that would be fine: there would be questions about causality but you go with what you've got.

I didn't follow up on this, partly because I don't know the full story here. The last I saw of the data, there wasn't much going on, but it may very well be that completely proper data analysis led to cleaner results which were ultimately publishable. To say it again for emphasis: I don't really know what was happening here; I only heard some things second-hand. In any collaboration, it's good to have a certain level of communication, trust, and sense of common purpose which just wasn't there in that project at all.

Anyway, back to the original question: I don't see why you can't yourself write and submit a letter to the editor. First run it by the authors of the article to see what they say, then send it to the journal. That would seem like the best option for all concerned. Ideally it won't be perceived as a moral issue but just as a correction. As the author of a published paper with a false theorem, I'm all in favor of corrections!

A while ago, this blog had a discussion of short English words that have no rhymes. We've all heard of "purple" (which, in fact, rhymes with the esoteric but real word hirple) and "orange" in this context, but there are others. This seems a bit odd, which I guess is why some of these words are famous for having no rhyme. Naively, and maybe not so naively, one might expect that at least some new words would be created to take advantage of the implied gaps in the gamut of two-syllable words. Is there something that prevents new coinages from filling the gaps? Why do we have blogs and and vegans and wikis and pixels and ipods, but not merkles and rilvers and gurples?

I have a hypothesis, which is more in the line of idle speculation. Perhaps some combinations are automatically disfavored because they interfere with rapid processing of the spoken language. I need to digress for just a moment to mention a fact that supposedly baffled early workers in speech interpretation technology: in spoken language, there are no pauses or gaps between words. If you say a typical sentence --- let's take the previous sentence for example --- and then play it back really slowly, or look at the sound waveform on the screen, you will find that there are no gaps between most of the words. It's not "I -- need -- to -- digress," it's "Ineed todigressforjusta momento..." Indeed, unless you make a special effort to enunciate clearly, you may well use the final "t" in "moment" as the "t" in "to": most people wouldn't say the t twice. But with all of these words strung together, how is it that our minds are able to separate and interpret them, and in fact to do this unconsciously most of the time, to the extent that we feel like we are hearing separate words?

My thought --- and, as I said, it is pure speculation --- is that perhaps there is an element of "prefix coding" in spoken language, or at least in spoken English (but presumably others too). "Prefix coding" is the assignment of a code such that no symbol in the code is the start (prefix) of another symbol in the code. Hmm, that sentence only means something if you already know what it means. Try this. Suppose I want to compose a language based on only two syllables, "ba" and "fee". Using a prefix code, it's possible to come up with a rule for words in this language, such that I can always tell where one word stops and another word ends, even with no gaps between words. ("Huffman coding" provides the most famous way of doing this.) For instance, suppose I have words bababa, babafee, feeba, bafee, and feefeefee. No matter how I string these together, it turns out there is only one possible breakdown into words: babafeefeebabafeefeefeefeefeebabababa can only be parsed one way, so there's no need for word breaks. In fact, as soon as you reach the end of one word, you know you have done so; no need to "go backwards" from later in the message, to try out alternative parses.

English doesn't quite work like this. For example, the syllable string see-thuh-car-go-on-the-ship can be interpreted as "see the cargo on the ship" or "see the car go on the ship". But it took me several tries to come up with that example! To a remarkable degree, you don't need pauses between the words, especially if the sentence also has to make sense.

So, maybe words that rhymes with "circle" or "empty" are disfavored because they would interfere with a quasi-"prefix coding" character of the language? Suppose there were a word "turple" for example. It would start with a "tur" sound, which is one of the more common terminal sounds in English (center, mentor, enter, renter, rater, later...). A string of syllables that contains "blah-blah-en-tur-ple-blah" could be split more than one place...maybe that's a problem. Of course, you'll say "but there are other words that start with "tur", why don't those cause a problem, why just "turple"? But there aren't all that many other common "tur" words --- surprisingly few, actually --- turn, term, terminal. "Turple" would be the worst, when it comes to parsing, because its second syllable --- pul --- is a common starting syllable in rapidly spoken English (where many pl words, like please and plus and play, start with an approximation of the sound).

So...perhaps I'm proposing nonsense, or perhaps I'm saying something that has been known to linguists forever, but that's my proposal: some short words tend to evolve out of the language because they interfere with our spoken language interpretation.

Too clever by, hmmm, about 5% a year

Coblogger John Sides quotes a probability calculation by Eric Lawrence that, while reasonable on a mathematical level, illustrates a sort of road-to-error-is-paved-with-good-intentions sort of attitude that bothers me, and that I see a lot of in statistics and quantitative social science.

I'll repeat Lawrence's note and then explain what bothers me.

Here's Lawrence:

In today's Wall Street Journal, Nate Silver of 538.com makes the case that most people are "horrible assessors of risk." . . . This trickiness can even trip up skilled applied statisticians like Nate Silver. This passage from his piece caught my [Lawrence's] eye:
"The renowned Harvard scholar Graham Allison has posited that there is greater than a 50% likelihood of a nuclear terrorist attack in the next decade, which he says could kill upward of 500,000 people. If we accept Mr. Allison's estimates--a 5% chance per year of a 500,000-fatality event in a Western country (25,000 causalities per year)--the risk from such incidents is some 150 times greater than that from conventional terrorist attacks."

Lawrence continues:

Here Silver makes the same mistake that helped to lay the groundwork for modern probability theory. The idea that a 5% chance a year implies as 50% chance over 10 years suggests that in 20 years, we are certain that there will be a nuclear attack. But . . . the problem is analogous to the problem that confounded Chevalier de Méré, who consulted his friends Pascal and Fermat, who then derived several laws of probability. . . . A way to see that this logic is wrong is to consider a simple die roll. The probability of rolling a 6 is 1/6. Given that probability, however, it does not follow that the probability of rolling a 6 in 6 rolls is 1. To follow the laws of probability, you need to factor in the probability of rolling 2 6s, 3 6s, etc.

So how can we solve Silver's problem? The simplest way turns the problem around and solves for the probability of not having a nuclear attack. Then, preserving the structure of yearly probabilities and the decade range, the problem becomes P(no nuclear attack in ten years) = .5 = some probability p raised to the 10th power. After we muck about with logarithms and such, we find that our p, which denotes the probability of an attack not occurring each year, is .933, which in turn implies that the annual probability of an attack is .067.

But does that make a difference? The difference in probability is less than .02. On the other hand, our revised annual risk is a third larger. . . .

OK, so Lawrence definitely means well; he's gone to the trouble to write this explanatory note and even put in some discussion of the history of probability theory. And this isn't a bad teaching example. But I don't like it here. The trouble is that there's no reason at all to think of the possibility of a nuclear terrorist attack as independent in each year. One could, of course, go the next step and try a correlated probability model--and, if the correlations are positive, this would actually increase the probability in any given year--but that misses the point too. Silver is making an expected-value calculation, and for that purpose, it's exactly right to divide by ten to get a per-year estimate. Beyond this, Allison's 50% has got to be an extremely rough speculation (to say the least), and I think it confuses rather than clarifies matters to pull out the math. Nate's approximate calculation does the job without unnecessary distractions. Although I guess Lawrence's comment illustrates that Nate might have done well to include a parenthetical aside to explain himself to sophisticated readers.

This sort of thing has happened to me on occasion. For example, close to 20 years ago I gave a talk on some models of voting and partisan swing. To model votes that were between 0 and 1, we first did a logistic transformation. After the talk, someone in the audience--a world-famous statistician who I respect a lot (but who doesn't work in social science) asked about the transformation. I replied that, yeah, I didn't really need to do it: nearly all the vote shares were between 0.2 and 0.8, and the logit was close to linear in that range; we just did logit to be on the safe side. [And, actually, in later versions of this research, we ditched the logit as being a distraction that hindered the development of further sophistication in the aspects of the model that really did matter.] Anyway, my colleague responded to my response by saying, No, he wasn't saying I should use untransformed data. Rather, he was asking why I hadn't used a generalized linear model; after all, isn't that the right thing to do with discrete data. I tried to explain that, while election data are literally discrete (there are no fractional votes), in practice we can think of congressional election data as continuous. Beyond this, a logit model would have an irrelevant-because-so-tiny sqrt(p(1-p)/n) error term which would require me to add an error term to the model anyway, which would basically take me back to the model I was already starting with. This point completely passed him by, and I think he was left with the impression that I was being sloppy. Which I wasn't, at all. In retrospect, I suppose a slide on this point would've helped; I'd just assumed that everyone in the audience would automatically understand the irrelevance of discrete-data models to elections with hundreds of thousands of votes. I was wrong and hadn't realized the accumulation of insights that any of us gather when working within an area of application, insights which aren't so immediately available to outsiders--especially when they're coming into the room thinking of me (or Nate Silver, as above) as an "applied statistician" who might not understand the mathematical subtleties of probability theory.

P.S. Conflict-of-interest note: I post on Sides's blog and on Silver's blog, so I'm conflicted in all directions! On the other hand, neither of them pays me (nor does David Frum, for that matter; as a blogger, I'm doing my part to drive down the pay rates for content providers everywhere), so I don't think there's a conflict of interest as narrowly defined.

More on health care performance and cost

Frank Hansen writes:

Life expectancy is an outcome with possibly many confounding factors like genes or lifestyle, other than cost.

I used the 2007 oecd data on system resources to construct a health care system score. Here is the graph of that score against per capita cost:

Objects of the class "Foghorn Leghorn"

Foghorn_Leghorn.png

The other day I saw some kids trying to tell knock-knock jokes, The only one they really knew was the one that goes: Knock knock. Who's there? Banana? Banana who? Knock knock. Who's there? Banana? Banana who? Knock knock. Who's there? Orange. Orange who? Orange you glad I didn't say banana?

Now that's a fine knock-knock joke, among the best of its kind, but what interests me here is that it's clearly not a basic k-k; rather, it's an inspired parody of the form. For this to be the most famous knock-knock joke--in some circles, the only knock-knock joke--seems somehow wrong to me. It would be as if everybody were familiar with Duchamp's Mona-Lisa-with-a-moustache while never having heard of Leonardo's original.

Here's another example: Spinal Tap, which lots of people have heard of without being familiar with the hair-metal acts that inspired it.

The poems in Alice's Adventures in Wonderland and Through the Looking Glass are far far more famous now than the objects of their parody.

I call this the Foghorn Leghorn category, after the Warner Brothers cartoon rooster ("I say, son . . . that's a joke, son") who apparently was based on a famous radio character named Senator Claghorn. Claghorn has long been forgotten, but, thanks to reruns, we all know about that silly rooster.

And I think "Back in the USSR" is much better known than the original "Back in the USA."

Here's my definition: a parody that is more famous than the original.

Some previous cultural concepts

Objects of the class "Whoopi Goldberg"

Objects of the class "Weekend at Bernie's"

Retirements

Benjamin Kay writes:

I wonder if you saw Bruce Reed's article The Year of Running Dangerously -- In a tough economy, incumbency is the one job nobody wants. about the recent flurry of retirement announcements in the Senate but also less well publicized ones in the House. My understanding is that there is a known effect on retirement from census driven redistricting. We also happen to be in a census year but I haven't read any journalists discussing that as as a factor. Do you have any insight into the relative explanatory decomposition of partisan politics, redistricting related concerns, and simple economy driven unpopularity in these retirement decision?

My reply: Retirement rates definitely go up in redistricting years (see, for example, figure 3 here), but that would be 2012, not 2010, I believe. The census is this year, but I don't think they're planning to redraw the district lines in time for the 2010 elections.

In any case, I imagine that somebody has studied retirement rates, to see if they tend to go up in marginal seats in bad economies. Overall, retirement rates are about the same in marginal seats as any other (at least, that's what Gary and I found when we looked at it in the late 1980s), but I could imagine that things vary by year. The data are out there, so I imagine somebody has studied this.

P.S. I've never understood why anybody would want to retire from a comfortable white-collar job. But now, spending a year on sabbatical and relaxing most of the time, I can really understand the appeal of retirement.

Damn!

I learned from a completely reliable source that the letter to the editor I published in the Journal of Theoretical Biology was largely in error.

I have to admire the thousands of anonymous Wikipedians who catch the mistakes of poseurs such as myself, and I'm looking forward to the forthcoming correction in the journal. The editors of JTB must be pretty embarrassed to have published a letter that was so wrong--but I guess it makes sense that they could be bamboozled by a statistician from a fancy Ivy League college.

Oh well, at least it's better that I learn the error of my ways now than that I live the rest of my life under the illusion that I knew what I was doing all this time!

P.S. I wish the anonymous Wikipedia editor had contacted me directly regarding my mistakes. As it is, I may never learn exactly how my criticisms have been "already addressed and corrected."

P.P.S. Due to the dynamic nature of Wikipedia, the above is now out of date. (Latest version is here.) Take a look at the comments below.

I'm giving a short course (actually, more like a series of lectures) in Leuven on 17 Feb.

Influential statisticians

Seth lists the statisticians who've had the biggest effect on how he analyzes data:

1. John Tukey. From Exploratory Data Analysis I [Seth] learned to plot my data and to transform it. A Berkeley statistics professor once told me this book wasn't important!

2. John Chambers. Main person behind S. I [Seth] use R (open-source S) all the time.

3. Ross Ihaka and Robert Gentleman. Originators of R. R is much better than S: Fewer bugs, more commands, better price.

4. William Cleveland. Inventor of loess (local regression). I [Seth] use loess all the time to summarize scatterplots.

5. Ronald Fisher. I [Seth] do ANOVAs.

6. William Gosset. I [Seth] do t tests.

My data analysis is 90% graphs, 10% numerical summaries (e.g., means) and statistical tests (e.g., ANOVA). Whereas most statistics texts are about 1% graphs, 99% numerical summaries and statistical tests.

I think this list is pretty reasonable, but I have a few comments:

1. Just to let youall know, I wasn't the Berkeley prof who told Seth that EDA wasn't important. I've even published an article about EDA. That said, Tukey's book isn't perfect. I mean, really, who cares about the January temperature in Yuma?

2, 3. I agree that S and R are hugely important. But if they weren't invented, maybe we'd just be using APL or Matlab?

4. Cleveland also made important contributions to statistical graphics.

5. I've written an article about Anova too, but at this point I think of Fisher's version of Anova as an excellent lead-in to hierarchical models and not such a great tool in itself. I think that psychology researchers will be better off when they forget about sums of squares, mean squares, and F tests, and instead focus on coefficients, variance components, and scale parameters.

6. I don't really do t-tests.

P.S. I wouldn't even try to make my own list. As a statistician myself, I've been influenced by so many many statisticians that any such list would run to the hundreds of names. I suppose if I had to make such a list about which statisticians have had the biggest effect on how I analyze data, it might go something like:

1. Rubin: He taught me applied statistics and clearly has had the largest influence on me (and, maybe, on many readers of my books)

2. Laplace/Lindley/etc.: The various pioneers of hierarchical modeling and applied Bayesian statistics

3. Gauss: Least squares, error models, etc etc

4. Cleveland: Crisp, clean graphics for data analysis. Although maybe if Cleveland had never existed, I'd have picked this up from somewhere else

5. Fisher: He's gotta be there, since he's had such a big influence on the statistical practice of the twentieth century

6. Jaynes: Not the philosophy of Bayes stuff, but just one bit--an important bit--in his book where he demonstrated the principle of setting up a model, taking it really seriously, looking hard to see where it doesn't fit the data, and then looking deeply at the misfit to see what it reveals about how the model to see how it could be improves.

But I'm probably missing some big influences that I'm forgetting right now.

Good timing

I was at the store today and bought some pants. Everything was 50-75% off, so I bought 4 pairs instead of just two. Then I came home and read this. Cool!

P.S. I'm amused by the level of passion in many of Tyler's blog commenters. Although, really, who am I to talk, given that I get passionate on the subject of hypothesis tests for contingency tables.

New kinds of spam

I got the following bizarre email, subject-line "scienceblogs.com/appliedstatistics/":

Hi,

After looking at your website, it is clear that you share the same concerns about Infections as we do here at Infection.org. Our website is dedicated to sharing the various up to date information regarding Infection, and we would love to share it to you and your readers.

I would like to discuss possible partnership opportunities with you. Please contact me if you are interested. Thank you.

June Smith
Assistant Editor
Infection.org
June.Infection@gmail.com

Just the "Assistant Editor," huh? I'm assuming that when Instapundit and Daily Kos got this email, it came directly from the top. On the other hand, the real nobody-blogs probably get a request from the Deputy Assistant Editor or an intern or someone like that. . . .

P.S. Spam is a kind of infection, so in this way I guess the message makes a certain kind of sense.

The newest way to slam a belief you disagree with--or maybe it's not so new--is to call it "religious." For example, "Market Fundamentalism is a quasi-religious faith that unregulated markets will somehow always produce the best possible results," and so is global warming ("The only difference between the religions right and the religious left, is that the religious right worships a man, and the religious left worships . . . Mother Nature"). As is evidence-based medicine ("as religious as possible . . . just another excuse, really--to sneer at people"). And then there's the religion of Darwinism.

I encountered an extreme example of this sort of thing recently, from columnist Rod Dreher, who writes disapprovingly of "(Climate) science as religion"--on a religious website called Beliefnet (which has, under the heading, "My Faith," the options Christianity, Buddhism, Catholic, Hinduism, Mormon, Judiasm, Islam, Holistic, and Angels. Dreher actually appears to be a supporter of climate science here; he's criticizing a dissent-suppressing attitude that he sees, not the actual work that's being done by the scientists in the field.

Maybe it's time to retire use of the term "religion" to mean "uncritical belief in something I disagree with." Now that actual religious people are using the term in this way, it would seem to have no meaning left.

Background

Perhaps I'm a little sensitive about this because back when I started doing statistics, people often referred to Bayesianism as a religion. At one point, when I was doing work on Bayesian predictive checking, one of my (ultra-classical) colleagues at Berkeley said that he was not a Bayesian. But if he was, he'd go the full subjective route. So he didn't understand what I was doing.

One of my Berkeley colleagues who studied probability--really, a brilliant guy--commented once that "of course" he was a Bayesian, but he was puzzled by how Bayesian inference worked in an example he'd seen. My feeling was: Bayes is a method, not a religion! Can't we evaluate it based on how it works?

And, a few years ago, someone from the computer science department came over and gave a lecture in the stat dept at Columbia. His talk was fascinating, bu the irritated me by saying how his method gave all the benefits of Bayesian inference "without having to believe in it." I don't believe in logistic regression either, but it sure is useful!

Update on Universities and $

We discussed here, Kevin Carey replies here.

70 Years of Best Sellers

I ran across this book, by Alice Payne Hackett, in the library last month: it lists the bestselling fiction and nonfiction books for every year from 1895 through 1965 (the year the book was written) and also the books that have sold the most total copies during that period. The nonfiction lists start in 1917. It's fun to read these, but the classifications confuse me a bit: apparently Charley Brown, Pogo, and the Bible are considered nonfiction. You could maybe make an argument for one or two of these, but it's hard to imagine that anyone could put all three of them in the nonfiction category!

Hackett writes:

The authors who have had the most titles on the seventy annual ists are Mary Roberts Rinehart with eleven, Sinclair Lewis with ten, Zane Grey and Booth Tarkington with nine each, and Louis Bromfield, Winston Churchill (the American novelist), George Barr McCutcheon, Gene Stratton Porter, Frank Yerby, Edna Ferber, Daphne du Maurier, and John Steinbeck with eight each.

And here are the top selling books in the United States during 1895-1965:

The Pocket Book of Baby and Child Care (The Common Sense Book of Baby and Child Care), by Benjamin Spock, 1946 . . . 19 million copies sold
Better Homes and Gardens Cook Book, 1930 . . . 11 million
Pocket Atlas, 1917 . . . 11 million
Peyton Place, by Grace Metalious, 1956 . . . 10 million
In His Steps, by Charles Monroe Sheldon, 1897 . . . 8 million
God's Little Acre, by Erskine Caldwell, 1933 . . . 8 million
Betty Crocker's New Picture Cookbook, 1950 . . . 7 million
Gone With the Wind, by Margaret Mitchell, 1937 . . . 7 million
How to Win Friends and Influence People, by Dale Carnegie, 1937 . . . 6.5 million
Lady Chatterley's Lover, by D. H. Lawrence, 1932 . . . 6.5 million
101 Famous Poems, compiled by R. J. Cook, 1916 . . . 6 million
English-Spanish, Spanish-English Dictionary, compiled by Carlos Castillo and Otto F. Bond, 1948 . . . 6 million
The Carpetbaggers, by Harold Robins, 1961 . . . 5.5 million
Profiles in Courage, by John F. Kennedy, 1956 . . . 5.5 million
Exodus, by Leon Uris, 1958 . . . 5.5 million
Roget's Pocket Thesaurous, 1923 . . . 5.5 million
I, the Jury, by Mickey Spillane, 1947 . . . 5.5 million
To Kill a Mockingbird,. by Harper Lee, 1960 . . . 5.5 million
The Big Kill, by MIckey Spillane, 1951 . . . 5 million
Modern World Atlas, 1922 . . . 5 million
The Wonderful Wizard of Oz, by L. Frank Baum, 1900 . . . 5 million
The Catcher in the Rye, by J. D. Salinger, 1951 . . . 5 million
My Gun is Quick, by Mickey Spillane, 1950 . . . 5 million
One Lonely Night, by Mickey Spillane, 1951 . . . 5 million
The Long Wait, by Mickey Spillane, 1951 . . . 5 million
Kiss Me, Deadly, by Mickey Spillane, 1952 , , , 5 million
Tragic Ground, by Erskine Caldwell, 1944 . . . 5 millon
30 Days to a More Powerful Vocabulary, by Wilfred J. Funk and Norman Lewis, 1942 . . . 4.5 million
Vengeance is Mine, by MIckey Spillane, 1950 . . . 4.5 million
The Pocket Cook Book, by Elizabeth Woody, 1942 . . . 4.5 million
Return to Peyton Place, by Grace Metalious, 1959 . . . 4.5 million
Never Love a Stranger, by Harold Robbins, 1948 . . . 4.5 million
Thunderball, by Ian Fleming, 1965 . . . 4 million
1984, by George Orwell, 1949 . . . 4 million
The Ugly American, by William J. Lederer and Eugene L. Burdick, 1958 . . . 4 million
A Message to Garcia, by Elbert Hubbard, 1898 , . . 4 million
Hawaii, by James A. Michener, 1959 . . . 4 million

This is so much fun, just typing these in. I hardly know when to stop. OK, here are the next few:

Journeyman, by Erskine Caldwell, 1935 . . . .4 million
The Greatest Story Ever Told, by Fulton Oursler, 1949 . . . 4 million
Kids Say the Darndest Things!, by Art Linkletter, 1957 . . . 4 million
The Radio Amateur's Handbook, 1926 . . . 4 million
Diary of a Young Girl, by Anne Frank, 1952 . . . 3.5 million
From Here to Eternity, by James Jones, 1951 . . . 3.5 million
Goldfinger, by Ian Fleming, 1959 . . . 3.5 million
Lolita, by Vladimir Nabokov, 1958 . . . 3.5 million
Trouble in July, by Erskine Caldwell, 1940 . . . 3.5 million
Lost Horizon, by James Hilton, 1935 . . . 3.5 million
Butterfield 8, by John O'Hara, 1935 . . . 3.5 million
The American Woman's Cook Book, ed. by Ruth Berolzhemier, 1939 . . . 3.5 million
Duel in the Sun, by Niven Busch, 1944 . . . 3.5 million
Georgia Boy, by Erskine Caldwell, 1943 . . . 3.5 million
Four Days, by American Heritage and U.P.I., 1964, 3.5 millon

And those are all the ones that, as of 1965, had at last 3.5 million recorded sales. (Hackett, annoyingly, reports sales figures to the last digit (for example, 19,076,822 for Dr. Spock), but I've rounded, following the rules of good taste.) Of all these, I'd say that five would or could be considered great literature (Chatterley, Catcher in the Rye, 1984, From Here to Eternity, and Lolita). Not such a bad total, considering all the possibilities. I've read ten of the books on the above list (not counting the Betty Crocker cookbook, which is where I got my recipe for biscuits), with From Here to Eternity being my favorite. I once tried to read Mockingbird but with no success. Of all the books above that I haven't read, I'd guess that I'd enjoy the John O'Hara the most. Also, Kids Say the Darndest Things, which someone (Phil?) once told me actually is very funny. Most of the books on the list sound vaguely familiar, but many only vaguely. For example, I recall the name "Erskine Caldwell" but know nothing about his books beyond what I can imagine from the titles. Maybe I once read something on his work in a compilation of reviews by Edmund Wilson ,or something like that? "Duel in the Sun" was made into a movie, Mickey Spillane was famous for suspense thrillers where the hero shot the girl, Kennedy "authored" rather than wrote Profiles in Courage. And so on.

I had a great time just going through the titles and authors. Here's the end of the list, all the books that are listed as having sold a (meaninglessly precise) 1,000,000 copies:

Anthony Adverse, by Harvey Allen, 1933
Brave Men, by Ernie Pyle, 1944
Etiquette, by Emily Post, 1922
The Fire Next Time, by James Baldwin, 1963
A Heap O'Livin', by Edgar Guest, 1916
Little Black Sambo, by Helen Bannerman, 1999
The Moneyman, by Thomas B. Costain, 1947
Pollyanna Grows Up, by Eleanor H. Porter, 1915
Short Story Masterpieces, edited by Robert Penn Warren and Albert Erskine, 1958
The Simple Life, by Charles Wagner, 1901
Stiletto, by Harold Robbins, 1960
Twixt Twelve and Twenty, by Pat Boone, 1957
The Web of Days, by Edna L. Lee, 1947
Will Rogers, by Patrick J. O'Brien, 1935
Youngblood Hawke, by Herman Wouk, 1962

These really are obscure; most of them I'd never heard of before. Seeing that Pat Boone title reminds me of "Pimples and Other Problems, by Mary Robinson," a title which I made up for an assignment in high school in which we were supposed to list and summarize 20 books that we had read. For some reason, my friends and I were getting bored as we got near the end, and so we filled out our lists with made-up books. Fictional fiction, as it were. Although I seem to recall that the Robinson book was more of a book of nonfictional reminisces, in the manner of Erma Bombeck but with an adolescent focus. We all thought it was funny that we padded our lists, but now that I'm a teacher myself, I realize what a pain it is to grade papers. Our teacher probably flipped through our assignments at lightspeed.

The writer with the most titles on the combined list (of all books selling at least a million copies, in all editions) was Erle Stanley Gardner, with 91 (!). It's funny, I don't remember seeing any of these in the public library when I was a kid. I wonder if the librarians considered them too hard-boiled for public consumption. His bestseller was The Case of the Lucky Legs, which came out in 1934 and, through 1965, had sold 3,499,948 copies. The Case of the Sulky Girl, from 1933, had sold 3.2 million copies, and it continues from there.

This is just so much fun. The analogy, I suppose, is with the weekly movie grosses that I've been told are now a national obsession (maybe not anymore?) or the TV ratings, which I remember reading about regularly in the newspaper thirty years ago. Back in 1965, books had some of the central position that movies (and video games?) have now in our culture. (TV seems to have come in and gone out; lots and lots of people watch TV, but I don't get a sense that people care too much anymore what are the top 10 shows in the Nielsens.) Movies are OK, but I'm still much more interested in books, which is one reason I so much enjoyed flipping through 70 Years of Best Sellers. (A sequel, "80 Years . . .," came out 10 years later, but that seems to be the end of the line.) The book concludes with a list of references, various books and articles about bestsellers, many of which look like they'd be fun to read.

P.S. This list is fun too. The numbers are much larger (it has A Tale of Two Cities, at 200 million copies, as the bestselling book not published by the government or a religious group, with the Boy Scout Handbook, Lord of the Rings, and one of the Harry Potter books following up). The numbers on this Wikipedia list come from all different sources and I'm sure that some are overestimates; beyond this, I guess that lots and lots of books have been sold in the forty years since 1965. The Wikipedia list is admittedly incomplete; for example, it doesn't include the aforementioned Perry Mason in its list of bestselling series. It does, however, note that "the Perry Rhodan series has sold more than 1 billion copies." I'd never heard of this one at all, but, following the link, I see that it's some sort of German serial, which I guess puts it in the same category as Mexican comic books and other things that I've vaguely heard about but have never seen. Once you start thinking about things like that--books that blur the boundary between literature and pop entertainment--I guess you can pile up some big numbers.

P.P.S. What are the bestselling statistics books? (Not counting introductory textbooks, which don't quite count, since students don't quite choose to by them.) The first two I can think of are Statistical Methods for Research Workers, Snedecor and Cochran, and Feller volume 1 (counting all editions in each case), but all these were published long ago and probably had most of their sales back in the days before book sales were so high (sales for all books are continuing to increase, in advance of the big crash that's coming some day soon). When thinking of total sales, maybe I should be thinking of books that have come out more recently. Exploratory Data Analysis? The Visual Display of Quantitative Information (yes, that's a statistics book)? I wouldn't quite count Freakonomics or Supercrunchers or Fooled by Randomness or the collected works of Malcolm Gladwell (or, for that matter, Red State, Blue State); these books are all about statistics but I wouldn't quite call them "statistics books." Generalized Linear Models, maybe? Everybody has that one, but in lifetime sales maybe it's not in the Snedecor and Cochran class. I'd hate to think that the all-time #1 is How to Lie with Statistics (or, worse, Statistics for Dummies), but maybe so. Or maybe there's something huge and obvious that I'm forgetting?

And the all-time #1 political science book is, what, Machiavelli? Or Hobbes, maybe? At least until Sarah Palin's memoir comes out.

Different sorts of survey bias

Fascinating blog by Nate Silver on different ways a survey organization can be biased (or not). Issues of question wording, and of which questions to ask in a survey, come up from time to time.

Recidivism statistics

From the news today:


A man charged with trying to kill a Danish cartoonist was arrested last year in an alleged plot to harm U.S. Secretary of State Hillary Clinton, officials said. [...] The suspect was one of four people arrested last summer in Nairobi in an alleged plot to harm Clinton during her tour of African countries, the newspaper Politken reported. The suspect was released from a Kenyan jail in September because of a lack of evidence and returned to Denmark, where he had been living, Sky News reported Sunday.

Just a few days ago, CNN reported:

That announcement led to questions about how many other former Guantanamo detainees may be planning to carry out terrorist attacks.

Pentagon officials have not released updated statistics on recidivism, but the unclassified report from April says 74 individuals, or 14 percent of former detainees, have turned to or are suspected of having turned to terrorism activity since their release.

Of the more than 530 detainees released from the prison between 2002 and last spring, 27 were confirmed to have engaged in terrorist activities and 47 were suspected of participating in a terrorist act, according to Pentagon statistics cited in the spring report.


More at Wikisource.

These are actually lower than the general population, where about 65% of prisoners are expected to be rearrested within 3 years. The numbers seem lower in recent years, about 58%. More at Wikipedia.

The Jewish Factor in Blue States

David Verbeeten writes:

A half-decade of blogging

We started this blog in October, 2004, as a way for people in my research group to share ideas, and for us to promote and elicit comments on our work. I soon came to regret that we hadn't started a year or so earlier; it appears that, up to 2003, all the blogs linked to all the other blogs. Starting in 2004 or so, the bigtime blogs mostly stopped updating their blogroll. (Luckily for us, Marginal Revolution didn't get the memo.) On the other hand, we benefited from late entry in having a sense of what we wanted the blog to be like. If I'd started blogging in 2002 or 2003, I suspect I would've been like just about everybody else and spewed out my political opinions on everything. By the end of 2004, I'd seen enough blogs that did that, and I realized we could make more of a contribution by keeping it more focused, keeping political bloviating, sarcasm, and academic gossip to a minimum.

Just to be clear: I'm not slammin those other kinds of blogs. Political opinions are great, and I think we really can learn from seeing ideas we agree with (or disagree with) expressed well and with emotion. Sarcasm is great too; it's what makes a peanut-butter-and-sandpaper sandwich worth eating, or something like that. And, hey, I love academic gossip; it's even more fun than reading about celebrities. These just aren't the best ways for me personally to contribute to the discourse.

When I started with the blog, I figured that if we were ever low on material, I could just link to my old articles, one at a time. But it's rarely come to this; in fact, I don't always get around to blogging my new articles and books right when they come out. The #1 freebie is that things I used to put in one-on-one emails or in referee reports, now I put on the blog so everyone can see. Much more efficient, I think. The only bad thing about the blog--other than the time it takes up--is that now I get occasional emails from people informing me of developments in the sociobiology of human sex ratios. A small price to pay, I'd say.

MCMC model selection question

Robert Feyerharm writes in with a question, then I give my response. Youall can play along at home by reading the question first and then guessing what I'll say. . . .

I have a question regarding model selection via MCMC I'd like to run by you if you don't mind.

One of the problems I face in my work involves finding best-fitting logistic regression models for public health data sets typically containing 10-20 variables (= 2^10 to 2^20 possible models). I've discovered several techniques for selecting variables and estimating beta parameters in the literature, for example the reversible jump MCMC.

However, RJMCMC works by selecting a subset of candidate variables at each point. I'm curious, as an alternative to trans-dimensional jumping would it be feasible to use MCMC to simultaneously select variables and beta values from among all of the variables in the parameter space (not just a subset) using the regression model's AIC to determine whether to accept or reject the betas at each candidate point?

Using this approach, a variable would be dropped from the model if its beta parameter value settles sufficiently close to zero after N iterations (say, -.05 < βk < .05). There are a few issues with this approach: Since the AIC isn't a probability density, the Metropolis-Hastings algorithm could not be used here as far as I know. Also, AIC isn't a continuous function (it "jumps" to a lower/higher value when the number of model variables decreases/increases), and therefore a smoothing function is required in the vicinity of βk=0 to ensure that the MCMC algorithm properly converges. I've run a few simulations and this "backwards elimination" MCMC seems to work, albeit it converges to a solution very slowly.

Anyways, if you have time I would greatly appreciate any input you may have. Am I rehashing an idea that has already been considered and rejected by MCMC experts?

Simple ain't easy

Let's start 2010 with an item having nothing to do with statistical modeling, causal inference, or social science.

Jenny Diski writes:

'This is where we came in' is one of those idioms, like 'dialling' a phone number, which has long since become unhooked from its original practice, but lives on in speech habits like a ghost that has forgotten the why of its haunting duties. The phrase is used now to indicate a tiresome, repetitive argument, a rant, a bore. But throughout my [Diski's] childhood in the 1950s and into the 1970s, it retained its full meaning: it was time to leave the cinema - although, exceptionally, you might decide to stay and see the movie all over again - because you'd seen the whole programme through. It seems very extraordinary now, and I don't know how anyone of my generation or older ever came to respect cinema as an art form, but back then almost everyone wandered into the movies whenever they happened to get there, or had finished their supper or lunch, and then left when they recognised the scene at which they'd arrived. Often, one person was more attentive than the other, and a nudge was involved: 'This is where we came in.' . . .

Interesting. It's been awhile since I've come to a move in the middle and sat through the next showing until reaching the point where I came in. Maybe this is not allowed anymore?

The real reason I wanted to discuss Diski's article, though, was because of an offhand remark she makes, dissing an academic author's attempt to write for a popular audience:

Skerry isn't really one to let go of jargon. In the preface he explains how to read his book, not as most books are doomed to be read, from beginning to end, but differently and 'in keeping with the multiplicity of voices that make up the text'. It gets quite scary: 'The temporal structure of these chapters goes from the present-tense narrative of my research trip in Chapter 1 to the achronological, "cubist" structure of Chapter 3 . . .

"Skerry" sounds like the name of a fictional character, but he's actually the author of the book under review.

My real point, though, is that I suspect that Skerry was not intentionally writing in jargon; it's just hard to write clearly. Harder than many readers realize, and maybe harder than professional writer Diski realizes. My guess is that Skerry was trying his best but he just doesn't know any better.

I had a similar discussion with Seth on this awhile ago (sorry, I can't find the link to the discussion), where he was accusing academics of deliberately writing obscurely, to make their work seem deeper than it really is, and I replying that we'd all like to write clearly but it's not so easy to do so.

There are some fundamental difficulties here, the largest of which, I think, is that the natural way to explain a confusing point is to add more words--but if you add too many words, it's hard to follow the underlying idea. Especially given that writing is one-dimensional; you can't help things along with intonation, gestures, and facial expressions. (There's the smiley-face and its cousin, the gratuitous exclamation point (which happened to be remarked upon by Alan Bennett in that same issue of the LRB), but that's slim pickings considering all the garnishes available for augmenting face-to-face spoken conversation.)

P.S. Here's my advice on how to write research articles. I don't really get into the jargon thing. Writing clearly and with minimal jargon is so difficult that I wasn't ready to try to give advice on the topic.

Normative vs. descriptive

Following a link from Rajiv Sethi's blog, I encountered this blog by Eilon Solan, who writes:

One of the assumptions of von-Neumann and Morgenstern's utility theory is continuity: if the decision maker prefers outcome A to outcome B to outcome C, then there is a number p in the unit interval such that the decision maker is indifferent between obtaining B for sure and a lottery that yields A with probability p and C with probability 1-p.

When I [Solan] teach von-Neumann and Morgenstern's utility theory I always provide criticism to their axioms. The criticism to the continuity axiom that I use is when the utility of C is minus infinity: C is death. In that case, one cannot find any p that would make the decision maker indifferent between the above two lotteries.

The funny thing is, this is an example I've used (see section 6 of this article from 1998) to demonstrate that you can, completely reasonably, put dollars and lives on the same scale. As I wrote:

We begin this demonstration by asking the students what is the dollar value of their lives---how much money would they accept in exchange for being killed? They generally answer that they would not be killed for any amount of money. Now flip it around: suppose you have the choice of (a) your current situation, or (b) a probability p$of dying and a probability (1-p) of gaining $1. For what value of p are you indifferent between (a) and (b)? Many students will answer that there is no value of p; they always prefer (a). What about p=10^{-12}? If they still prefer (a), let them consider the following example.

To get a more precise value for p, it may be useful to consider a gain of $1000 instead of $1 in the above decision. To see that $1000 is worth a nonnegligible fraction of a life, consider that people will not necessarily spend that much for air bags for their cars. Suppose a car will last for 10 years; the probability of dying in a car crash in that time is of the order of 10*40,000/280,000,000 (the number of car crash deaths in ten years divided by the U.S. population), and if an air bag has a 50% chance of saving your life in such a crash, this gives a probability of about 7*10^{-4} that the bag will save your life. Once you have modified this calculation to your satisfaction (for example, if you do not drive drunk, the probability of a crash should be adjusted downward) and determined how much you would pay for an air bag, you can put money and your life on a common utility scale. At this point, you can work your way down to the value of $1 (as illustrated in a different demonstration). This can all be done with a student volunteer working at the blackboard and the other students making comments and checking for coherence.

The student discussions can be enlightening. For example, one student, Julie, was highly risk averse: when given the choice between (a) the current situation, and (b) a 0.000 01 probability of dying and a 0.999 99 of gaining $10,000, she preferred (a).
Another student in the class pointed out that 0.000 01 is approximately the probability of dying in a car crash in any given three-week period. After correcting for the fact that Julie does not drive drunk, and that she drives less than the average American, perhaps this is her probability of dying in a car crash, with herself as a driver, in the next six months. By driving, she is accepting this risk; is the convenience of being able to drive
for six months worth $10,000 to her?

This demonstration is especially interesting to students because it shows that they really do put money and lives on a common scale, whether they like it or not.

So . . . is this a violation of the continuity axiom, or not? In a way, it is, because people's stated preferences in these lotteries do not satisfy the axiom. In a way, it's not, because people can be acting in a way consistent with the axiom without realizing it. From this perspective, the axiom (and the associated mathematics) are valuable because they give us an opportunity to confront our inconsistencies.

In that sense, the opposition isn't really normative vs. descriptive, but rather descriptive in two different senses.

(Regular readers of this blog will know that I have big problems with the general use of utility theory in either the normative or the descriptive sense, but that's another story. Here I'm talking about a circumscribed problem where I find utility theory to be helpful.)

End-of-the-Year Altruists

I've been picking on the Freaknomics blog a lot recently, while occasionally adding the qualifier that in general it's great. What you see here is the result of selection bias: when the Freakonomics blog has material of its usual high quality, I don't have much to add, and when there's material of more questionable value, I notice and sometimes comment on it.

In this end-of-the-year spirit, though, I'd like to point to this entry by Stephen Dubner on altruism, which to my mind captures many of the Freakonomics strengths: it's an engaging, topical, and thought-provoking article on an important topic, and it discusses some current research in economics.

And this is as good a way as any for me to end another year of blogging.

Coethnicity

My colleague Macartan Humphreys recently came out with book, Coethnicity (with James Habyarimana, Daniel Posner, and Jeremy Weinstein, addresses the question of why public services and civic cooperation tend to be worse in areas with more ethnic variation. To put it another way: people in homogeneous areas work well together, whereas in areas of ethnic diversity, there's a lot less cooperation.

I'll give my comments, then at the end I posted a response from Macartan.

From one perspective, this one falls into the "duh" category. Of course, we cooperate with people who are more like us! But it's not so simple. Macartan and his colleagues discuss and discard a number of reasonable-sounding explanations before getting to their conclusion, which is that people of the same ethnic group are more able to enforce reciprocity and thus are more motivated to cooperate with each other.

But, looking at it another way, I wonder whether it's actually true that people in homogenous societies cooperate more. I think of the U.S. is pretty ethnically diverse, compared to a lot of much more disorganized places. One question is what counts as ethnicity. Fifty or a hundred years ago in the U.S., I think we'd be talking about Irish, English, Italians, etc., as different ethnic groups, but now they'd pretty much all count as white. To what extent is noncooperation not just the product of ethnic diversity but also a contributor to its continuation?

Macartan and his collaborators address some of these issues in their concluding chapter, and I'm sure there's a lot more about this in the literature. This is an area of political science that I know almost nothing about. When a researcher such as myself writes a book in American politics, we don't have to explain much--our readers are already familiar with the key ideas. Comparative politics, though, is a mystery to the general reader such as myself.

I should say something about the methods used by Macartan and his collaborators. They went to a city in Uganda, told people about their study, and performed little psychology/economics experiments on a bunch of volunteers. Each experiment involved some task or choice involving cooperation or the distribution of resources, and they examined the results by comparing people, and pairs of people, by ethnicity, to see where and how people of the same or different ethnic groups worked together in different ways.

One thing that was cool about this study, and which reminded me of research I've seen in experimental psychology, was that they did lots of little experiments to tie up loose ends and to address possible loopholes. Just for example, see the discussion on pages 137-139 of how they rule out the possibility that their findings could be explained by collusion among study participants.

I was also thinking about the implications of their findings for U.S. politics. (Macartan has told me that he doesn't understand how there can be a whole subfield of political scientist specializing in American politics, but he told me that he'll accept "Americanists" by thinking of us as comparative politics scholars who happen to be extremely limited in what we study.) The authors allude to research by Robert Putnam and others comparing civic behavior in U.S. communities of varying ethnic homogeneity, but I also wonder about public opinion at the national level, not just local cooperation but also to what extent people feel that "we're all in this together" and to what extent people evaluate policies and candidates based on how they effect their ethnic group (however defined). I'm also interested in the sometimes-vague links between ethnicity and geography, for example the idea that being a Southerner (in the U.S.) or a Northerner (in England) seems like an ethnic identity. Even within a city, different neighborhoods have different identities.

If I haven't made the point clear enough already, I think the book is fascinating, and it looks like it will open the door to all sorts of interesting new work as well.

Complicated categories

From a letter by Caroline Williamson of Brunswick, Australia, in the London Review of Books:

Ange Mlinko repeats the rumour that Barbara Guest married an English lord (LRB, 3 December 2009). She married Stephen Haden-Guest in 1948; he was the son of the Labour MP Leslie Haden-Guest, who was made a political peer in 1950. Stephen Haden-Guest inherited the title in 1960, six years after the couple divorced.

As an American, I'm eternally amused by this sort of thing. I just love it that people out there cares whether someone is a lord, or a knight, or whatever. It reminds me of the rule that the wife of a king is a queen, but the husband of a queen is not necessarily a king.

P.S. Yes, I know that Americans are silly in other ways. I grew up 2 blocks away from a McDonald's! I'm not saying that we're better than people from other countries, just that this particular thing amuses me.

Reuters 1, New York Times 0

Analysis.

Credulity.

(See here for background.)

I recently blogged on the following ridiculous (to me) quote from economist Gary Becker:

According to the economic approach, therefore, most (if not all!) deaths are to some extent "suicides" in the sense that they could have been postponed if more resources had been invested in prolonging life.

In my first entry I dealt with Becker's idea pretty quickly and with a bit of mockery ("Sure, 'counterintuitive' is fine, but this seems to be going off the deep end . . ."), and my commenters had no problem with it. But then I updated with a more elaborate argument and discussion of how Becker could've ended up making such a silly-seeming (to me) statement, and the commenters here and here just blasted me. I haven't had such a negative reaction from my blog readers since I made the mistake of saying that PC's are better than Macs.

This got me thinking that sometimes a quick reaction is better than a more carefully thought-out analysis. But I also thought I'd take one more shot at explaining my reasoning and, more importantly, understanding where I might have gone astray. After all, if I can barely convince half the commenters at the sympathetic venue of my own blog, I must be doing something wrong!

Yesterday I posted this graph, a parallel-coordinates plot showing health care spending and life expectancy in a sample of countries:

6a00e00982269188330120a76420ea970b-500wi.jpg

I remarked that a scatterplot should be better. Commenter Freddy posted a link to the data--you guys are the best blog commenters in the world!--so, just for laffs, I spent a few minutes making a scatterplot containing all the same information. Here it is. (Clicking on any of the graphs gives a larger version.)


healthscatter.png

(I was able to make the circles gray thanks to the commenters here.)

How do the two graphs compare? There are some ways in which the first graph is better, but I think these have to do with that graph being made by a professional graphic designer--at least, I assume he's a professional; in any case, he's better at this than I am! He also commented that he removed a few countries from the plot to make it less cluttered. Here's what happens if I take them out too:

healthscatter2.png

(Unlike the National Geographic person, I kept in Turkey. It didn't seem right to remove a point that was on the edge of the graph. I also kept in Norway, which was the highest-spending country on the graph, outside the U.S. And I took out Sweden and Finland--sorry, Jouni!--because they overlapped, too. Really, I prefer jittering rather than removing as a solution to overlap, but here I'll go with what was already done in this example.)

What the scatterplot really made me realize was the arbitrariness of the scaling of the parallel coordinate plot. In particular, the posted graph gives a sense of convergence, that spending is all over the map but all countries have pretty much the same life expectancy--look at the way the lines converge to a narrow zone as you follow the lines from the left to the right of the plot.

Actually, though, once you remove the U.S., there's a strong correlation between spending and life expectancy, and this is super-clear from the scatterplot.

The only other consideration is novelty. The scatterplot is great, but it looks like lots of other graphs we've all seen. This is a plus--familiar graphical forms are easier to read--but also a minus, in that it probably looks "boring" to many readers. The parallel-coordinate plot isn't really the right choice for the goal of conveying information, but it's new and exciting, and that's maybe why one of the commenters at the National Geographic site hailed it as "a masterpiece of succinct communication." Recall our occasional discussions here on winners of visualization contests. The goal is not just to display information, it's also to grab the eye. Ultimately, I think the solution is to do both--in this case, to make a scatterplot in some pretty, eye-catching way.

P.S. I never know how much to trust these purchasing-power-adjusted numbers. Recall our discussion of Russia's GDP.

P.P.S. And here's the R code. Yes, I know it could be cleaner, but I just thought some of the debutants out there might find it helpful:

How to make colored circles in R?

In R, I can plot circles at specified sizes using the symbols() function, but for some reason it won't allow me to do it in color. For example, try this:

symbols (0, 0, circles=1, col="red")

It makes a black circle, just as if the "col" argument had never been specified. What's wrong?

P.S. I could just write my own function to draw circles, but that would be cheating. . . .

Ben Hyde and Aleks both sent me this:

6a00e00982269188330120a76420ea970b-500wi.jpg

The graph isn't as bad as all that, but, yes, a scatterplot would make a lot more sense than a parallel coordinate plot in this case. Also, I don't know how they picked which countries to include. In particular, I'm curious about Taiwan. We visited there once and were involved in a small accident. We were very impressed by the simplicity and efficiency of their health care system. France's system is great too, but everybody knows that.

Ekkehart Schlicht points to this article by Bruno Frey suggesting a change in journal review processes, so that the editorial board first decides whether to accept or reject a paper and then referees are brought in solely to suggest changes on accepted papers. Frey's paper was published in 2003 and, according to Google Scholar, has been cited about 100 times, but I don't know what effects it's had. One journal I know with something close to Frey's system is Statistica Sinica, which screens all submissions through the editorial board before sending out to reviewers. Another is Economic Inquiry, which accepts or rejects your paper as is, without going through a painful revision process. On the downside, Economic Inquiry charged $75 to submit an article, which was kind of irritating. Statistics journals don't do that.

Frey's article is thoughtful and entertaining but does not mention what seems to me to be the biggest advantage of his proposal, which is that it offers a huge reduction in the amount of labor put in by the referees! Frey quotes journal rejection rates of 95%. It would be a lot easier to get these referee reports if there were only 1/20th as many to chase down. When I write a referee report it usually takes about 15 minutes, but other people put more effort into each review, and it's inefficient to waste their time.

I also don't think Frey makes enough of the fact that editing and reviewing journal articles is volunteer work. Sure, there's some prestige involved in editing a journal, and it also gives you some chance to influence the direction of the field, but my impression is these payoffs are low compared to the cost. (Rather than edit a journal, I've chosen to edit a magazine--that is, to blog--which is similar in many ways but gives me the freedom to focus on the topics that interest me rather than on whatever happens to be submitted. (Regular readers know that I often do react to "submissions"--that is, things that people email me--but I don't have to.)

Beyond this, I agree with Frey's general point that, when you write a book, you're writing for the reader, whereas when you write a journal article, you're writing for the referees. Econ journals are particularly bad, in my experience. It's just a different style. In statistics or political science, someone might publish 5 or 10 major papers in a year. In economics (maybe also in psychology?), people work over and over again on a single paper, trying for that elusive "home run." I don't know that either approach is better, but I find it difficult to switch from one to the other.

Frey also points out that as researchers get older, they're less inclined to spend the time on the referee process, instead writing books or publishing in less-demanding journals or simply placing their articles on the web. Schlicht recommends something called RePEC, and Christian posts papers on Arxiv, something that I've found to be a pain in the ass because of the requirement that the article be in Latex. I certainly don't plan to submit many articles to econ journals unless I have an economist collaborator who feels like dealing with the review process.

This is all very important to us because we work hard and, having done the work, we'd like others to follow our lead. It's so frustrating to figure something out but then not be able to communicate our findings to others who might be interested.

P.S. Here's Frey's decision tree (which he calls the Journal Publication Game):

Who owns sparklines?

This is pretty funny (in a horrible sort of way).

My Tiger Woods post

I was just thinking about how everyone's buggin Tiger about his stuff on the side, but nobody cared that the Beatles were doing all the same things (well, not the text messaging, I guess) with groupies. The Beatles are the rock-star equivalent of Tiger, right? A long sequence of #1's, disciplined about work, and so on?

Then again, Lennon and McCartney didn't do ads for AT&T, Gilette, Nike, Accenture (huh? what's that, anyway?), Gatorade, or TLC Laser Eye Centers (or any eye centers, as far as I know). Maybe the standards are higher for people in advertising?

Felix Salmon mocks the above-linked study which claims evidence that Tiger Woods's scandal hurt his sponsors financially, but what I really don't understand, though, is how it can make sense for these companies to be paying a golfer to endorse their products. I mean, Golf Digest, sure, but the others? I'm gonna buy somebody's razor because they paid a million dollars to some dude who can putt? I mean, sure, I understand the reasoning, sort of: Tiger gets attention, you see his face on TV and you whip around to see what the ad is about. If you're a 30 billion dollar company, it can be worth spending $20 million if you think it will increase profits by 0.067%. But it still seems a bit weird to me. At the level of individual decisions, it makes some sense, but if you step back a bit, it's just bizarre.

P.S. The Freaknomics blog links to a Yahoo News report of the study claiming that Tiger Woods's sponsors lost money, but without linking to Felix Salmon's demolition job. I assume that I'm not the only Freakonomics readers who reads Salmon, so maybe someone will point this out in the comments there.

Taxation curves and poverty traps

Dan Lakeland has been thinking about taxation curves and the poverty trap.

Advice to "never bill by the hour"

Is this true? I usually bill by the hour, but I have to say that there's always some awkwardness about this aspect of consulting. Compared the typical hourly rates charged by statistical consultants, my impression is that I charge more but that I bill for far fewer hours--partly because I do consulting as an extra, not as my main job, so I'm typically trying to keep the hours limited.

Maybe a fixed charge would be better, but the trouble is that it's not always clear to me what exactly is needed. Or maybe there should be two stages: first a fixed charge where the product is an assessment of the problem, then another fixed charge for the main projects. Or maybe charge an hourly rate for the little problems and a fixed rate for the big ones. It's something to think about. It would be great to get enough money from consulting to really support some of my research efforts.

I recently reviewed Bryan Caplan's book, The Myth of the Rational Voter, for the journal Political Psychology. I wish I thought this book was all wrong, because then I could've titled my review, "The Myth of the Myth of the Rational Voter." But, no, I saw a lot of truth in Caplan's arguments. Here's what i wrote:

Bryan Caplan's The Myth of the Rational Voter was originally titled "The logic of collective belief: the political economy of voter irrationality," and its basic argument goes as follows:

(1) It is rational for people to vote and to make their preferences based on their views of what is best for the country as a whole, not necessarily what they think will be best for themselves individually.

(2) The feedback between voting, policy, and economic outcomes is weak enough that there is no reason to suppose that voters will be motivated to have "correct" views on the economy (in the sense of agreeing with the economics profession).

(3) As a result, democracy can lead to suboptimal outcomes--foolish policies resulting from foolish preferences of voters.

(4) In comparison, people have more motivation to be rational in their conomic decisions (when acting as consumers, producers, employers, etc). Thus it would be better to reduce the role of democracy and increase the role of the market in economic decision-making.

Caplan says a lot of things that make sense and puts them together in an interesting way. Poorly informed voters are a big problem in democracy, and Caplan makes the compelling argument that this is not necessarily a problem that can be easily fixed--it may be fundamental to the system. His argument differs from that of Samuel Huntington and others who claimed in the 1970s that democracy was failing because there was too much political participation. As I recall, the "too much democracy" theorists of the 1970s saw a problem with expectations: basically, there is just no way for "City Hall" to be accountable to everyone, thus they preferred limiting things to a more manageable population of elites. Caplan thinks that voting itself (not just more elaborate demands for governmental attention) is the problem.

Bounding the arguments

I have a bunch of specific comments on the book but first want to bound its arguments a bit.

The Death of the Blog Post?

Aleks sent me this ugly thing. It's a joke? Or perhaps a sad reflection that people prefer production values over substance?

While putting together a chapter on inference from simulations and monitoring convergence (for a forthcoming Handbook of Markov Chain Monte Carlo; more on that another day), I came across this cool article from 2003 by Jarkko Venna, Samuel Kaski, and Jaakko Peltonen, who show how tools from multivariate discriminant analysis can be used to make displays of MCMC convergence that are much more informative than what we're used to. There's also an updated article from 2009 by Venna with Jaakko Peltonen and Samuel Kaski.

After a brief introduction, Venna et al. set up the problem:

It is common practice to complement the convergence measures by visualizations of the MCMC chains. Visualizations are useful especially when analyzing reasons of convergence problems. Convergence measures can only tell that the simulations did not convergence, not why they did not. MCMC chains have traditionally been visualized in three ways. Each variable in the chain can be plotted as a separate time series, or alternatively the marginal distributions can be visualized as histograms. The third option is a scatter or contour plot of two parameters at a time, possibly showing the trajectory of the chain on the projection. The obvious problem with these visualizations is that they do not scale up to large models with lots of parameters. The number of displays would be large, and it would be hard to grasp the underlying high-dimensional relationships of the chains based on the component-wise displays.

Some new methods have been suggested. For three dimensional distributions advanced computer graphics methods can be used to visualize the shape of the distribution. Alternatively, if the outputs of the models can be visualized in an intuitive way, the chain can be visualized by animating the outputs of models corresponding to successive MCMC samples. These visualizations are, however, applicable only to special models.

This seems like an accurate summary to me. If visualizations for MCMC have changed much in 2003, the news certainly hadn't reached me. I'd only add a slight modification to point out that with high resolution and small multiples, we can plot dozens of trace plots on the screen at once, rather than the three or four which has become standard (because that's what Bugs does).

In any case, it's a problem trying to see everything at once in a high-dimensional model. Venna et al. propose to use discriminant analysis on the multiple chains to identify directions in which there is poor mixing, and then display the simulations on this transformed scale.

Here's an example, a two-dimensional linear discriminant analysis projection of 10 chains simulated from a hierarchical mixture model:

csda1.png

And here's another plot, this time showing the behavior of the chains near convergence, using discriminative component analysis:

csda2.png

The next step, once these patterns are identified, would be to go back to the original parameters in the models and try to understand what's happening inside the chains.

Venna et al. have with what seems like a great idea, and it looks like it could be implemented automatically in Bugs etc. The method is simple and natural enough that probably other people have done it too, but I've never seen it before.

P.S. I wish these people had sent me a copy of their paper years ago so I didn't have to wait so long to discover it.

Garbage Time

Phil amusingly introduces the basketball term "garbage time" to refer to the point where a discussion thread reduces to back-and-forth arguments without much hope of further progress.

That got me thinking how garbage time starts at different points on different blogs. Here at Statistical Modeling, it doesn't happen much at all. The Freakonomics blog often has high-quality comments (as I discussed here), but there are so many of them that there's not really a chance for progress to be made in the comment threads. The posts at 538 get lots of comments, but garbage time there usually starts around comment #2 or so.

Alek Tabarrok posts this beautiful graph that was prepared for the ultimate in bureaucratic institutions:


6a00d8341c66b253ef01287675419c970c-800wi.png

Learn to program!

I often run into people who'd like to learn how to program, but don't know where to start. Over the past few years, there has been an emergence of interactive tutorial systems, where a student is walked through the basic examples and syntax.

  • Try Ruby! will teach you Ruby, a Python-like language that's extremely powerful when it comes to preprocessing data.
  • WeScheme will teach you Scheme, a Lisp-like language that makes writing interpreters for variations of Scheme very easy.
  • Lists by Andrew Plotkin is a computer game that requires you to be able to program in Lisp. Lisp is the second-oldest programming language (after Fortran), but Ruby and Python do most of what Lisp has traditionally been useful for.

Maybe there will be a similar tool for R someday!

Thanks to Edward for the pointers!

Linking the unlinkable

Two of the bloggers I find the most entertaining and thought-provoking are Phil Nugent and Steve Sailer. I don't know that they would agree with each other on anything, but they do have one thing in common, which is that they like to review movies. Anyway, each of them has a super-long blogroll, and what I'm wondering is: what's the shortest set of links that will take you from Nugent to Sailer (or vice-versa). It has to be a series of links going from one to the other--i.e., it's not enough that both link to the same page (Arts & Letters Daily, in case you're wondering).

I'm hoping that a long long chain is needed--it's too much to hope that you just "can't get there from here," but I'm pessimistically guessing that, the Internet being what it is, you can get there in two links.

P.S. I wasted a few more minutes and found that Nugent links to 2 Blowhards, who links to Sailer. So that's it. A bit of a letdown, but I guess inevitable given the huge number of links on these guys' pages. I'm hoping that Nugent will hear about this and eliminate his 2 Blowhards link, thus make my linking question more interesting. And, believe me, 2 Blowhards is not nearly as interesting as Nugent or Sailer.

Gueorgi Kossinets writes:

We have an opening in the Content Ads Quality group at Google (Mountain View).

They can email their resume directly to me, which should speed up the process. I will also be happy to talk informally about the type of work we do, Google culture, etc.


Update on estimates of war deaths

I posted a couple days ago on a controversy over methods of counting war deaths. This is not an area I know much about, and fortunately some actual experts (in addition to Mike Spagat, who got the ball rolling) wrote in to comment.

Their comments are actually better than my original discussion, and so I'm reposting them here:

Bethany Lacina writes:

I didn't work on the Spagat et al. piece, but I'm behind the original battle deaths data. Your readers might be interested in the "Documentation of Coding Decisions" available at the Battle Deaths Database website. The complete definition of "battle deaths"--admittedly a tricky concept--starts on page 5. The discussion of Guatemala starts on page 219.

The goal of the Documentation is to preserve all the sources we used and the logic of how we interpreted them. If you or any of your readers know of sources we haven't consulted, for any conflict, it would be terrific to hear about them: battledeaths@prio.no

Ameilia Hoover writes:

This debate is missing a key part -- namely, any sort of awareness that there are estimation methods out there that improve on both surveys (usual stalking horse of Spagat, et al.) and convenience data such as press reports (usual stalking horse of many other people).

Spagat et al. are more or less correct about all of the many, many problems with survey data. They're right to criticize OMG (OMG!). But this isn't, or at any rate shouldn't be, a debate between survey and convenience methods.

The authors dismiss (at page 936; again in footnote 2) estimation techniques other than retrospective mortality surveys and "collation of other reports". But while it's true that demographers often (usually? Help me out here, demographers) use retrospective survey data in their analyses, there's also a long-standing literature that uses census data instead, matching across sources in order to model (a) patterns of inclusion in convenience sources and (b) the number of uncounted cases. This method accurately counts deer, rabbits, residents of the United States, children with various genetic disorders, and HIV patients in Rome (to name a few examples I can think of) -- and, yes, also conflict-related deaths.

Bethany Lacina's link to the PRIO documentation is really interesting on this point. For El Salvador, the case with which I'm most familiar, PRIO's best estimate is 75,000 total deaths -- 55,000 battle deaths and 20,000 "one sided" deaths. I think this is reasonable-ish (maybe the total is between 50,000 and 100,000?), but there's no actual evidence to support such a number. The sources PRIO cites are expert guesses, rather than statistical analyses of any sort.

PRIO's El Salvador estimates are based on *neither* documented/documentable convenience data (e.g., press reports, NGO reports) *nor* survey data. The United Nations-sponsored Truth Commission for El Salvador's list of documented (and partially documented) deaths includes about 14,000 total deaths, many of which are duplicates. Two other NGO databases include about 6,000 and about 1,500 deaths, respectively. Again, there's significant overlap and many duplicates. Yet no one imagines that the total deaths in this conflict were 21,500. In the Salvadoran case as in many others, inclusion in the data is incredibly biased toward urban, educated, and politically active victims. (They're also biased in any number of other ways, of course.)

Prof. Gelman is right to point out the discrepancy between the Guatemala survey numbers, the Guatemala convenience (PRIO) numbers, and the number that most people cite as the best approximation for Guatemala (200,000). Importantly, that "200,000" is based in large part on census numbers. (See http://shr.aaas.org/mtc/chap11.html and http://shr.aaas.org/guatemala/ceh/mds/spanish/toc.html, statistical analyses from the Commission for Historical Clarification, Guatemala's Truth Commission.) So why ignore census correction methods?

Given that discrepancies between survey and convenience data are very often dwarfed by discrepancies between those numbers and the numbers we believe to be correct, I worry that the surveys-versus-convenience-data fight isn't more about protecting academic projects and prerogatives than about actually finding the correct answer.

Romesh Silva writes:

The claim that demographers often/usually use retrospective mortality surveys in their analyses is a bit off the mark. Looks like it is borne out of some confusion in some parts of the academy between the methods of demographers and epidemiologists... Broadly speaking, demographers use a wide array of sources including population censuses, vital registration systems, demographic surveillance systems, and surveys (of all flavors: longitudinal, panel, and retrospective). In the field of conflict-related mortality, demographers have actually relied almost exclusively on sources other than surveys. For example, Patrick Heuveline and Beth Daponte have used population censuses (and voter registration lists) in Cambodia and Iraq, respectively, and demographers at the ICTY (Helge Brunborg and Ewa Tabeau) have used various types of "found data" which equate to (incomplete) registration lists along side census correction methods. Distinguished demographers Charles Hirschman and Sam Preston were in the minority amongst demographers, when they used a household survey to estimate Vietnamese military and civilian casualties between 1965 and 1975. The folks who routinely use surveys in the field of conflict-related mortality are epidemiologists, not demographers. The folks at Johns Hopkins, Columbia's Mailman School of Public Health, Harvard Humanitarian Initiative, Physicians for Human Rights, MSF, Epicentre, etc use variants of the SMART methodology with a 2-stage cluster design are epidemiologists. This design and methodology has been coarsely adapted from a back-of-the-envelope method used to evaluate vaccination coverage in least developed countries. However, epidemiologists at the London School of Hygiene and Tropical Medicine have recently noted that this method "tends to be followed without considering alternatives" and "there is a need for expert advice to guide health workers measuring mortality in the field" (See http://www.ete-online.com/content/4/1/9).

I just thought it might help to put this all in one place.

Stephen Dubner quotes Gary Becker as saying:

According to the economic approach, therefore, most (if not all!) deaths are to some extent "suicides" in the sense that they could have been postponed if more resources had been invested in prolonging life.

Dubner describes this as making "perfect sense" and as being "so unusual and so valuable."

When I first saw this I was irritated and whipped off a quick entry on the sister blog. But then I had some more systematic thoughts of how Becker's silly-clever statement, and Dubner's reaction to it, demonstrate several logical fallacies that I haven't seen isolated before.

No, it's not true that most deaths are suicides

I'll get to the fallacies in a moment but first I'll explain in some detail why I disagree with Becker's statement. The claim that most deaths are suicides seemed evidently ridiculous to me (not just bold, counterintuitive, taboo-shattering, etc., but actually false), and my inclination in such settings is to mock rather than explicate--but Becker and Dubner are smart guys, and if they can get confused on this topic, I'm sure others can too.

The following is the last paragraph in a (positive) referee report I just wrote. It's relevant for lots of other articles too, I think, so I'll repeat it here:

Just as a side note, I recommend that the authors post their estimates immediately; I imagine their numbers will be picked up right away and be used by other researchers. First, this is good for the authors, as others will cite their work; second, these numbers should help advance research in the field; and, third, people will take the estimates seriously enough that, if there are problems, they will be uncovered. It makes sense to start this process now, so if anything bad comes up, it can be fixed before the paper gets published!

I have to admit that I'm typically too lazy to post my estimates right away; usually it doesn't happen until someone sends me an email request and then I put together a dataset. But, after writing the above paragraph, maybe I'll start following my own advice.

Conflict over conflict-resolution research

Mike Spagat writes:

I hope that this new paper [by Michael Spagat, Andrew Mack, Tara Cooper, and Joakim Kreutz] on serious errors in a paper on conflict mortality published in the British Medical Journal will interest you. For one thing I believe that it is highly teachable. Beyond I think that it's important for the conflict field (if I do say so myself). Another aspect of this is that the BMJ is refusing to recognized that there are any problems with the paper. This seems to be sadly typical behavior of journals when they make mistakes.

Spagat et al's paper begins:

In a much-cited recent article, Obermeyer, Murray, and Gakidou (2008a) examine estimates of wartime fatalities from injuries for thirteen countries. Their analysis poses a major challenge to the battle-death estimating methodology widely used by conflict researchers, engages with the controversy over whether war deaths have been increasing or decreasing in recent decades, and takes the debate over different approaches to battle-death estimation to a new level. In making their assessments, the authors compare war death reports extracted from World Health Organization (WHO) sibling survey data with the battle-death estimates for the same countries from the International Peace Research Institute, Oslo (PRIO). The analysis that leads to these conclusions is not compelling, however. Thus, while the authors argue that the PRIO estimates are too low by a factor of three, their comparison fails to compare like with like. Their assertion that there is "no evidence" to support the PRIO finding that war deaths have recently declined also fails. They ignore war-trend data for the periods after 1994 and before 1955, base their time trends on extrapolations from a biased convenience sample of only thirteen countries, and rely on an estimated constant that is statistically insignificant.

Here they give more background on the controversy. They make a pretty convincing case that many open questions remain before we can rely on survey-based estimates of war deaths. In particular, they very clearly show that the survey-based estimates provide no evidence at all regarding questions of trends in war deaths--the claims of Obermeyer et al. regarding trends were simply based on a statistical error. The jury is still out, I think, on what numbers should be trusted in any particular case.

Here's a summary of the data used by Obermeyer et al.:

obez522912.f4.gif

Who's on Facebook?

David Blei points me to this report by Lars Backstrom, Jonathan Chang, Cameron Marlow, and Itamar Rosenn on an estimate of the proportion of Facebook users who are white, black, hispanic, and asian (or, should I say, White, Black, Hispanic, and Asian).

Funding research

Via Mendeley, a nice example of several overlapping histograms:

picture-31.png

The x axis is overlabelled, but I don't want to nitpick.

Previous post on histogram visualization: The mythical Gaussian distribution and population differences

Update 12/21/09: JB links to an improved version of the histograms by Eric Drexler below. And Eric links to the data. Thanks!

darwin.gif

Dan Goldstein points to a draft article by Andreas Graefe and J. Scott Armstrong:

Dean Eckles writes:

I have a hopefully interesting question about methods for analyzing varying coefficients as a way of describing similarity and variation between the groups.

In particular, we are studying individual differences in susceptibility to different influence strategies. We have quantitative outcomes (related to buying books), and influence strategy (7 levels, including a control) is a within-subjects factor with two implementations of each strategy (also within-subjects).

Wired reports a great new opportunity to make money online by suing internet companies for revealing the data:

An in-the-closet lesbian mother is suing Netflix for privacy invasion, alleging the movie rental company made it possible for her to be outed when it disclosed insufficiently anonymous information about nearly half-a-million customers as part of its $1 million contest to improve its recommendation system.

I'm not sure whether the litigators have read this particular section of the Netflix prize rules:

To prevent certain inferences being drawn about the Netflix customer base, some of the rating data for some customers in the training and qualifying sets have been deliberately perturbed in one or more of the following ways: deleting ratings; inserting alternative ratings and dates; and modifying rating dates.

So yes, you can match a set of reviews with someone else, but how will you know that it's really a person and not a random coincidence? 0.5 million review traces give plenty of opportunity for a false positive match. Netflix learned from AOL's data release disaster, which resulted in a few people getting fired.

But this theme is important. Many internet companies provide free services in return for the ability to employ user data for profit. Andrew Parker looked at which companies make profit out of user data. Usually, the data is never given away, but just used to make other people's lives easier. Let's say that you bookmark a particular page - others won't see that you've done it, but they will see that there are people that find that page worthy of saving - therefore it can be listed higher up in search results.


A more problematic area is medicine. Wired reports that there is a market out there for medical records, and that anonymity protection isn't very secure.

Keeping medical data public would allow massive advances in medicine. For example, the Personal Genomes project seeks to analyze a number of volunteers in a lot of detail (see, for example, Steven Pinker's medical record). If a few million people did that, we'd know so much more about disease, risks, factors affecting it, effectiveness of drugs, diet, the effects of genome.

One-sided disclosure gets many people worried - their insurance rates might go up, they might not get a job. It would help if everyone was doing that: nobody feels well being naked when others wear swimsuits.

But we should also ask ourselves as a society - what is insurance? Is insurance a protection against uncontrollable risk or is it an instrument of equality? Is genome our destiny or an uncontrollable risk?

Previous posts on this topic: EU data protection guidelines, Privacy vs Transparency.

Some scams

Tyler Cowen links to this article by Frank Stajano and Paul Wilson:

The success of many attacks on computer systems can be traced back to the security engineers not understanding the psychology of the system users they meant to protect. We [Stajano and Wilson] examine a variety of scams and "short cons" that were investigated, documented and recreated for the BBC TV programme The Real Hustle and we extract from them some general principles about the recurring behavioural patterns of victims that hustlers have learnt to exploit. We argue that an understanding of these inherent "human factors" vulnerabilities, and the necessity to take them into account during design rather than naïvely shifting the blame onto the "gullible users", is a fundamental paradigm shift for the security engineer which, if adopted, will lead to stronger and more resilient systems security.

I wasn't blown away by the theoretical arguments in the article, but the scams are fascinating.

Universities and $

In this article about college funding, Kevin Carey says something that I've long believed, which is that government-supported financial aid doesn't quite work how you might imagine: colleges can just raise their prices along with any aid packages that come along. The price tag for college is not fixed, and so what looks like a subsidy for low-income students can just end up being a way for universities to jack up their prices by a corresponding amount.

But Carey also says some things that don't convince me so much. My impression is that he just threw in all sorts of negative attitudes about universities, without thinking about how they all fit together.

In discussing this, I'm not trying to pick on Kevin Carey, who makes excellent points about the desirability of publicly available information on what students actually learn in college. My point here is to use this generally fine article to highlight some ways in which people get confused when talking about higher education.

Carey writes:

Essentially, colleges don't figure out how much money they need to spend and then go get it. Instead, they get as much money as they can and then spend it. Since reputations are relational-the goal is to be better than the other guy-there is no practical limit on how much colleges can spend in pursuit of self-glorification. As former Harvard President Derek Bok wrote, "Universities share one characteristic with compulsive gamblers and exiled royalty: There is never enough money to satisfy their desires."

I agree that this describes colleges, and I'll take Bok's word for it that it describes compulsive gamblers and exiled royalty too. But doesn't it really describe almost anybody? I mean, who among us, Ubs excepted, figures out how much money they needs to spend and then goes and gets it? The much much more common pattern, I think, is that people get what jobs they can do and, ideally, want to do, and then if they need more money, sure, they try to get more. But when people make more, they tend to spend more and feel the need for even more, etc. I don't see at all what's special about universities here--this just seems like a cheap shot to me. Universities are like other organizations: they're happy to take money that people are willing to give to them. I mean, I don't see Apple saying, "Hey, we have enough money--we're gonna give out i-pods for free."

p = 0.5

In the middle of a fascinating article on South Africa's preparations for the World Cup, R. W. Johnson makes the following offhand remark:

Any minute now the usual groaning will be heard from teams which claim that they, uniquely, have been drawn in a 'group of death'. What is the point, one might ask, in groaning about a random draw? Well, the trouble starts there, for the draw is not entirely random. In practice, seven teams are seeded, according to how well they've been doing in international matches, along with an eighth team, the host nation, whose passage into the second round is thus made easier - on paper. The draw depends on which balls rise to the top of the jar and thus get plucked out first; but it's rumoured that certain balls get heated in an oven before a draw, thus guaranteeing that they will bubble to the top. The weakest two teams aside from South Africa and North Korea are South Korea and New Zealand. The odds are, of course, heavily against any two or more of these bottom four finding themselves in the same group. If they do, we will have to be deeply suspicious of the draw.

This got me wondering. What is the probability that the bottom four teams will actually end up in different groups?

Given the rules as stated above, eight of the teams (including South Africa) start in eight different groups. There are 24 slots remaining. Now let's assign the next three low-ranking teams. The first has a 21/24 chance of being in one of the seven groups that does not have South Africa; the next has a 18/23 chance of being in one of the six remaining groups, and the next has a 15/22 chance of being in one of the five remaining. Combining these, the probability that the bottom four teams are in four different groups is 1-(21/24)*(18/23)*(15/22) = 0.53. (Unless I did the calculation wrong. Such things happen.)

So, no, I don't think that if two of these teams happen to find themselves in the same group, that "we will have to be deeply suspicious of the draw."

P.S. The 53% event happened: the four bottom-ranked teams are in different brackets. So we can breathe a sigh of relief.

Lowering the minimum wage

Paul Krugman asks, "Would cutting the minimum wage raise employment?" The macroeconomics discussion is interesting, if over my head.

But, politically, of course nobody's going to cut the minimum wage. Can you imagine the unpopularity of a minimum wage cut during a recession? I can't imagine that all the editorial boards of all the newspapers in the country could convince a majority of Congress to vote for this one, whatever its economic merits.

Which makes me wonder why the idea is being discussed at all. Is it an attempt to shoot down a minimum wage increase that might be in the works? Krugman mentions that Serious People are proposing a minimum wage cut, but he doesn't mention who those Serious People are. I can't imagine that they're serious about thinking this might happen.

Other voices, other blogs

What follows is a "meta" sort of discussion, so I'll put it below the fold and most of you can skip it.

Jimmy pointed me to this news article. My reaction to this is that the standards in teaching are low enough that someone like Xiao-Li or me can be considered to be an entertaining lecturer. It would be a lot hard to get by in standup.

Four out of the last 15 posts on this blog have been related to climate change, which is probably a higher ratio than Andrew would like. But lots of people keep responding to them, so the principle "give the people what they want" suggests that another one won't hurt too much. So, here it is. If you haven't read the other posts, take a look at Andrew's thoughts about forming scientific attitudes, and my thoughts on Climategate and my suggestions for characterizing beliefs. And definitely read the comments on those, too, many of which are excellent.

I want to get a graphic "above the fold", so here's the plot I'll be talking about.
WarmingProbDists.png

The "All Else Equal" Fallacy, again

Here's the entry from the statistical lexicon:

The "All Else Equal" Fallacy: Assuming that everything else is held constant, even when it's not gonna be.

My original note about this fallacy came a couple years ago when New York Times columnist John Tierney made the counterintuitive claim (later blogged by Steven Levitt) that driving a car is good for the environment. As I wrote at the time:

These guys are making a classic statistical error, I think, which is to assume that all else is held constant. This is the error that also leads people to misinterpret regression coefficients causally. (See chapters 9 and 10 of our book for discussion of this point.) In this case, the error is to assume that the walker and the driver will be making the same trip. In general, the driver will take longer trips--that's one of the reasons for having a car, that you can easily take longer trips. Anyway, my point is not to get into a long discussion of transportation pricing, just to point out that this seemingly natural calculation is inappropriate because of its mistaken assumption that you can realistically change one predictor, leaving all the others constant.

I hadn't thought much about this but then I see that Levitt repeated this error in his new Freakonomics book and on his blog, where he writes:

Unmasked

This story makes me think of a few things:

The lively discussion on Phil's entries on global warming here and here prompted me to think about the sources of my own attitudes toward this and other scientific issues.

For the climate change question, I'm well situated to have an informed opinion: I have a degree in physics, two of my closest friends have studied the topic pretty carefully, and I've worked on a couple related research projects, one involving global climate models and one involving tree ring data.

In our climate modeling project we were trying to combine different temperature forecasts on a scale in which Africa was represented by about 600 grid boxes. No matter how we combined these precipitation models, we couldn't get any useful forceasts out of them. Also, I did some finite-element analysis many years ago as part of a research project on the superheating of silicon crystals (for more details of the project, you can go to my published research papers and scroll way, way, way down). We were doing analysis on a four-inch wafer, and even that was tricky, so I'm not surprised that you'll have serious problems trying to model the climate in this way. As for the tree-ring analysis, I'm learning more about this now--we're just at the beginning of a three-year NSF-funded project--but, so far, it seems like one of those statistical problems that's easy to state but hard to solve, involving a sort of multilevel modeling of splines that's never been done before. It's tricky stuff, and I can well believe that previous analyses will need to be seriously revised.

Notwithstanding my credentials in this area, I actually take my actual opinions on climate change directly from Phil: he's more qualified to have an opinion on this than I am--unlike me, he's remained in physics--and he's put some time into reading up and thinking about the issues. He's also a bit of an outsider, in that he doesn't do climate change research himself. And if I have any questions about what Phil says, I can run it by Upmanu--a water-resources expert--and see what he thinks.

What if you don't know any experts personally?

It helps to have experts who are personal friends. Steven Levitt has been criticized for not talking over some of his climate-change speculations with climate expert Raymond Pierrehumbert at the University of Chicago (who helpfully supplied a map showing how Levitt could get to his office), but I can almost sort-of understand why Levitt didn't do this. It's not so easy to understand what a subject-matter expert is saying--there really are language barriers, and if the expert is not a personal friend, communication can be difficult. It's not enough to simply be at the same university, and perhaps Levitt realized this.

Twitteo killed the bloggio star

I've seen the future of Liebling opimality, and it ain't pretty.

A. J. Liebling (author of The Honest Rainmaker and many other classics) once boasted, "I can write faster than anyone who can write better and I can write better than anyone who can write faster." I've long admired this sentiment, as has political journalist Mickey Kaus, who has lived it by moving from magazine and book writing to blogging and, now, twittering.

I'm worried, though, now that Kaus's blogging has become more twitter-like, that he's approaching a logical extreme of Liebling optimality, which is to make his posts shorter and shorter and faster and faster until he's reduced to sitting at his keyboard, posting single characters, one at a time, very rapidly:

e...r...y...4...2...n...u...and so forth.

Some spots on the efficient frontier are more comfortable than others, no?

P.S. On the other hand, I'm sure Kaus still has another book or two or three within him, if he decides to move back in the other direction along that curve.

DIY data analysis: three fun examples

I recently came across some links showing readers how to make their own data analysis and graphics from scratch. This is great stuff--spreading power tools to the masses and all that.

From Nathan Yau: How to Make a US County Thematic Map Using Free Tools and How to Make an Interactive Area Graph with Flare. I don't actually think the interactive area graphs are so great--they work with the Baby Name Wizard but to me they don't do much in the example that Nathan shows--but, that doesn't really matter, what's cool here is that he's showing us all exactly how to do it. This stuff is gonna put us statistical graphics experts out of business^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^Ha great service.

And Chris Masse points me to these instructions from blogger Iowahawk on downloading and analyzing a historical climate dataset. Good stuff.

David points me to this news article by Dennis Cauchon, which begins:

Professor Risk

From the co-author of the celebrated Scholarpedia article on Bayesian statistics...

Visualizing UK budget

I was impressed by the Where Does My Money Go? - an interactive visualization interface to the UK government budget. If one ignores the painful color scheme (see below), the interactivity of exploring the data is notable.

UK_budget.png

One particularly interesting aspect is a regional spending breakdown, which shows which regions are contributing to the budget and which ones are disproportionally benefiting from it.

The British also have a great website that quantitatively analyzes the behavior in their parliament: Public Whip.

Question about Regression

Marcos Sanches writes:

Some (statistical) stories about BUGS

Hey, I don't think I ever posted a link to this. It's a discussion in the journal Statistics in Medicine of an article by David Lunn, David Spielhalter, Andrew Thomas ,and Nicky Best. (Sorry but I can't find the Lunn et al. article online, or I'd link to it.) Anyway, here's my discussion. Once upon a time . . .

I first saw BUGS in a demonstration version at a conference in 1991, but I didn't take it seriously until over a decade later, when I found that some of my Ph.D. students in political science were using Bugs to fit their models. It turned out that Bugs's modeling language was ideal for students who wanted to fit complex models but didn't have a full grasp of the mathematics of likelihood functions, let alone Bayesian inference and integration. I also learned that the modular structure of BUGS was a great way for students, and researchers in general, to think more about modeling and less about deciding which conventional structure should be fit to data.

Since then, my enthusiasm for BUGS has waxed and waned, depending on what sorts of problems I was working on. For example, in our study of income and voting in U.S. states [1], my colleagues fit all our models in BUGS. Meanwhile we kept running into difficulty when we tried to expand our model in different ways, most notably when going from varying-intercept multilevel regressions, to varying-intercept, varying-slope regressions, to models with more than two varying coefficients per group. Around this time I discovered lmer [2], a function in R which fits multilevel linear and generalized linear models allowing for varying intercepts and slopes. The lmer function can have convergence problems and does not account for uncertainty in the variance parameters, but it is faster than Bugs and in many cases more reliable-so much so that Jennifer Hill and I retooled our book on multilevel models to foreground lmer and de-emphasize Bugs, using the latter more as a way of illustrating models than as a practical tool.

What does BUGS do best and what does it do worst?

Lots of accusations are flying around in the climate change debate. People who believe in anthropogenic (human-caused) climate change are accused of practicing religion, not science. People who don't are called "deniers", which some of them think is an attempt to draw a moral link with holocaust deniers. Al Gore referred to Sarah Palin as a "climate change denier," and Palin immediately responded that she believes the climate changes, she just doesn't think the level of greenhouse gases in the atmosphere has anything to do with it. What's the right word to use for people like her? And yes, we do need some terminology if we want to be able to discuss the climate change debate!

Differential Evolution MCMC

John Salvatier writes:

I remember that you once mentioned an MCMC algorithm based on Differential Evolution, so I thought you might be interested in this paper, which introduces an algorithm based on Differential Evolution and claims to be useful even in high dimensional and multimodal problems.

Cool! Could this be implemented in Bugs, Jags, HBC, etc?

Jenny quotes Erica Wagner:

Isaac Bashevis Singer wrote for more than four decades on an Underwood portable. For him, his machine was a kind of first editor. "If this typewriter doesn't like a story, it refuses to work," he said. "I don't get a man to correct it since I know if I get a good idea the machine will make peace with me again. I don't believe my own words saying this, but I've had the experience so many times that I'm really astonished. But the typewriter is 42 years old. It should have some literary experience, it should have a mind of its own."

Hey, I've been writing for almost 42 years myself!

More to the point, the Singer quote reminds me of my own experience in doing mathematics. It's virtually impossible for me to write down a formula with pen on paper unless I understand what the formula means. The act of writing enforces rigor. It makes perfect sense to me that a similar thing would happen to Singer when typing stories.

A journalist contacted me to ask what I thought about this article by Marshall Burke, Edward Miguel, Shanker Satyanath, John Dykema, and David Lobell:

Like a lot of scientists -- I'm a physicist -- I assumed the "Climategate" flap would cause a minor stir but would not prompt any doubt about the threat of global warming, at least among educated, intelligent people. The evidence for anthropogenic (that is, human-caused) global warming is strong, comes from many sources, and has been subject to much scientific scrutiny. Plenty of data are freely available. The basic principles can be understood by just about anyone, and first- and second-order calculations can be perfomed by any physics grad student. Given these facts, questioning the occurrence of anthropogenic global warming seems crazy. (Predicting the details is much, much more complicated). And yet, I have seen discussions, articles, and blog posts from smart, educated people who seem to think that anthropogenic climate change is somehow called into question by the facts that (1) some scientists really, deeply believe that global warming skeptics are wrong in their analyses and should be shut out of the scientific discussion of global warming, and (2) one scientist may have fiddled with some of the numbers in making one of his plots. This is enough to make you skeptical of the whole scientific basis of global warming? Really?

"Orange" ain't so special

Mark Liberman comes in with a data-heavy update (and I mean "data-heavy" in a good way, not as some sort of euphemism for "data-adipose") on my comments of the other day. I'm glad to see that he agrees with me that my impressedness with Laura Wattenberg's observation was justified.

Yet more antblogging

ant_colony135.jpg

James Waters writes:

Equation search, part 2

Some further thoughts on the Eureqa program which implements the curve-fitting method of Michael Schmidt and Hod Lipson:

The program kept running indefinitely, so I stopped it in the morning, at which point I noticed that the output didn't quite make sense, and I went back and realized that I'd messed up when trying to delete some extra data in the file. So I re-ran with the actual data. The program functioned as before but moved much quicker to a set of nearly-perfect fits (R-squared = 99.9997%, and no, that's not a typo). Here's what the program came up with:

models.png

The model at the very bottom of the list is pretty pointless, but in general I like the idea of including "scaffolding" (those simple models that we construct on the way toward building something that fits better) so I can't really complain.

It's hard to fault the program for not finding y^2 = x1^2 + x2^2, given that it already had such a success with the models that it did find.

Commenter Michael linked to a blog by somebody called The Last Psychiatrist, discussing the recent study by Rank and Hirschl estimating that half the kids in America in the 1970s were on food stamps at some point in their childhood. I've commented on some statistical aspects of that study, but The Last Psychiatrist makes some good points regarding how the numbers can and should be interpreted.

Hey, made ya look!

Russell's paradox, major league version

From Ubs:

How fast is Rickey? Rickey is so fast that he can steal more bases than Rickey. (And nobody steals more bases than Rickey.)

Maybe so, actually.

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48