Recently in Statistical graphics Category

1. Understanding the 'Russian Mortality Paradox' in Central Asia: Evidence from Kyrgyzstan

Short answer: alcohol and suicide.

2. Lumberjacks as a counterexample to the idea of a "risk premium"

They take lots of risks and don't get paid well for it.

3. Cell size and scale

This is a visualization you won't want to miss.

4. Three guys named Matt

5. The political philosophy of the private eye

A genre that was rendered obsolete in 1961 (but nobody realizes it).

Variations on the histogram

| 10 Comments

Lorraine Denby and Colin Mallows write:

It is usual to choose to make the bins in a histogram all have the same width. One could also choose to make them all have the same area. These two options have complementary strengths and weaknesses--the equal-width histogram oversmooths in regions of high density and is poor at identifying sharp peaks; the equal-area histogram oversmooths in regions of low density and so does not identify outliers. We describe a compromise approach which avoids both of these defects. We argue that relying on asymptotics of the Integrated Mean Square Error leads to inappropriate recommendations.

I'm so glad they wrote this article (it appeared recently in the Journal of Computational and Graphical Statistics)! I've thought for a long time that (a) histogram bars are typically too wide (for example, as set by default in software packages such as S and R), and (b) that the underlying problem was that people think of the goal of the histogram as to closely approximate the density function.

A key benefit of a histogram is that, as a plot of raw data, it contains the seeds of its own error assessment. Or, to put it another way, the jaggedness of a slightly undersmoothed histogram performs a useful service by visually indicating sampling variability. That's why, if you look at the histograms in my books and published articles, I just about always use lots of bins. I also almost never like those kernel density estimates that people sometimes use to display one-dimensional distributions. I'd rather see the histogram and know where the data are.

Denby and Mallows go far beyond my vague thoughts by considering histograms with varying widths and coming up with a particular algorithm. I'd like to try out their method on my own problems. Is there R package out there?

Silly stat-based music video

| 1 Comment

Richard Morey writes:

I don't know if you are into this sort of thing, but I came across it on the web and thought it was entertaining. Essentially, it is a music video made up of visualizations of quantitative information. It follows a day in a workers life. I suspect some of the data is real. Anyway, it is a creative use of data visualization. I don't know anything about the artist(s).

Better than a boxplot

| 11 Comments

I'd love if someone else were to write my article, tentatively titled "Better than a boxplot," with the following abstract: "We demonstrate graphical options that dominate the boxplot. We hope that, once these alternatives are understood, boxplots are never used again." But I have a horrible feeling I'm going to have to write this article itself.

How to make graphs that work

| 9 Comments

Aleks pointed me to this good advice from Seth Godin on preparing graphs. Some snippets:

1. Don't let popular spreadsheets be in charge of the way you look . . . when you show me something exactly like something I've seen a hundred times before, what do you expect me to do? Here's a hint: Zzzzzz.

2. Tell a story

3. Follow some simple rules

- Time goes on the bottom, and goes from left to right

- Good results should go up on the Y axis. This means that if you're charting weight loss, don't chart "how much I weigh" because good results would go down. Instead, chart "percentage of goal" or "how much I lost."

4. Break some other rules . . .

When I visited AT&T Labs last month, I saw some beautiful poster-size maps from Yifan Hu, visualizing structures in large data sets. Here's TV Land:

uverse_1000_country_labels.png

The research is by Emden Gansner, Yifan Hu, Stephen Kobourov, Chris Volinsky, and you can download their article and more maps at the above link.

Regarding the request for a "good graphical way of showing changes in the distribution of a population among quantile categories," Antony Unwin sends in this:

MultBarsFatherSonIncome.png

He writes:

Graphing a transition matrix

| 3 Comments

Andy Baxter writes:

I wondered if you had a suggestion for a good graphical way of showing changes in the distribution of a population among quantile categories from one time period to another. I'm working on a project in which I need to show our district leadership the stability of various value-added estimates of a given teacher's effectiveness from year to year. For example, how many teachers in the first quintile remain in the first quintile 1 year later? I know I could probably just do it with a table, but I wondered if there was a better way to do it with a graph. Any ideas or links to good examples?

My reply:

I imagine there's been a lot of work on this general task: it's basically the same problem as summarizing transition matrices, which is a bit issue in sociology. Anyway, here's my quick suggestion.

Label i as the starting quintile and j as the quintile one year later. You then have 25 data points, corresponding to the percentage of teachers that start in quintile i and end up in quintile j. Call these p_ij. The sum of the 25 p_ij's will be 100% (by definition).

The natural next step would be to make a scatterplot showing these 25 values, perhaps a circle in each grid point with the size of the circle at (i,j) being proportional to p_ij. But I have a slightly different idea which takes up a bit more space but might be more helpful in showing what you're looking for.

I'm thinking of a display with 5 narrow plots, side by side. Plots i=1,2,3,4,5 correspond to starting quintiles 1,2,3,4,5. Plot i has 5 arrows, each starting at position (0,i) and going to positions (0,j), j=1,2,3,4,5. The width of the j-th arrow here is proportional to p_ij. The separate plots can be pretty narrow because they are only going from 0 to 1 on the x-axis.

My suggestion is to give this a try. If it works out, please let me know--I can post the graph on the blog.

I'm thinking of a graph with 25 lines, where the width of line ij is proportional to p_ij. The positions of the lines

How about a set of five graphs, one for each of the five "before" quintiles. Each graph has five lines showing the number of cases starting in

JoAnn Kuchera-Morin's Allosphere. See also here.

Everybody says OmniGraphSketcher is great, but I can't use it because I don't have a Mac. Is there anything that people recommend for Windows? I'm trying to do a lot of remote meetings with people, so this seems like it could be a useful tool.

OmniGraphSketcher

| 6 Comments

Aleks points me to this graph plotting program. I don't know anything about it, but, hey, maybe it's good.

Gapminder TV show

| No Comments

Mike Maltz writes:

This is an hour-long TV show, but well worth watching, even for those (like me) who have seen Rosling's presentations to the TED conference. It's in Swedish, but captioned in English.

Daniel Lakeland writes:

My wife sent me this link, saying how cool it looked. I [Lakeland] told her it was one of the worst things I'd seen in a long time...Apparently it won the Guardian's "Visualization Contest"...

Alfred Inselberg, the inventor of parallel coordinates, sent along this fascinating handout with a bunch of color graphs illustrating the power of the parallel-coordinates idea.

Here's a cool picture, along with Inselberg's caption:

par1.png

In the background is a dataset with 32 variables and 2 categories. On the left is the plot of the first two variables in the original order, on the right are the best two variables after classification. The algorithms discovers the best 9 variables (features) needed to describe the classification rule, with 4% error, and orders them according to their predictive power.

A couple more below:

I want to talk about some similarities between writing and statistical graphics. Just about everybody knows something about writing, and I'd like to help transfer some of this expertise to thinking about statistical graphics.

The story begins with some ugly pie charts I noticed the other day. I wascommenting on them and suddenly realized . . . the graphs weren't as bad as I thought they were! To be more precise, the graphs had a lot of failings, but the sum total of all these problems wasn't so bad.

Here are the actual charts:

sanford1.PNG

sanford2.PNG

As I wrote earlier, these graphs have lots of obviously-fixable problems, most notably that the wedges aren't labeled directly. Instead, the reader has to go back and forth, back and forth, between the chart and the legend. On the other hand, the information is conveyed unambiguously.

I'd like to make the analogy to sloppy writing--misspellings, grammatical errors, sentence fragments and run-ons, garden-path sentences, distracting cliches, and all the rest. (All these "errors" can be used to good effect. No rule is absolute. For sure, baby. Much of the time, though, I think these really are mistakes rather than intentional use for )emphasis or clarity.)

Why is sloppy writing a bad thing? For example, what's wrong with using "it's" instead of "its," or messing up subject-verb agreement, or losing track of an adverb's pointer, in a setting where the meaning is clear? The problem is that it creates work for your readers, who often have to double back to figure out the meaning. If you're Ezra Pound writing a poem, maybe you want to have that effect, but I don't think it's the goal of most journalists, news bloggers, etc.

OK, back to the pie charts. They could be worse, but they require a lot of work to read. Arguably, this criticism could be thrown at any graph: for example, I love line plots, but if you've never seen a line plot before, you'll struggle with it. The difference is that you can learn to read line plots, but you'll never be able to quickly read the pie charts shown above: no matter what, you have to back and forth between the pie, the legend, the pie, the legend, and so forth, to keep it all in your mind at once.

To push the analogy further, I'm recommending what might be called the George Orwell approach to statistical graphics: the goal is to be clear as a window pane. This isn't the only option, though. There's the Chris Ware style: graphs that are tiny and nearly impossible to read, but if you stare at them for a long time you realize they actually make sense. Or the Martin Amis style: flashy gimmicks that make the graph fun to read even if you don't care so much about the subject. Or the Veronica Geng style: playing it straight while going over the top at the same time. And so forth.

I think some of the confusion that has arisen from Ed Tufte's work is that people read his book and then want to go make cool graphs of their own. But cool like Amis, not cool like Orwell. We each have our own styles, and I'm not trying to tell you what to do, just to help you look at your own writing and graphics so you can think harder about what you want your style to be.

P.S. Yes, yes, I'm sure I have various usage, grammatical, and stylistic errors above. Give me a break, man! It's just a blog entry. More to the point . . . by now you should trust me enough to think, when you see something discordant, that maybe I've done it on purpose!

P.P.S. Another issue is cost or effort. It wasn't necessarily worth it for Tom Schaller to learn a bunch of new graphical tools just to make his blog entry slightly easier to read. In my discussion above, I'm ignoring the investment in time required to think in terms of graphics and to learn the relevant software.

Ben Hyde pointed me to this data-based dating site. I have no comments on how it works for dates, but they have a lot of fun maps, for example this:

Are some human lives worth more than others?


Scale
268,864 people have answered

And this:

If you knew for sure you would not get caught,
would you commit murder for any reason?


Scale
359,761 people have answered

This is great; I can't resist giving a couple more:

Beta distribution explorer

| 1 Comment

Brendan O'Connor created a small applet that allows exploring the beta distribution interactively (just hit arrow keys on the keyboard):

beta_explorer.png

This is a good example of what interactive visualization can do - Andreas Buja was also showing some cool examples some time ago.

He also has source available (for Processing).

Should Mark Sanford resign?

| 6 Comments

At our sister blog, Tom Schaller says no:

Is Sanford a cad for bolting his family on Father's Day weekend? Of course, but that is a private, moral failing, rather than a failure of public duty. . . .

I [Schaller] oppose most of what Mr. Sanford stands for politically. His showy rejection of federal stimulus money targeted for his state was a crass publicity stunt designed to garner national attention for Mr. Sanford at the expense of his constituents, many of whom are struggling economically. . . . Should Mr. Sanford's ambitions founder on the shoals of a personal scandal, however, yet another opportunity will be lost to establish the long-overdue separation between private comportment and public service. So here's hoping he doesn't resign or, if he does, it is a matter of personal choice rather than him bowing to political pressure.

I see where Schaller is coming from. Lots of people have complicated personal lives, and it's not clear at all that these difficulties have much if anything to do with governing. But I don't know if I agree with him on the wall of separation between private comportment and public service.

Consider the Sanford case. Schaller's a Democrat, so he can evaluate Sanford on his policies. But if Schaller were a Republican, he might very well want Sanford out of there because he tarnishes the brand, makes the party a laughingstock, etc. Also makes it harder for Sanford to convincingly follow a "family values" agenda which Schaller (if he were a Republican) might want. These are legitimate concerns for a Republican to have. Even if you don't think Sanford's personal indiscretions are important, you might want him gone and replaced by a more effective Republican. Just as, from the other direction, a Democrat would've preferred a zipped-fly version of Bill Clinton.

Some time ago FlowingData had an article on visualizing tables - which really is about visualizing spreadsheets in terms of correlations between columns. While Circos generates very colorful displays:

circos.png

Today I was impressed by a much cleaner and Tuftier variant on the theme by Mike Bostock, called Dependency Tree:

dependency-tree.png

Click on the link, it's interactive. Jeff Heer and Bostock also have a new JavaScript visualization toolkit out ProtoVis, which simplifies the creation of such stuff. The computer scientist in me finds this development very cool. But I still like my correlation matrices.

Sometimes people think it's a disaster when you have more predictors than data points, but I always point out that, no, it's better to have 9 predictors than just 1 or 2. After all, if you really wanted just 1 or 2, you could just throw out most of your data!

Nate's chart is excellent, especially the ordering of the candidates in order of the percent favoring resignation:

sanford2.PNG

I also like the gratuitious exclamation marks which add fun value without actually making the graph any harder to read. The key reason this works is that Nate wisely did not fill in the blank squares with "No!"s.

My only comments are:

The only thing that puzzles me about this article (sent to me by Chris Wiggins) is that at first it's presented as new: "The trend is buried deep in United States census data . . " A couple paragraphs down, the article explains that these patterns were published last year by Lena Edlund and Doug Almond (who presented the results in our quantitative political science seminar). In any case, it's an excellent news article and discusses the issues well. The only thing I'd like to see are some sample sizes, so that students who are given this article to read can compute the standard errors on their own.

Also, I have a couple problems with their graph. First, I'm not a fan of expressing sex ratios as #boys per 100 girls. To me, it's clearer just to give %girls (or %boys) as a straight number: 48.8% or whatever. Second, it's a mistake to make these as bar graphs starting at zero. Here, zero is not a reasonable baseline: it's not like you're really expecting to see zero girl births. I appreciate that they were trying to make a pretty graph, but in this case I'd go with a simple dot plot with +/- 1 standard error bars on the points. Or, better still, a line plot with time on the x-axis (one point for each decade) lines connecting the dots for each ethnic group, and also the vertical lines indicating standard errors.

Line plots are the best, and it's great when you can put time on the x-axis.

A great image viewer

| 5 Comments

I used to display my .png files using the default viewer in Windows. Then Aleks told me about Irfanview. It's much better.

Daniel Becker's Random-Walk graphically demonstrates how different distributions can be generated with physical processes: Normal distribution falls out of a Pachinko machine, and Poisson from a dart-throwing process. He also shows how pseudo random number generators have higher-order correlations within them. Pretty!

[via Infosthetics]

Mojca has pointed me to Paul Nylander's visualizations. He's using raytracing software and Mathematica to create pieces of visualization art:

Chen-Gackstatter.jpg

I tried looking for examples that could be useful in the usual statistical practice, and his example of Horseshoe magnetic fields demonstrates distributions on a surface:

OppositeRotate.gif OppositeTranslate.gif

Michael Maltz wrote an excellent article on visualizing data in criminal justice. You can read the article yourself, but I just wanted to comment on something he writes on page 12 of his article: "Promote Data Visualization as a First Step." I completely agree--and he has some good examples to make this point--but I'd also like to promote data visualization as an intermediate step (to check fit of model to data) and a last step (to summarize the inferences from a fitted model).

Traffic map update

| No Comments

Commenters pointed out that the map to which I linked yesterday actually shows the number of people entering each station, not, as implied by the visual structure of the map, the traffic on the subway lines between the stations. I agree with the commenters that line width doesn't seem like a good way to show information that is at the station level. Better to use differently-sized circles or something like that.

But this sets up a fun statistical problem: estimate the traffic on the subway lines given the data on the number of people entering each station (along with any other available data, and whatever modeling assumptions are needed to complete the picture). I guess there must be people at the transportation dept. doing this sort of thing, but I wouldn't be surprised if they're using deterministic solve-for-x algorithms that could be improved by a more statistical approach.

P.S. Richard Clegg writes in:

As you surmised this is a well-studied problem. Actually in the field of road transport this would be broken into two separate but related problems -- the origin-demand matrix estimation problem (given a set of observations what set of demands from origin to destination best explain them) and the related traffic assignment problem (given an origin demand matrix and a network with limited capacity on links how does one assign traffic onto network links).

In particular the traffic assignment problem has some attractive statistical properties if certain assumptions are made.

I replied:

About 25 yrs ago I worked on finite-element methods for thermal models, so I figured the mathematics would be similar. As noted on blog, I suspect that inclusion of some stochastic elements to the problem could improve things as well as extend the range of problems to which these methods could be applied.

And Clegg added the following:

For the origin-demand matrix problem there are a variety of approaches both frequentist and Bayesian -- I am far from an expert here (but hope to be more expert soon since I am involved with a grant proposal on the subject which I am hoping will be funded). For the traffic assignment problem there are a number of approaches, "deterministic" and "stochastic" to varying degrees. In the stochastic approach you make certain assumptions about how users disperse across routes of different costs (by assuming an error distribution on the user's perception of route costs -- as it turns out, a Gumbell distribution often produces "nice" answers). There are even the so-called "doubly stochastic" problems where the demand from each origin to each destination is assumed to have a distribution and then users perceives routes imperfectly according to another distribution. If you google "Stochastic user equilibrium" you will find more about the problem than you ever wanted to know.

Sounds good. I also expect there's some room for improvement using hierarchical modeling.

Aleks points me to this map (link from here) showing subway ridership using line widths.

It's fun, and it would be good to do something similar with road traffic using available data. The statistical problems would be interesting too: road traffic data are incomplete, so you'd want to do some modeling to get reasonable numbers over time. This could be a great project, actually.

My only comment on the subway graphs is that the B and D trains seem to have disappeared below 145th St.:

145th.png

Maybe their lines could be run parallel to the A/C on the graph? There's a similar problem with the E/F in Queens. Also it looks to me like two of the 1/2/3 lines have disappeared below 96th St.

P.S. More here.

Eric Gilbert and Karrie Karahalios have a paper on tie strength, distinguishing between strong and weak ties in social networks, published at the Computer and Human Interaction conference. Eric is one of the recipients of 2009 Google fellowships. There are some neat ideas there:

Presenting the distributions of predictors
predictors.png

Pretty, informative and compact.

Distribution of outcomes
outcomes.png

Not sure the median is particularly interesting.

Graphical model summary
model summary.png

They describe it as:

The predictive power of the seven tie strength dimensions. [...] A dimension's weight is computed by summing the absolute values of the coefficients belonging to it. The diagram also lists the top three predictive variables for each dimension. [...]

While the aggregation of coefficients in the same category is nice, there are some problems summing betas together. Rarely occurring values with huge betas are often an artifact of overfitting and not of informativity, and betas for continuous predictors are strongly affected by scale. Consider these betas:

Days since last communication -0.76
Days since first communication 0.755
Intimacy × Structural 0.4
Wall words exchanged 0.299

So, the top two predictors are probably correlated, and opposite to one another - resulting in runaway absolute betas.

I've suggested the concept of net leverage a few years ago in a natural language binary outcome setting as an attempt to improve the presentation of feature importance in regression models, but this topic is worth revisiting.

Awhile ago I posted some maps based on the Pew pre-election polls to estimate how Obama and McCain did among different income groups, for all voters and for non-Hispanic whites alone. The next day the blogger and political activist Kos posted some criticisms. I disagree with one of Kos's suggestions--he wanted me to rely on exit polls, but I don't actually see them as more reliable than the Pew pre-election polls--but he pointed out some serious problems with my maps. I realized that some fixes were in order. Most importantly:

- My maps would be improved by replacing solid red and blue with continuous shading to distinguish between landslides and narrow margins.

- I needed a more flexible model that would allow the nonlinear pattern of voting and income to vary by state. (In the previous model, I fit a nonlinear pattern (by including a separate logistic regression coefficient for each of the five income categories) but allowed the states to vary only with intercepts and slopes. In the new model, we're letting all five coefficients vary by state.)

During the past couple of months, I've been working on this when I've had a spare hour or two, and now I think we have something reasonable to share. Here it is:

10graphs2008income.png

States colored deep red and deep blue indicate clear McCain and Obama wins; pink and light blue represent wins by narrower margins, with a continuous range of shades going to pure white for states estimated at exactly 50/50.

General comments

The maps are based on a model fit to four ethnic categories (non-Hispanic white, black, Hispanic, other), but I'm only displaying total and non-Hispanic whites. The others are interesting too but they're based on a lot less data: they're my (current) best estimates but are much more reliant on model extrapolation.

The estimates are entirely based on the Pew data--except that we use Census-based voter turnout estimates to reweight estimates in each state, and we shift each state's estimates to be consistent with the actual election outcome in the state. (For example, if our estimate says that Obama got 48% of the total vote in a state (adding up voters from all income and ethnicity categories), and he actually got 46%, then we'd pull down our estimates for each category so that the estimated total is 46%.)

Some particular changes

I'll talk about a couple of states where Kos pointed out issues with my original maps.

New Hampshire. John McCain won 45% of the two-party vote in New Hampshire, a state which is 93% non-Hispanic white, 1% black, 2% Hispanic, 2% Asian, and 2% other. Based on the Census survey, we estimate that non-Hispanic whites were 96% of New Hampshire's voters in 2008. If whites represented 96% of the voters, and if McCain received 20% of the votes of the other 4%, then his share of the white vote would be 46%--thus, as Kos pointed out, it's hard to believe that McCain won in four of the five income categories among whites in the state, as my original map had implied. The problem was in the way that I'd adjusted things to the national vote.

Michigan. As Kos points out, Michigan was closely divided among whites, and so there was something fishy about my original maps, which had Obama winning among whites in four of the five income categories. The new map does not have this problem.

Colorado. This state reveals some problems with the published exit poll data: according to CNN, McCain got 48% of the white vote in Colorado, but, when this was broken down by income, he got 45% of the vote of whites under $50,000 and 47% of the vote of whites over $50,000. This is a mathematical impossibility: using the exit poll numbers, McCain's percentage of the total white vote should then be (.19*45% + .62*47%)/(.19+.62) = 46.5%, not 48%. I don't know which of these--if either--is correct. I assume all of these numbers are from the corrected exit polls, adjusted to match up to the actual vote proportions in each state. Our estimate gives McCain 51% of the white vote in Colorado. I think this is possible too, and for that matter it's consistent with the exit poll estimate of 48%, which has a standard error of at least sqrt(.48*(1-.48)/(.81*1254))=.015, so the exit poll number is within two standard errors of our estimate.

Estimates and raw data

Here are graphs showing our estimates, along with the weighted average from the Pew surveys in each group.(including only those respondents who expressed a preference for Obama or McCain and also said they were "absolutely certain" they had registered to vote):

48states.png

You can see the partial pooling from the data to the model, with more pooling in small states such as Wyoming, Rhode Island, and Vermont, and less pooling in states such as California, Texas, and New York where sample sizes were larger. The graphs show estimated McCain vote share, so, unsurprisingly, the lines for whites are higher than the lines for all voters, with differences smaller in states such as Wyoming or Vermont where there are very few nonwhite voters.

Some technical details

Even after restricting to respondents who are certain they are registered, the pre-election polls don't do a great job matching the population of voters. To correctly weight to voters (rather than to the general adult population), we used the 2008 Current Population Survey post-election supplement, which has information on voter turnout. We'll write a technical article describing exactly what we did, but the short version is that the CPS numbers are generally considered to be much more reliable than exit polls or pre-election polls for estimating turnout rates among different groups within a state. What we actually did was to use a multilevel model to smooth the CPS numbers using the latest population totals from the American Community Survey.

Yair also came up with a cool color scheme. Instead of going from deep red to deep blue through purple, we divided up the color scheme as follows: for proportions between 0 and .5, we used different shades of blue (deep blue, getting progressively lighter, toward white), then going from .5 to 1, we used deeper and deeper reds, starting with white, through light pink, to red. (Don't worry, I'll post the R code.) This worked much, much better than the purple schemes I was playing with before. More visual resolution, and a key benefit is that it's immediately clear which states are above and below the 50% threshold. Finally, I did a little trick of my own and used a square-root transformation (more specifically, if the estimate vote proportion for McCain is x, I defined z = 2*(x-.5), and then wored with sign(z)*sqrt(z)) to spread out the resolution near 0.5 and compress it near 0 and 1.

One other thing. The Pew organization sent me their raw data and posted them on the web for anyone to use. The exit polls still refuse to report anything but summaries. I don't see this refusal as a sign of confidence on their part. Please also read my earlier note for further discussion of the Pew and exit polls.

All this work is joint with Yair Ghitza.

This looks like a cool book.

cat.gif

One of the chapters is by John Hughes. Perhaps he'll talk about cinematography? I have no idea what he's been up to lately. There's also this chapter, which I hope is beautiful in its content if not in its appearance.

A horse-race graph!

| No Comments

Responding to my question about graphing horse race results, Megan Pledger writes:

While waiting up late to snipe at an internet auction, I put together some simple data of a horse race and used ggplot to plot it. It's discrete time race data rather than continuous time and has very simple choice options for the horse. The graph is a starting point!

horse_race.gif

[The picture doesn't fully fit on the blog window here; right-click and select "view image" to see the whole thing.]

My reply: Very nice--thanks! I won't look a gift horse in the mouth . . . but if I were to be picky, I'd suggest making the tods smaller, the lines thinner, and the colors of the five horses more distinct. All these tricks should make the lines easier to follow. I'd also suggest gray rather than black for the connecting lines.

I think I'd also supplement it with a blown-up version of the last bit (from 80-100 on the x-axis), since that's where some interesting things are happening.

And here's the code:

I just happened to run across this today. It's awesome.

This is oddly compelling:

ccofworld.jpg

The color scheme is boring--it just replicates geographic information that is already clear from the picture. I'd prefer a more informative color scheme, perhaps based on per-capita GDP, but that's a minor quibble. (With the new color scheme, it might help to outline the continents in gray to make it easier to locate everything.)

Also, of course the dots are not necessary, but maybe they give the map some of its charm. Lower-case letters are certainly much easier to distinguish than upper-case letters.

One other point that otherwise might be missed: What really makes this map work is that it does not display the borders between the countries. Border displays draw attention to the countries' shapes, which is not usually what we care about. That's one reason why I'm not a fan of those distort-a-maps that stretch out states or countries in proportion to their population.

Funny graph

| 8 Comments

Corey Yanofsky pointed me to this:

chart-debtunderrule.jpg-thumb-410x249.jpeg

This is a fun one--it has so many flaws, I hardly know where to begin.

This one is also interesting, in that they seem to have decided retroactively to blame the Democrats as of January, 2007. Fair enough, I guess. In retrospect, I'm surprised the Congressional Republicans and the McCain campaign didn't try this tactic in the 2008 campaign--to say that, sure, the economy is a disaster but it's the fault of the Congressional Democrats. Maybe the "economy is fundamentally strong" pitch wasn't such a good idea.

It's hard to see that arguments about the national debt will convince many people right now. But, stepping back a bit, the role of the Republican campaign right now can't really be to change policy or even to convince a majority of Americans that Obama and the Democrats are doing things wrong. Rather, the short-term goal has got to be to keep up morale among the base. I don't know that the deficit is a great talking point there either, but maybe it is; I imagine they've done some polling.

P.S. Yes, I'm happy to comment on silly Democratic graphs too--just send 'em in!

When demonstrating his Alice in Wonderland example, Brad Paley showed how the words in the center of the display were located by grabbing a word with his mouse, clicking to show its connections with places in the text, and then moving the word, showing the lines of the connections stretching, then letting go to show the word bounce back. The image of the word connected to the places using rubber bands was clear.

What I want to know is, can somebody rig a robot arm to do this so I could feel the pull? Imagine a robot arm that can be moved within a 30cm x 40cm box. You could use this to feel the springiness of the connections in Brad's diagram; the idea is that you'd say a word (for example, "Alice"); the pointer of the robot arm would move to the position of the word in the display, and then you could--with effort--move the robot arm away from this place. When you let go or relax your grip, the pulling of the virtual rubber bands would return the arm back to its original place, and you could feel the strength of the pull.

The arm could also be used to feel a curve (for example, a nonlinear regression or spline, or a mathematical function such the logarithm or the normal distribution curve), as follows: the arm would start at one end of the curve and the user could grip it and move it along, with the motion physically constrained so that the arm would trace the curve.

In displaying several curves--for example, level curves indicating indifference curves - the arm could start on one of the curves and be programmed to stay on that curve, unless the user pushes hard, in which case there would be resistance during which the arm moves between curves. It would then lock into the next curve, which the user could again trace until he or she pushes hard enough to get the arm unstuck again.

More generally, the robot arm could be used for exploring three-dimensional functions such as physical potentials, likelihood functions, and probability densities. From any point in the two-dimensional box, a "gravitational force" would pull the arm toward a local minimum (or, for a likelihood or probability density, the maximum) of the function. Then with moderate effort the user could move the arm around and, by feeling the resistance, get a sense of gradients, minima, and constraints.

(for example, a nonlinear regression or spline, or a mathematical function such the logarithm or the normal distribution curve), as follows: the arm would start at one end of the curve and the user could grip it and move it along, with the motion physically constrained so that the arm would trace the curve.

In displaying several curves--for example, level curves indicating indifference curves - the arm could start on one of the curves and be programmed to stay on that curve, unless the user pushes hard, in which case there would be resistance during which the arm moves between curves. It would then lock into the next curve, which the user could again trace until he or she pushes hard enough to get the arm unstuck again.

More generally, the robot arm could be used for exploring three-dimensional functions such as physical potentials, likelihood functions, and probability densities. From any point in the two-dimensional box, a "gravitational force" would pull the arm toward a local minimum (or, for a likelihood or probability density, the maximum) of the function. Then with moderate effort the user could move the arm around and, by feeling the resistance, get a sense of gradients, minima, and constraints.

An example of how I'd like to use the robot arm together with a visual graph of data and model fit

I was originally thinking of this as a statistical tool for blind people, but I'd actually like to have one of these myself, for example to understand the sensitivity of a model fit to changes in parameter values. I'm thinking of twp graphs next to each other: a graph of parameter space and a graph of data with fitted curves. The robot arm would be pointed to the posterior mode or maximum likelihood estimate in parameter space. As I moved the robot arm around, I'd feel resistance--it would be difficult to move far in parameter space without feeling the increase in -log(density)--and at the same time the curve would be moving in the fitted curve + data plot. The muscular resistance information on one graph and the visual information on the other graph would together give me a sense of what aspects of the data are determining the model fit.

Here's an example of what it might look like:

2graphs.png

I'd also like to be able to use the robot arm to pull on the fitted curve and feel the resistance as I move it away from the data.

P.S. Here's the R code for the above graphs:

Two kinds of book

| 9 Comments

One of the things Brad Paley talked about the other day was the computer program he used to make a visualization of the text of Alice in Wonderland [link fixed]. (Click on the "Alice in Wonderland" link; it's really cool.)

My first question when I saw this was, why is the book presented as a circle rather than a line? The circle places the end of the book at the same place as the beginning. There are some reasons this might make sense--after all, Alice wakes up from her dream at the very end of the book, returning to where she was at the start--but, overall, I don't see the circularity making sense. I asked Brad during his talk, but he did not have time to respond (too many questions were being asked, a problem I'd love to have at my own talks!). He indicated that he did have a good reason, though, so if he lets me know I'll report it here.

People asked what was the point of the TextArc display (other than it looking pretty), and Brad gave a bunch of examples of what the plot showed. In some way it was similar to some of my statistical research efforts, in that the results were impressive but ended up confirming things that made sense and that, ultimately, we already knew. In my case, my colleagues and I found that American Indians are not randomly distributed in the social network; in Brad's case, he found that Alice is a central character in Alice in Wonderland, that the words "Mock" and "Turtle" go together, and so forth. (See here for more.)

When pressed further, Brad justified TextArc as a souped-up index. This made a lot of sense to me: his graph tells you lots of information that's not in a conventional index and also allows you to map straight back to the original text. I agree that it's silly to criticize the program for what it doesn't do. It's an automatic program and does a lot. I'm also impressed by any program written more than 5 years ago that still works!

Anyway, one of Brad's remarks about using this tool to understand text made me think that there are two kinds of books:
1. Books that you want to read straight through, from beginning to end.
2. Books that you use for reference, flipping through and looking for what you need.
The horrible thing is that I write all my books as if they will be read from beginning to end, but I'm pretty sure most people read them as reference books. For most people--even most statisticians--reading Bayesian Data Analysis from beginning to end would be like me reading the instruction manual for my washing machine. I pick up the instruction manual when I need it, and then I look for what I need.

Anyway, I thought this might be relevant to TextArc and similar projects. Maybe Alice in Wonderland is not the best example; it might make more sense to use TextArc for a book such as Bayesian Data Analysis that has a sequence but is primarily used for reference. (I went to the TextArc site but can't find the program; at least, there's no easy way to feed in a book and have it produce the TextArc picture.)

Nathan Yau makes some good points in response to my belated comments on his "5 Best Data Visualization Projects of the Year."

First off, I'd like to apologize for saying the projects "suck," That was just rude. Would I like it if somebody said that the examples in Bayesian Data Analysis "suck" because they're not completely realistic, or if somebody said that the demos in Teaching Statistics "suck" because they're not tied closely enough to the lecture material? A better thing for me to say would've been: "I don't particularly like these as data displays, but I'm impressed by the effort that went into them, and I'm glad to see these sort of data-based displays getting a broad audience."

In the interest of constructive discussion, I'd like to make a few points.

Visualizing travel distances

| 2 Comments

Via epc I came across Jonathan Soma's Triptrop NYC, which practically in real time estimates how long it's going to take to travel from a location in NYC by subway to any other location, and paints this graphically as an overlay on a map. Here's Manhattan, starting from Columbia's Statistics Department:

distances.png
stat-subway.png

The walking distance is 3mph. Distances in number of minutes are color-coded and above the map. Jonathan says he used SciPy and curve fitting, along with a precalculated database of over 120000 distances between locations.

There are other "time travel" tools for London and UK, but Triptrop NYC is the first one I've come across that allows you to enter your own precise location. As for similar visualizations, here's my post on housing prices in New York.

Programmer/designer W. Bradford Paley spoke yesterday for the data visualization group here at Columbia. He gave an amazing talk, one of the best I've ever seen. One reason I say this is that about half the talk was devoted to an application he built for Wall Street trading--something I just couldn't care less about, it's hard for me to imagine a topic I'm less interested in--and, even so, I liked the talk a lot.

The seminar participants--a mixture of architects, computer scientists, and some other people, including a psychologist and even a statistician--had lively discussion throughout. In fact, there was so much going on, that I'll spread my comments through several blog entries over the next few days.

Right now I want to talk about Paley's speaking style, which was great in so many ways, but what really got to me was how he managed to get so many questions and comments from the audience---so much that people had to ask him to stop taking questions so he could move forward with the material. This was amazing. When I speak, I always struggle to get audience participation. Usually I get a few questions at the end, but not this kind of barrage all the way through.

What can I do to involve the audience more? I've always thought I need more "hooks" but have not been sure how to do it. After seeing Paley's talk, my new idea is to devote more of my talks to process. I typically present results without a lot of detail on how I got there. But maybe it would be better to talk more about what I did. At least, that worked for Paley.

The funny thing is that I love answering questions, and I think I'm good at it. That's one reason I get so frustrated that I don't get more questions when I speak. People typically think my talks are entertaining, informative, and thought provoking--at least, that's what they tell me--but I'm lacking the hooks that draw people in.

P.S. More here on Paley's talk.

Better late than never

| 13 Comments

A friend writes:

Does this stuff suck? Or am I missing something?

My reply: Yes, I agree. They all suck (for the purpose of data display).

Slate has a beautiful animated rendering of the job gains/losses over the past 2 years. It would be very difficult to show the trends without animation.

job-loss.png

Two other things I like: The quantity circles are so much more informative than using color to paint states: we all know that most job losses are in NY and CA, because they're the biggest! Those circles help control for state population density.

The animation helps control for job gains in the previous period: it hides the cities that are relatively stable, but it nicely shows boom-bust cities (NYC) and stagnation-bust cities (Detroit).

(Via Peter's Twitter.)

Via infosthetics, I came across a new and very nice web application for data analysis, Verifiable. Among their featured graphs, there's a very nice one displaying the association between politics and religion:

Party_Affiliation_By_Religious_Tradition,_Percentages.png

This graph also shows how the often-hated bar charts can be effective. In all, the graphs coming out of Verifiable look like some of the best I've come across. Previously, I've written about ManyEyes, which is quite versatile and allows many data types, and Swivel, which was among the first. Nicely done.

[Several commenters have pointed out (thanks!) that the selection of colors is not good, and that some religions in the list are very similar, or too small to be interesting. When it comes to selecting good colors, I stand by ColorBrewer.]

I use R for just about everything, including exploratory data analysis and graphics. The only other package with which I have any familiarity is Mathematica. I've been generally satisfied with R graphics, although there are things that I always struggle with, such as:


  1. using expression() to get symbols and math expressions the way I want them;

  2. writing line labels at an angle so that they lie along the line (and then having to re-do this if I change the dimensions of the plot, e.g. by changing the margins);

  3. setting margins when I have multiple plots on a single figure, so that the axis labels fit but there is still enough room for the data;
  4. placing labels or legends where they don't get in the way of the plot.

At least in my normal course of business, all of these issues only come up when I'm making publication-quality figures (or at least presentation-quality), not when I'm exploring the data or comparing the data to predictions. So I've always thought of R as being excellent for exploratory data analysis, and fair or poor for making publication-quality output. But sometimes I do find myself taking a lot of time on an exploratory plot (such as the example here), which is frustrating.
View image
And then a friend mentioned that he thinks R is good for publication-quality graphics --- you have precise control over everything --- but is terrible for exploratory graphics, which is exactly the opposite of the way I think of it! He pointed that, aside from some crude things you can do with identify(), R's graphics are non-interactive: you can't click to remove bad data points, or zoom in and out, or click on a line and change its color or width. He said good exploratory graphics programs let you do all of these things. But here's the kicker: he couldn't name a good exploratory graphics program! He says he knows they exist, but he doesn't know what they are.

So: what's worth a look, besides R and Mathematica? Am I using R just because it's what I'm used to (and it's free), or is it actually the best thing out there, as I have always assumed?

Igor Carron forwarded this from Ed Tufte:

The Recovery Accountability and Transparency Board was created by the American Recovery and Reinvestment Act to coordinate and conduct oversight of funds distributed under this law in order to prevent fraud, waste and abuse. . . . The Board has a series of functions and powers to assist it in the mission of providing oversight and promoting transparency regarding expenditure of funds at all levels of government. . . . The Board is also charged under the Act with establishing and maintaining a user friendly website, Recovery.gov, to foster greater accountability and transparency in the use of covered funds. The job of the Recovery Accountability and Transparency Board is to make sure that Recovery.gov fulfills its mandate -- to help citizens track the spending of funds allocated by the American Recovery and Reinvestment Act.

Last month I reported on a statistical analysis by Josh Millet at Criteria Corp., suggesting that the economic climate for small business is improving. Millet now has an update (posted on 1 Apr but I assume it's serious):

With the final March numbers now in, the Hiring Activity Index nudged upwards very slightly again this month, to 62.3% from 61.4% in February. To me [Millet] this is an encouraging sign that the February jump in hiring activity by small businesses was not just a blip. If the data we're seeing means anything, the hiring situation for small and medium-sized businesses has begun to rebound.

Here's the graph I made of his numbers:

criteria2.png

Millet also answers a bunch of potential criticisms of his measure:

There were some interesting comments and questions about the HAI and its potential utility as a leading economic indicator. We [Criteria Corp.] do sell our software on a subscription basis, and someone pointed out that if non-active subscribers didn't renew because of the downturn, this could artifically inflate the HAI because it is based on the percentage of our customer base that is actively doing pre-employment testing in a given month. This is a legitimate point, but I [Millet] will say that while low levels of use are a reason that customers sometimes do not renew, we haven't see non-renewal rates climb much since November, when the HAI dropped by 10 points. It was also suggested that higher numbers of job-seekers may result in applicants for positions that may not have been desirable previously--this is theoretically possible, but I don't see much evidence for it. What is most certainly true is that companies are getting far more applicants per open positon, as I previously blogged about here. However, since the HAI is based on the percentage of companies testing in a month, not the overall volume of tests, this shouldn't influence the HAI unduly, and wouldn't in any case explain the plunge in November and (partial) rebound in February.

Aleks forwarded this to me. It looks interesting. I'm disappointed that the readings don't include anything by Bill Cleveland, but on the plus side the course appears to be incredibly well organized. I'm sure I'd love it. They sure didn't have classes like this when I was in college.

A Glimpse of Our Future

| 6 Comments

Jeff pointed me to this graph from congressmember Paul Ryan:

paul ryan budget.gif

Ryan is actually being generous to the Democrats here. You can't imagine how things are going to look around 2150 or so!

Manoel Galdino pointed me to a discussion on the Polmeth list on the topic of reporting p-values and regression coefficients. (The polmeth listserv doesn't seem to have a way to link to threads, but if you go here for March 2009 you can scroll down to the posts on "Displaying regression coefficients.") I don't want to go on and on about this, but in the interest of advancing the ball forward just a bit, here are a few thoughts:

Jose Aleman points me to this conference on 18-19 June at Fordham University:

This conference is about applications of the R software and Graphics system to important policy and research problems, not about R per se. It provides an excellent opportunity to bring together researchers from various disciplines using R in their reproducible research work. We hope to provide practical help to students and researchers alike.

It says here that I'm an invited speaker. I don't actually remember being asked to do so, but if I did, then I guess I'll be there! Fordham is conveniently located near the zoo so perhaps I can somehow combine this with a family trip.

P.S. I checked more carefully and the conference is actually at the Manhattan branch of Fordham (at Lincoln Center). So no zoo trip, unfortunately!

See here. It's an important issue, but their plot has two huge problems:

1. The big fat circles in the diagonal axis are conveying no information and are, to my eye, a distraction.

2. They forgot to to order the variables, as a result creating a confusing pattern. Try reordering to put the highly-correlated variables together (as Tian did for Figure 8 in our article).

They also gave the variables unreadable abbreviations. This is not specifically an error with the correlation plot but it's a common mistake that can easily be avoided.

P.S. More here from Eduardo and John.

I love stories and for a long time have wanted to put together a little book of my favorite statistics stories. I know this is not something that would ever reach David Sedaris levels of popularity (to say the least) but at least it would give me some good material to use at the beginning of class or for other times when I want to engage students in a way that's not too taxing for them. (In the meantime, I recommend that all of you who teach statistics or methods classes begin each of your classes, while the students are walking in, with a 5-minute discussion of whatever the latest items are on this blog.)

Anyway, I have a new story right here for ya.

Understanding well-being

| 9 Comments

From America's Health Insurance Plans:


The Gallup-Healthways Well-Being Index, a unique twenty-five year partnership in research and care, is an on-going daily survey that began in January 2008. It surveys 1,000 Americans 350 days per year.

The research and methodology underlying the Well-Being Index is based on the World Health Organization definition of health as "not only the absence of infirmity and disease, but also a state of physical, mental, and social well-being."

While I can't really say what "1000 people 350 days per year" really means, here's a nice map of the aggregate measure of well-being (if you click on it, you will get a slightly larger version):

well-being.png

It's an interesting dataset and it would be interesting to see some analysis about the factors associated with well-being. If you do it using the tables that are available from the site, post a comment, and I'll add it to the entry later on.

As for the visualization - I would have preferred a continuous color scale, rather than having it collapsed into just 5 levels. Also, the boundaries between districts only have to be drawn when the color for both districts is the same (quite rarely, if you follow the advice from the previous sentence) and when there is no other border closer than n pixels (because the boundaries are less important than the colors indicating the variable of interest).

David Hillis is a biologist who has written on evolutionary trees. In response to my blog on Laura Novick's research on the perception of cladograms, Hillis writes:

It turns out that the best tree figures for students are neither of the two options she looked at, but rather the kind of trees that we use in Life: The Science of Biology. A more comprehensive study by a Univ. of Missouri education researcher, which included each of these options, clearly showed that the best comprehension by students was achieved with figures like the one in the attached file. People rarely draw trees this way for publications, however, because they are harder to draw than the ones with straight lines.

And here's an example:

hillis.png

I wonder if Laura has done research on this particular type of display.

In honor of Darwin's 200th birthday, some research by psychologist Laura Novick on the presentation of evolutionary trees ("cladograms"):

cladograms.gif

Her research shows that students are much better at understanding the diagram on the left than the one on the right. She calls the one on the left a "tree" and the one on the right a "ladder," which confuses me a bit: the one on the right looks more like tree branches to me.

Andy Sutter writes:

It's been a while (~2 years?) since I was last reading your blog semi-regularly and submitted a comment or two, but I was reading something today that made me recall those days.

At the time, I was curious about why social scientists present data as charts of regression coefficients, since I'd never seen such a presentation in the physical sciences.

Yair showed me this. It's simply amazing. Click on the link RIGHT AWAY and be awed. Great examples and all the code you'll ever need.

Another bad graph

| 2 Comments

Jeff Jenkins writes:

Boxplot challenge

| 12 Comments

In response to the comments here, I say:

I have never ever seen an example where I've felt a boxplot was appropriate. I'm open to being convinced, but I don't think you'll be able to convince me. Bring on the examples!

John and I gave our presentation on statistical graphics today, and then coincidentally I found this monograph by Rafe Donahue (link from Helen DeWitt). I started skimming and it looks pretty good so far. He uses horizontal jittering instead of the horrible boxplot, and that makes me happy already. On the other hand--since I'm being superficial here--I'm not a fan of the marginal-notes style of referencing. I always feel that this style draws undue attention to what are ultimately the least important parts of the book.

More seriously, Donahue's monograph looks interesting, and I'll have to read it more carefully. I've been looking for something on graphics that goes beyond the nuts and bolts of how to make a particular graph and considers what should actually be plotted and why.

On a theoretical level, I wonder how his ideas connect to my ideas of exploratory data analysis and statistical modeling (see here and here). I think the connections are there (as in Donahue's principle #28, 43, 52, and 86: "The data display is the model."

Actually, many of his principles are things that I tell people also. Just today I discussed how you have to tell the viewer what the plot is (Donahue's principle #23).

P.S. A minor point: Donahue's principle #53 is, "Plot cause versus effect." Doesn't he mean, "Plot effect versus cause"? Usually we say y vs. x, not x vs. y. Or else I'm missing something here.

I'm not gonna miss this!

| No Comments

The following CUIPS Professional Development Seminar: "How (Not) to Present Quantitative Results," Thursday February 12, 12:30-2:00pm, 707 IAB.

Data sources

| 2 Comments

Pippa Norris writes:

I realize that it is clunky but if you could always, always cite the survey source and date below each figure, this would make the book much easier to read and interpret. If I know the source, then it is easier to judge the meaning of the presentation, the exact questions used, and the reliability of the data. I use a lot of figures in presenting my own work, to the despair of my publisher, and I know how difficult it is to both combine elegance and simplicity with technical details in a compact space. But if we don't provide these details, then this is such a bad role model for our students!

She's got a point. I'm a big believer in having the graph and caption be a self-contained entity--as I tell my students, you have to think about people like me who only read the graphs--but I've rarely put the data source right in there too. In our book we have all the data sources listed in the notes at the end, but I agree that putting sources right on the graph would be a good idea. Actually, I think what I want to do is write some R functions to make graphs just the way I like them, and one option on the graph will be to give the data sources in small print near the bottom.

Burt Monroe writes:

I [Burt] sent an entry for the Chance visualization contest. By the time I'd ferreted out the original in our library and quizzed family friends (my father was a biologist) about bacterial taxonomy, I ended up writing a goofy little paper about the whole thing.

Here's Burt's article, and here's his graph:

burt.png

This is pretty, although to my eye it looks a little busy. I'd probably favor including this information in multiple plots. That said, I didn't actually look at the data or the problem--all I did was post the announcement--so maybe it's the best thing to do. I like that Burt investigated the context of the problem and didn't just treat this as a "dataset" to be graphed.

A diagram of graphs

| 3 Comments

Jess sends along this, which isn't a bad idea although I disagree with how it's organized. For one thing, I think just about all graphs are comparisons; for another, I think line graphs are often the way to go, so I'm unhappy to see them in only a few of the pictures here; for another, the scatterplot-plus-regression-line, which I love, isn't anywhere to be found. But I appreciate the thought.

Multicolor text in R

| 1 Comment

Hey, I've wanted to do this for awhile! Example code here.

I gotta say, I find the expression() function incredibly difficult to use. Examples are key.

Seth writes:

I'm writing a critique of how epidemiologists analyze their data and one thing they rarely do is provide scatterplots. I suppose their excuse is they have too many points. Do you know of any papers about how to make scatterplots with large numbers of points?

My reply: The book, Graphs of Large Datasets: Visualizing a Million (which I've been planning to review on this blog for literally over two years, I even took notes on it and everything) discusses this issue in details, including tricks such as alpha-blending.

My impression is that if you have millions or even thousands of points, a density plot can do the trick.as in page 149 of Red State, Blue State. Perhaps readers have other suggestions. But if you just want to make the point and give a definitive reference, I'd go with the Graphics of Large Datasets book.

John Sides posts this graph:

iraqis.png

You will perhaps not be surprised to hear that I have no comments on the substance but I have some thoughts on the presentation. I'd bound the y-axis at 0 and 100% (currently it goes beyond these limits), also I'd put the year labels between the hash marks rather than on them (think about it: on this scale, 1995 is a time period, not a single point), also I'd put percent signs on the y-axis (e.g., "25%" rather than "25") for some useful redundancy. But other than these minor comments, I think the graph is beautiful.

The year-labeling issue is not completely trivial, especially when trying to interpret when the series ends. I've noticed that people often have difficulties representing time on the x-axis of graphs. Other times, you'll see, for example, a graph going from 1950 to 2000 with 50 little hash marks and tiny slanted labels at 1951, 1953, 1955, 1957, etc. Instead of simply labeling every 20 or 25 years.

Jon Peltier saw this horrible graph that I'd discussed earlier:

CXM946.gif

Peltier writes:

Well, this is an eye-catching chart. It seems to show an inward spiral, but the overall trend is really not very clear. It also looks distorted, too tall and not wide enough, but I examined axis settings several times, then even physically measured the lengths of the January-July and April-October spokes, and everything lined up. This optical illusion was caused by plotting the data in a radar chart: April and October were the two largest numbers in 1929, stretching the curve vertically. The first step to improving this chart is to cut it between December and January, and unroll it. . . .

In the chart below, we can easily see the downward trend which started around the time of the stock market crash of October 1929. The trend was well underway even before the Smoot-Hawley Act was enacted in June 1930, because many international trade partners had instituted preemptive retaliatory tariffs of their own. By the middle of 1932, the volume of international trade had effectively plateaued at one-third of its high of 1929:

plughole-timeline.png

That's what I'm talking about.

P.S. Jason Roos writes:

Ooooh this is ugly

| 9 Comments

Steve Buyske points me to this:

CXM946.gif

Boy do I hate this. A straight time series would do so much better. They should also follow the general principle of extending the series, going back before 1929 and after 1933. But my main feeling is that this spiderweb action is just horrible.

Chance magazine graphics contest

| 6 Comments

Mike Larsen, editor of Chance magazine, passed along the announcement for a graphics contest. Entries are due 15 Jan 2009, and there is a specific requirement, which is that they display the data described below (and also here).

CHANCE GRAPHIC DISPLAY CONTEST: Burtin's antibiotic data

The year 2008 marks the 100th anniversary of the birth of Will Burtin (1908-1972). Burtin was an early developer of what has come to be called scientific visualization.
In the post World War II world antibiotics were called "wonder drugs" for they provided quick and easy cures for what had previously been intractable diseases. Data were being gathered to aid in learning which drug worked best for what bacterial infection. Being able to see the structure of drug performance from outcome data was an enormous aid for practitioners and scientists alike. In the fall of 1951 Burtin published a graph showing the performance of the three most popular antibiotics on 16 different bacteria.

Graphics on shirts?

| No Comments

Bob writes:

You should to team up with these guys to promote your book. This one looks like it could come out of one of your papers with its shared horizontal scale, no ticks, and no vertical scale. It was even done with the help of a NY prof.

I don't really know what's going on here, but if anyone wants to go with this, be my guest...

Sometimes people will email me that their comments aren't published on the blog. It's a good idea to be a registered user to prevent this from happening - as we have tens of thousands of spammy messages, and one sees unspeakable things there. So it was interesting to see a visualization (developed by some famous open source developers) of where blog spam comes from:

Picture 1.png

It's a great visualization, except for the colors: the USA is bright red. But what does this tell us? That the USA has the highest number of computers on the World Wide Web, and the total number of blog comments posted? We know that already! The visualization should provide information that isn't known already.

So should one just present the ratio between spammy and hammy comments for each country? That would be valid, but it would involve ad-hoc modeling. Instead, one has to build a model that removes the influence of variables that are already known to influence the outcome, such as the number of computers, the number of all comments posted, and so on. I'll write more about how to do this another day.

Florence Nightingale's graph

| 4 Comments

Chris Zorn pointed me to this news article by Julie Rehmeyer.

COXCOMB.jpg

Given the context, the graph is impressive and important. But given what we know today, it would've been even better as a line plot. (Rehmeyer suggests a bar graph but I assume that's just because she doesn't know about line graphs; see here for a simple example.)

Visualizing election polls

| 3 Comments

A colleague points me to these supremely ugly pie-like graphs by Richard Riesenfeld and Geoff Draper. On the other hand, who am I to say they're ugly? I'm sympathetic to the goal of "exposing complex relationships that are not obvious by usual methods of statistical analysis." And it's hard to argue with "Eighty-eight percent said they enjoyed using the software and 71 percent completed all the tasks without errors." I've certainly never performed such an evaluation of my own graphical methods, instead relying, Tufte-like, on my introspective judgment.

Ben Lauderdale writes:

I [Ben] had this map [see below] on my door for the last week. Based on exactly the same calculation using constant 95% black support and census-proportional representation. The white counties are the ones whose census names didn't match properly with the names used in the library(maps) package in R, I was too lazy to fix them.

ben1.png

Cool. I'd only suggest using light gray rather than heavy black lines between counties; the map as it is overemphasizes the county borders, I think. But I respect his laziness; there's always time later to fix the details.

Ben continues:

[Below are] the state-by-state county share plots for the lower 49, Obama vote share as a function of black population share. V.O. Key's observation that whites who live near blacks in southern states are less positively inclined towards them is *still* visible in several states.

ben2.png

The circle areas are proportional to county voter turnout. (The biggest circle is L.A. county in California, and so forth.)

Ben also had this comment about his map:

It reminded me of something Bob Putnam would say every time someone presented an empirical talk in our Center for the Study of Democratic Politics series during the year he was a fellow here at Princeton: "You should include miles to the Canadian border as a variable in your regression, it is the most important proxy for political culture in America!" At least in the eastern half of the country, he has a point.

Except for New Hampshire and Vermont, I think.

P.S. For graphics enthusiasts, here are some earlier graphs that I gave the thumbs-down on before Ben came up with the 50 plots above:

I have to say I prefer a college freshman's plot to yours, Andrew. Although, you did hack it together at 3am after strolling around Grant Park. And drawing the y axis from 0 is a mistake which you didn't make, too.

Henry posted some great links to voter turnout data and discussions of the topic by Michael McDonald. Henry's graph is here.

Just for fun, I decided to redisplay the information; here is my version:

turnout.png

I've updated it with the latest estimate as of 9 Nov 2008.

Key differences between my graph and Henry's:

1. I go back to 1948, Henry starts at 1980.
2. My y-range goes from 45% to 65%; Henry goes all the way from 0 to 100.
3. Henry's graph labels every election; I label every 20 years.
4. Henry's graph is in gray with many black horizontal lines and a blue line with data; mine is black and white with a line and with data points indicated by dots.

Items 1 and 2 above are the most important; I think: by showing a shorter time range and compressing the y range, Henry makes the changes look less impressive. I understand the rationale for including the whole y-range here, but in this case, since changes are being discussed, and a 5% change is, historically, a big deal, I prefer my graph. I did extend the y-scale out to the [45%,65%] range, though, because I wanted to give a little bit of perspective; it would somehow seem misleading for the data to cover the entire y-range in this case.

In any case, I'm not trying to criticize Henry here; making graphs is just something I like to do, and something I like to think about.

P.S. Below is my (updated) R code, for those of you who want to play at home:

Phoenix Suns shooters

| 4 Comments

Yair sends in this plot of the week:

suns-wings.png

He writes:

This displays the smoothed distribution of shots taken by wing players for the Phoenix Suns in the '07-'08 regular season (Matt Barnes played for the GS Warriors that year). Raja Bell seems like the perfect wing player for the Suns, because he plays defense and then basically sits at the 3-pt line waiting for Steve Nash to give him the ball for a good shot. Leandro Barbosa is similar, but he drives a bit more (especially when Nash is off the floor). Grant Hill didn't fit this mold because he has no 3-pt shot; he is more of a mid-range guy. From this standpoint, Matt Barnes (their free-agent pickup) looks like he could be a better fit. Of course, this plot says nothing about whether he actually hits the threes, but at least his heart is in the right place. Then again, if their offensive system changes because of the new coach, all bets are off.

Pretty graphs, huh? The color scheme seems good for a team called the Suns.

Ted Dunning writes:

Google analytics normally does a pretty good job of dealing with statistical issues. For instance, the Google website optimizer product does a correct logistic regression complete with error bars and (apparently) Bayesian analysis of how likely one setting is to actually be better than another.

But their demo of their latest visualization product is worth a write-up. They seem to ascribe volumes of meaning to a variations in small count statistics.

Check out the video.

As Aleks knows, I can't bear to watch videos. I like the idea of dynamic graphics, but I can't stand the lack of control that comes from watching a video. I like to read something that I can see all of at once.

But the Google tool looks pretty cool. Also, I didn't know they did Bayesian logistic regression. I wonder what prior distribution they use? This is a topic that my colleagues and I have thought about.

Ted continues:

Why Model?

| 2 Comments

Stan pointed me to a short article "Why Model?" by J. M. Epstein. The default principle, both in statistics and in machine learning, is to predict. Any act of statistical fitting that involves likelihood is inherently predictive in its nature.

Visualization is in no way different from predictive modeling - it's just that the (sometimes implicit) model is transparent and interpretable. Visualization is not the only type of interpretable model: even a table with regression coefficients is interpretable, a decision tree is an interpretable model, a list of typical cases is an interpretable model. A 2D scatter plot that nicely shows the difference in outcomes is a model, because the two dimensions used by the plot indeed help distinguish the outcomes.

Most priors are grounded purely in the desire to capture the truth, as such they are predictive priors. But the interpretable models involve priors that are not grounded in prediction - but rather in the human cost of interpretation. The more difficult it is to interpret a parameter, the lower prior probability of interpretation it should have.

In summary, while most mathematical treatment of statistical modeling tends to be focused purely on prediction, there is a good reason why the cost of interpretation should be considered. Epstein's list of why interpretability matters should motivate us to care:

What's up with Kazakhstan?

| 5 Comments

Chris Zorn pointed me to this graph and asked for my thoughts. I replied that I'd seen worse, but the use of two dimensions doesn't help, and the comparison to the GDP of Kazakhastan is just weird. I mean, who has any idea what is the GDP of Kazakhstan??

Chris replied,

I'm teaching first-year Ph.D. methods on PoliSci this term, and we have a feature called "Graph of the Day," where -- for five minutes or so at the beginning of every class -- the students all look at and comment on some graph from a paper, the press, etc. I used this one yesterday, and the response (from people with a grand total of three weeks of graduate education) was identical: "What's up with Kazakhstan?", and "Isn't a reference point supposed to be *non-obscure*?"

Graph of voter turnout by age

| 7 Comments

Here's a pretty picture (from Charles Franklin, link from John Sides):

Turnoutbyagecitizens.png

What a great graph! I won't be picky, but if I were, I'd make the following suggestions:
- Bigger numbers on the axes--as is, they're hard to read.
- Add percentage signs on the y-axis.
- Label age every 20 years rather than every 10.
- Put the "80-84" age group at 82 (rather than 80), and put the "85 and up" group at 88 (rather than 85).
- Pick colors other than red and blue.

Strong cyclones growing stronger

| 9 Comments

Jamie pointed me to this graph in the NYT:

hurricanes.png

Nice! Especially the margin of error, the subtle colorings and the use of the gray background. The y-axis labels are a little weird (why not simply 60, 80, 100, and 120), and I'm not sure how to think about the x-axis. (Given the scale on the y-axis, should we really care about a change of 1/2 mile per hour in wind speed.) Also, I don't really understand what it means to measure changes in wind speed, if the storms are themselves categorized by wind speed! But, as a graph, it has many excellent features.

Erratum

| No Comments

In Red State, Blue State, I attributed the distorted maps to "computer scientists Michael Gastner, Cosma Shalizi, and Mark Newman." Cosma writes that "none of us are or were 'computer scientists', and in fact we were all trained as physicists, and working as physicists at the time." My bad.

Data as chartjunk

| 1 Comment

31novel.xlarge1.jpg

See here for what I'm talking about.

Pretty pictures

| No Comments

Chris Paulse points to this interesting slideshow.

I find the National Weather Service display to be much more useful than weather.com and other commercial sites. But its city-search finder is terrible. Go here and enter "Atlanta" and see what you get. It's a list of about 20 cities. And they don't even list Atlanta, Georgia first. It's buried in there, about fifteenth in the list. What's that all about?

Bad binning can mislead

| 6 Comments

Howard Wainer writes:

A friend sent me this USA Today article with a graph about HIV:

Animated adiposity

| 2 Comments

Rebecca sends in this animated graph and writes, "all the white states inititally are a bit deceptive, but even so, it's pretty striking, and the animation is very effective." I think I'd prefer a time series of the national average, along with a color-coded animated map showing each state relative to the national average in each year.

Dept of silly graphs

| 5 Comments

Bill Harris points to this:

directv.png

Bill writes:

Opening Day

| 2 Comments

Nathan Yau writes,

I recently put up a visualization showing the spread of walmarts over time, ... I'm wondering if you know of any other "opening dates" data (starbucks, for example)? I'm itching to put some more data into my code.

More graphical propaganda

| 3 Comments

John Sides reproduces this graph showing Kenyan election results:

kenyaexitpoll.PNG

What a horrible graph! The re-coloring and re-ordering of the wedges makes the difference between "official results" and "poll" seem much greater than they are.

As in my earlier example of PDA (propaganda data analysis), I have no comments on the merits of the case (for example, what can you learn from a poll taken six months after the election)--I'm just weighing in on the graphical presentation.

Too clever by half

| 11 Comments

I appreciate the effort, but I fear that the message that many have taken from Tufte is "graphs should be cool" rather than "graphs should be clear." As Yu-Sung put it, "I am still figuring out how to read it."

Hey . . . nice graph!

| No Comments

From Andrew Sullivan. More here. I love this stuff.

Andrew Smith sends in this:

graph.jpg

He writes, "I think it beats the pie chart you referenced in your previous blog post! My brain still hurts trying to parse it."

In all seriousness: yes, a scatterplot would be better. And they gotta work on their axis labeling. "61.8?"

P.S. In contrast, the photographic height/weight chart is excellent.

Damn this is cool

| 4 Comments

Chris Zorn writes, http://graphics8.nytimes.com/packages/flash/politics/20080603_MARGINS_GRAPHIC/margins.swf

He's clearly a man of few words. I'll give it as a link. You can play with it, click on things, see all sorts of fun stuff.

What I'd really like to do is pipe this through a hierarchical model to smooth out the inevitable survey fluctuations. Also, it would be good to subtract off main effects. For example, in the graph below, are well-educated Arkansans particularly strong Clinton supporters, or is this just a combination of Arkansas being a Clinton state and small-sample fluctuation?

pretty.png

Anyway, I'm not complainin, just suggesting even more things that could be done with these data and this software. The first thing to do is to run it with the 2000 and 2004 exit polls. This app would go great with our Red State, Blue State book.

I'm not the only one who gets frustrated about such things.

I was thinking more about axes that extend beyond the possible range of the data, and I realized that it's not simply an issue of software defaults but something more important, and interesting, which is the way in which graphics objects are stored on the computer.

R (and its predecessor, S) is designed to be an environment for data analysis, and its graphics functions are focused on plotting data points. If you're just plotting a bunch of points, with no other information, then it makes sense to extend the axes beyond the extremes of the data, so that all the points are visible. But then, if you want, you can specify limits to the graphing range (for example, in R, xlim=c(0,1), ylim=c(0,1)). The defaults for these limits are the range of the data.

What R doesn't allow, though, are logical limits: the idea that the space of the underlying distribution is constrained. Some variables have no constraints, others are restricted to be nonnegative, others fall between 0 and 1, others are integers, and so forth. R (and, as far as I know, other graphics packages) just treats data as lists of numbers. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn't know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.

What I think would be better is for every data object to have a "type" attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc. If the type is not specified (i.e., NULL), it could default to unconstrained continuous (thus reproducing what's in R already). Graphics functions could then be free to use the type; for example, if a variable is constrained, one of the plotting options (perhaps the default, perhaps not) would be to have the constraints specify the plotting range.

Lots of other benefits would flow from this, I think, and that's why we're doing this in "mi" and "autograph". But the basic idea is not limited to any particular application; it's a larger point that data are not just a bunch of numbers; they come with structure.

The discussion here of graphics defaults inspired me to collect this list of defaults in R graphics that I don't like. In no particular order:

- Axes that extend below 0 or above 1
- Tick marks that are too big. They're ok on the windows graphics device, but when I make my graphs using postscript(), I have to set tck=-.02 so that they're not so big.
- Axis labels that are too far from the axes
- Axis numbers that are spaced too closely together
- A horrible system of cryptic graphics parameters ("mgp", "mar", "xaxs", "xaxt", etc)
- Too much space on the outside of the graph. This becomes a real problem when many graphs are put on the page. This can be corrected using mar, but it's a pain, and lots of people don't know about this and just use the default settings (which is why bad defaults are a problem).

I'm sure I could make my own functions to do this but I haven't ever gotten around to doing this; I just copy code from old examples.

There are also things that I have to do by hand but should be done automatically (yes, I know that means I should write my own functions . . .), in particular, labeling individual lines directly on a graph rather than with a legend.

P.S. Yes, I know R is free so I shouldn't complain . . .

Unalphabetize!

| 8 Comments

I dream of a day when a journalist such as Ezra Klein, when seeing a graph such as this from Rob Goodspeed,

wordcounts.jpg

will immediately say, Hey! Why are these items in alphabetical order? That just confuses things. (It's not like they need to be in alphabetical order so that we can look up "faith" in the index or whatever.)

I have no substantive comment on the graph except that it seems unfair to McCain in that his page has fewer total words, which as displayed in the graph makes him look less substantive overall. I mean, maybe it's just a choice for him to focus on just a few issues.

P.S. I'm not knocking Goodspeed, who put in the work to make the graph, or Klein, who went to the trouble of finding it. I'm just saying that in the ideal world, an irrelevantly alphabetized graph would JUMP OUT OF THE PAGE as something not quite right, in the way that a typo or grammatical error does now. But, hey, my job is education, right? So here's my try.

P.P.S. Howard Wainer has called this the Alabama First error and wrote an article on the topic in Chance in 2001.

What's out there? I have a few desires:

1. A speech-oriented statistics package--a front-end to something like Stata or R with voice commands and spoken output. For example:

User: Regress income on height and sex.
Computer: [repeats, to make sure no misunderstanding] Regress income on height and sex.
User: Yes
Computer: There is no "income" variable
U: What variables do we have?
C: height, sex, weight, occupation, earnings, age---
U: [interrupts] Set y to earnings
C: Set y to earnings
U: Yes
C: Regression of income on height and sex. The intercept is 3.4 with a standard error of 1.2. The slope for height is . . .
U: Add the interaction of height and sex
C: Add the interaction of height and sex
U: Yes
C: Regression of income on height, sex, and height times sex. The intercept is . . .

It would be good to have lots of functions here, but I imagine we could start with regressions and simple statistics and then see what else is useful.

2. A statistical graphics program that uses touch and sound to convey information. For a scatterplot or two-dimensional intensity graph could be conveyed with a setup where as you move a mouse (or a pen, or your hand) over a pad, the computer makes louder sounds where there are more data. I'm thinking of something that sounds like rain, with individual drops for single data points and various sounds of heavy rain or rushing water where there are lots of data.

I'm sure lots more could be done here, for example using some combinations of pitch, timing, chirps, etc., to convey different patterns in data.

Does anyone know what's out there? A quick web search yields this for SPSS and this, which claims to let you hear images, and this screen reader. But what I think we should really be doing is creating some software that is so cool that sighted people will want to use it too.

Recent Comments

  • jonathan: When I was a kid, I saw "powers of ten" read more
  • Andrew Gelman: I discussed this issue in the blog entry linked above, read more
  • Andrew Gelman: Yes, exactly. I think people are making a big mistake read more
  • Bill Drissel: As I hear English, {problem} linked to {candidate cause} and read more
  • Bill Jefferys: I appreciate the link to the very cool "size of read more
  • Thank God for western civ: The under 30 crowd supports school vouchers and social security read more
  • Jared: Elke Weber, right there at Columbia, has done a bunch read more
  • Thorfinn: Maybe you're right about the risk premium, but I'm not read more
  • Bill Harris: I've got a similar question, and I wonder if your read more
  • JonBen: Very interesting data. I understand the social context of putting read more
  • Radu Craiu: I feel compelled to confess that I have read K read more
  • Paul: I think a lot of the issue comes down to read more
  • Nick Cox : Jacob: Thanks for your extra comments. You'd have saved yourself read more
  • Asa: Thanks everyone. I figured out a pretty solid solution to read more
  • Stuart Buck: Is it that medical schools are trying to screen out read more
  • Jacob: BTW, in no way I am putting down R. R read more
  • Jacob: Nick, Of course, my comment on MATLAB's popularity is based read more
  • Steven: http://www.cockeyed.com/science/gallon/liquid.html See for more info read more
  • Andrew Gelman: Jonathan: You are giving the conventional definition of risk aversion read more
  • Jonathan: As an economist who does his work with "the public," read more