November 23, 2008
Visualizing election polls
A colleague points me to these supremely ugly pie-like graphs by Richard Riesenfeld and Geoff Draper. On the other hand, who am I to say they're ugly? I'm sympathetic to the goal of "exposing complex relationships that are not obvious by usual methods of statistical analysis." And it's hard to argue with "Eighty-eight percent said they enjoyed using the software and 71 percent completed all the tasks without errors." I've certainly never performed such an evaluation of my own graphical methods, instead relying, Tufte-like, on my introspective judgment.
Posted by Andrew at 6:51 PM | Comments (3) | TrackBack
November 18, 2008
Estimated votes by county among non-blacks
Ben Lauderdale writes:
I [Ben] had this map [see below] on my door for the last week. Based on exactly the same calculation using constant 95% black support and census-proportional representation. The white counties are the ones whose census names didn't match properly with the names used in the library(maps) package in R, I was too lazy to fix them.

Cool. I'd only suggest using light gray rather than heavy black lines between counties; the map as it is overemphasizes the county borders, I think. But I respect his laziness; there's always time later to fix the details.
Ben continues:
[Below are] the state-by-state county share plots for the lower 49, Obama vote share as a function of black population share. V.O. Key's observation that whites who live near blacks in southern states are less positively inclined towards them is *still* visible in several states.

The circle areas are proportional to county voter turnout. (The biggest circle is L.A. county in California, and so forth.)
Ben also had this comment about his map:
It reminded me of something Bob Putnam would say every time someone presented an empirical talk in our Center for the Study of Democratic Politics series during the year he was a fellow here at Princeton: "You should include miles to the Canadian border as a variable in your regression, it is the most important proxy for political culture in America!" At least in the eastern half of the country, he has a point.
Except for New Hampshire and Vermont, I think.
P.S. For graphics enthusiasts, here are some earlier graphs that I gave the thumbs-down on before Ben came up with the 50 plots above:
First version:

Second version:

Ben was skeptical about proportional circle sizes, but I think it turned out pretty well.
I'd also recommend non-alphabetical ordering of the states and moving away from the misleadingly square 7x7 grid, but I didn't want to hassle Ben any more.
Posted by Andrew at 4:25 PM | Comments (4) | TrackBack
November 6, 2008
The Youth Vote: Freshman vs. Statistics Professor
I have to say I prefer a college freshman's plot to yours, Andrew. Although, you did hack it together at 3am after strolling around Grant Park. And drawing the y axis from 0 is a mistake which you didn't make, too.
Posted by Boris at 9:49 PM | Comments (2) | TrackBack
November 5, 2008
The stunning^H^H^H^H^H^H^H^H slight increase in voter turnout
Henry posted some great links to voter turnout data and discussions of the topic by Michael McDonald. Henry's graph is here.
Just for fun, I decided to redisplay the information; here is my version:

I've updated it with the latest estimate as of 9 Nov 2008.
Key differences between my graph and Henry's:
1. I go back to 1948, Henry starts at 1980.
2. My y-range goes from 45% to 65%; Henry goes all the way from 0 to 100.
3. Henry's graph labels every election; I label every 20 years.
4. Henry's graph is in gray with many black horizontal lines and a blue line with data; mine is black and white with a line and with data points indicated by dots.
Items 1 and 2 above are the most important; I think: by showing a shorter time range and compressing the y range, Henry makes the changes look less impressive. I understand the rationale for including the whole y-range here, but in this case, since changes are being discussed, and a 5% change is, historically, a big deal, I prefer my graph. I did extend the y-scale out to the [45%,65%] range, though, because I wanted to give a little bit of perspective; it would somehow seem misleading for the data to cover the entire y-range in this case.
In any case, I'm not trying to criticize Henry here; making graphs is just something I like to do, and something I like to think about.
P.S. Below is my (updated) R code, for those of you who want to play at home:
# turnout time seriesturnout.year <- seq (1948,2008,4)
turnout.vap <- c(.511,.616,.593,.628,.619,.609,.552,.535,.528,.533,.503,.550,.489,.512,.553,NA)
turnout.VEP <- c(.522,.623,.602,.638,.628,.615,.562,.548,.547,.572,.542,.606,.526,.556,NA,.601)
turnout.VEP[turnout.year==2004] <- turnout.vap[turnout.year==2004] + (turnout.VEP[turnout.year==2000] - turnout.vap[turnout.year==2000])
n <- length (turnout.year)png ("turnout.png", height=300, width=400)
par (mar=c(4,4,2,0), tck=-.01, mgp=c(2,.5,0))
plot (turnout.year, turnout, type="l", xlab="Year", ylab="Percentage of voting-age\npopulation who turned out to vote", xaxt="n", yaxt="n", bty="l", ylim=c(.45,.65))
points (turnout.year[1:(n-1)], turnout[1:(n-1)], pch=20)
points (turnout.year[n], turnout[n], pch=21, cex=1.2)
axis (1, seq(1960,2000,20))
yticks <- seq (.45,.65,.05)
axis (2, yticks, paste(yticks*100,"%",sep=""))
mtext ("Voter turnout in postwar presidential elections", line=1)
dev.off()
Posted by Andrew at 12:36 PM | Comments (11) | TrackBack
October 29, 2008
Phoenix Suns shooters
Yair sends in this plot of the week:

He writes:
This displays the smoothed distribution of shots taken by wing players for the Phoenix Suns in the '07-'08 regular season (Matt Barnes played for the GS Warriors that year). Raja Bell seems like the perfect wing player for the Suns, because he plays defense and then basically sits at the 3-pt line waiting for Steve Nash to give him the ball for a good shot. Leandro Barbosa is similar, but he drives a bit more (especially when Nash is off the floor). Grant Hill didn't fit this mold because he has no 3-pt shot; he is more of a mid-range guy. From this standpoint, Matt Barnes (their free-agent pickup) looks like he could be a better fit. Of course, this plot says nothing about whether he actually hits the threes, but at least his heart is in the right place. Then again, if their offensive system changes because of the new coach, all bets are off.
Pretty graphs, huh? The color scheme seems good for a team called the Suns.
Posted by Andrew at 7:21 PM | Comments (3) | TrackBack
October 23, 2008
Google analytics versus random variation
Ted Dunning writes:
Google analytics normally does a pretty good job of dealing with statistical issues. For instance, the Google website optimizer product does a correct logistic regression complete with error bars and (apparently) Bayesian analysis of how likely one setting is to actually be better than another.But their demo of their latest visualization product is worth a write-up. They seem to ascribe volumes of meaning to a variations in small count statistics.
Check out the video.
As Aleks knows, I can't bear to watch videos. I like the idea of dynamic graphics, but I can't stand the lack of control that comes from watching a video. I like to read something that I can see all of at once.
But the Google tool looks pretty cool. Also, I didn't know they did Bayesian logistic regression. I wonder what prior distribution they use? This is a topic that my colleagues and I have thought about.
Ted continues:
I don't know that they do Bayesian logistic regression. I have compared their results on a few cases to your R code (bayesglm) and to non-Bayesian results (glm). The results from all three (Google, glm, bayesglm) were indistinguishable. I suppose that is good news from the standpoint of choice of prior. The only advantage in practice to a Bayesian process would be that having a decent prior would prevent the system from having bad transients when data collection is begun.The Bayesian-ness that I was referring to was the computation of whether a design option was a good choice. They present these result in terms of probability that a design option would be better than the original design and probability that the design option is the best choice of the ones shown. The computation of these probabilities is inherently Bayesian since it involves integration over the posterior. I have long advocated this way of presenting results to decision makers and am happy to see Google using it. In fact, I generally go one step further since business decision makers usually don't care if they have the absolute best answer and do care about getting answers sooner. The number that I present in these problems is usually the probability that a design option will be within x% of the best design (I call this score viability). This score has the nice property that when an option really doesn't matter, viability of both options will increase to 100% as more data is collected.
I completely agree with Ted's point. There's lots of statistical writing on how to estimate rankings. But, from a decision-analytic point of view, it's very rare that you'd care about rankings at all! Especially in settings with many options and noisy data.
Posted by Andrew at 10:28 PM | Comments (0) | TrackBack
October 21, 2008
Why Model?
Stan pointed me to a short article "Why Model?" by J. M. Epstein. The default principle, both in statistics and in machine learning, is to predict. Any act of statistical fitting that involves likelihood is inherently predictive in its nature.
Visualization is in no way different from predictive modeling - it's just that the (sometimes implicit) model is transparent and interpretable. Visualization is not the only type of interpretable model: even a table with regression coefficients is interpretable, a decision tree is an interpretable model, a list of typical cases is an interpretable model. A 2D scatter plot that nicely shows the difference in outcomes is a model, because the two dimensions used by the plot indeed help distinguish the outcomes.
Most priors are grounded purely in the desire to capture the truth, as such they are predictive priors. But the interpretable models involve priors that are not grounded in prediction - but rather in the human cost of interpretation. The more difficult it is to interpret a parameter, the lower prior probability of interpretation it should have.
In summary, while most mathematical treatment of statistical modeling tends to be focused purely on prediction, there is a good reason why the cost of interpretation should be considered. Epstein's list of why interpretability matters should motivate us to care:
1. Explain (very distinct from predict)
2. Guide data collection
3. Illuminate core dynamics
4. Suggest dynamical analogies
5. Discover new questions
6. Promote a scientific habit of mind
7. Bound (bracket) outcomes to plausible ranges
8. Illuminate core uncertainties.
9. Offer crisis options in near-real time
10. Demonstrate tradeoffs / suggest efficiencies
11. Challenge the robustness of prevailing theory through perturbations
12. Expose prevailing wisdom as incompatible with available data
13. Train practitioners
14. Discipline the policy dialogue
15. Educate the general public
16. Reveal the apparently simple (complex) to be complex (simple)
[Link to paper corrected. Thanks to Lee Sigelman for pointing it out.]
Posted by Aleks Jakulin at 8:38 AM | Comments (2) | TrackBack
September 25, 2008
What's up with Kazakhstan?
Chris Zorn pointed me to this graph and asked for my thoughts. I replied that I'd seen worse, but the use of two dimensions doesn't help, and the comparison to the GDP of Kazakhastan is just weird. I mean, who has any idea what is the GDP of Kazakhstan??
Chris replied,
I'm teaching first-year Ph.D. methods on PoliSci this term, and we have a feature called "Graph of the Day," where -- for five minutes or so at the beginning of every class -- the students all look at and comment on some graph from a paper, the press, etc. I used this one yesterday, and the response (from people with a grand total of three weeks of graduate education) was identical: "What's up with Kazakhstan?", and "Isn't a reference point supposed to be *non-obscure*?"
Posted by Andrew at 9:30 AM | Comments (5) | TrackBack
September 17, 2008
Graph of voter turnout by age
Here's a pretty picture (from Charles Franklin, link from John Sides):

What a great graph! I won't be picky, but if I were, I'd make the following suggestions:
- Bigger numbers on the axes--as is, they're hard to read.
- Add percentage signs on the y-axis.
- Label age every 20 years rather than every 10.
- Put the "80-84" age group at 82 (rather than 80), and put the "85 and up" group at 88 (rather than 85).
- Pick colors other than red and blue.
Posted by Andrew at 8:26 PM | Comments (7) | TrackBack
September 5, 2008
Strong cyclones growing stronger
Jamie pointed me to this graph in the NYT:

Nice! Especially the margin of error, the subtle colorings and the use of the gray background. The y-axis labels are a little weird (why not simply 60, 80, 100, and 120), and I'm not sure how to think about the x-axis. (Given the scale on the y-axis, should we really care about a change of 1/2 mile per hour in wind speed.) Also, I don't really understand what it means to measure changes in wind speed, if the storms are themselves categorized by wind speed! But, as a graph, it has many excellent features.
Posted by Andrew at 9:58 AM | Comments (9) | TrackBack
September 2, 2008
Erratum
In Red State, Blue State, I attributed the distorted maps to "computer scientists Michael Gastner, Cosma Shalizi, and Mark Newman." Cosma writes that "none of us are or were 'computer scientists', and in fact we were all trained as physicists, and working as physicists at the time." My bad.
Posted by Andrew at 5:01 PM | Comments (0) | TrackBack
August 31, 2008
Data as chartjunk

See here for what I'm talking about.
Posted by Andrew at 7:54 PM | Comments (1) | TrackBack
August 23, 2008
Pretty pictures
Chris Paulse points to this interesting slideshow.
Posted by Andrew at 4:05 PM | Comments (0) | TrackBack
August 22, 2008
Can't the National Weather Service do better than this?
I find the National Weather Service display to be much more useful than weather.com and other commercial sites. But its city-search finder is terrible. Go here and enter "Atlanta" and see what you get. It's a list of about 20 cities. And they don't even list Atlanta, Georgia first. It's buried in there, about fifteenth in the list. What's that all about?
Posted by Andrew at 3:03 AM | Comments (9) | TrackBack
Bad binning can mislead
Howard Wainer writes:
A friend sent me this USA Today article with a graph about HIV:
![]()
He sent it because of a paper I published a couple of years ago (with Marc Gessaroli & Monica Verdi) about how we can distort results by changing the bin category boundaries.
The USA Today graph changes the width of the bins. In the attached alternative plots I tried two tactics. One was to plot the average per year HIV infections for each year in a bin:
The other was to group in such a way as to make a (false) point:
P.S. According to comments below, Howard may have been mistaken in his criticisms. Still an interesting discussion topic, though.
Posted by Andrew at 12:14 AM | Comments (6) | TrackBack
July 21, 2008
Animated adiposity
Rebecca sends in this animated graph and writes, "all the white states inititally are a bit deceptive, but even so, it's pretty striking, and the animation is very effective." I think I'd prefer a time series of the national average, along with a color-coded animated map showing each state relative to the national average in each year.
Posted by Andrew at 12:08 AM | Comments (2) | TrackBack
July 16, 2008
Dept of silly graphs
Bill Harris points to this:

Bill writes:
Someone on the evaltalk mailing list pointed out the great bar graph representing the numbers 56, 81, and 95 (the number of HD channels on three different services). The group was trying to figure out the y axis scaling that could produce such a graph . . . The scaling is perhaps not that hard, if you assume it's a 3-D plot and the 95 is much closer to the viewer than the other bars. That's not a graphing technique I've seen formally taught.
P.S. I also wonder what's on those extra channels. Are they 14 channels of home shopping, or what?
P.P.S. No, this is not even close to being a candidate for worst graph ever.
Posted by Andrew at 12:30 AM | Comments (4) | TrackBack
July 11, 2008
Opening Day
Nathan Yau writes,
I recently put up a visualization showing the spread of walmarts over time, ... I'm wondering if you know of any other "opening dates" data (starbucks, for example)? I'm itching to put some more data into my code.
Posted by Andrew at 6:03 PM | Comments (2) | TrackBack
July 10, 2008
More graphical propaganda
John Sides reproduces this graph showing Kenyan election results:
What a horrible graph! The re-coloring and re-ordering of the wedges makes the difference between "official results" and "poll" seem much greater than they are.
As in my earlier example of PDA (propaganda data analysis), I have no comments on the merits of the case (for example, what can you learn from a poll taken six months after the election)--I'm just weighing in on the graphical presentation.
Posted by Andrew at 3:27 AM | Comments (3) | TrackBack
June 14, 2008
Too clever by half
I appreciate the effort, but I fear that the message that many have taken from Tufte is "graphs should be cool" rather than "graphs should be clear." As Yu-Sung put it, "I am still figuring out how to read it."
Posted by Andrew at 7:01 PM | Comments (11) | TrackBack
June 13, 2008
Hey . . . nice graph!
From Andrew Sullivan. More here. I love this stuff.
Posted by Andrew at 12:31 AM | Comments (0) | TrackBack
June 6, 2008
New candidate for worst graph ever
Andrew Smith sends in this:

He writes, "I think it beats the pie chart you referenced in your previous blog post! My brain still hurts trying to parse it."
In all seriousness: yes, a scatterplot would be better. And they gotta work on their axis labeling. "61.8?"
P.S. In contrast, the photographic height/weight chart is excellent.
Posted by Andrew at 1:04 AM | Comments (3) | TrackBack
June 5, 2008
Damn this is cool
Chris Zorn writes, http://graphics8.nytimes.com/packages/flash/politics/20080603_MARGINS_GRAPHIC/margins.swf
He's clearly a man of few words. I'll give it as a link. You can play with it, click on things, see all sorts of fun stuff.
What I'd really like to do is pipe this through a hierarchical model to smooth out the inevitable survey fluctuations. Also, it would be good to subtract off main effects. For example, in the graph below, are well-educated Arkansans particularly strong Clinton supporters, or is this just a combination of Arkansas being a Clinton state and small-sample fluctuation?

Anyway, I'm not complainin, just suggesting even more things that could be done with these data and this software. The first thing to do is to run it with the 2000 and 2004 exit polls. This app would go great with our Red State, Blue State book.
Posted by Andrew at 12:06 AM | Comments (4) | TrackBack
May 19, 2008
Frustrating inability of standard graphics programs do recognize discrete variables
I'm not the only one who gets frustrated about such things.
Posted by Andrew at 1:21 AM | Comments (4) | TrackBack
May 12, 2008
Axes that extend below 0 or above 1: actually a bigger issue involving how statistical variables are stored on the computer
I was thinking more about axes that extend beyond the possible range of the data, and I realized that it's not simply an issue of software defaults but something more important, and interesting, which is the way in which graphics objects are stored on the computer.
R (and its predecessor, S) is designed to be an environment for data analysis, and its graphics functions are focused on plotting data points. If you're just plotting a bunch of points, with no other information, then it makes sense to extend the axes beyond the extremes of the data, so that all the points are visible. But then, if you want, you can specify limits to the graphing range (for example, in R, xlim=c(0,1), ylim=c(0,1)). The defaults for these limits are the range of the data.
What R doesn't allow, though, are logical limits: the idea that the space of the underlying distribution is constrained. Some variables have no constraints, others are restricted to be nonnegative, others fall between 0 and 1, others are integers, and so forth. R (and, as far as I know, other graphics packages) just treats data as lists of numbers. You also see this problem with discrete variables; for example when R is making a histogram of a variable that takes on the values 1, 2, 3, 4, 5, it doesn't know to set up the bins at the correct places, instead setting up bins from 0 to 1, 1 to 2, 2 to 3, etc., making it nearly impossible to read sometimes.
What I think would be better is for every data object to have a "type" attached: the type could be integer, nonnegative integer, positive integer, continuous, nonnegative continuous, binary, discrete with bounded range, discrete with specified labels, unordered discrete, continuous between 0 and 1, etc. If the type is not specified (i.e., NULL), it could default to unconstrained continuous (thus reproducing what's in R already). Graphics functions could then be free to use the type; for example, if a variable is constrained, one of the plotting options (perhaps the default, perhaps not) would be to have the constraints specify the plotting range.
Lots of other benefits would flow from this, I think, and that's why we're doing this in "mi" and "autograph". But the basic idea is not limited to any particular application; it's a larger point that data are not just a bunch of numbers; they come with structure.
Posted by Andrew at 9:54 AM | Comments (14) | TrackBack
Some graphics defaults that I dislike
The discussion here of graphics defaults inspired me to collect this list of defaults in R graphics that I don't like. In no particular order:
- Axes that extend below 0 or above 1
- Tick marks that are too big. They're ok on the windows graphics device, but when I make my graphs using postscript(), I have to set tck=-.02 so that they're not so big.
- Axis labels that are too far from the axes
- Axis numbers that are spaced too closely together
- A horrible system of cryptic graphics parameters ("mgp", "mar", "xaxs", "xaxt", etc)
- Too much space on the outside of the graph. This becomes a real problem when many graphs are put on the page. This can be corrected using mar, but it's a pain, and lots of people don't know about this and just use the default settings (which is why bad defaults are a problem).
I'm sure I could make my own functions to do this but I haven't ever gotten around to doing this; I just copy code from old examples.
There are also things that I have to do by hand but should be done automatically (yes, I know that means I should write my own functions . . .), in particular, labeling individual lines directly on a graph rather than with a legend.
P.S. Yes, I know R is free so I shouldn't complain . . .
Posted by Andrew at 8:21 AM | Comments (9) | TrackBack
May 8, 2008
Unalphabetize!
I dream of a day when a journalist such as Ezra Klein, when seeing a graph such as this from Rob Goodspeed,

will immediately say, Hey! Why are these items in alphabetical order? That just confuses things. (It's not like they need to be in alphabetical order so that we can look up "faith" in the index or whatever.)
I have no substantive comment on the graph except that it seems unfair to McCain in that his page has fewer total words, which as displayed in the graph makes him look less substantive overall. I mean, maybe it's just a choice for him to focus on just a few issues.
P.S. I'm not knocking Goodspeed, who put in the work to make the graph, or Klein, who went to the trouble of finding it. I'm just saying that in the ideal world, an irrelevantly alphabetized graph would JUMP OUT OF THE PAGE as something not quite right, in the way that a typo or grammatical error does now. But, hey, my job is education, right? So here's my try.
P.P.S. Howard Wainer has called this the Alabama First error and wrote an article on the topic in Chance in 2001.
Posted by Andrew at 1:27 PM | Comments (8) | TrackBack
April 15, 2008
Statistical software for blind people
What's out there? I have a few desires:
1. A speech-oriented statistics package--a front-end to something like Stata or R with voice commands and spoken output. For example:
User: Regress income on height and sex.
Computer: [repeats, to make sure no misunderstanding] Regress income on height and sex.
User: Yes
Computer: There is no "income" variable
U: What variables do we have?
C: height, sex, weight, occupation, earnings, age---
U: [interrupts] Set y to earnings
C: Set y to earnings
U: Yes
C: Regression of income on height and sex. The intercept is 3.4 with a standard error of 1.2. The slope for height is . . .
U: Add the interaction of height and sex
C: Add the interaction of height and sex
U: Yes
C: Regression of income on height, sex, and height times sex. The intercept is . . .
It would be good to have lots of functions here, but I imagine we could start with regressions and simple statistics and then see what else is useful.
2. A statistical graphics program that uses touch and sound to convey information. For a scatterplot or two-dimensional intensity graph could be conveyed with a setup where as you move a mouse (or a pen, or your hand) over a pad, the computer makes louder sounds where there are more data. I'm thinking of something that sounds like rain, with individual drops for single data points and various sounds of heavy rain or rushing water where there are lots of data.
I'm sure lots more could be done here, for example using some combinations of pitch, timing, chirps, etc., to convey different patterns in data.
Does anyone know what's out there? A quick web search yields this for SPSS and this, which claims to let you hear images, and this screen reader. But what I think we should really be doing is creating some software that is so cool that sighted people will want to use it too.
Posted by Andrew at 11:37 AM | Comments (4) | TrackBack
March 28, 2008
Another way to organize data...
There is a great video on YouTube which shows a project representing numbers of people by grains of rice. You can see the contrast of one person (in this case, Tony Blair - clearly, this is a UK project) to the number of people on one continent, to the number of people in one country, etc. Though simple, it's interesting to actually see someone do this.
Link:
http://www.youtube.com/watch?v=iDWcuBygAUw
Posted by Juli at 11:59 AM | Comments (0) | TrackBack
Surface vs. Contour Plots for the Presentation of Three-Dimensional Data
Chris Zorn sends along this. A rare paper about graphics that offers data as well as opinion!
Posted by Andrew at 12:53 AM | Comments (6) | TrackBack
March 23, 2008
Gap in Life Expectancy Widens for the Nation
I don't really have anything to say about this article in the New York Times, except for a comment on the graph:

It's pretty good--excellent use of a common y-axis--but I think they should've put time on the x-axis and used three separate lines for "low status," "intermediate status," and "high status." I prefer to put time on the x-axis where possible. Especially since the article's focus is on the "gap," which looks like a slope in the above graph. If time were on the x-axis and different lines were used for different-status groups, then the "gaps" really would be gaps between lines in the graph.
The article (by Robert Pear) had some interesting discussion about differences between rich and poor in smoking, risky behavior, access to health insurance, and adherence to treatment advice. The only thing that seemed silly to me was when someone was quoted as saying, "Middle-class and upper-income people have greater access to the huge amounts of health information on the Internet." I mean, sure, the internet is great, but crediting it with years of life expectancy seems a bit too strong. The internet is a fine way to score some Vicodin, but it's hard for me to imagine it can explain health disparities. (Maybe just a failure of my imagination, as usual.)
P.S. Typo fixed; thanks, Garrett!
P.P.S. See Steve Kass's comments below.
Posted by Andrew at 1:36 PM | Comments (7) | TrackBack
March 21, 2008
Nathan Yau's contest
See here. He's offering a copy of Tufte's "The visual display of quantitative information." My own favorite among Tufte's books is his second book, "Envisioning information," which, to my taste, has more in the way of practical tips and less in the way of ranting. (Don't get me wrong, I like all of Tufte's books; I'm just giving you my preferences. His first book was fun to read, but his second book changed how I do statistics.)
Posted by Andrew at 12:54 AM | Comments (2) | TrackBack
March 11, 2008
But why are bananas labeled as more difficult than pears?
(From Kaiser)
P.S. I fixed the mistake in title of blog entry. Anyway, I still think bananas are easier than pears. Although, it's true that you have to take care of bananas so they don't get bruised.
Posted by Andrew at 10:29 PM | Comments (11) | TrackBack
March 5, 2008
Why bother giving it a title at all?
I'm always worried about using too much jargon when labeling my graphs, but I don't think I'll ever be able to top this title:
"Figure 5-9: Annual Intersextile Ranges in Budget Authority for Domestic Subfunctions, Fiscal Years 1951-2005"
I like the "fiscal years" bit--it's a nice touch.
P.S. The actual content in the graph is interesting and important--as the author (Eric Patashnik) notes in the text, "year-to-year variability [in discretionary government spending on domestic items] declined significantly between the 1950s and the mid-1980s." It's all good stuff, just an amusingly jargon-laden title.
Posted by Andrew at 12:21 AM | Comments (2) | TrackBack
February 26, 2008
Plotting a million (or more) better than Galton
Howard pointed me to this cool page of artwork by Chris Jordan. Here's one example:
Plastic Cups, 2008 60x90"Depicts one million plastic cups, the number used on airline flights in the US every six hours.

Partial zoom:

Detail at actual print size:

Several more are at Jordan's website.
Posted by Andrew at 12:38 AM | Comments (0) | TrackBack
February 11, 2008
More discreteness, please
Justin Wolfers presents this graph that he (along with Eric Bradlow, Shane Jensen, and Adi Wyner) made comparing the career trajectory of Roger Clemens to other comparable pitchers:

The point is that Clemens did unexpectedly well in the later part of his career (better earned run average, allowed fewer walks+hits) compared to other pitchers with long careers. This in turn suggests that maybe performance-enhancing drugs made a difference. Justin writes:
To be clear, we don’t know whether Roger Clemens took steroids or not. But to argue that somehow the statistical record proves that he didn’t is simply dishonest, incompetent, or both. If anything, the very same data presented in the report — if analyzed properly — tends to suggest an unusual reversal of fortune for Clemens at around age 36 or 37, which is when the Mitchell Report suggests that, well, something funny was going on.
I can't comment on the steroids thing at all, but I will say that I'd like more information than are in the graphs. For one thing, Clemens is clearly not a typical pitcher and never has been. At the very least, you'd like to see the comparison of his trajectory with all the other individual trajectories, not simply the average. For another, the graphs above seem to be relying way too much on the quadratic fit. At least for the average of all the other pitchers, why not show the actual averages. Far be it from me to criticize this analysis (especially since I am friends with all four of the people who did it!)--this is just a recreational activity, and I'm sure these guys have better things to do than correct ERA's for A.L./N.L. effects, etc.--but I think you do want to have some comparisons of the entire distribution, as well as a sense of how much the "unusal reversal around ages 36 or 37" is an artifact of the fitted model.
P.S. to Justin, Eric, Shane, and Adi: Now youall have permission to be picky about my analyses in return. . . .
P.P.S. Nathan made this plot showing data from the 16 most recent Hall of Fame pitchers.
Posted by Andrew at 12:40 AM | Comments (11) | TrackBack
February 1, 2008
Exemplary statistical graphics
Chris "last author" Zorn points us to this, commenting: "Too many pie charts, but aside from that..."
Posted by Andrew at 4:02 PM | Comments (3) | TrackBack
January 24, 2008
Does jittering suck?
Antony Unwin saw this scatterplot (see here for background):

and had some comments and suggestions. I'll show his plots below, but first I want to talk about jittering. I wonder if the main problem my original graph above is that it is too small.
In any case, the jittering makes it looks weird, but I wonder whether it would be better if it were jittered a bit more, so that the clusters of points blurred into each other completely. (Since the data are integers, we could just jitter by adding random U(-.5,.5) numbers to each point in the x and y directions.)
But Antony says:
Jittering makes strong assumptions, which are rarely mentioned. It is bad for small cell sizes (you can get odd patterns unless you specifically adjust your jittering to account for cell size and how many do that?) and is bad for large cell sizes (because of overplotting and because you get solid blocks which can hardly be distinguished from one another). In fairness I should declare myself as an anti-jittering fundamentalist and say that there are hardly any circumstances when I think jittering is useful. Jittering is a legacy from the days when you could only plot points. Area plots should always be the first choice.
Maybe he's right. A gray-scale plot using image() might be a better way to go in a situation like this one with many hundreds of data points.
The Unwin solution
OK, now here are Antony's plots:



and the following description:
The attached fluctuation diagram was drawn in iPlots (which expects mosaicplot variables to be factors):soc.scoreS1<--soc.score.S econ<-as.factor(econ.score.S) soc<-as.factor(soc.scoreS1) imosaic(econ,soc, type="fluctuation")I think this plot shows the bivariate distribution of the data much better. It is clear where the bulk of the data lie and differences between cells are much more apparent. Best of all, you can link to other displays. I have included the same plot twice more, once with the Democrats highlighted, once with the Republicans highlighted.
The parties were selected from a barchart of party affiliation:
ibar(pid)
He then makes a plot showing Dems, Reps, and Independents in the same grid:
You just include another level in the fluctuation diagram:imosaic(pid,soc,econ, type="f")
I [Antony] think it looks better with the cells shaded, which can be achieved with
iplot.opt(fillColor="gray")
or colour them by party with
iplot.opt(col=Party)
It might be better to leave out the Independents, as there are not so many, though the ordering looks right, with their plot lying between those for the Democrats and the Republicans.
Here are the pics:



(Sorry about all the white space. I converted the plots from pdf to png to display on the blog.)
(Fixed 1/24/2008 10:42)
Posted by Andrew at 6:00 AM | Comments (13) | TrackBack
January 9, 2008
Errol Morris update
Regarding this story, Antony Unwin sends the following graph with a note:

It all depends what you want to show. I [Unwin] like the attached multiple barcharts view comparing the distribution of reasons by verdict, with the reasons ordered by count. The "Undecided" figures are taken from the piechart and the information on Totals in the article:Shadow 89
Gravity 5
Camera/Exposure 20
Topography/Climate 21
Character/Artistic 42
Ball Properties 31
Practical Concerns 16
Shelling 45
Rocks 20
Physics 2The graphic shows that
(a) Shadow was by far the most important reason cited by those saying "On";
(b) the "Undecideds" were also influenced by Shadow;
(c) Character and Shelling were most relevant by a small margin for those judging "Off" and were relatively rarely mentioned by those saying "On";and this graphic reflects the absolute numbers.
A multiple barcharts view of verdicts by reason is also interesting, Do you consider reason or verdict first?
Posted by Andrew at 12:47 AM | Comments (1) | TrackBack
December 19, 2007
Why do people persist in using terrible statistical graphics?
Some of you will remember that a few months ago this blog mentioned Errol Morris's New York Times article about two famous old photographs, both of which show the same stretch of road on the Crimean peninsula. One photograph shows the road covered with cannonballs, with additional cannonballs strewn around the ground on both sides of the road; the other shows the road clear of cannonballs. As Morris discusses, it has long been assumed that the photo with the clear road --- the "off the road" picture --- was taken first, and that the photographer and his crew then moved a bunch of cannonballs onto the road to take the "on" picture. In his article, Morris questioned whether this ordering was in fact correct.
As Morris discusses in another article, the traditional wisdom was in fact correct: "Off" came first. This can be determined pretty conclusively by looking at the cannonballs that are lying around on the ground: many of them have shifted position slightly, and in every case they are slightly farther downhill in the On photo than in the Off photo. The only story that makes sense is that the Off photo was taken, and then these cannonballs were disturbed (presumably by the photographer and his team
Before the answer was known for sure, Morris asked his readers to send in their opinions and reasons. In a new article, Morris summarizes the reasons, using some of the worst statistical graphics I have seen in 2007 (it's worth taking a look). And he likes them (the graphics, I mean)!
If anyone would like to make a better display, here are Morris' data. (Sorry, he doesn't really discuss what the reasons mean, so you'll just have to work with what's here). The first line is a header line; subsequent lines give the reason, the number of people who cited this reason in describing why they think "On" came first, and the number who cited this reason in describing why "Off" came first. ("Off" is the right answer).
Reason,On,Off
Shadow,149,23
Gravity,3,5
# and Position,155,75
Camera/Exposure,20,10
Topography/Climate,33,17
Character/Artistic,20,37
Ball Properties,17,8
Practical Concerns,60,25
Shelling,10,30
Rocks,13,19
Physics,2,2
Posted by Phil at 1:22 PM | Comments (7) | TrackBack
December 7, 2007
Graphical Representations of Voting Results
Matt pointed me to this paper by Robert Vanderbei:
We describe and illustrate various ways to represent election results graphically. The advantages and disadvantages of the various methods are discussed. While there is no one perfect way to fairly represent the outcomes, it is easy to come up with methods that are superior to those used in recent elections.
The coolest thing in the paper are some 3-color maps. Here's 1992: blue is Clinton, red is Bush, and green is Perot:

It has the usual problem that large sparsely-populated areas are overrepresented but otherwise is ok, and certainly provides some interesting information. Vanderbei has some interesting discussion of the choice of colors for displaying these scales.
My other thoughts on the paper:
1. What's with the lower-case "democratic" and "republican"? It's standard to write these in caps.
2. I really hate those so-called "cartograms" (p.10 of the paper) since they draw attention to the distortion (the distribution of population) rather than the votes, which is really what we want to see.
3. I still like this map, which unfortunately isn't in the paper:

4. For the maps by Congressional district (page 11 of the paper), I'd prefer to put one dot per district rather than shading. The shading overemphasizes large areas (as usual) and also adds another distracting feature of drawing attention to the shapes of the districts, which is not the main point of interest.
That said, I do often present colored-state maps myself, because it is a clear way of presenting the information, despite all the problems in interpretation.
Posted by Andrew at 2:51 AM | Comments (4) | TrackBack
December 3, 2007
Exploratory data analysis course
Aleks noticed this interesting-looking course:
Course Contents Predictive Analytics and Exploratory Data Mining* the relationship between predictive analytics and exploratory data mining
* the role of graphics in exploratory analysis
* complexity in a PowerPoint world
* the analyst's dilemmaWorking with Unstructured Data
* data streams versus structured data
* social network analysis as a solution to unstructured problems
* statistical mechanics of network analyses
* predicting with a network
* complex networks versus reductionismExploratory Data Mining and Predictive Models
* exploratory data mining success
* predictive modeling methods
* logistic regression
* decision trees
* neural networks
* the truth about neural networks
* comparing and contrasting predictive modeling methods
* model structure and impact on exploratory results
* graphical review of model results
* multi-dimensional graphicsExploratory Predictive Modeling
* initial data screening
* elements of an exploratory script
* developing complex predictive models for exploratory efforts
* identifying important variables
* analyzing variables, domains, and clusters
* graphical review of models and dataExploratory Findings
* extracting new hypotheses (exploratory findings) from the predictive model
* building confidence with the exploratory findings
* recognizing and overcoming impediments to acceptance by the target audience
Remind me again why we teach classes on boring topics like "categorical data anlysis" . . .
Posted by Andrew at 11:48 AM | Comments (0) | TrackBack
November 22, 2007
Assistance in picking colors and charts
A few years ago I have used Cindy Brewer's ColorBrewer system for picking the right color scheme for graphics, based on experience from cartography. Recently, it has also been made into a R library. In particular, ColorBrewer distinguishes three types of color schemes: diverging, quantitative and qualitative:

Diverging blurs out the mean (appropriate for visualizing normally distributed real variables, or also correlations with color), quantitative blurs out the zero (appropriate for visualizing exponentially distributed positive variables with color), and qualitative makes it easy to distinguish adjacent values (appropriate for categorical variables).
ColorBrewer provides guidelines with respect to appropriateness of a color scheme for print, monitors, laptops, projectors, photocopying and even color blindness (I've once had someone complain about my color schemes after a lecture - just to find out that he's color blind).
I've been frustrated by people who do not use infinite palettes for visualizing data, and RColorBrewer exacerbates the problem by not being able to create palettes of arbitrary size. But, using simple linear interpolation between colors, one can create very appealing infinite palettes that maintain the approximate perceptual linearity (meaning that a change in our perception of color strength is proportional to the change in value across the scale) of ColorBrewer's palettes:

While some people might argue that 11 bins are enough, I'd respond to this by saying that binning is an act of pure and inexcusable laziness in the case when you can easily visualize a continuum.
Just a few days ago, I've come across another tool: Juice Analytics Chart Chooser. It lists a number of charts: both as pictures, and also as PowerPoint and Excel templates:

Each chart may have some of the following features:
- Trend involves a variable indicating time
- Composition involves a set of variables that add up to 1
- Distribution exhibits an occurrence count for different values of some variable
- Comparison pairs up two or more variables
- Relationship visualizes a complex relationship between two or more variables
Thereby, you can see how visualizations carry many parallels to models. Picking a visualization is very much alike picking a good model. I have discussed this before.
For the end, here is a chart I use for assigning dimensions to graphical elements. A full circle indicates a good choice, an empty one an acceptable choice, while the absence of the circle means that the element isn't useful for presenting that particular aspect of the data.

Note, nominal corresponds to the qualitative scale above, quantitative to diverging and ordinal approximately to sequential.
Happy Thanksgiving!
Posted by Aleks Jakulin at 1:51 AM | Comments (6) | TrackBack
November 15, 2007
When is a bad graphic a good graphic?
This graphic (from SolarPowerRocks.com, which also gives references for the numbers, which I have not checked) is pretty neat. It compares annual U.S. energy R&D expenditures with the cost of the war in Iraq (one might well question whether it is reasonable or even meaningful to compare those, but that's not what this post is about). The graphic is neat precisely because it is so useless: it makes the point that the costs that it compares are so wildly different in magnitude that you can't even plot them on the same graph. Of course any of us could think of ways that you could plot them on the same graph, but that would make the graphic more informative at the expense of making it useless for its intended purpose.
SolarPowerRocks.com says "These figures are in millions. The source for energy R&D expenditures is from the National Council for Science and the Environment. "

One thing worth pointing out, I guess: As Andrew and I have discussed for literally decades, this is the sort of thing that makes people say "if we can afford the war in Iraq, we can afford to spend $X on solar power research." But that works the wrong way: we can't afford the solar power research, because we're spending all of that money in Iraq!
Posted by Phil at 4:47 PM | Comments (2) | TrackBack
October 26, 2007
Venn Diagram Challenge Summary 1.5
Few people have pointed us to some more of the Venn Diagram Challenge diagrams in response to the Venn Diagram Challenge Summary 1:
Patrick Murphy has pointed us to other works of his

It's nicely made and has a goodness of Venn Diagram combined with a bar chart. If you are interested there are more versions of his works here, and here.
Here is my [Bernard's] attempt at this. Based on information provided by Igor my [Bernard's] feeling is that the question behind the Venn Diagram is "which combinations of tests are consistent over time".
Bernard's work is good at comparing between Autism and Autism Spectrum. Although it would have been better if it had the baseline counts somewhere.
We are still working on the second part of the summary.
Posted by Masanao at 10:20 PM | Comments (0) | TrackBack
October 15, 2007
Parallel Coordinates and a talk at Columbia
Alfred Inselberg, the inventor of parallel coordinates (pictured below) will be giving a talk at Columbia this Thursday at 11am. More information in the extended entry.

Columbia Vision and Graphics Center Lecture
Thursday, October 18, 2007, 11am
Schapiro Center, Interschool Lab
Columbia UniversityMultidimensional Visualization and its Applications
Alfred Inselberg
School of Mathematical Sciences, Tel Aviv University, Israel
Senior Fellow in Visualization -- San Diego Supercomputer Center, USAThe desire to understand the underlying geometry of multidimensional
problems motivated several visualization methodologies to augment our limited
3-dimensional perception. After a short overview, Parallel Coordinates are
rigorously developed, obtaining a 1-1 mapping between subsets of Euclidean
N-space and subsets of 2-space. It leads to representations of lines, flats,
curves, intersections, hypersurfaces, proximities and geometrical
construction algorithms. Convexity can be visualized in any dimension,
as well as non-orientability (Moebius strip) and other properties of
hypersurfaces. This is a visual multidimensional coordinate system with
applications to air traffic control, visual and automatic data mining,
and interactive models of complex systems.
Posted by Aleks Jakulin at 4:59 PM | Comments (9) | TrackBack
Venn Diagram Challenge Summary 1
The Venn Diagram Challenge which started with this entry has spurred exciting discussions at Junk Charts, EagerEyes.org, and at Perceptual edge. So I thought I will do my best to put them together in one piece.
Outcomes people created can be divided into 2 classes, first group dealt with the problem of expressing the "3-way Venn diagram of percentage with different base frequency". Second group went a little deeper to figure out the better way to express what the paper is trying to express in a graphical way. Our ultimate goal is the second one, however, first problem is it's self a interesting challenge and thus I will deal with them separately. ( Second group will be dealt with in the Venn Diagram Challenge Summary 2 which should come shortly after this article. )
Venn diagram converted into a table:
(For background you can look at the previous posts original entry, on Antony Unwin's Mosaic chart, and Stack Lee's bar chart.)
How to express 3-way Venn diagram of percentage with different base frequency better
Here are 4 graphs that I am aware of that falls in this category:
Stack Lee
Robert Kosara

Patrick Murphy

Antony Unwin

It is always amazing to see how people make cool graphics out of same data.
There were 4 things ( percentage, base frequency, structure, possible trend, and maybe more) or maybe more, from the Venn diagram that could have been expressed graphically. When we dissect the above graphs by the 4 things noted above, result is the following:

So the biggest differences between the graphs are the way in which the structure is expressed. Another point to note is how the different graphs addressed the issue of the base frequency. It's hard to say which one's the best because they all have points which I like. For example, to express percentage Antony's Mosaic chart seems the most suitable since it is clear that it is showing a proportion by having gray area with the green area on the bar. To express base frequency, again I like Antony's Mosaic chart since it gives heavy weight on the ones with more samples, which are the results that we should focus more on. As for expressing structure, it is tough call between Patrick and Robert, I personally like them both in a different way. Stack's bar chart seems very good at comparing between Autism and Autism Spectrum which I should have put in the chart.
What we did:
We made 2 graphs
Figure 1: line graph of prevalence of best-estimate diagnosis at age 9 years conditioned on clinician (clinician was chosen arbitrarily)

Figure 1. Prevalence of best-estimate diagnosis at age 9 years with frequency of diagnostic combinations at age 2 years expressed as area of circle. Vertical line show plus minus 2 standard error bounds based on the implicit binomial distribution with Bayesian correction (*1). Upper graph represents the case where clinician is yes and bottom is for clinician no. PL-ADOS stands for Pre-Linguistic Autism Diagnostic Observation Schedule; ADI-R stands for Autism Diagnostic Interview–Revised.
Figure 2: line graph of prevalence of best-estimate diagnosis at age 9 years by combination of tests
Figure 2. Prevalence of best-estimate diagnosis at age 9 years with frequency of diagnostic combinations at age 2 years expressed as area of circle. Autism. Blue line represents the Pre-Linguistic Autism Diagnostic Observation Schedule (PL-ADOS); Green line represents Autism Diagnostic Interview–Revised (ADI-R); and Red line represents Clinician.
If we do the same analysis we get this:

For figure 1 you can see the trend easily, with the cost of loosing the overall structure. Alternatively figure 2 keeps the structure, but it comes with the cost of visual complexity. Area of circle is not my favorite way to express the base frequency, but it does a good job of showing which points are more important without interfering with the trend line. Also this figure is generalizable to more complex Venn Diagrams.
What do you think? We appreciate your constructive comments!
( If you have charts that was not mentioned in this article and would like to be acknowledged give us a comment. Also those who tacked the issue of sensitivity and specificity, I didn't forget you. You will be mentioned in Venn Diagram Challenge Summary 2. ...to be continued...)
(*1) Calculation of standard error with Bayesian correction is done as:

Posted by Masanao at 10:00 AM | Comments (4) | TrackBack
October 10, 2007
How to make a certain interactive graph in R (or other convenient software)?
Dana Kelly writes, "Here's a link to a NYT article on trends in commercial aviation accident rates. I particularly liked the interactive graphic in the article. Do you know how such graphics can be constructed in R?" The short answer is no, I don't know how to do it. But maybe someone who is reading this knows?
Posted by Andrew at 6:33 PM | Comments (11) | TrackBack
September 28, 2007
Another try at the autism graph
Someone writes "Keep it simple" and sends this in:
See here and here for background.
Posted by Andrew at 12:17 PM | Comments (8) | TrackBack
September 27, 2007
Antony Unwin's graphs for autism data
In response to this query on how to reexpress Venn-diagram data graphically, Antony sends along this picture:

and writes:
The Autism data are surprisingly clearly structured. I haven't included the basic barcharts for each variable, though they provide useful information towards understanding the data.Since this is a categorical dataset with five variables, some variation of a mosaicplot should be a first choice for displaying the variables in combination. I calculated how many were diagnosed and how many not from the prevalence percentages. I then drew doubledecker plots weighted by these numbers with the diagnosed selected.
In the top figure Groups A and B are aggregated and the seven possible combinations of the three tests are plotted in the nested ordering of Clinician, ADI-R and PL-ADOS. The increasing prevalence with this ordering stands out (ie that Clinician tests have higher prevalence rates, and within those then ADI-R). The sizes of the different groups are also emphasised.
In the lower figure Groups A and B are separated by splitting each of the 7 bars in the top figure accordingly. Here it is obvious that there is very little difference between A and B in terms of prevalence with any of the combinations of tests.
The diagrams were drawn with Heike Hofmann's MANET software. It includes a line for the empty zero combination (far left of both plots). The diagrams could also have been drawn with Martin Theus's MONDRIAN software, which runs on all platforms, while MANET only runs on the Mac, but then the labelling beneath the plots would have had to have been added. For a publication the labelling would be further refined.
This graph is indeed pretty, and the bars do a good job of conveying that the ultimate data are counts. Still, I think I'd prefer a set of line graphs. I just find these mosaic plots hard to read. Maybe Masanao and I can try the line plots and then write a joint paper with Antony and Igor comparing the different representations.
Posted by Andrew at 9:10 AM | Comments (5) | TrackBack
September 26, 2007
R video tutorial
Check out this (from Dan Goldstein):
Posted by Andrew at 9:09 AM | Comments (2) | TrackBack
September 25, 2007
Redoing Venn diagrams as readable graphs?
Igor sends along this graph on autism diagnosis
and asks whether this information can be presented better graphically. The answer is definitely yes, although I don't have time in the next 15 seconds to figure out exactly how to do it. My intuition is to do some sort of line plot, showing the probability of autism given different factors which can interact, perhaps using a structure of multiple graphs as in Figure 2 of this paper. Even with binary factors, a graph with the factor on the x-axis can work well, especially if you use small multiples to display different conditions.
I'm not sure, though, since I haven't read enough to have figured out what the substantive goal is here.
Any other thoughts?
P.S. I still like this Venn diagram, though:

P.P.S. See here for Antony Unwin's plot of the autism data.
Posted by Andrew at 9:55 AM | Comments (4) | TrackBack
September 17, 2007
Displaying confidence intervals
Gregor points me to this paper by Tom Louis and Scott Zeger. It's fine--definitely an improvement over the usual tabular displays--but I agree with Gregor that graphical display is better.
Posted by Andrew at 11:07 AM | Comments (0) | TrackBack
September 14, 2007
Resources for science editors
Rahul sent in this (from here). It's good to reach new audiences.
Posted by Andrew at 12:49 AM | Comments (0) | TrackBack
September 12, 2007
Plot the data
One of the early mantras one hears in statistics is "Plot the data." When I first heard it, it was followed by "by hand"; I suspect that part gets elided these days. Still, the advice is good. It's often easier to make sense of a list of numbers if you can visualize them.Most of the time, that takes time we don't have.
When we get an email or a report with a table of numbers, we know that plotting the numbers means grabbing a piece of graph paper (does your office supply cabinet even stock graph paper anymore?) or opening up your favorite spreadsheet, copying numbers, and drawing a graph. I rarely take the time.Last week, I got yet another email with a table of numbers showing how something had changed over time. I was curious, so I wrote a short J script (now edited into a one line script) to turn the clipboard into data and another to plot the data.
Voilá! Now I had an easy and quick way to grab and plot data. I tried grabbing data out of an OpenOffice.org Writer document, and it worked, too. Grabbing data out of a Writer table was almost as good; my script lost the shape of the table, but that's easy to fix.
What's more, when you've got it in J, you can also apply various J statistical routines to the data, or you can pass it to R for more advanced statistical processing.
Yet another simple productivity tool, yet another reason to learn J as a tool for thinking and doing, yet another way to make sense with numbers.
As Xiao-Li can tell you, I don't know J, but I do like the idea of quickly making graphs. In R it can take awhile. I'm getting better at it, but then again I've been using S and R for almost 20 years. Even so, I always have to spend a lot of time screwing around with the defaults to make things look good.
Bill writes,
Things can be pretty fast and easy in J. I didn't fix the script to have the graph's ordinate always start at or at least include the origin, but that's pretty trivial, as is adding titles, legends, and the like.What's challenging is J's extreme mathematical ability to deal with arrays and functions of arrays. J has compositions of functions (hooks, forks, conjunctions) that are extremely powerful and easy to use, once you catch on, but it's like learning a foreign language -- until a switch flipped in my brain, it seemed opaque. Similarly, J's rank conjunction lets you create derived functions to operate on arrays in wonderfully interesting ways, making explicit program loops a thing of the past -- but catching onto rank initially can be mind-bending. Fortunately, doing what seems reasonable often works. Thus (+/ % #) is the program ("verb") for the arithmetic mean: the sum over a list (+/) divided by (%) the number of items in the list (#) -- no mention of the size of the list nor even of the list itself -- you can apply it to columns, rows, diagonals, whatever, of data.
This reminds me a bit of APL (and I don't mean that in a good way), but maybe some sort of menu-based version can be set up. I used to laugh at menus but now I'm thinking this is the way to go. Also to have user-modifiable graphs that then get converted to a script so that the results can be easily saved and replicated.
Posted by Andrew at 1:31 AM | Comments (12) | TrackBack
August 31, 2007
Data visualization
Eric Tassone writes,
Have you seen this impressive article, "Data Visualization: Modern Approaches"? There are some nice visualizations there so I think you will find it worth a browse, and it should balance my account since I once forwarded you that ultra-ugly Treasury graph that made its way onto your blog.
Oddly enough, I was distracted by the ads at the beginning of the linked article. (This was odd because I'm rarely distracted by ads in print magazines.) I wasn't particularly impressed by the examples in the article (except for the Rosling talk which I already knew about; see here and here).
Also I like the baby names site.
Posted by Andrew at 1:55 AM | Comments (0) | TrackBack
Graphs from tables
Samantha Ross writes:
I've been reading about the recent efforts (including yours) to turn tables into graphs. Unfortunately, my experience as an applied research with little R knowledge makes constructing these graphs difficult. Even with the new tables2graphs.com, I'm having trouble fitting a graph to my needs. Could you post (or send) some example R code you use to construct graphs for displaying coefficients? And other R code too maybe! Perhaps this is like asking to copy someone else's written paragraph in your own paper without attribution and if so, well, certainly my apologies. But my guess is that applied researchers will continue displaying coefficient tables until they can get sufficient guidance to construct the graphs. I don't want to be one of them so I'm asking for help!
Yu-Sung replies: Try the function coefplot() in our package "arm". Type ?coefplot in R to see example codes (plot 7 might be what you want). And you are welcome to ask us question if you have problems using the function.
Posted by Andrew at 12:31 AM | Comments (1) | TrackBack
August 27, 2007
Standardized coefficients
Denis Cote writes,
I am still struggling with standardized coefficients you suggest in your regression book.First, I was surprised you don’t seem to mention at all completely standardized coefficients (betas). There are so ubiquitous. Is there any other reason for not standardizing Y other than to keep its original scale? What about meaningless score in some tests?
Second, which coefficients would be best to graph? Are the standard errors from the regression with z2 Xs meaningful and comparable? What would be the appropriate error bar to graph?
...
By the way, I am turning my tables into graphs!
My reply: I'm not sure what you mean by "completely standardized coefficients." We do suggest standardizing continuous input variables by subtracting the mean and dividing by 2 standard deviations. This seems like completely standardizing to me. Also, yes, standardizing y can make sense also. Although in practice we often standardize by taking logs. And, yes, I do think that, typically, coefs for different predictors can be graphed and compared--if the inputs are binary or else have been rescaled. (I like to center binary inputs but it does not typically make sense to rescale a binary input, since a change of 1 unit is already interpretable.)
Posted by Andrew at 2:52 PM | Comments (0) | TrackBack
August 22, 2007
3D social network visualization
Juli sent this link. My pet peeve is calling things "3D." It's a 2D display--it's on a flat screen, after all. Or I guess it is actually 3D, counting time as the third dimension. In any case, it looks cool, but I'm skeptical about its usefulness for understanding networks (for the reasons Matt has given in the past).
Posted by Andrew at 8:30 AM | Comments (1) | TrackBack
August 15, 2007
Another bad chart
I don't want to be doing this every day, but I have to agree with Brendan, who writes that "the differing tilt and skew of the two Y axes makes it really hard to interpret." I'll do youall a favor and not repeat the graph here. It comes from this article. Visually, though, it is weirdly compelling.
P.S. I fixed the link to the graph.
Posted by Andrew at 9:24 AM | Comments (4) | TrackBack
August 14, 2007
Worse than a pie chart?
John sends in this horrible example:

Among other problems, the graph uses areas to represent per-capita numbers. On the plus side, the graph did what was necessary, which was to get attention. For that purpose, the graph is excellent.
Posted by Andrew at 11:19 AM | Comments (1) | TrackBack
August 10, 2007
Some unsolicited advice to teaching assistants, prompted by a question about sketching on the computer
Charlie Gibbons writes,
I will be a TA for intermediate micro this fall and am looking for a program to use to draw graphs for my handouts (eg, utility curves, budget sets, etc). I would like to be able to:Draw the curves "freehand" (smooth curves based upon path points, but not by plotting, say, y = ln(x) ) Save in vector format (especially encapsulated PostScript so that it is compatible with LaTeX).
This sounds good to me. Does anyone know of any software out there that does this? My quick thought is that it's not actually so hard to write R code to make graphs that look right--I just play with the functional form a bit (using curve(), that convenient hack in R) until it looks how I want. But, yeah, it would be useful to make sketches and put them in documents. I'm sure something's available that does so.
And now, the unsolicited advice
But . . . maybe my real question to Charlie should be: Should you really be making handouts at all? Teaching assistants always want to lecture and prepare handouts. Really, though, textbook writers are professionals, and they've put everything you need right in the book.
My advice is: take the time you were going to put into these handouts, and instead spend it supervising the students in active learning: mostly working in pairs or small groups on homework or homework-like problems. Or, if you really want to prepare some extra material, prepare some drills so you can do your best to make sure that all your students get the basics down.
Posted by Andrew at 3:07 PM | Comments (8) | TrackBack
Many faces
Juan-José Gibaja-Martíns sends in this link. The graphs aren't all how I would do them, but I like that people are taking graphics seriously.
Posted by Andrew at 3:04 PM | Comments (0) | TrackBack
August 9, 2007
I prefer dotplots to barplots
Masanao made this:

and this:

for a paper I'm involved in. Unfortunately, neither one is going in the paper. In any case, I prefer the one with the dots.
I would only make a few little changes, mostly to make everything smaller (while keeping fonts readable), pulling the lines closer together and also writing the leftmost labels on two lines so they'll fit.
Posted by Andrew at 6:59 AM | Comments (2) | TrackBack
August 2, 2007
Ranking is a trap
Ranks have lots of problems. They're statistically unstable (see the work of Tom Louis) and can mask nonlinearity. I was recently reminded of these patterns in seeing two sets of graphs reproduced by Kaiser:
From the Wall Street Journal, graphs of baby names over time:

The graphs are ok but plotting ranks, rather than proportion of total names each year, is a mistake, since it makes the y-axis extremely hard to interpret, it's not clear where zero is, etc. As Kaiser points out, in any case there are difficulties when the scales are different for different plots, but, beyond this, the ranks are making things tougher.
And, from the New York Times, a summary of problems with the subway lines:

Here, the ranks are giving a hyperprecision that is not helpful. (Also, encasing the subway line numbers/letters in black circles makes them harder to read, at least on the screen.) As some commenters pointed out, it would probably be better to just display each line with three or five grades, sort of like how Consumer Reports does the ratings.
Posted by Andrew at 9:17 AM | Comments (6) | TrackBack
The worst graph every made?
This (found by Kaiser Fung from this horrible BBC site) it is possibly the worst graph I've ever seen:

As Kaiser writes, "The use of patterns for shading is especially disconcerting. The graphic also lacks self-sufficiency as we have trouble comparing countries without referencing the underlying data." And commenter Chris Hibbert points out that "they seem to be using a pie chart to compare independent data points. The chart about GP consultations should be a bar chart, since the point is to compare the countries side-by-side rather than to emphasize that they make up a single whole (they don't.)" Actually, I'd prefer a dotplot (following Bill Cleveland's general advice), but that's just quibbling.
Posted by Andrew at 12:16 AM | Comments (11) | TrackBack
July 30, 2007
Voting map without those ugly state boundaries
Matt Franklin sent in this improved version of this picture:

Thanks, Matt!
Posted by Andrew at 12:10 AM | Comments (2) | TrackBack
July 25, 2007
A good graphic, or "too clever by half"?

Some browsers don't seem to be able to display this image very well for some reason, so you may need to try the PDF version.
This is a rather complicated graphic that relates an acceptable risk of contracting anthrax (upper left) to the number of samples you should take to check for anthrax in a building (lower right). The relationship between risk and airborne concentration of spores is very uncertain; the relationship between airborne and surface concentrations is highly variable, depending on activity levels and surface type and other factors; detection probabilities can vary quite a bit; and you might decide that you want to be 99% sure or 99.9% sure that, if the building is contaminated at an unacceptable level, you will obtain at least one 'positive' sample.
I like this graphic: I think that by playing around with it for a few minutes, you can understand how the various assumptions affect the outcome. I think it is more effective than, say, 8 different plots of "number of samples needed" versus "acceptable risk level", representing different combinations of assumptions for dose-response, resuspension, and detection probability. But a statistician friend says it's "too clever by half."
What do you think?
Posted by Phil at 2:41 PM | Comments (7) | TrackBack
July 18, 2007
tables2graphs.com
John Kastellec writes,
Eduardo Leoni and I have created a web site, located at http://tables2graphs.com, accompanying our paper, "Using Graphs Instead of Tables in Political Science," which is available here.The site contains complete and annotated R code for all the graphs that appear in the paper. We hope that readers interested in turning tables into graphs can use this code to produce their own graphs in R.
We also would like your help. Because so many social scientists use Stata, we would also like to provide Stata code for creating each graph (if possible). Neither of us is fully versed in Stata graphics, however, and the site currently provides Stata code for only one of the graphs. If you have Stata code that we could apply to some of our graphs and don't mind sharing it with us, we would greatly appreciate it. (Our email addresses can be found on the site).
Regular (or even irregular) readers of this blog will be able to guess that I am supportive of this project.
Posted by Andrew at 9:34 AM | Comments (4) | TrackBack
July 17, 2007
Using between-country comparisons to make implicit causal inferences about policies
Two different people (Christoper Mann and Jeff Lax) pointed me to this graph in the Wall Street Journal that features a goofy regression line. My expertise on taxes and economic growth is zero, and the statistical problems with the regression line are apparent, so I don't really have anything to say here. Hey, if all roads go through Rome, it's only fair that all lines go through Norway.
But, to get serious for a minute . . . Setting aside the concerns with the regression line or with measurement issues in defining the variables being graphed, it's an interesting reminder of the duality between descriptive vs. causal inference and aggregate vs. individual-level analysis (or, as would be said in psychology, between-subject vs. within-subject analysis). I'm not criticizing the use of graphs such as these (or corresponding regression models) that use between-country comparisons to make implicit causal inferences about policies--it's just helpful to remember the assumptions needed to draw these conclusions.
Posted by Andrew at 6:19 AM | Comments (4) | TrackBack
July 9, 2007
Hans Rosling 2007
We had this entry almost a year ago. This year Hans Rosling gives yet another talk titled "New insights on poverty and life around the world". The talk is great, and the ending is quite shocking..

In a follow-up to his now-legendary TED2006 presentation, Hans Rosling demonstrates how developing countries are pulling themselves out of poverty. He shows us the next generation of his Trendalyzer software -- which analyzes and displays data in amazingly accessible ways, allowing people to see patterns previously hidden behind mountains of stats. (Ten days later, he announced a deal with Google to acquire the software.) He also demos Dollar Street, a program that lets you peer in the windows of typical families worldwide living at different income levels. Be sure to watch straight through to the (literally) jaw-dropping finale.
The software for video presentation is quite useful. You can see the outline of the talk and jump to the section that you are interested in by overlaying the cursor on top of it.
Posted by Masanao at 9:30 AM | Comments (4) | TrackBack
July 2, 2007
Mockups for graphs
Kaiser discusses here the value of sketching preliminary versions of a plot to see what might work, before going to the effort of making the full graph. I agree completely--in my class on graphics, we would go through several mock-ups before trying to program something up.
The only trouble is that I don't know of any software for mockups. Ideally one could draw these prototypes and then feed in the data and see what the plots look like. A menu of 50 or so prototypes might do it, I suppose.
One other thing: in a comment here, Derek refers to the "objects" of a graph, i.e., what's being plotted. (For example, in a scatterplot, each dot is a what?) One of my pet peeves with graphs--and data descriptions in general--is that it's standard to label the axes, and often the plot itself, but rarely are the individual objects labeled. I often see a scatterplot where I can't figure out what's being plotted. I would usually ask people, "what are the units of the plot" (using the term "unit" as used in survey sampling, for the items being measured), but this just confuses people, because they think of units of measurement (kilograms or whatever). I'll try the term "object" and see how it goes.
I've been trying to train people to describe a plot by, instead of saying, "This is a graph of weight vs. height", to say, "Each dot is a person. Weight is on the vertical axis and height is on the horizontal axis." It's tough, though. People internalize the objects and forget that others don't know what's being plotted.
Posted by Andrew at 3:25 PM | Comments (6) | TrackBack
Plotting models with more than one input variable
Manuel Spínola writes,
In the book you say that it is a good idea to plot the fitted model. How do you do that when you have, for example, 3 explanatory variables? Or do you mean plotting one variable at the time?
My reply:
I would plot one variable at a time, but then you can use multiple graphs to show different levels of a second variable, and multiple lines per graph to show different levels of a third variable, and solid/dotted lines to show a fourth variable, and a second dimension of a grid of plots to show a fifth, and color to show a sixth. For an example, see Figure 2 of this paper.
I used to like to use different symbols and symbol sizes to add more dimensions, but now I'm happier with "small multiples"--that is, one-way or two-way grids of plots. Sort of like Trellis, except that I like to label the internal axes locally (right on the little graphs) and label the external axes on the outside of the plot (see, for example, Figure 15.7 on page 335 in our book). I find the Trellis convention (using external labels for the internal axes) confusing.
Posted by Andrew at 11:52 AM | Comments (1) | TrackBack
June 21, 2007
Nations of Europe: adding priors to multidimensional scaling
Yesterday we were looking at the musical taste proximity between European countries. But what about the proximity between European nations in terms of the genes? The field of population genetics investigates this problem. I have taken some Y chromosome data, and computed the distance between two nations based on their genetic distance.
The result, obtained with MDS is as follows:

I've color-coded different language groups. We can see that North Africans are quite different, and that within Europe, there is a clear gradient from the East to West, with several clusters. The islands of Gotland and Sardinia are composed of a diverse mix from different populations.
The interesting point, however, is that I've initialized the positions of points to the geographic positions, which can roughly be interpreted as a prior. This is a bit unusual: usually the points are randomly initialized, or initialized with some sort of a linear dimension reduction technique, such as with Torgerson's procedure.
Multidimensional scaling is that old way of embedding a set of points described in terms of their similarities into a lower-dimensional space so that the Euclidean distances in this space reflect the similarities. While there are closed-form solutions to the problem when the transformation is linear, usually based on the SVD, one can achieve lower stress by allowing nonlinear transformations such as SMACOF.
SMACOF is a deterministic hill-climb, but it depends on the starting point. The starting point can either be considered to be a nuisance, but it can also be considered to be the equivalent of a regularization term or a prior. Csiszár, for example, pointed out that iterative scaling finds the point from the set of solutions that satisfy constraints that is closest to the starting point. While this doesn't exactly fit the regularization term or prior setting, it is nevertheless a very appropriate way of stabilizing MDS: with so many co-dependent parameters, a MDS posterior distribution seems incomprehensible (although Fig 3 in Jackman's Multidimensional Analysis of Roll Call Data via Bayesian Simulation does provide an interesting visualization, but with the use of informative priors). In this sense, SMACOF and iterative scaling can be seen as the updating of the prior.
This is the original placement:

Two dimensions are insufficient to show the complexity of the data. Here you can see the stresses (green means too far, red is too close):

Finally, this is what comes out from initializing the points randomly:

Definitely not as easy to understand as the geographic original - yet it has better "stress" than the prior-based one. The sensible geographic prior nicely helps orient the result.
In summary, the benefit of priors goes beyond the Bayesian methodology.
Posted by Aleks Jakulin at 4:17 PM | Comments (13) | TrackBack
June 12, 2007
Amusing map
Leonardo Monasterio sent me this link. I'd give my thoughts but I don't have much to add beyond what's already in that blog entry. I wouldn't consider this a serious statistical tool but it's amusing.
Actually, this whole Strange Maps blog is cool.
Posted by Andrew at 9:15 AM | Comments (2) | TrackBack
June 6, 2007
Multiple predictors
Jarrett Byrnes writes,
A group of us are working through your Multilevel book, and a question has come up regarding models incorporating multiple predictors. We were working some of the chapters on using simulation to draw inference, but have been been puzzling over how one can represent their data with fitted and simulated lines, one factor at a time. True, one can show the fitted and simulated model for a variety of other factors that are not of interest, but this seems unsatisfactory, particularly if you are incorporating three or more predictors on your model. Do you have any suggestions as to how one can best present data and models for these more complex models, such that a reader can assess the relationship between the model and the data for each single predictor?
My reply: I think you have two questions:
1. How to display a fitted regression that has many input variables?
2. How to display a sequence of fitted models?
I'll discuss each issue in turn. (Warning: for neither problem do I have a great answer yet.)
1. How to display a fitted regression that has many input variables? I'd start with curves of the expected value of y as a function of each input, using a separate plot for each input variable and multiple curves as necessary to show interactions. See, for example, the graphs on page 91 of our new book. With more than two inputs, I'd probably stack the graphs vertically. We're still working on our general R function for this. And I'd also display the estimated coefficients (as in the lower graph on page 337 of our book), probably after standardizing the inputs by subtracting means and dividing by two standard deviations.
When a model has three-way interactions or many two-way interactions, the displays start to get tricky, and I have no great answer yet. I do think, though, that if we try harder we'll gradually make progress here. Traditionally, graphical methods have focused on displaying raw data; as that same ingenuity is used to display inferences and fitted models, I think good new general plots will arise.
2. How to display a sequence of fitted models? This is a really important question, and again I don't see anything great right now. The series of graphs for the arsenic example in chapter 5 of our book give some sense of what can be done, but we're pretty disorganized there. It would be good to have a coherent display to visualize what happens to a model when a predictor is added, something like the graphs on page 12 of this paper. I will add, though, that I am not particularly interested in model selection or model averaging, at least as these concepts are typically formulated statistically. I'm more interested in putting together a good model and using simpler models as steps in understanding what the ultimate model is doing.
Posted by Andrew at 8:07 AM | Comments (1) | TrackBack
June 3, 2007
Interaction in information software
I found this interesting article on information software and interaction.

Here is the abstract:
The ubiquity of frustrating, unhelpful software interfaces has motivated decades of research into “Human-Computer Interaction.” In this paper, I suggest that the long-standing focus on “interaction” may be misguided. For a majority subset of software, called “information software,” I argue that interactivity is actually a curse for users and a crutch for designers, and users’ goals can be better satisfied through other means.Information software design can be seen as the design of context-sensitive information graphics. I demonstrate the crucial role of information graphic design, and present three approaches to context-sensitivity, of which interactivity is the last resort. After discussing the cultural changes necessary for these design ideas to take root, I address their implementation. I outline a tool which may allow designers to create data-dependent graphics with no engineering assistance, and also outline a platform which may allow an unprecedented level of implicit context-sharing between independent programs. I conclude by asserting that the principles of information software design will become critical as technology improves.
It's Edward Tufte applied to software and web design. Since we are seeing more information that are presented on the web such as this, it may be good to give a thought on how to deal with "interaction".
Posted by Masanao at 12:54 PM | Comments (0) | TrackBack
May 31, 2007
Color of Flags
I'm not a pie chart person. But here is an example where I don't mind the use (I found it here):

Using a list of countries generated by The World Factbook database, flags of countries fetched from Wikipedia (as of 26th May 2007) are analysed by a custom made python script to calculate the proportions of colours on each of them. That is then translated on to a piechart using another python script. The proportions of colours on all unique flags are used to finally generate a piechart of proportions of colours for all the flags combined. (note: Colours making up less than 1% may not appear)
It's pretty, it's something about proportion, it's not trying to show clear numeric result, data-to-ink/pixel ratio is not a problem in this case, yet there's some information that you will have hard time seeing from table. (Such as Tunisia has slightly more white then Turkey.)
Now I'm not for the alphabetical ordering of the countries, but then again I don't have a better suggestion.

Is there any reason that no country uses pink?
This site has summary of what not to do with pie chart.
You can also look at this paper.
Posted by Masanao at 6:16 PM | Comments (5) | TrackBack
May 24, 2007
A Bayesian formulation of exploratory data analysis and goodness-of-fit testing
I love this paper. Here's the abstract (yes, it's too long, I know):
Exploratory data analysis (EDA) and Bayesian inference (or, more generally, complex statistical modeling)--which are generally considered as unrelated statistical paradigms--can be particularly effective in combination. In this paper, we present a Bayesian framework for EDA based on posterior predictive checks. We explain how posterior predictive simulations can be used to create reference distributions for EDA graphs, and how this approach resolves some theoretical problems in Bayesian data analysis. We show how the generalization of Bayesian inference to include replicated data y.rep and replicated parameters theta.rep follows a long tradition of generalizations in Bayesian theory.On the theoretical level, we present a predictive Bayesian formulation of goodness-of-fit testing, distinguishing between p-values (posterior probabilities that specified antisymmetric discrepancy measures will exceed 0) and u-values (data summaries with uniform sampling distributions). We explain that p-values, unlike u-values, are Bayesian probability statements in that they condition on observed data.
Having reviewed the general theoretical framework, we discuss the implications for statistical graphics and exploratory data analysis, with the goal being to unify exploratory data analysis with more formal statistical methods based on probability models. We interpret various graphical displays as posterior predictive checks and discuss how Bayesian inference can be used to determine reference distributions.
We conclude with a discussion of the implications for practical Bayesian inference. In particular, we anticipate that Bayesian software can be generalized to draw simulations of replicated data and parameters from their posterior predictive distribution, and these can in turn be used to calibrate EDA graphs.
Also this paper.
Posted by Andrew at 6:09 AM | Comments (2) | TrackBack
May 23, 2007
Cool timeline
Aleks pointed me to this. Usually I'm not such a fan of tricky displays, but I have to admit that this one is kind of pretty.
Posted by Andrew at 11:25 AM | Comments (1) | TrackBack
May 17, 2007
Two-dimensional machine learning conference space
Janez Demsar has shown me a chart of machine learning conferences and their similarity. To compute the similarity between the conferences, he used the Jaccard distance, based on the proportion of authors that publish in both venues (set intersection of authors of both venues) versus those that publish in either (set union of authors of both venues). Afterwards, he employed multidimensional scaling to embed the points into 2D space. Lines' thickness indicates proximity. As for color, red are journals, blue are conferences. He acquired the data from the DBLP.

We can see the data mining (KDD/PKDD) more towards the bottom, machine learning in the middle (ICML/ECML/ML/JMLR) largely separating the two, and AI on the top. To the left there are special areas, such as neural networks (ICANN/NN) or medical applications (ARTMED/AIME). Do not, however, interpret these areas as marginal: it's just that the lens was centered on the highly connected conferences to the right of the diagram.
There are a few challenges with analyzing such proximity data statistically. First: the authorship data should be controlled by year: long-running conferences will appear detached from the base. Second: when there is not much data, there is uncertainty in the similarity. For this we first need a probabilistic stress function (an uncertain distance can be stretched or shrunk more than a certain one). Finally, the nonconvexity of MDS can be remedied with good priors. One might also debate the pros and cons of using similarity functions on the original features, or whether to generate the original features directly from the latent variables.
Also see Map of Science and Scientometrics.
Posted by Aleks Jakulin at 12:19 PM | Comments (6) | TrackBack
April 6, 2007
Election & Public Opinion by PIIM
Here is interactive visualization of Election & Public Opinion by PIIM. It's an interactive display of Red / Blue state. Election data goes all the way back to 1789, the first presidential election.

This application will familiarize you with the voting process of the United States. Explore how public opinion and "creative democracy" has such a persuasive effect on the country; and how just a handful of votes may cause significant impact.
Historical background, the current voting process, and informative visualization of every major election are available. The Issue and policy tools permit some creative "What if" experiments in redrawing an election based on subtle alternations to historical outcomes.
Posted by Masanao at 8:58 AM | Comments (1) | TrackBack
March 27, 2007
How to summarize a multilevel model fit?
Michael Kubovy writes,
Can you point me to a model report of empirical research (preferably of a designed experiment) using mixed models?As you know, the pattern in psychology is to have a stultifying paragraph listing which effects and interactions were or were not significant: "… the a by b interaction was significant --- F(n1, n2) = 23.3456, p ≤ .00023 … " --- followed by an interaction plot that summarizes the main results.
But suppose I [Kubovy] need to report the process that led me to settle on txt.lmer2:> anova(txt.lmerM, txt.lmer1M, txt.lmer2M, txt.lmer3M)
Data: txt
Models:
txt.lmerM: duration ~ vis * display + pitch * display + aud * pitch + (1 | subj)
txt.lmer1M: duration ~ vis * display + pitch * display + aud * pitch + (1 + aud | subj)
txt.lmer2M: duration ~ vis * display + pitch * display + aud * pitch + (1 + aud + pitch | subj)
txt.lmer3M: duration ~ vis * display + pitch * display + aud * pitch + (1 + aud + pitch + vis | subj)
Df AIC BIC logLik Chisq Chi Df Pr(>Chisq)
txt.lmerM.p 12 27216 27287 -13596
txt.lmer1M.p 14 27168 27252 -13570 51.45 2 6.7e-12 ***
txt.lmer2M.p 21 26601 26727 -13280 580.77 7 < 2e-16 ***
txt.lmer3M.p 26 26607 26762 -13278 4.17 5 0.52
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1where *M means that method = 'ML',
and then plot the complex pattern I believe best summarizes my data with CIs, but here I need to say that these were obtained using REML.
All this is so unfamiliar to readers and my students that there is be resistance. They are accustomed to ANOVAs done with OLS in which effect E is tested against the E by subject interaction.
So: how do I make the report of my data analysis best fit people's pre-conceived ideas of what such a report should look like, w/o misrepresenting what was done?
My (brief) reply: I would plot coefficient estimates graphically (see, for example, pages 306, 307, 312, 313, 328, 338, 341 of our new book). These graphs all look different, which indicates that I don't have any systematic way of doing this yet. I'm hoping this will come, as a product of further research. Also, I don't really like formal model comparisons using AIC, BIC, etc. (but, of these, I much prefer AIC because at least it has a direct interpretation in terms of predictive error). I like summarizing complex Anova-type models in graphical displays as shown in the Anova chapter of our book.
Posted by Andrew at 11:06 AM | Comments (1) | TrackBack
March 14, 2007
1200+ examples of information visualization at PIIM
A friend of mine introduced this site to me. It's a database for information graphics that Parsons Institute for Information Mapping (PIIM) is building. They are accepting submissions so if you have interesting graphical display, take a shot to be in the "most comprehensive, manually annotated (and taxonomically classified) information graphics database in the world" which they are aiming for. I was not able to pull out 1200 graphics, but there are displays that I've never seen before. A list of key words might be helpful.
Posted by Masanao at 7:03 PM | Comments (1) | TrackBack
February 22, 2007
Ugly graphics
Gerlinde Schuller sent me this link. This is bad in so may ways, most notably in its focus on the shapes of countries and its huge-Greenland projection. The cutouts of the United States with a super-prominent detached Alaska are particularly horrifying. I'd suggest they make some scatterplots.
Posted by Andrew at 12:44 AM | Comments (0) | TrackBack
January 8, 2007
A taxonomy of visualizations
The Visual Literacy project has a wonderful taxonomy of visualizations formatted as a periodic table:
Each type of visualization is described in terms of four multi-level attributes:
- high/low complexity of the visualization ("mass") [updated 1/11/07]
- data/information/concept/strategy/metaphor/compound visualization
- process/structure visualization
- overview/detail/both
- divergent(exploratory) / convergent(summary) thinking
While I find the examples of data visualization quite limited, it is interesting to see how much wider the scope of visualization is.
They also have a taxonomy/directory of visualization scholars.
I've had problems viewing it in Firefox (the pop-ups are empty), but it works fine in IE. I found this on Information Aesthetics.
Posted by Aleks Jakulin at 11:16 AM | Comments (2) | TrackBack
December 15, 2006
Jimmy's weight over time
What do I like best about this graph?

My favorite part of this graph is the title--it really personalizes the data. See more here.
Posted by Andrew at 5:35 PM | Comments (1) | TrackBack
November 4, 2006
Civil liberties and war
Adam Berinsky is presentjng this paper at the New York Area Political Psychology Meeting today. I don't have much to say about the content of the paper, except that a key issue would seem to me to be framing: are civil liberties a luxury (as our math professors would say in college when proving a theorem, "culture") that we can't afford in wartime, or are civil liberties a form of security that is needed more than ever during a war? I would think that many of the controversies about civil liberties--in policy discussions and in public opinion--depend on this framing.
In any case, I have some comments about the graphs in the paper. First, I like how the paper follows in the Page and Shapiro tradition of presenting results graphically rather than as tables. For the Berinsky paper, I'd recommend more consistency in the presentation, basically displaying the information, wherever possible, as line plots with time on the x-axis. This parallelism will make the paper easier to read, I think--partly because the graphs can be made physically small and thus fit into the text better, also because a compact display allows more information to be displayed and be made visible in one place (so that the reader--and the researcher--can see more comparisons and learn more).
In detail:
The x-axes should be cleaner. I'd recommend, either putting a tick mark at Jan 1 for each year, or else showing year boundaries on the x-axis and putting the year labels between tick marks (so that, for example, "2003" is placed between the 1 Jan 2003 and 1 Jan 2004 tick marks. It's confusing to read raphs such as Fig 1 with tick marks at "Jul-2001", "May-2003", "Mar-2003", etc.
Fig 5.2 is hard to read. I'd recommend actually replacing Fig 5.2 by 3 small figures, one for each of the poll questions you're analyzing. Figs would be on common scales, and for each fig, you can show the time series for Reps, Dems, and Independents. I'd also like to see these go back before 1995. Perhaps can get similar questions from NES?
Figs 5.3 and 5.4 should be combined as time series. Also, I'd like to see these questions ordered in increasing (or decreasing) support for the "no on civil liberties" response.
Fig 5.5 should have some data on it. Actually, I think it should be rewritten as a time series, with year on the x-axis and 2 lines for the 2 levels of war support (0 and 1). Also, I'd make this richer in info by considering subsets of the population. A famous example is education: highly-educated people supported the war more.
Fig 5.6 ("Threat and intolerance") could use a more descriptive title. What are the questions here? It's good for figures to be self-contained. Also, the lines should be labled directly (not with a legend) and the x-axis should just have labels every 10 years. Again, maybe more could be learned by looking at subsets of the population or at other questions.
Fig 5.7: a little confusing. Maybe breaking up into 2 or 3 or 4 little plots (arranged on a grid) would help. Also, I'd label the x-axis as discussed in the Fig 5.1 comment.
Figs 5.8 and 5.9 should be combined and presented as time series.
Posted by Andrew at 7:47 AM | Comments (0) | TrackBack
September 20, 2006
Colors in R
Tian links to a document showing hundreds of shades of colors in R. I don't think I would've listed them alphabetically, but it is convenient to see them all in one place. When picking out colors, don't forget that they look different on the computer, projected onto a screen, and on paper.
Posted by Andrew at 8:43 AM | Comments (3) | TrackBack
June 6, 2006
Displaying Financial Data, redux
A few weeks ago, I posted an entry about a bad graphical display of financial data; specifically, which asset classes have performed well, or badly, by year. Here's the graphic:

I pointed out that although this graphic is poor, it's not easy to display the same information really well, either. For instance, a simple line plot does a far better job than the original graphic of showing the extent to which asset classes do or don't vary together, and which ones have wilder swings from year to year, but it's also pretty confusing to read. Here's what I mean:

I suggested that others might take a shot at this, and a few people did.
Kelly O'Day sent

which is good for comparing variability of different classes, but bad for seeing which classes do or don't vary together in time. Kelly also sent

Hadley Wickham sent this contribution:
(Hadley provides the R code, too, at had.co.nz. I feel that I should note that this R code is both more elegant and more general than what I woulda done.) The lower plot breaks the asset classes into groups based on variance, which is nice. As with my graphic, though, the heavily overlapping lines and sometimes similar colors makes it hard to see exactly what is going on with what asset class.
Richard uses a tabular approach, where colors indicate yearly performance:
I would say that each of these has some good and some bad characteristics (even the original one at the top). It's very hard to make a single display that lets you see both the relative and absolute performances, for each year and for the whole period. The original graphic gives up on the absolute performances (or at least gives up on graphically displaying them; you can still read off the percent gains), in favor of simply rank-ordering within each year and overall. My contribution, and the uppermost of Hadley's plots, puts everything on a single line plot; you can see how things vary together, you can see the relative volatility (i.e. variance) of the various asset classes...but this is a lot of lines on a single plot, and is therefore hard to read. (Hadley's color scheme is harder for me to distinguish than "my" color scheme, which was an attempt to duplicate the one in the original chart). Kelly's two contributions attempt to resolve the overlapping-lines issue by presenting the data two ways: side-by-side, which allows visually comparing variances but does not help with comparing temporal behavior; and vertically aligned, which allows comparison of temporal variability but makes it harder to compare variances. Richard's table is easy to follow, but (for me) much of the interpretation comes from reading the numbers rather than taking advantage of our ability to process graphical information.
Comments?
Posted by Phil at 12:55 PM | Comments (4) | TrackBack





