Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics

John and I gave our presentation on statistical graphics today, and then coincidentally I found this monograph by Rafe Donahue (link from Helen DeWitt). I started skimming and it looks pretty good so far. He uses horizontal jittering instead of the horrible boxplot, and that makes me happy already. On the other hand–since I’m being superficial here–I’m not a fan of the marginal-notes style of referencing. I always feel that this style draws undue attention to what are ultimately the least important parts of the book.

More seriously, Donahue’s monograph looks interesting, and I’ll have to read it more carefully. I’ve been looking for something on graphics that goes beyond the nuts and bolts of how to make a particular graph and considers what should actually be plotted and why.

On a theoretical level, I wonder how his ideas connect to my ideas of exploratory data analysis and statistical modeling (see here and here). I think the connections are there (as in Donahue’s principle #28, 43, 52, and 86: “The data display is the model.”

Actually, many of his principles are things that I tell people also. Just today I discussed how you have to tell the viewer what the plot is (Donahue’s principle #23).

P.S. A minor point: Donahue’s principle #53 is, “Plot cause versus effect.” Doesn’t he mean, “Plot effect versus cause”? Usually we say y vs. x, not x vs. y. Or else I’m missing something here.

10 thoughts on “Fundamental Statistical Concepts in Presenting Data: Principles for Constructing Better Graphics

  1. I'm surprised that you don't like boxplots. I think they are the best displays ever! It uses only five numbers (plus outliers) but tells you so much about the distribution. The example used in the monograph is trivial; it has so few data points; for large-scale data, the dot plot just doesn't work. Give boxplots another chance!

  2. Histograms, frequency polygons, and density plots cannot be as compactly displayed next to one another to show differences between groups. Boxplots are superior to barplots, that's for sure.

  3. The topic "statistical principles for constructing plots" or better "for choosing which plot to make" is interesting!
    In classes I have presented various plots, and the questions are always like "how do we know which plot to make". So I started to explain comparing with hypothesis testing: Given a hypothesis (question) , we must ask which plot is "optimal" to contrast that hypothesis. What is optimal may depend on what are the alternativ hypothesis.
    The analogy goes further: Given the hypothesis (and choosen plot), there is a "null plot" (expected if the null is true) and an "alternative plot", expected if the alternative is true. For instance, for normal QQ plots,
    the null plot is a straight line, and the alternative plots are various curved lines.

    I find this analogy useful.

  4. I usually say x versus y. That might just follow from alphabetizing, rather than any deeper principal, but conceptually, I think of x varying under my control, and y responding. So cause versus effect seem right to me.

    Perhaps if you came from an ordinal versus abscissa world, you'd think y vs x, but then that's saying effect versus cause, which sounds backwards.

    tomato, tom8o :-)

  5. 'horrible boxplot' is probably not really fair … If I were to use the term 'horrible', I would rather use it in cases when people start to plot each and every point in a large dataset, with the effect that you either don't see anything any more in the plot, or your are attracted by relatively few outliers because you can't judge the density correctly.

    It is probably best to be able to use all sorts of different plots and use them when appropriate.

  6. OOOO – I think they might be best…

    I discovered them when I read EDA in '77 after 2 years of Stats and thought they would be useful. later when I worked helping people (mainly biologists and agriculturalists) analyse their experiments I saw lots of real data and very little looked normal!
    Doing their anova's and plot's of means +_ ses that was always missed.

    I think was a revolutionary step, actually. Why ?
    If as Andrew says the graph is the model – then here was a model with few assumptions… just talking about the data.
    For once in my life I disagree with Hadley – a histogram for small – medium amounts of data is not better – it wastes space & is generally too lumpy to be useful (ie it depends on grouping too much).
    Of Course Boxplots are not perfect – they are very weak with data that can be bi- or multi-modal. But these days there are fixes for that.
    This pattern of data is common in (biology too, eg when sexes are not externally distinguishable).

    Dave

Comments are closed.