At last, an entry that’s not about R

Seth writes:

I’m writing a critique of how epidemiologists analyze their data and one thing they rarely do is provide scatterplots. I suppose their excuse is they have too many points. Do you know of any papers about how to make scatterplots with large numbers of points?

My reply: The book, Graphs of Large Datasets: Visualizing a Million (which I’ve been planning to review on this blog for literally over two years, I even took notes on it and everything) discusses this issue in details, including tricks such as alpha-blending.

My impression is that if you have millions or even thousands of points, a density plot can do the trick.as in page 149 of Red State, Blue State. Perhaps readers have other suggestions. But if you just want to make the point and give a definitive reference, I’d go with the Graphics of Large Datasets book.

8 thoughts on “At last, an entry that’s not about R

  1. If you have a lot of observations but they can take discrete value or perhaps continuous ones limited by rounding, then adding a small amount of random variation in x and y can help.

  2. Aren't sunflower plots supposed to be good for this sort of thing?

    To respond to the underlying question, however, I wonder if lack of use of scatterplots in epi might be more related to restrictions and reluctance around revealing health data that isn't aggregated to a safe disclosure level?

  3. OneEyedMan: You're talking about jittering, which I love (lots of examples in my books). But in some cases I think the gray-scale density plot is better. Compare the gray-scale image on page 149 of Red State, Blue State with its predecessor, the jittered scatterplots on page 542 of this article. I think the scatterplot does well with a continuous density (see the top graph on p.539 of the above-linked article) but with discrete data I'm now liking the density plot.

    Shartkoff: Sunflower plots are, in my opinion, a too-clever-by-half idea that is completely useless.

    In response to your second point, I think it's more force of habit than anything else.

  4. 2d density plots are ok, but they do assume a underlying continuous smooth distribution on R^2 – and this is often not the case! Binning is a good alternative (and very simple to understand and explain) – and binning using hexagons tends to be a little better than binning with squares, as explained in D. B. Carr, R. J. Littlefield, W. L. Nicholson, and J. S. Littlefield. Scatterplot matrix techniques for large n. Journal of the American Statistical Association, 82 (398):424–436, 1987.

  5. Hadley,

    Thanks for the reference. Also, yes, binning is what I was talking about when I mentioned density plots. In the example mentioned above, the data are actually discrete so the binning is easily done.

  6. Hi,

    Could you tell us some references (especially papers, but it's ok books as well) about doing grahpics? I read a lot of you talking about doing graphics, not tables, and the problems with others guys' graphics, that I think: "Do I know how to do good graphics and how to do a good job to comunicate my findings?

    Thanks
    Manoel Galdino

  7. I suggest to use smoothScatter function from geneplotter package in Bioconductor. It creates a "smoothed color density representation of the scatterplot, obtained through a kernel density estimate". I use it a lot when I have a lot of data!

    library(geneplotter)
    ?smoothScatter
    x1

  8. I would say that often the issue in epi, at least traditionally, is that outcome and exposure are true dichotomous variables of counts (dead/not dead, "ate tuna salad"/"did not eat tuna salad"), which makes for a miserable looking scatterplot. The best diagnostic presentation of the data is the 2×2 table. I'd argue that most epi questions have at the base this data structure which is not conducive to scatterplots, irrespective of the number of data points (and not appropriate for 'jittering').

    The problem is that the tendency is to make all data fit into that 2×2 even when it's not warranted. I've seen a tragic amount of statistical power vanish into thin air by compulsive dichotomization.

Comments are closed.