10 thoughts on “Develop your intuition about histograms

  1. True, though that may be more the breaking algorithm than anything else… Scott or Friedman-Diaconis gets you something much more reasonable (though I'll accept that most people aren't aware that differences exist!)

    hist(y,"Scott")
    curve(dcauchy(x),col="red",lwd=2)

    or…

    hist(y,"Scott",xlim=c(-100,100))
    curve(dcauchy(x),col="red",lwd=2)

    for something more "textbook" :-)

  2. Byron,

    These help but I still don't think they look so pretty. And doing xlim=c(-100,100) is cheating, I think! My point is that some distributions aren't so amenable to histograms.

  3. For me, there is no intrinsic reason why an histogram should lead to a "good" visualisation of a distribution. Even independently from aesthetic criteria, the automatic choice of the bin width is difficult, and is too often badly implemented in software packages. For example the R default is Sturge heuristic (1926), which derivation is simply wrong. Scott(1985) proposed another one based on a gaussian assumption, but Hyndman(1995) showed that that was also bad in some cases. Other works have been done with more complex heuristics, involving more and more computation (Freedman(1981),Wand(1997)) but none of them really conviced me. I'd prefer considering the histogram as a model of the distribution of the data, and the bin width as a parameter. Then a bayesian inference and the so called "automatic Occam" could lead to a better choice of the width (at the price of an optimisation in bin-width space).
    Discretisation is a complex issue, more profound than usually though.
    If somebody is interested in references, mail me.
    Regards
    Pierre

  4. OK then, smart guy, if a histogram isn't the right answer here, what is a better way to display the distribution?

    I'm tempted to play devil's advocate and say that depending on what you want, a histogram may be OK here: you have a lot of data clustered around 0, and some extreme outliers that go waaay out in the tails. If you care about the distribution "near" zero, and you know it, then it's not cheating to focus on that area, so make a histogram of just the data for |x| l.t. 100 or so, or just |x| l.t. 25.

    If I were to try to defend the lowly histogram, that's what I'd say. In this extreme case, that's a tough defense to make, especially for an exploratory analysis: the point of the histogram isn't supposed to be that it tells you where all of your data are, it's that it shows something about the shape of the distribution, and that's not really the case here.

    There are other standard approaches to seeing something about the distribution, though, like looking at quantiles and extreme values. A simple plot(sort(y)) will kinda show you what's going on, at least to the point that you know to look deeper.

    But yeah, a histogram isn't very helpful.

  5. I think histograms are great but not for all purposes. For this particular example, the weird thing is that the data can have such extreme values, either positive or negative. If the data were all-positive, you could just take the log first and then do the histogram. Real data that are approximately Cauchy-distributed tend to be ratios of the form y/x, where x could be either positive or negative. The extreme values correspond to x near 0. I'd probably prefer a scatterplot of y vs. x.

  6. In term of teaching advices, I have a interesting quote:

    " The message to the student, though, is clear – don't just automatically accept the default histogram generated by your statistics program. Especially if the data is skewed or multi-modal or has some unusual feature in its distribution, the student should generate a number of histograms and choose that which seems to best give a true picture of the data, or that which best displays the particular aspect of the data in which the student is interested."
    Unfortunatly, I can't find who is the author of this, it may be Paul Velleman, but I'm not sure. ;-(

  7. Okay, my stupid comment looks even more stupid for my trying to put a less-than sign in my pretend R code. I was going for this:

    y = rcauchy(100)
    hist(y)

    As for what might be better, given a huge dataset of a Cauchy distribution, I deal mostly with normal distributions, so this would make the most sense to me:

    y = rcauchy(10000)
    n = pnorm(y)
    qqplot(y,n)

    Not the most informative, I guess, but it gets the point across to non-stats-heavies like me.

Comments are closed.