Perils of analyzing data without context

Song Qian writes:

I am very pleased to see your comment on not analyzing data without context. Would you please elaborate the reasons on your blog? I have been teaching an intro data analysis class to our professional masters students since 2005. One thing I have emphasized is the understanding of the underlying scientific problem before conducting any data analysis. This point is not always well-taken. Thanks.

My response: From a Bayesian point of view, it’s pretty clear: no context = no prior information. It’s really more than that, though, since the context structures the model itself, not just the numerical information that you use to regularize parameter estimates. For the climate change example, Bill Jefferys provides a good discussion here on what you can get from substantive knowledge.

5 thoughts on “Perils of analyzing data without context

  1. Put differently, both the prior and the likelihood should be informed by knowledge of the processes that are going on. In the context of scientific analysis, the likelihood is particularly sensitive to bad model choice in the absence of physically informed modeling information.

  2. The biggest problem I have found with analyzing data without context in a non-Bayesian analysis is a lack of understanding potential bias caused by data collection techniques. I have seen a few examples of this where someone thought they had found something significant until they realized that the data was collected inconsistently and when adjustments were made the results were less significant.

    Here, the biggest problem with the ground temperature data has been inconsistent data collection as the locations, the characteristics of the locations and the measurement techniques used have changed over time. These changes mean that there may be bias in the data and it can be in either direction, but without context we can't know if there is or is not any bias that needs to be adjusted for.

  3. Aside from the theoretical arguments, there is a very powerful practical one: the possibility of error. There could be measurement error. There could be model error. There could be typos, for god's sake. Your first line of defense is being able to look at the output and ask, does this make sense? If you don't know the context, it's hard to know what "sense" is. I suspect everyone who's ever taught stats urges students to do this kind of reality check, although it isn't a formal principle. But it's still a good idea.

Comments are closed.