Bayesian analysis of case-control studies

Harker Rhodes writes:

I’m looking for a chapter in someone’s textbook titled “Bayesian analysis of case-control studies”. Or a chapter with any title covering that subject…..

Here’s his longer story:

I’m an clinician teaching himself Bayesian statistics by working his way through your textbook “Bayesian Data Analysis” . Eventually I’ll get where I’m going, but it’s proving to be slow going and I thought that if I described to you the specific problem I’m trying to solve you might help by directing my reading to specific chapters (or some other textbook entirely if that’s appropriate).

I want to analyze genetic association data taking into account prior information about the disease risk of each individual.

Consider a dataset consisting of 270 Parkinsons disease (PD) patients and the same number of “matched” controls. I know age, sex, and genotype at both a previously unstudied genetic locus and several other extensively studied loci for each individual. Based on prior knowledge, I have an estimate of the odds ratio and 95% CI for association the previously studied genetic with PD and I have an estimate of the odds of developing PD as a function of age and sex from the epidemiologic literature.

The cohorts are “matched” in the sense that they are all adults over a particular age and that there are the same number of men and women in the PD and control cohort. But they are not all of the same age and they certainly are not matched for the genotype at the genetic loci for which I have prior information.

The conventional analysis of this data is to ignore age, sex, and genotype at the previously studied loci and simply calculate the OR and 95% CI the previously unstudied locus. But that ignores everything else I know about these subjects.

For example, both age and sex are very strong risk factors for the development of PD. Intuitively, if I have a PD patient with the candidate locus risk allele who is a 55 y/o woman (and who therefore has a low a priori risk of PD), that means a lot more than having a 85 y/o man with the same genotype. But the conventional analysis treats them both equally.

I think that what I need to do is start with a model which predicts the odds of an individual having PD as a function of age and sex and genotype. That’s ought to be easy – I have a table with the odds of PD as a function of age and sex from the epidemiology literature and I can multiply that by the product of the true ORs for the individual based on genotype.

But that “model” is not the likelihood function p(y|theta) because y is not a vector of 1s and 0s indicating that particular individuals do or do not have PD. It can’t be since the fact that the first 270 individuals have PD and the remaining 270 do not is guaranteed by the design of the experiment.

What am I doing wrong here?

And will I figure that out for myself if I read “Bayesian Data Analysis” from front to back or is there something else I should be reading?

My reply: I agree that it makes sense to predict the outcome given whatever background variables you have, rather than simply trying to make some sort of crude aggregate comparison. But, you ask, how does the data collection (the case-control study) come into the Bayesian analysis? It comes in through the model for the probability that a unit is included in the sample, as discussed in chapter 7 of Bayesian Data Analysis. What happens is that there’s an unknown parameter which is the prevalence of the cases, and the study is not informative about this parameter, but you can learn about other parameters in the model. I don’t think we have a case-control study example in chapter 7, but the principles are there.

Also, I think there might be something on case-control studies in the Carlin and Louis book, Bayesian Methods for Data Analysis.

8 thoughts on “Bayesian analysis of case-control studies

  1. I'm currently working on a similar problem (genome-wide analysis of coronary artery disease), and I have a conceptual problem with the treatment of age in these models. It seems to me that the the outcome variable should actually be age of onset plus a dummy indicator for "onset never". So the model should be changed into some kind of survival thingy, and outcomes should be treated as censored for so-called controls (because they might get the disease later).

    "But that "model" is not the likelihood function p(y|theta) because y is not a vector of 1s and 0s indicating that particular individuals do or do not have PD."

    Nevertheless, treating y as the experimental outcome still permits asymptotically consistent estimation of the odds-ratio by maximum likelihood. I'd guess a Bayesian analysis would inherit the asymptotic consistency from the likelihood function, and the whole deal could be justified using the material on ignorable experimental designs in chapter 7 of BDA. So go ahead and pretend that disease status as the outcome!

  2. Malay Ghosh, Bhramar Mukherjee and several of their students at Florida and Michigan (respectively) have written quite a bit on Bayesian analysis of epidemiological data, in particular on case-control studies. They have a few papers on the subject, which might be worth reading.

  3. For a Bayesian justification of the likelihood-based 'trick' of using prospective analysis on retrospective data, see Seaman and Richardson's Biometrika paper. When fitting models that are a long way from how the data was actually generated (i.e. treating PD as an outcome) Bayesians have to consider carefully how the prior on (all) the fitted model's parameters ends up representing the 'actual' prior beliefs.

    Harker; you could take a look at Mukherjee et al (2005) Bayesian Analysis for case-control studies: A review article. Handbook of Statistics, Vol 25, pp 793-819. Also, your comment about other covariates is puzzling; if you have a frequency matched case-control study, the conventional analysis *would* adjust for the matching variables.

  4. Also likely worth looking at causal analysis literature here – Rubin's approach requires some special finessing for case-control studies (according to Rubin)

    though I am unsure about Pearl's approach here.

    Keith

  5. if the purpose is "to analyze genetic association data taking into account prior information about the disease risk of each individual", what is the difference between the Bayesian method and the conventional method of including them (as well as their interaction with the genetic factor) into the model?

  6. Bayesian Biostatistics (edited by Don Berry and Dalene Stangl) has some good information and references on case-control studies.

  7. You might want to take a look at Sander Greenland's paper "Bayesian perspectives for epidemiological research: I. Foundations and basic methods" in IJE 2006;35:765. A simple way is to add a strata of 'fictitious' data that describes the prior info, and treat it as extra info in your current analysis.

    To Freddy, I think the problem is that you're losing the information from those prior studies that estimate the risk of PD given a genotype adjusted for age and sex. The conventional analysis of the new data, while controlling for those factors, would be doing the work "all over again", and so the thought is that you're using some of your new information to reestimate what you might be able to simply postulate as a prior using Bayesian techniques.

    But I'm teaching myself, too. Does that sound right?

Comments are closed.