There are never 70 distinct parameters

Sam Seaver writes:

I’m a graduate student in computational biology, and I’m relatively new to advanced statistics, and am trying to teach myself how best to approach a problem I have.

My dataset is a small sparse matrix of 150 cases and 70 predictors, it is sparse as in many zeros, not many ‘NA’s. Each case is a nutrient that is fed into an in silico organism, and its response is whether or not it stimulates growth, and each predictor is one of 70 different pathways that the nutrient may or may not belong to. Because all of the nutrients do not belong to all of the pathways, there are thus many zeros in my matrix. My goal is to be able to use the pathways themselves to predict whether or not a nutrient could stimulate growth, thus I wanted to compute regression coefficients for each pathway, with which I could apply to other nutrients for other species.

There are quite a few singularities in the dataset (summary(glm) reports that 14 coefficients are not defined because of singularities), and I know the pathways (and some nutrients) I can remove because they are almost empty, but I would rather not because these pathways may apply to other species. So I was wondering if there are complementary and/or alternative methods to logistic regression that would give me a coefficient of a kind for each pathway?

My reply:

If you have this kind of sparsity, I think you’ll need to add some prior information or structure to your model. Our paper on bayesglm suggests a reasonable default prior, but it sounds to me that you’ll have to go further.

To put it another way: give up the idea that you’re estimating 70 distinct parameters. Instead, think of these coefficients as linked to each other in a complex web.

More generally, I don’t think it ever makes sense to think of a problem with a lot of loose parameters. Hierarchical structure is key. One of our major research problems now is to set up general models for structured parameters, going beyond simple exchangeability.

2 thoughts on “There are never 70 distinct parameters

  1. Why not use network / graph theory models? There has been a wealth of work on these in the last decade, with a boatload of work on the statistical modeling of these in the last 3 or 4 years. See especially March and June issues of AOAS this year – these both have special sections on statistical network models. A lot of these models are being used to look at genes vs pathways, a similar problem to that of Mr. Seaver.

Comments are closed.