Question about Predictive Variance in Heteroscedastic Regression

Posted on October 30, 2006 12:00 AM by Andrew

Jo-Anne Ting writes,

I’m from the Computational Learning and Motor Control lab at the University of Southern California. We are currently looking at a weighted linear regression model where the data has unequal variances (as described in your “Bayesian Data Analysis” book). We use EM to infer the parameters of the posterior distributions.

However, we have noticed that in the scenario where the data set consists of a large number of outliers that are irrelevant to the regression, the value of the posterior predictive variance would be affected by the number of outliers in the data set, since the posterior variance of the data is inversely proportional to the number of samples in the data set. It seems to me that logically, this should not be the case, since I would hope the amount of confidence associated with a prediction would not be decreased by the number of outliers in the data set.

Any insight you could share would be greatly appreciated regarding the effect of the number of samples in a data set on the confidence interval of a prediction in heteroscedastic regression.

My response: I’m not quite sure what’s going on here, because I’m not quite sure what the unequal-variance model is that’s being used. But if you have occasional outliers, then, yes, the predictive variance should be large, since the preidcitive variance represents uncertainty about individual predicted data points (which, from the evidence of the data so far, could indeed be “outliers”; i.e., far from the model’s point prediction).

One way to get a handle on this would be to do some cross-validation. Cross-validation shouldn’t be necessary if you fully understand and believe the model, but if you’re still trying to figure things out it can be a helpful way to see if the predictions and predictive uncertainties make sense.

3 thoughts on “Question about Predictive Variance in Heteroscedastic Regression”

Anonymous on October 30, 2006 9:27 AM at 9:27 am said:

I don't understand something about the question: The questioner says that the outliers are "irrelevant to the regression." What does that mean? If the residual variance is being estimated from the data, then the outliers will (and should) affect the posterior estimate of the variance. If you want the outliers to really be "irrelevant to the regression," one way to do that would be to use a mixture model that assumes the data come from two groups, a "regular" group with small variance and an "outlier" group with large variance. (Gelman, Carlin, Stern and Rubin has an example like that, using response time of schizophrenic people.) But if you're doing a "typical" Bayesian regression, the outliers will affect the variance estimate, as they should.
Jo-Anne on November 1, 2006 3:56 PM at 3:56 pm said:

I would expect outliers in the data to affect the variance estimate of a prediction, but I would not expect it to decrease the variance estimate (which "typical" Bayesian regression does). I guess my question is: should I be more confident in my prediction because there are more outliers in my data set?
Phil Price on November 3, 2006 9:56 AM at 9:56 am said:

Well, I'm confused about a bunch of stuff. First: I assumed that when you said the outliers "affect" the posterior variance, you meant that the varianc went _up_, not down. I agree with you, the posterior estimate of the variance should not go down.

When I run into oddities like this in my own work, it's usually because (1) I'm not fitting the model I thought I was fitting, or (2) I'm misinterpreting the model output in some way. So my standard procedure is to simulate data from my intended model, run my model fit, and see if the answers come out right. (Of course, since I'm using a known model with known parameters, I know what the answers should be). For some reason I procrastinate and struggle to find other approaches for a long time before I finally do this, even though it rarely takes more than fifteen minutes or perhaps half an hour to generate the simulated dataset, run the model, and check the results. I guess it's sort of an admission of failure somehow, in my subconscious. But in any case, I recommend this approach.

Comments are closed.