Imputation for longitudinal data

Posted on July 18, 2008 12:14 AM by Andrew

Rey De Castro writes:

I have a longitudinal data set that needs imputation, but the problem doesn’t seem to resemble a typical imputation situation. So I’m casting about for a reasonably defensible approach that I can implement without tremendous custom-programming effort. My question concerns Bayesian approaches to imputation.

The Situation: I have longitudinal data for each of a group of schoolchildren. Each observation in the series is a multilevel class indicator of several canonical locations (i.e., indoor-home, indoor-school, outdoors, commuting) where the child reported being present during a particular 15-minute interval. Essentially, it’s a series giving each child’s location over time at 15-minute intervals. There are ~100 children, and each child’s series is very long: ~2000 observations.

The data for some children are patchy in some places, so I need to impute observations in some reasonably systematic way. The children’s observations are synchronous, so imputation might be done by using the frequency distribution of children among microenvironments at a particular time-interval as a basis for probabilistically imputing an observation. But then, I’m unsure how to combine the between-subject information with information from the child’s own longitudinal series.

Could there be a Bayesian approach to imputing these longitudinal data?

My reply: Yes, it’s certainly possible to do a Bayesian model here. The challenge is the time series of 2000 observations per kid. It would help to see some graphs of these time series to get a sense of what would be a good way to model it. You could always try something simple such as a first-order Markov model, which won’t be at all realistic but might be ok for interpolation (i.e., imputation). I have a feeling this could be set up in an approximate way as a multilevel regression but I’m not quite sure how.

2 thoughts on “Imputation for longitudinal data”

Alex on July 17, 2008 6:12 PM at 6:12 pm said:

If you setup the model as a multinomial logit (or probit) where the dependent variables are the location indicators and the regressors are defined as the lag of the location indicators, the model would be similar to a a Markov chain (although with an additional noise term in the transition probabilities), as the probability of moving from one state to the next would depend only on the previous state (location) plus the noise term. Perhaps by extending this concept with whatever covariates you are using a workable imputation scheme could be devised.
Hadley on July 17, 2008 7:57 PM at 7:57 pm said:

It would be very interesting to see this data set – given that school children spend a large chunk of their time in very predictable locations (e.g. asleep: ~8 hours, school: ~7 hours) I'd want to know exactly where the observations are missing and why. If you have any knowledge about the process causing the missingness, shouldn't you include that in the imputation? Similarly, I'd want more information about what you are going to do with the data after you have imputed it. (i.e. why are you doing imputation in the first place?)

Comments are closed.