Social networks

I'm currently working on statistical models to infer network structure from questions on standard surveys. Specifically, we use questions of the form "How many X's do you know?,'' where X represents a population of interest. These data, also known as Aggregated Relational Data (ARD), use survey respondents' social networks to reach members of the population of interest, X. ARD are commonly used to estimate features of populations which are difficult to sample directly, such as the homeless or persons with HIV/AIDS (see Killworth et al. (1998), for example). Recently, ARD have also been used to compare levels of segregation by race and across other dimensions of potential social cleavage in contemporary America (DiPrete et al., To appear.).

My dissertation research, with Professor Tian Zheng, extends our previous work on estimating specific population and respondent features to estimate relationships between groups using ARD. We develop a latent space model where the propensity for an individual to know members of a given group is independent, given the positions of the individual and the group in a latent "social space," similar latent space models for complete networks (Hoff 2005). Using this framework, we estimate relative homogeneity of groups and describe variation in the propensity for interaction between respondents and population members.

For a summary of our current work using latent space models, please see our poster from the recent Joint Statistical Meetings in Vancouver or the associated handout. Also, here is a flyer we created which describes our method for a general audience.

Dynamic modeling

I am also working on developing an online binary classification procedure for cases when there is uncertainty about the model to use and when parameters within a model change over time. We address model uncertainty through Dynamic Model Averaging (DMA), a dynamic extension of Bayesian Model Averaging (BMA) in which posterior model probabilities are also allowed to change with time. Our method accommodates different levels of change in the data-generating mechanism by calibrating a "forgetting" factor. We propose an algorithm which adjusts the level of forgetting in a completely online fashion using the posterior predictive distribution.

We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Our results indicate that the factors associated with which children receive a particular type of procedure changed significantly over the seven years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would only require storing sensitive patient information temporarily, reducing the risk of a breach of confidentiality.

Bayesian association rule mining

The emergence of large-scale medical record databases presents exciting opportunities for data-based personalized medicine. Prediction lies at the heart of personalized medicine and in this paper we propose a statistical model for predicting patient-level sequences of medical symptoms. We develop a new approache for predicting the next event within a "current sequence," given a "sequence database" of past event sequences. Specifically we propose the Hierarchical Association Rule Mining Model (HARM) that generates a set of association rules such as dyspepsia and epigastric pain imply heartburn, indicating that dyspepsia and epigastric pain are commonly followed by heartburn. HARM produces a ranked list of these association rules. Both patients and caregivers can use the rules to guide medical decisions. Built-in explanations represent a particular advantage of the association rule framework---the rule predicts heartburn because the patient has had dyspepsia and epigastric pain. Using the symptoms of many similar patients, our method provides predictions specialized to any given patient, even when little information about the patient's history of symptoms is available.