Social networksI'm currently working on statistical models to infer network structure from questions on standard surveys. Specifically, we use questions of the form "How many X's do you know?,'' where X represents a population of interest. These data, also known as Aggregated Relational Data (ARD), use survey respondents' social networks to reach members of the population of interest, X. ARD are commonly used to estimate features of populations which are difficult to sample directly, such as the homeless or persons with HIV/AIDS (see Killworth et al. (1998), for example). Recently, ARD have also been used to compare levels of segregation by race and across other dimensions of potential social cleavage in contemporary America (DiPrete et al., To appear.).
My dissertation research, with Professor Tian Zheng, extends our previous work on estimating specific population and respondent features to estimate relationships between groups using ARD. We develop a latent space model where the propensity for an individual to know members of a given group is independent, given the positions of the individual and the group in a latent "social space," similar latent space models for complete networks (Hoff 2005). Using this framework, we estimate relative homogeneity of groups and describe variation in the propensity for interaction between respondents and population members.
For a summary of our current work using latent space models, please see our poster from the recent Joint Statistical Meetings in Vancouver or the associated handout. Also, here is a flyer we created which describes our method for a general audience.
McCormick, T. H., Salganik, M. J. and Zheng, T. (2010). "How many
people do you know?:Efficiently estimating personal
network size," Journal of the American
Statistical Association, 105, 59-70.
Diprete, T. D., Gelman, A., McCormick, T. H., Teitler, J., and
Zheng, T. (To appear). "Segregation in social networks based on
acquaintanceship and trust," American Journal of Sociology.
McCormick, T. H. and Zheng, T. (2010). "Latent demographic profile estimation in at-risk populations," Under review.
McCormick, T. H. and Zheng, T. (2010). "A latent space representation of overdispersed relative propensity in 'How many X's do you know?' data," in Conference Proceedings of the Joint Statistical Meetings, Vancouver, B.C.
McCormick, T. H., Ruf, J., Moussa, A., Diprete, T. D., Gelman, A., Teitler, J., and Zheng, T. (2010). "A practical guide to measuring social structure using indirectly observed network data," in Under review.
McCormick, T. H. and Zheng, T. (2009). "Towards a unified framework for inference in Aggregated Relational Data," in Conference Proceedings of the Joint Statistical Meetings, Washington, D.C.
McCormick, T. H., Ruf, J., Moussa, A., Diprete, T. D., Gelman, A., Teitler, J., and Zheng, T. (2009). "Measuring social distance using indirectly observed network data," in Conference Proceedings of the Joint Statistical Meetings, Washington, D.C.
- McCormick, T. H. and Zheng, T. (2007). "Adjusting for recall bias in 'How many X's do you know?' surveys," in Conference Proceedings of the Joint Statistical Meetings, Salt Lake City, Utah.
Dynamic modelingI am also working on developing an online binary classification procedure for cases when there is uncertainty about the model to use and when parameters within a model change over time. We address model uncertainty through Dynamic Model Averaging (DMA), a dynamic extension of Bayesian Model Averaging (BMA) in which posterior model probabilities are also allowed to change with time. Our method accommodates different levels of change in the data-generating mechanism by calibrating a "forgetting" factor. We propose an algorithm which adjusts the level of forgetting in a completely online fashion using the posterior predictive distribution.
We apply our method to data from children with appendicitis who receive either a traditional (open) appendectomy or a laparoscopic procedure. Our results indicate that the factors associated with which children receive a particular type of procedure changed significantly over the seven years of data collection, a feature that is not captured using standard regression modeling. Because our procedure can be implemented completely online, future data collection for similar studies would only require storing sensitive patient information temporarily, reducing the risk of a breach of confidentiality.
- McCormick, T. H., Raftery, A. E., Madigan, D., Burd, R. "Dynamic logistic regression and dynamic model averaging for binary classification," Under revision, Biometrics.
Bayesian association rule miningThe emergence of large-scale medical record databases presents exciting opportunities for data-based personalized medicine. Prediction lies at the heart of personalized medicine and in this paper we propose a statistical model for predicting patient-level sequences of medical symptoms. We develop a new approache for predicting the next event within a "current sequence," given a "sequence database" of past event sequences. Specifically we propose the Hierarchical Association Rule Mining Model (HARM) that generates a set of association rules such as dyspepsia and epigastric pain imply heartburn, indicating that dyspepsia and epigastric pain are commonly followed by heartburn. HARM produces a ranked list of these association rules. Both patients and caregivers can use the rules to guide medical decisions. Built-in explanations represent a particular advantage of the association rule framework---the rule predicts heartburn because the patient has had dyspepsia and epigastric pain. Using the symptoms of many similar patients, our method provides predictions specialized to any given patient, even when little information about the patient's history of symptoms is available.
- McCormick, T. H., Rudin, C., and Madigan, D. "A hierarchical model for association rule mining of sequential events: an approach to automated medical symptom prediction," Under review. Also available through the MIT Sloan Research Paper Series.