Ideas for Projects

1. Bayesian Data Mining – finding interestingly large counts in massive tables.

DuMouchel, W. (1999). Bayesian data mining in large frequency tables, with an application to the FDA spontaneous reporting systems. American Statistician, 53, 177-202
Paper here.slides available here.

DuMouchel W, Pregibon D (2001). Empirical Bayes screening for multi-item associations [ps] Proc. KDD 2001, ACM Press, San Diego, CA.
available here.

2. Data Squashing – compressing large datasets to facilitate statistical analysis.

DuMouchel, W., Volinsky, C., Johnson, T., Cortes, C., and Pregibon, D. (1999). Squashing flat files flatter. In: Proceedings of the Fifth ACM Conference on Knowledge Discovery and Data Mining, 6-15.
available here

See also this paper and this paper.

3.Recommender Systems – making recommendations based on past shopping/rating behavior

4. Delegate Sampling – ideas for tree building with massive data

Breiman, L. and Friedman, J. (1984). Tools for large data set analysis. In: Statistical Signal Processing, E.J. Wegman and J.G Smith (Eds.), New York, M. Dekker, 191-197.

Domingos, P. and Hulten, G. (2000). Mining High-Speed Data Streams, In: Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, 71-80.

Mining Time-Changing Data Streams, with Geoff Hulten and Laurie Spencer. Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining (pp. 97-106), 2001. San Francisco, CA: ACM Press.


5. Big Bayesian Networks – scaling algorithms for learning Bayesian networks

Nice tutorial on David Heckerman’s home page

6. Multiclass Classification – using coding theory ideas for multiclass classification

Dietterich, T. G., Bakiri, G. (1995) Solving Multiclass Learning Problems via Error-Correcting Output Codes. Journal of Artificial Intelligence Research 2: 263-286. PDF


7. Markov Transition Distributions – models for higher order Markov chains

Berchtold, A. and Raftery, A.E. (1999). The Mixture Transition Distribution (MTD) Model for High-Order Markov Chains and Non-Gaussian Time Series. Technical Report no. 360, Department of Statistics, University of Washington, August 1999.

8. Global partial orders from sequential data

Mannila, H. and Meek, C. (2000). Global partial orders from sequential data. Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, 161-168.

9. Probabilistic models for query approximation

Pavlov, D., Mannila, H. and Smyth, P. (2000). Probabilistic models for query approximation with large sparse binary data sets. UAI-2000.

10. Adaptive bagging

11. Pasting bites together for prediction in large data sets and on-line

12. Hierarchical model-based clustering for large datasets

13. Monte Carlo importance sampling for large-scale Bayesian analysis

Ask me and I can give you a paper about this.

14. Retrieval properties of large text corpora

Information retrieval investigators have noticed that retrieval performance (i.e. proportion of relevant documents returned in response to a query) improves as the size of the document collection increases. The project should conduct some simulations to see if this phenomenon is universal. Come and talk to me if you are interested in this.

15. Bayesian model averaging for logistic regression

"Bayesian model averaging" is a technique to improve the predictive performance of models such as classifiers. It has not really been evaluated in the context of logistic regression. Project: evaluate Bayesian model averaging (BMA) for logistic regression using datasets from the UCI machine learning repository. Software to do BMA is at

16. Bayesian model averaging versus Mixtures of Experts

Compare the predictive performance of Bayesian model averaging to so-called mixtures-of-experts models. BMA software is at: ME software is at

17. Text categorization for disputed authorship

Text categorization concerns the automatic assignment of documents to categories. The idea is learn a categorization algorithm from a set of hand-labeled documents. There are lots of interesting statistical/data mining ideas in this area. One application concerns disputed authorship: given samples of particular author's works, assign disputed samples to the right author. The classic work in this area is very old and updating it would make for a nice project. See:

18. Text mining adverse event reports

Various public database record adverse reactions to medical products. The free-text component of these data is of considerable interest to various parties. Key problems include: (1) automatic assignment of adverse event codes to the free text reports, (2) recovering the temporal sequence of the adverse events, and (3) mapping verbatim drug names to a canonical form. Lots of possible projects here.

19. Compare regularized logistic regression to random forests

Logistic regression is an old statistical method that has been around for a long time. Recent work has shown that logistic regression, suitably regularized, works as well as more trendy methods like boosted trees and support vector machines on very large-scale problems. There are a few possible projects here that could result in a nice paper.

20. Location determination in wireless networks

The basic idea here is to try to figure out a user's physical location from the strengths of the signals to various access points. There is a lot of interest in this problem. Here is a web page on location determination. The papers by Bahl et al. describing the RADAR system are especially nice.

21. Orthographic analysis

Given a list of hispanic first and last names and a list of non-hispanic first and last names, build a predictive model that accuractely classifies future names as hispanic or non-hispanic.