Data Mining

Syllabus and Description

David Madigan

Department of Statistics

Columbia University





Data Mining is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, computational biology, and internet commerce. The analysis of data on this scale presents exciting new computational and statistical challenges.

This course will provide an overview of current research in data mining and will be suitable for graduate students from many disciplines. The prerequisites for the class are basic computing proficiency as well as knowledge of elementary concepts in probability and statistics. The course will include a project component we will discuss this in the first class or two.


Outline of Course Structure

The syllabus below describes in outline the material we hope to cover. This may change as we go, depending on time constraints and the interests of the students in the class. Data mining is a very broad area, encompassing ideas from statistics, machine learning, databases, and visualization. We cannot hope to cover all aspects of data mining in depth. Instead, the course aims to introduce some of the major concepts and explore a few of them in depth.


  1. Introduction and Motivation
  2. what is data mining? how does it relate to knowledge discovery in databases? what is the data mining process? what are typical applications? what kinds of data do people mine? introduction to the major classes of techniques: exploratory analysis, descriptive modeling, pattern and rule discovery, and retrieval by content.

  3. Databases and data warehousing
  4. relational databases & SQL, the data warehousing process, data warehousing designs,


  5. Exploratory Data Analysis and Visualization

  7. An Overview of Data Mining Algorithms

  9. Modeling for Data Mining
  10. general principles including model scoring, search and optimization

  11. Descriptive Modeling

  13. Predictive modeling

  15. Pattern and rule discovery

  17. Text mining

  19. Bayesian data mining

  21. Observational Studies



T. Hastie, R. Tibshirani, and J. Friedman (2001) The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.
This will be the primary text for the course.

In addition we will use:

Principles of Data Mining (D. Hand, H. Mannila, and P. Smyth, MIT Press, 2001)

J. Han and M. Kamber (2000) Data mining: concepts and techniques. Morgan Kaufman.