Data Mining

Syllabus and Description

David Madigan

Department of Statistics

Columbia University

madigan-at-stat.columbia.eeedddduuu

Data Mining is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, computational biology, and internet commerce. The analysis of data on this scale presents exciting new computational and statistical challenges.

This course will provide an overview of current research in data mining and will be suitable for graduate students from many disciplines. The prerequisites for the class are basic computing proficiency as well as knowledge of elementary concepts in probability and statistics. The course will include a project component – we will discuss this in the first class or two.

Outline of Course Structure

The syllabus below describes in outline the material we hope to cover. This may change as we go, depending on time constraints and the interests of the students in the class. Data mining is a very broad area, encompassing ideas from statistics, machine learning, databases, and visualization. We cannot hope to cover all aspects of data mining in depth. Instead, the course aims to introduce some of the major concepts and explore a few of them in depth.

- Introduction and Motivation
- Databases and data warehousing
- Exploratory Data Analysis and Visualization
- An Overview of Data Mining Algorithms
- Modeling for Data Mining
- Descriptive Modeling
- Predictive modeling
- Pattern and rule discovery
- Text mining
- Bayesian data mining
- Observational Studies

what is data mining? how does it relate to knowledge discovery in databases? what is the data mining process? what are typical applications? what kinds of data do people mine? introduction to the major classes of techniques: exploratory analysis, descriptive modeling, pattern and rule discovery, and retrieval by content.

relational databases & SQL, the data warehousing process, data warehousing designs,

general principles including model scoring, search and optimization

Textbook

T. Hastie, R. Tibshirani, and J. Friedman (2001) *The Elements of
Statistical Learning: data mining, inference and prediction*.
Springer Verlag.

This will be the primary text for the
course.

In addition we will use:

*Principles of Data Mining* (D. Hand, H. Mannila, and
P. Smyth, MIT Press, 2001)

J. Han and M. Kamber (2000) *Data mining: concepts and techniques*. Morgan Kaufman.