Data Mining

Data Mining

Syllabus and Description

David Madigan

Department of Statistics

Columbia University

madigan-at-stat.columbia.eeedddduuu

Data Mining is a dynamic and fast growing field at the interface of Statistics and Computer Science. The emergence of massive datasets containing millions or even billions of observations provides the primary impetus for the field. Such datasets arise, for instance, in large-scale retailing, telecommunications, astronomy, computational biology, and internet commerce. The analysis of data on this scale presents exciting new computational and statistical challenges.

This course will provide an overview of current research in data mining and will be suitable for graduate students from many disciplines. The prerequisites for the class are basic computing proficiency as well as knowledge of elementary concepts in probability and statistics. The course will include a project component – we will discuss this in the first class or two.

Outline of Course Structure

The syllabus below describes in outline the material we hope to cover. This may change as we go, depending on time constraints and the interests of the students in the class. Data mining is a very broad area, encompassing ideas from statistics, machine learning, databases, and visualization. We cannot hope to cover all aspects of data mining in depth. Instead, the course aims to introduce some of the major concepts and explore a few of them in depth.

Introduction and Motivation

what is data mining? how does it relate to knowledge discovery in databases? what is the data mining process? what are typical applications? what kinds of data do people mine? introduction to the major classes of techniques: exploratory analysis, descriptive modeling, pattern and rule discovery, and retrieval by content.

Databases and data warehousing

relational databases & SQL, the data warehousing process, data warehousing designs,

Exploratory Data Analysis and Visualization

An Overview of Data Mining Algorithms

Modeling for Data Mining

general principles including model scoring, search and optimization

Descriptive Modeling

Predictive modeling

Pattern and rule discovery

Text mining

Bayesian data mining

Observational Studies

Textbook

T. Hastie, R. Tibshirani, and J. Friedman (2001) The Elements of Statistical Learning: data mining, inference and prediction. Springer Verlag.
This will be the primary text for the course.

In addition we will use:

Principles of Data Mining (D. Hand, H. Mannila, and P. Smyth, MIT Press, 2001)

J. Han and M. Kamber (2000) Data mining: concepts and techniques. Morgan Kaufman.