Data Mining (V0.3)

NOTE: Most of the material (slides, dates, etc) is on http://courseworks.columbia.edu/ This page is merely intended to guide you there.

STAT W4240.001 DATA MINING
TR 06:10P-07:25P
SCHOOL OF SO 903

Instructor: Aleks Jakulin, PhD

Purpose: To introduce students to the cutting edge of data mining, focusing on data analysis problems and data types. A student will be able to approach a problem of data mining, analyze it and identify tools to solve it quickly and efficiently. A distinguishing characteristic of data mining is the pragmatic employment of an algorithm as a tool, rather than the underlying mathematics.

Method: Provide a conceptual framework that allows navigating a methodological cornucopia present in data mining. With the framework, one can pick any tool, piece of software, and apply it to a problem at hand.

Prerequisites :

· basic knowledge of probability and statistics

· some knowledge of programming

· quantitative attitude

· passion for truth

· helpful: an eye for graphics, communication/presentation skills, logic, linear algebra, numerical mathematics, advanced programming, computer graphics, visual design

Practical: Students will form teams and address a data mining problem of practical importance. All projects will be published online in the format of a web page. Teams should combine presentation, computational and mathematical skills.

Lecture Format: Each lecture will be centered on a case study. The case will be examined through four aspects: goal, computation, model and data. The aspects will be explained as needed for the case. There will be guest lectures on specific tools, guest talks and demonstrations (15 minutes, followed by a discussion), and just guests sitting in.

Grading: 60% project, 15% midterm (Mar 11), 25% final. The exams will focus on your ability to solve practical problems and find flaws in other types of analysis.

Ethics: Collaboration on midterm and final are not allowed. All projects will be posted permanently online with a list of credits (who did what). If you use external help, credit it. Your own contributions need to be listed, and everyone will ask questions that will establish if you did it yourself or not.

Self study: I’ll be posting good applied books and links here.

Conceptual Structure:

Most data mining tasks can be reduced to using tools of four basic categories:

Purpose

Summarization

Prediction

- Compression

- Decision-making

System identification or reverse engineering

- Exploration

- Action, manipulation and causality

Anomaly detection

Active learning and experimentation

Anti-Data Mining: Privacy protection, obfuscation and data camouflage

Models

Visualization

Determinism and probability

Linear models

Interaction

Clustering

Kernels and exemplars

Logic, rules, trees

Geometry

Nonlinearity

Process

Fitting, over-fitting, under-fitting, complexity

Model evaluation: cross-validation, test/training

Model stabilization: priors and regularization

Model comparison

Model addition: ensembles

Model subtraction: holding things constant

Computation: scalability, parallelization, numeric precision

Data

Data types:

- Tabular

- Relational

- Temporal

- Text

- Network

- Geographical

- Structured

- Exotic

- Multi-instance

- Multi-task

Data preparation:

- Collecting

- Cleaning

- Structuring

- Missing data

- Wrong data

- Uncertain data

Links to previous classes:

David Madigan (2008)

Chris Volinsky (2009) (/DataMining on the website)