Data Mining (V0.3)

 

NOTE: Most of the material (slides, dates, etc) is on http://courseworks.columbia.edu/ This page is merely intended to guide you there.

 

STAT W4240.001 DATA MINING
TR     06:10P-07:25P
SCHOOL OF SO 903

 

Instructor: Aleks Jakulin, PhD

 

 

 

Purpose: To introduce students to the cutting edge of data mining, focusing on data analysis problems and data types. A student will be able to approach a problem of data mining, analyze it and identify tools to solve it quickly and efficiently. A distinguishing characteristic of data mining is the pragmatic employment of an algorithm as a tool, rather than the underlying mathematics.


Method: Provide a conceptual framework that allows navigating a methodological cornucopia present in data mining. With the framework, one can pick any tool, piece of software, and apply it to a problem at hand.

 

Prerequisites :

       basic knowledge of probability and statistics

       some knowledge of programming

       quantitative attitude

       passion for truth

       helpful: an eye for graphics, communication/presentation skills, logic, linear algebra, numerical mathematics, advanced programming, computer graphics, visual design

 

Practical: Students will form teams and address a data mining problem of practical importance. All projects will be published online in the format of a web page. Teams should combine presentation, computational and mathematical skills.

 

Lecture Format: Each lecture will be centered on a case study. The case will be examined through four aspects: goal, computation, model and data. The aspects will be explained as needed for the case. There will be guest lectures on specific tools, guest talks and demonstrations (15 minutes, followed by a discussion), and just guests sitting in.

 

Grading: 60% project, 15% midterm (Mar 11), 25% final. The exams will focus on your ability to solve practical problems and find flaws in other types of analysis.

 

Ethics: Collaboration on midterm and final are not allowed. All projects will be posted permanently online with a list of credits (who did what). If you use external help, credit it. Your own contributions need to be listed, and everyone will ask questions that will establish if you did it yourself or not.


Self study: Ill be posting good applied books and links here.

 

Conceptual Structure:

 

Most data mining tasks can be reduced to using tools of four basic categories:

 

 

Purpose

 

 

Summarization

 

Prediction

-       Compression

-       Decision-making

 

System identification or reverse engineering

-       Exploration

-       Action, manipulation and causality

 

Anomaly detection

 

Active learning and experimentation

 

Anti-Data Mining: Privacy protection, obfuscation and data camouflage

 

 

Models

 

 

Visualization

 

Determinism and probability

 

Linear models

 

Interaction

 

Clustering

 

Kernels and exemplars

 

Logic, rules, trees

 

Geometry

 

Nonlinearity

 

 

Process

 

 

Fitting, over-fitting, under-fitting, complexity

 

Model evaluation: cross-validation, test/training

 

Model stabilization: priors and regularization

 

Model comparison

 

Model addition: ensembles

 

Model subtraction: holding things constant

 

Computation: scalability, parallelization, numeric precision

 

 

Data

 

 

Data types:

-       Tabular

-       Relational

-       Temporal

-       Text

-       Network

-       Geographical

-       Structured

-       Exotic

-       Multi-instance

-       Multi-task

 

Data preparation:

-       Collecting

-       Cleaning

-       Structuring

-       Missing data

-       Wrong data

-       Uncertain data

 

 

 

Links to previous classes:

David Madigan (2008)

Chris Volinsky (2009) (/DataMining on the website)