Data Mining (V0.3)
NOTE: Most of the material (slides, dates, etc) is on http://courseworks.columbia.edu/ This page is merely intended to guide you there.
STAT W4240.001 DATA MINING
SCHOOL OF SO 903
Instructor: Aleks Jakulin, PhD
Purpose: To introduce students to the cutting edge of data mining, focusing on data analysis problems and data types. A student will be able to approach a problem of data mining, analyze it and identify tools to solve it quickly and efficiently. A distinguishing characteristic of data mining is the pragmatic employment of an algorithm as a tool, rather than the underlying mathematics.
Method: Provide a conceptual framework that allows navigating a methodological cornucopia present in data mining. With the framework, one can pick any tool, piece of software, and apply it to a problem at hand.
á basic knowledge of probability and statistics
á some knowledge of programming
á quantitative attitude
á passion for truth
á helpful: an eye for graphics, communication/presentation skills, logic, linear algebra, numerical mathematics, advanced programming, computer graphics, visual design
Practical: Students will form teams and address a data mining problem of practical importance. All projects will be published online in the format of a web page. Teams should combine presentation, computational and mathematical skills.
Lecture Format: Each lecture will be centered on a case study. The case will be examined through four aspects: goal, computation, model and data. The aspects will be explained as needed for the case. There will be guest lectures on specific tools, guest talks and demonstrations (15 minutes, followed by a discussion), and just guests sitting in.
Grading: 60% project, 15% midterm (Mar 11), 25% final. The exams will focus on your ability to solve practical problems and find flaws in other types of analysis.
Ethics: Collaboration on midterm and final are not allowed. All projects will be posted permanently online with a list of credits (who did what). If you use external help, credit it. Your own contributions need to be listed, and everyone will ask questions that will establish if you did it yourself or not.
Self study: IÕll be posting good applied books and links here.
Most data mining tasks can be reduced to using tools of four basic categories:
System identification or reverse engineering
- Action, manipulation and causality
Active learning and experimentation
Anti-Data Mining: Privacy protection, obfuscation and data camouflage
Determinism and probability
Kernels and exemplars
Logic, rules, trees
Fitting, over-fitting, under-fitting, complexity
Model evaluation: cross-validation, test/training
Model stabilization: priors and regularization
Model addition: ensembles
Model subtraction: holding things constant
Computation: scalability, parallelization, numeric precision
- Missing data
- Wrong data
- Uncertain data
Links to previous classes:
Chris Volinsky (2009) (/DataMining on the website)