Data Mining (V0.3)


NOTE: Most of the material (slides, dates, etc) is on This page is merely intended to guide you there.


TR     06:10P-07:25P


Instructor: Aleks Jakulin, PhD




Purpose: To introduce students to the cutting edge of data mining, focusing on data analysis problems and data types. A student will be able to approach a problem of data mining, analyze it and identify tools to solve it quickly and efficiently. A distinguishing characteristic of data mining is the pragmatic employment of an algorithm as a tool, rather than the underlying mathematics.

Method: Provide a conceptual framework that allows navigating a methodological cornucopia present in data mining. With the framework, one can pick any tool, piece of software, and apply it to a problem at hand.


Prerequisites :

       basic knowledge of probability and statistics

       some knowledge of programming

       quantitative attitude

       passion for truth

       helpful: an eye for graphics, communication/presentation skills, logic, linear algebra, numerical mathematics, advanced programming, computer graphics, visual design


Practical: Students will form teams and address a data mining problem of practical importance. All projects will be published online in the format of a web page. Teams should combine presentation, computational and mathematical skills.


Lecture Format: Each lecture will be centered on a case study. The case will be examined through four aspects: goal, computation, model and data. The aspects will be explained as needed for the case. There will be guest lectures on specific tools, guest talks and demonstrations (15 minutes, followed by a discussion), and just guests sitting in.


Grading: 60% project, 15% midterm (Mar 11), 25% final. The exams will focus on your ability to solve practical problems and find flaws in other types of analysis.


Ethics: Collaboration on midterm and final are not allowed. All projects will be posted permanently online with a list of credits (who did what). If you use external help, credit it. Your own contributions need to be listed, and everyone will ask questions that will establish if you did it yourself or not.

Self study: Ill be posting good applied books and links here.


Conceptual Structure:


Most data mining tasks can be reduced to using tools of four basic categories:









-       Compression

-       Decision-making


System identification or reverse engineering

-       Exploration

-       Action, manipulation and causality


Anomaly detection


Active learning and experimentation


Anti-Data Mining: Privacy protection, obfuscation and data camouflage








Determinism and probability


Linear models






Kernels and exemplars


Logic, rules, trees










Fitting, over-fitting, under-fitting, complexity


Model evaluation: cross-validation, test/training


Model stabilization: priors and regularization


Model comparison


Model addition: ensembles


Model subtraction: holding things constant


Computation: scalability, parallelization, numeric precision






Data types:

-       Tabular

-       Relational

-       Temporal

-       Text

-       Network

-       Geographical

-       Structured

-       Exotic

-       Multi-instance

-       Multi-task


Data preparation:

-       Collecting

-       Cleaning

-       Structuring

-       Missing data

-       Wrong data

-       Uncertain data




Links to previous classes:

David Madigan (2008)

Chris Volinsky (2009) (/DataMining on the website)