Truth in Data

Posted on August 18, 2009 2:36 PM by Andrew

David Blei is teaching this cool new course at Princeton in the fall. I’ll give the description and then my thoughts.

COS597A: Truth in Data

In COS513, we covered the fundamentals of probabilistic modeling: How to build models, how to fit models to data, and how to infer unknown quantities based on those fitted models. This suite of computational problems is fundamental to many modern machine learning algorithms, with applications in information retrieval, computer vision, computational linguistics, and bioinformatics.

In traditional machine learning tasks, like search and classification, a model is as good as its performance is measured. (A better spam filter will filter more spam; a better object recognizer will recognize more objects.) Many recent applications of probabilistic modeling, however, are towards more exploratory ends. We can build models to identify the hidden community structure of a social network, the hidden thematic structure of a corpus of documents, or the hidden patterns of genes that govern our biology.

Evaluating models built for exploration is tricky, and this is the problem that we will discuss. Many questions arise: How can we interpret the results of a probabilistic model? What can we say about data based on such an analysis? How can we test our modeling assumptions? How can we diagnose where and when they go wrong? How should we change our model based on these diagnoses?

Computational methods for answering these questions are essential to drawing sound conclusions from data. Most previous work is for small low-dimensional data sets and simple statistical models; methods that can address complex models for massive high-dimensional data are hard to come by. Our goal will be to come up with a few new ones.

Prerequisites: COS513 or graduate-level background in statistical modeling. Students are expected to be comfortable with probability theory, model building, approximate inference methods (e.g., MCMC), and a statistical programming language (e.g., R or Matlab).

This looks great. I think that “confidence building” and “model understanding” are hugely important topics that don’t fit into the usual statistical theory. See here and here. It’s a wide-open research area.

I’ve been thinking about this stuff for a long time, but especially recently with all the multilevel models I’ve been fitting. For more discussion regarding a specific example, see here.

And for an impressionistic view of the whole topic, see here.

1 thought on “Truth in Data”

ZBicyclist on August 19, 2009 5:16 AM at 5:16 am said:

Looks interesting, but "Truth in Data" is a bit deceptive. It's not about QA (the most common industry use of that term) but more "Subjective Data Analysis".

I don't mean "subjective" perjoratively, but descriptively. Lacking a specific, obvious criterion (e.g. accuracy in filtering spam) such analysis is by nature more subjective. A common example would be customer segmentation, in which multiple segmentation schemes may be in operation in a company at one time because of different goals of the segmentation. If a client asks for "the true customer segments", it's best to laugh out loud just to get the issue out in the open.

Comments are closed.