Search engine technology course here at Columbia!

Dragomir Radev is visiting from the University of Michigan and teaching this cool course, which I highly recommend to our computationally-minded statistics students.

COMS 6998-004: Search Engine Technology

Call number: 53013

Fall 2007

Thursdays 6:10-8 PM, starting September 6, 2007

833, Mudd

Dragomir R. Radev

Expected Audience:

– graduate students in Computer Science and related areas
– undergraduate students, with instructor’s permission

Goal of the course

A significant portion of the information that surrounds us is in
textual format. A number of techniques for accessing such information
exist, ranging from databases to natural language processing.

Some of the most prestigious companies these days spend large amounts
of money to build intelligent search engines that allow casual users
to find what they want anytime, from anywhere, and in any language.

In this course, we will cover the theory and practice behind the
implementation of search engines, focusing on a wide range of topics
including methods for text storage and retrieval, the structure of the
Web as a graph, evaluation of systems, and user interfaces.

Instructor

Dragomir R. Radev obtained a PhD in Computer Science from Columbia
in 1999. He has taught numerous courses in Natural Language
Processing, Databases, and Information Retrieval at Columbia and
the University of Michigan. He has received a number of awards
including the Columbia CS department award for graduate student
teaching, the University of Michigan UROP award for outstanding
research mentorship, as well as the Gosnell Prize for excellence in
political methodology. As undergraduate he has been on a team that
finished in the top 10 at the ACM international computer
programming contest and, later, has been the coach of Columbia’s
team for 4 years, taking it to three international finals. Dragomir
has worked in different capacities for places like Microsoft
Research, IBM Research, MITRE, AT&T Bell Labs, and Yahoo! He
recently coached the US Linguistics team which took first place at
the International Linguistics Olympiad.

Format

The class will meet once a week for 2 hours.

The students will be responsible for 3 assignments and a final
project. There is no final exam.

Assigmnments

Here are some sample assignments:
* building a search engine
* text classification for spam recognition
* building a question answering system
* Web graph analysis

Final project

An open-ended research project in one of two categories:
* Research paper – using the SIGIR format. Students will be in
charge of problem formulation, literature survey, hypothesis
formulation, experimental design, implementation, and possibly
submission to a conference like SIGIR or WWW.
* Software system – develop a working, useful system (including an
API). Students will be responsible for identifying a niche
problem, implementing it and deploying it, either on the Web or as
an open-source downloadable tool.

Readings

* required: Information Retrieval by Manning, Schuetze, and Raghavan
[1] (http://www-csli.stanford.edu/~schuetze/information-retrieval-b
ook.html). Use the version of August 12/14, 2007.
* optional: Modeling the Internet and the Web: Probabilistic Methods
and Algorithms by Pierre Baldi, Paolo Frasconi, Padhraic Smyth,
Wiley, 2003, ISBN: 0-470-84906-1 [2] (http://ibook.ics.uci.edu).
* papers from SIGIR, WWW, and other venues

Tentative Syllabus

* Introduction.
* Queries and Documents. Models of Information retrieval. The
Boolean model. The Vector model.
* Document preprocessing. Tokenization. Stemming. The Porter
algorithm. Storing, indexing and searching text. Inverted
indexes.
* Word distributions. The Zipf distribution. The Benford
distribution. Heap’s law. TF*IDF. Vector space similarity and
ranking.
* Retrieval evaluation. Precision and Recall. F-measure. Reference
collections. The TREC conferences.
* Automated indexing/labeling. Compression and coding. Optimal
codes.
* String matching. Approximate matching.
* Query expansion. Relevance feedback.
* Text classification. Naive Bayes. Feature selection. Decision
trees.
* Linear classifiers. k-nearest neighbors. Perceptron. Kernel
methods. Maximum-margin classifiers. Support vector
machines. Semi-supervised learning.
* Lexical semantics and Wordnet.
* Latent semantic indexing. Singular value decomposition.
* Vector space clustering. k-means clustering. EM clustering.
* Crawling the web. Webometrics. Measuring the size of the
web. The Bow-tie model.
* Hypertext retrieval. Web-based IR. Document closures. Focused
crawling.
* Random graph models. Properties of random graphs: clustering
coefficient, betweenness, diameter, giant connected component,
degree distribution.
* Social network analysis. Small worlds and scale-free
networks. Power law distributions. Centrality.
* Graph-based methods. Harmonic functions. Random walks. PageRank.
* Models of the web. Hubs and authorities. Bipartite graphs. HITS
* and SALSA.
* Discovering communities. Spectral clustering.
* Semi-supervised passage retrieval. Question answering.
* Probabilistic models of IR. Language
models. Burstiness. Self-triggerability.
* Quick overview of topics such as: Collaborative
filtering. Recommendation systems. Information
extraction. Hidden Markov Models. Conditional Random
Fields. User behavior.
* Quick overview of topics such as: Adversarial IR. Spamming and
anti-spamming methods. Text summarization.
* Additional topics (if time permits):
XML retrieval. Text tiling. Human behavior on the web.

References

[1] http://www-csli.stanford.edu/~schuetze/information-retrieval-book.html
[2] http://ibook.ics.uci.edu/