Wednesday, April 19, 2006

[New paper in the literature]: A systematic comparison and evaluation of biclustering methods for gene expression data

Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(9):1122-1129; doi:10.1093/bioinformatics/btl060

Amela Preli 1, Stefan Bleuler 1,*, Philip Zimmermann 2, Anja Wille 3,4, Peter Bühlmann 4, Wilhelm Gruissem 2, Lars Hennig 2, Lothar Thiele 1 and Eckart Zitzler 1

Motivation: In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available.

Results: First, this paper provides a methodology for comparing and validating biclustering methods that includes a simple binary reference model. Although this model captures the essential features of most biclustering approaches, it is still simple enough to exactly determine all optimal groupings; to this end, we propose a fast divide-and-conquer algorithm (Bimax). Second, we evaluate the performance of five salient biclustering algorithms together with the reference model and a hierarchical clustering method on various synthetic and real datasets for Saccharomyces cerevisiae and Arabidopsis thaliana. The comparison reveals that (1) biclustering in general has advantages over a conventional hierarchical clustering approach, (2) there are considerable performance differences between the tested methods and (3) already the simple reference model delivers relevant patterns within all considered settings.

Availability: The datasets used, the outcomes of the biclustering algorithms and the Bimax implementation for the reference model are available at


Supplementary information: Supplementary data are available at

Tuesday, April 18, 2006

[New paper in the literature]: Family-based designs in the age of large-scale gene-association studies

Nature Reviews Genetics 7, 385-394
Nan M. Laird and Christoph Lange

Abstract: Both population-based and family-based designs are commonly used in genetic association studies to locate genes that underlie complex diseases. The simplest version of the family-based design — the transmission disequilibrium test — is well known, but the numerous extensions that broaden its scope and power are less widely appreciated. Family-based designs have unique advantages over population-based designs, as they are robust against population admixture and stratification, allow both linkage and association to be tested for and offer a solution to the problem of model building. Furthermore, the fact that family-based designs contain both within- and between-family information has substantial benefits in terms of multiple-hypothesis testing, especially in the context of whole-genome association studies.

Monday, April 17, 2006

Papers on Cross-Validation and Bootstrap

Estimating the Error Rate of a Prediction Rule: Improvement of Cross Validation
B Efron. (1983) JASA

Improvements on Cross-Validation: The .632+ Bootstrap Method
B Efron and R Tibshirani (1997) JASA

On Bias, Variance, 0/1-Loss, and the Curse-of-Dimensionality
JH Friedman. (1997) Data mining and Knowledge Discovery.

Friday, April 14, 2006

Will books be endangered species soon?

Science: who needs books is part of a series of articles on scientific publishing. The one sentence I liked most in this article is: "Would Darwin need a publisher now? Would he even write a book?" I can't help musing, would Darwin just put up a blog?

Yesterday, a brilliant young man, Atif E Gulab, entered my office. He is currently a freshman in SEAS at Columbia. He said he would like to do a summer independent research project that launches an on-line magazine on statistics/statistics education for high school students and college non-majors. He has started a blog on that.

During our discussion, one topic came up: why online? The answer is really not that hard to think. There are so many things that an online magazine can carry whereas the regular ones can't, such as tons of pictures, stream-video interviews, music pieces, comments, etc.

I showed Atif the projects my W1111 students last semester accomplished using Wiki. Several projects employed multimedia forms in their data collections. The project reports are only interesting when you can really see and hear the pictures they used and the music they played to their survey respondents and experiment subjects. This is just like the illustrated version of the Da Vinci Code using pictures of all the discussed symbols and artworks makes the reading of the book so much more interesting. I bet Dan Brown would not want to write a controversy story on a musician since he does not get to show the readers the actual music in a printed book.

Atif said maybe in 10 years, no one will read a real book any more. Well. I don't think so. I love books. Love holding them in my hands. Atif does not seem to be attached to actual printed books at all. Maybe after the passing of my generation, the species of books might really get endangered.

Thursday, April 13, 2006

Joint Modeling of Linkage and Association: Identifying SNPs Responsible for a Linkage Signal

Am. J. Hum. Genet., 76:934-949, 2005
Mingyao Li, Michael Boehnke, and Gonçalo R. Abecasis

Abstract: Once genetic linkage has been identified for a complex disease, the next step is often association analysis, in which single-nucleotide polymorphisms (SNPs) within the linkage region are genotyped and tested for association with the disease. If a SNP shows evidence of association, it is useful to know whether the linkage result can be explained, in part or in full, by the candidate SNP. We propose a novel approach that quantifies the degree of linkage disequilibrium (LD) between the candidate SNP and the putative disease locus through joint modeling of linkage and association. [Read more by following the link above]

Genome-wide strategies for detecting multiple loci that influence complex diseases

Nature Genetics 37, 413 - 417 (2005)
Published online: 27 March 2005; doi:10.1038/ng1537

Jonathan Marchini, Peter Donnelly, Lon R Cardon

Abstract: After nearly 10 years of intense academic and commercial research effort, large genome-wide association studies for common complex diseases are now imminent. Although these conditions involve a complex relationship between genotype and phenotype, including interactions between unlinked loci1, the prevailing strategies for analysis of such studies focus on the locus-by-locus paradigm. Here we consider analytical methods that explicitly look for statistical interactions between loci. We show first that they are computationally feasible, even for studies of hundreds of thousands of loci, and second that even with a conservative correction for multiple testing, they can be more powerful than traditional analyses under a range of models for interlocus interactions. We also show that plausible variations across populations in allele frequencies among interacting loci can markedly affect the power to detect their marginal effects, which may account in part for the well-known difficulties in replicating association results. These results suggest that searching for interactions among genetic loci can be fruitfully incorporated into analysis strategies for genome-wide association studies.

Maximum-likelihood estimation of haplotype frequencies in nuclear families

Genetic Epidemiology 27:21 - 32
Tim Becker, Michael Knapp

Abstract:The importance of haplotype analysis in the context of association fine mapping of disease genes has grown steadily over the last years. Since experimental methods to determine haplotypes on a large scale are not available, phase has to be inferred statistically. For individual genotype data, several reconstruction techniques and many implementations of the expectation-maximization (EM) algorithm for haplotype frequency estimation exist. Recent research work has shown that incorporating available genotype information of related individuals largely increases the precision of haplotype frequency estimates. We, therefore, implemented a highly flexible program written in C, called FAMHAP, which calculates maximum likelihood estimates (MLEs) of haplotype frequencies from general nuclear families with an arbitrary number of children via the EM-algorithm for up to 20 SNPs. For more loci, we have implemented a locus-iterative mode of the EM-algorithm, which gives reliable approximations of the MLEs for up to 63 SNP loci, or less when multi-allelic markers are incorporated into the analysis. Missing genotypes can be handled as well. The program is able to distinguish cases (haplotypes transmitted to the first affected child of a family) from pseudo-controls (non-transmitted haplotypes with respect to the child). ... [Read more by following the link above] © 2004 Wiley-Liss, Inc.

Several multiclass gene expression papers

A mixture model-based strategy for selecting sets of genes in multiclass response microarray experiments.
Bioinformatics. 2004 Nov 1;20(16):2562-71
Broet P, Lewin A, Richardson S, Dalmasso C, Magdelenat H.

A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression.
Bioinformatics. 2004 Oct 12;20(15):2429-37.
Li T, Zhang C, Ogihara M.

BagBoosting for tumor classification with gene expression data.
Bioinformatics. 2004 Dec 12;20(18):3583-93
Dettling M.

For more papers on this topic:
Pubmed keywords: multiclass (or multi-class) AND gene expression

Wednesday, April 12, 2006

Bayesians, Frequentists, and Scientists

JASA 2005, vol. 100, no. 469, pp. 1 - 5
Brad Efron

Abstract: Broadly speaking, nineteenth century statistics was Bayesian, while the twentieth century was frequentist, at least from the point of view of most scientific practitioners. Here in the twenty-first century scientists are bringing statisticians much bigger problems to solve, often comprising millions of data points and thousands of parameters. Which statistical philosophy will dominate practice? My guess, backed up with some recent examples, is that a combination of Bayesian and frequentist ideas will be needed to deal with our increasingly intense scientific environment. This will be a challenging period for statisticians, both applied and theoretical, but it also opens the opportunity for a new golden age, rivaling that of Fisher, Neyman, and the other giants of the early 1900s. What follows is the text of the 164th ASA presidential address, delivered at the awards ceremony in Toronto on August 10, 2004.

Wednesday, April 05, 2006

NYC: Snow in April --- is it really abnormal?

Only summary data are available. I plotted the boxplots using five-number summaries provided. Some of the maximums are regarded as outliers by R.

R code: snow.R

In Spring 2005, I used a new example for w1111: snow falls in NYC. And that spring, we had record amount of snow. My students said that I jinxed it. :)

Well, today is absolutely not my fault.

Tuesday, April 04, 2006

99 bottles of beer

I got this link from a friend's blog. In its' own description:

"This Website holds a collection of the Song 99 Bottles of Beer programmed in different programming languages. Actually the song is represented in 933 different programming languages and variations. For more detailed information refer to historic information."

I saw this blog days ago and didn't have too much to say about it then. A couple of independent events over the past two days make me feel that it is worth noting. It finally hit me that when someone has proposed a "solution" to computerized generation of this song, 932 other parties still propose different solutions just to achieve the SAME thing. In publishing scientific papers (well, I only know about statistics and genetics), you probably don't want to submit a paper on a new method that achieve the same thing as some existing methods. If there is no new better performance, there is no new contribution, as it seems. I used to agree with this statement and then this number, "933", sort of "shocked" me into thinking. We definitely know more from 933 different programs that can "instruct" a computer to print out the lyrics than from just one such program. Then why there is no credit to runner-ups that solve important problems in the scientific world?

I was reading an editorial on the South Korean stem cell scandal. The author analyzed that since there is absolutely no credit to the person who discovers one day later than the first person, someone is pushed (by scientific greed) to fabricate something up just to take the first place for the moment and then go back to work out the details. Sure, most scientists have the integrity not to do something like this. The author also pointed out that the fame the first discoverer receives (accelerated by internet these days) makes the scientific research world become more and more like celebrity competition. I think the author has a point even thought I don't think things are this dramatic in statistics.