Wednesday, August 03, 2005

Local Computations with Probabilities on Graphical Structures and Their Application to Expert Systems

Journal of the Royal Statistical Society. Series B (Methodological) Vol. 50, No. 2 (1988), pp. 157-224 [JSTOR link]

S. L. Lauritzen; D. J. Spiegelhalter

Abstract: A causal network is used in a number of areas as a depiction of patterns of `influence' among sets of variables. In expert systems it is common to perform `inference' by means of local computations on such large but sparse networks. In general, non-probabilistic methods are used to handle uncertainty when propagating the effects of evidence, and it has appeared that exact probabilistic methods are not computationally feasible. Motivated by an application in electromyography, we counter this claim by exploiting a range of local representations for the joint probability distribution, combined with topological changes to the original network termed `marrying' and `filling-in'. The resulting structure allows efficient algorithms for transfer between representations, providing rapid absorption and propagation of evidence. The scheme is first illustrated on a small, fictitious but challenging example, and the underlying theory and computational aspects are then discussed.

Tuesday, August 02, 2005

Maximum Likelihood from Incomplete Data via the EM algorithm

Journal of the Royal Statistical Society. Series B. Vol. 39, No. 1, pp. 1-38 [JSTOR link]
A. P. Dempster, N. M. Laird, and D. B. Rubin

Abstract: A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

Reversible Jump Markov Chain Monte Carlo Computation and Bayesian Model Determination

Biometrika Vol. 82, No. 4, pp. 711-732 [JSTOR link]
Peter J. Green

Abstract: Markov chain Monte Carlo methods for Bayesian computation have until recently been restricted to problems where the joint distribution of all variables has a density with respect to some fixed standard underlying measure. They have therefore not been available for application to Bayesian model determination, where the dimensionality of the parameter vector is typically not fixed. This paper proposes a new framework for the construction of reversible Markov chain samplers that jump between parameter subspaces of differing dimensionality, which is flexible and entirely constructive. It should therefore have wide applicability in model determination problems. The methodology is illustrated with applications to multiple change-point analysis in one and two dimensions, and to a Bayesian comparison of binomial experiments.

Identification and measurement of neighbor-dependent nucleotide substitution processes

Bioinformatics 2005 21(10):2322-2328; doi:10.1093/bioinformatics/bti376
Peter F. Arndt and Terence Hwa

Motivation: Neighbor-dependent substitution processes generated specific pattern of dinucleotide frequencies in the genomes of most organisms. The CpG-methylation–deamination process is, e.g. a prominent process in vertebrates (CpG effect). Such processes, often with unknown mechanistic origins, need to be incorporated into realistic models of nucleotide substitutions.

Results: Based on a general framework of nucleotide substitutions we developed a method that is able to identify the most relevant neighbor-dependent substitution processes, estimate their relative frequencies and judge their importance in order to be included into the modeling. Starting from a model for neighbor independent nucleotide substitution we successively added neighbor-dependent substitution processes in the order of their ability to increase the likelihood of the model describing given data. The analysis of neighbor-dependent nucleotide substitutions based on repetitive elements found in the genomes of human, zebrafish and fruit fly is presented.

Availability: A web server to perform the presented analysis is freely available at:

Monday, June 06, 2005

An Entropy-Based Statistic for Genomewide Association Studies

Am. J. Hum. Genet. 77:27–40, 2005
Jinying Zhao, Eric Boerwinkle, and Momiao Xiong

Abstract: Efficient genotyping methods and the availability of a large collection of single-nucleotide polymorphisms provide valuable tools for genetic studies of human disease. The standard chi-square statistic for case-control studies, which uses a linear function of allele frequencies, has limited power when the number of marker loci is large. We introduce a novel test statistic for genetic association studies that uses Shannon entropy and a nonlinear function of allele frequencies to amplify the differences in allele and haplotype frequencies to maintain statistical power with large numbers of marker loci. We investigate the relationship between the entropy-based test statistic and the standard chi-square statistic and show that, in most cases, the power of the entropy-based statistic is greater than that of the standard chi-square statistic. The distribution of the entropy-based statistic and the type I error rates are validated using simulation studies. Finally, we apply the new entropy-based test statistic to two real data sets, one for the COMT gene and schizophrenia and one for the MMP-2 gene and esophageal carcinoma, to evaluate the performance of the new method for genetic association studies. The results show that the entropy-based statistic obtained smaller P values than did the standard chi-square statistic.