Novel Methods for Visualization and Clustering of DNA Microarray Data

Sarah Moussa, University of California, Berkeley

Despite the high dimensionality of microarray and EST data sets, many of their dimensions are correlated, creating redundancy. For microarray data, gene expression patterns are correlated because the genes are co-regulated and because related tissues express functional subsets of genes in a similar fashion; ESTs (Expressed Sequence Tags) are fragments of genes and hence EST data exhibits similar correlations. Dimensionality reduction is a technique that eliminates redundancy in a data set and extracts the most important components of the data according to some metric of importance. In addition, dimensionality reduction lowers the computational complexity of subsequent data processing (e.g. clustering) while also facilitating 2-d or 3-d visualization of the data. Such techniques have achieved moderate success for visualizing and clustering microarray data, despite the fact that only linear methods such as PCA have been widely used. PCA finds a low-dimensional representation of the data that preserves the total variance in the data as best possible; however, this variance-preserving criterion and subsequent linear projection is not guaranteed to preserve cluster structure or patterns in the data set and in fact may obfuscate or remove structure that may be present. We will present results of an alternative exploratory analysis of microarray and EST data utilizing novel pattern recognition algorithms (kernel PCA with specificity kernels), nonlinear dimensionality reduction algorithms (Isomap, LLE), and classical iterative and agglomerative clustering of genes and tissues as well as more computationally intensive spectral clustering techniques which require the eigen-decomposition of a potentially very large matrix. This is work done for Dr. Daniel Rokhsar while on practicum at the DOE Joint Genome Institute in Spring 2005.

Abstract Author(s): S.C.Moussa, D.S.Rokhsar, M.I.Jordan