A Clustering Algorithm to Identify Latent Cell Types From DNA Methylation Patterns

Alexander Williams, University of California, San Diego

Photo of Alexander Williams

The brain is composed of cells with remarkably diverse functions and morphologies. Defining and characterizing these cell types is a long-standing goal in neuroscience. DNA methylation, an epigenetic modification involving the addition of methyl groups to genomic cytosine nucleotides, is a potentially useful marker of cell identity and function. In particular, methylation patterns are thought to be stable in adult animals and can be mapped using whole-genome bisulfite sequencing (WGBS). Furthermore, the addition and removal of DNA methylation may be involved in cell and tissue differentiation; methylation is targeted to different genes in different cells during development and typically silences those genes, resulting in cell-type-specific gene expression. Methylation patterns vary among major classes of cells in the brain and across cell samples from different tissues and organs.

While it is sometimes possible to separate cells using genetic or molecular markers, many cell types cannot be isolated in large enough quantities for WGBS. As a result, these experiments are typically performed on large tissue samples. Methylation patterns are measured across many short DNA sequencing reads, each originating from a cell of unknown type. We developed an unsupervised learning algorithm to cluster these reads, and thus infer cell-type-specific methylation profiles from tissue samples with mixed cell types. We represent the methylation pattern across bisulfite sequencing reads as a sparse incomplete binary matrix and obtain a low-rank matrix factorization with alternative convex programming. By appropriately constraining and regularizing the matrix factorization, we can efficiently obtain a soft clustering of the bisulfite sequencing reads into cell types. These results can be used as a heuristic to initialize a hard clustering algorithm, such as k-means clustering, if desired.

Abstract Author(s): Alex H. Williams, Eran A. Mukamel