Revising Estimates of Protistan Diversity in the Environment Using an Unsupervised Topic Modeling Approach

Arianna Krinos, Massachusetts Institute of Technology

Photo of Arianna Krinos

Computational studies of unicellular microbial eukaryotes (i.e. protists) suffer from a lack of available reference genomes in public databases (Caron et al., 2017, Nat. Rev. Microbiol.; Keeling et al., 2014, PLoS Biol.). While there is an overrepresentation of genomes of protists that are of medical and agricultural importance (Keeling & Campo, 2017, Curr. Biol.), available protist sequences are not representative of the vast diversity of protists that exist in the environment, and hence lead to under-annotation when using homology-based approaches. As an alternative to homology-based approaches, techniques for k-mer based and unsupervised approaches towards taxonomic annotation have existed since the early days of environmental genomics (Rosen et al., 2008, Adv. Bioinforma.), and continue to be developed (Manekar & Sathe, 2018, GigaScience). However, recent work has often relied on a critical mass of protistan reference genomes and transcriptomes, leaving many sequences unannotated (Werbin et al., 2022, f1000 Research). Here, we introduce a novel topic modeling approach to assign taxonomic annotations to environmental sequences, which leverages reference genomes and transcriptomes but reduces bias. Our approach is based on recent evidence of the usefulness of a PCA-based approach, an interpretable method that also achieves an optimal solution (Ke & Wang, 2019). It leverages the traditional topic modeling workflow, but introduces the use of seed sequences for topic identification, and automates consensus annotations to clusters at multiple levels of taxonomy after topics are formed. We show how this unsupervised approach compares to similar k-mer based strategies for taxonomic assignment, as well as to a database search approach using the DIAMOND tool and a selection of reference protistan genomes. Further, we show how the use of this approach instead of a traditional database search – which may be skewed towards protists represented in databases – changes estimates of protistan abundance and diversity in environmental samples from previous studies.

Abstract Author(s): Arianna I. Krinos, Sarah K. Hu, Michael J. Follows, Harriet Alexander, Frederik Schulz