Better Data Splits for Machine Learning With Astartes​

Jackson Burns, Massachusetts Institute of Technology

Photo of Jackson Burns

Machine Learning (ML) is an immensely popular approach to accelerate traditional workflows. Critical to ML is the partitioning of datasets into training, validation, and testing subsets used to initialize, improve, and interrogate models. Unfortunately, the validation subset is often neglected in literature studies either intentionally because of data scarcity or unintentionally for lack of a standardized software tool. Doing so causes data leakage and compromises the integrity of the resulting models. Additionally, it is common practice in the literature to assign the subsets randomly. This approach is fast and efficient but only measures a model's capacity to interpolate. Model accuracy from such splits may be overly optimistic if given new data that is dissimilar to the training data; thus, there is a growing need to easily measure performance for extrapolation tasks. To address these issues we report astartes, an open-source Python package that implements many similarity- and distance-based algorithms to partition data into more challenging training, testing, and validation subsets. Separate from astartes, users can then use these splits to better assess out-of-sample performance with any ML model of choice. Astartes operates on arbitrary vector inputs, comes pre-packaged with featurization schemes for chemical data, and can be easily integrated into new and existing ML workflows.

Development is hosted on GitHub (github.com/JacksonBurns/astartes) and contributions for new sampling methods and featurization approaches are welcomed.

Abstract Author(s): Jackson Burns, Kevin Spiekermann, William Green