Robust Data Generation for Machine Learning in High-Resolution Transmission Electron Microscopy of Nanoscale Materials

Luis Rangel DaCosta, University of California, Berkeley

Photo of Luis Rangel DaCosta

In this work, we present a general framework, some guiding principles, and some automated tools for constructing arbitrary synthetic datasets which closely mimic experimental data as produced on the transmission electron microscope (TEM) and for studying the use of synthetic datasets in machine learning workflows. Our data generation pipeline begins with a structural generation tool, Construction Zone, an open-source Python package, built on top of popular materials modeling packages that allows for the generation of arbitrary nanoscale atomic scenes in an algorithmic and automated way. We study a simple model problem, segmentation of nanoparticles in high-resolution TEM micrographs, and analyze the performance of models on several different experimental datasets. We also demonstrate that models trained completely on simulated data can achieve state-of-the-art performance; model performance, both in- and out-of-distribution, can be further saturated with transfer learning on small amounts of experimental data. Use of automated tools to construct high-quality simulated datasets, with full control of the underlying dataset distribution, proves to be a powerful method for investigating the data dependence of machine learning techniques, generalization phenomena, and learning dynamics.

Abstract Author(s): Luis Rangel DaCosta, Katherine Sytwu, Catherine Groschner, Mary Scott