A Cloud-Native Shared Computing Environment for Large-Scale Analysis of Astronomy Data Sets

Envision a world in which you never have to download a catalog of data to your own computer, send your results and code over email in order to collaborate, or move your code to a larger machine in order for it to run faster. As the size of astronomy data sets scales from gigabytes to petabytes, this traditional science workflow breaks down, and alternative tools must be provided that allow our analysis to scale to the size of the data sets currently being produced. Shared computing environments in the cloud offer a better reality. In the cloud, object stores can make large data sets broadly accessible, shared file systems allow for easy collaboration between users on the system, and elastic computing architectures scale easily and robustly to the size of your workload. We have built a shared computing environment that leverages existing technologies such as JupyterHub, Kubernetes and Apache Spark to provide and integrate each of these elements and make distributed computing in a shared environment a natural and user-friendly experience. Our platform also integrates and makes available custom software that enables astronomy-specific workloads, such as catalog cross matches, to be performed quickly in a distributed manner.

Abstract Author(s)

Steven Stetzler, Mario Juric

University

University of Washington