Now on ASCR Discovery: Ghost Writer

Date

Processor cores in a giant supercomputer want data delivery that’s fast and efficient. It’s a difficult goal to achieve in large calculations, like those in chemistry and bioinformatics problems. The size and complexity of such computations can snarl communication between hundreds of processors as each solves its own piece of the puzzle. If communication traffic jams develop, programs can bog down. Making big computations efficient means making data transmission fast and seamless.

To overcome communications challenges, researchers have developed Casper, a layer of code that inserts itself into the workflow to overcome delays. Casper is portable, so it should be useful for other high-performance computing (HPC) software that uses a similar communications strategy. It could even facilitate the move toward exascale computing, when machines are capable of a quintillion calculations per second.

Casper was first developed for NWChem, a Pacific Northwest National Laboratory-designed computational chemistry program that lets researchers model complex chemical processes such as how radioactive waste decays and how ultraviolet radiation causes cancer in skin cells. From its infancy in the mid-1990s, NWChem was designed to run on networked processors, as in an HPC system, using one-sided communication, says Jeff Hammond of Intel Corp.’s Parallel Computing Laboratory.

In one-sided communication, a processor is programmed internally to fetch data from and write data to another processor without that processor’s involvement. Eliminating the need to match communication on two processors greatly simplifies programming in applications where data are accessed irregularly. This strategy also reduces communication overhead, which can burden large-scale simulations. When communication proceeds without a programmer’s instruction, it makes asynchronous progress.

As HPC systems have evolved, Message Passing Interface (MPI) – computer code that facilitates overall communication between parallel processors – has become the lingua franca of supercomputers, allowing researchers to move data via various strategies.

MPI is widely used, but some computer scientists have resisted adopting it. The code can be sluggish with programs that use one-sided communication, Hammond says. When a program doesn’t make asynchronous progress, calculations can crawl to a near standstill.

Jim Dinan, Pavan Balaji, and Hammond, all at Argonne National Laboratory at the time, had worked out some of these communication problems between NWChem and earlier MPI versions. But NWChem didn’t work optimally with MPI-3, the newest version.

To solve this problem, Hammond and Balaji envisioned adding another layer of code that would speed up NWChem, much like introducing nitrous oxide in a dragster engine to boost its power. They worked with Min Si, a University of Tokyo doctoral student, to develop the code now called Casper during her 2013 Argonne internship.

To streamline asynchronous communication, Casper lets researchers designate one or more processor cores as “ghost processes,” Si says. They serve as landing points for communication and allow asynchronous progress to take place while other cores perform the computation.

There’s no need to change MPI, Hammond says. Casper is “just this invisible code that sneaks in and takes care of things. Casper is, you know, the friendly ghost that comes in and helps you.” The friendly ghost solution is particularly useful because it works with any MPI version, no matter who built the supercomputer.

Read more at ASCR Discovery, a website highlighting research supported by the Department of Energy’s Advanced Scientific Computing Research program.

Image caption: In an NWChem simulation (data here from a complex problem involving water molecules), Casper shows uniform improvement as the number of cores increases. Image courtesy of Casper/Argonne National Laboratory.