COVID-19 has vastly impacted every aspect of life and work, but even with the very changed landscape of face-to-face collaboration and interaction, the scientific community continues to drive advancements forward.
The recently concluded GPU Hackathon, hosted by the San Diego Supercomputer Center (SDSC) and held in partnership with Oak Ridge Leadership Computing Facility (OLCF), NVIDIA, and the National Energy Research Scientific Computing Center (NERSC), brought together seven teams across multiple disciplines using a newly launched completely remote, digital event format. By collaborating with mentors who are experts in GPU programming, these seven teams from 11 institutions—many of them having relatively little to no GPU experience—worked to port and optimize their applications. Every team achieved a speed-up.
Using OpenACC to Accelerate a Material Science Code
Stochastic GW (sGW) is a many-body perturbation theory (MBPT) code that allows the calculation of quasiparticle energies within the GW approach for large systems with many thousands of atoms. The bottlenecks in the code are the application of the Hamiltonian operator on stochastic orbitals, which is dominated by large 3D Fast Fourier Transforms (FFTs) with specialized geometry.
The team Stochastic Donkeys came to the hackathon with the goal of porting the CPU-only code and its core algorithm, a time propagation of a single wave function, to GPU using OpenACC and cuFFT. They faced many challenges including a heavily segmented code with convolutions, complex kernels, and linear algebra, and team members that were new to the sGW software.
Since it was the slowest part of the code, the team’s Initial strategy was to port and optimize the core algorithm by adding and implementing OpenACC directives to specific kernels. Finding that GPU memory copies consumed 40% of the time, the Stochastic Donkeys began slowly removing data copies that were occurring because the code was segmented into many subroutines. They continued to encounter many memory allocation and free memory bottlenecks so the team optimized cuFFTMany versus using cuFFT in asynchronous data streams to achieve speedups within the routines, bringing the time consumed on the GPU for memory copies down to nine percent.
The Stochastic Donkeys were able to achieve a 6.3X speedup for the whole code (excluding the I/O overhead), as well as speed-ups of 30-40X for individual routines. With ten percent of the code still on the CPU, the team has a go-forward strategy to continue working toward porting the remainder to GPU.
Tackling Quantum Chemistry and Molecular Dynamics with CUDA
QUantum Interaction Computational Kernel (QUICK) is an open source quantum chemistry application that solves the electronic Schrodinger equation with Hartree-Fock (HF) or Density Functional Theory (DFT), enabling researchers to compute energy and forces on atoms. Primarily a Fortran code with some C++, MPI and CUDA, the algorithmic motifs of QUICK are two-electron integral engines, numerical quadrature of exchange correlation (XC) potential and linear algebra.
The eight-person QUICK team came to the GPU Hackathon with multiple goals. They first wanted to try to understand the performance of the existing kernels and identify code performance and memory consumption bottlenecks. They also wanted to improve the performance of the existing CUDA code for both electron repulsion integrals (ERIs) and XC quadrature code, as well as develop a multi-GPU implementation. Finally, the team wanted to port the numerically intensive linear algebra to GPU to speed up the matrix diagonalization and develop a strategy for building libraries from the GPU-accelerated parts of QUICK that could be used in other applications.
Initial profiling showed an expected bottleneck with the computational expensive matrix diagonalization taking significant time, but also uncovered an unexpected bottleneck within the one-electron integrals. Additionally, investigation into the source code revealed that improving the ERIs code performance was too complex for the scope of the hackathon, so instead the team decided to concentrate on parallelizing ERIs across multiple GPUs. The team also decided to focus on cuSolver, a collection of dense and sparse direct linear solvers and Eigen solvers that is part of the CUDA Toolkit, to port the matrix diagonalization to GPU.
At the end of the hackathon, the QUICK team achieved a 8.5X speed up for the matrix diagonalization over the internal CPU diagonalizer by implementing cuSolver and eliminating unnecessary device/host data transfers. With this approach, the team realized a time-savings for each iteration of their simulation (15% for Taxol) with the expectation that even more time would be saved running larger molecules. Additionally, they implemented multi-GPU integral sorting for ERIs—introducing a second sorting algorithm in the ERI engine that allowed each thread in the warp to call the same type of integral and enabled even distribution of ERIs across the GPUs—achieving good performance and perfect load balancing.
GOMC for Molecular Dynamics
Team GOMC from Wayne State University attended the hackathon to optimize GOMC, an open-source software for simulating molecular systems using the Metropolis Monte Carlo algorithm to study vapor–liquid equilibria, adsorption in porous materials, surfactant self-assembly, and condensed phase structure for complex molecules.
Sampling phase space is achieved through moves of molecules to random trial locations and using a Boltzmann weight to evaluate whether to include the configuration. For the hackathon, the GOMC team decided to focus on the most computationally expensive move in the code, the multi-particle force-biased translation/rotation move. This was just included in the latest release, and the team had not had the opportunity yet to optimize the newest move in the code so the hackathon provided the perfect opportunity.
Although their initial strategy was to implement Particle Mesh Ewald (PME) for the code’s electrostatics, Team GOMC quickly realized this was too ambitious for the hackathon timeline and changed their strategy to focus on the Lennard-Jones and Real space part of the electrostatic calculation. For the final result, the team was able to reduce the simulation time of 8.7 seconds at the start of the hackathon to 0.567 seconds at the end of the hackathon--a 16x speedup. This was achieved by eliminating the precalculation of vectors of ordered pairs of interactions between particles on the CPU and porting the calculations to the GPU; further parallelizing the algorithm to reduce the number of interactions in the force calculation; and using a counter-based random number generator to synchronize and move a large section of the software from host to device.
Neural Time Series for Data Modeling
Members from the Kutas Cognitive Electrophysiology Lab (Kutas Lab) are using electroencephalogram (EEG) recordings to find systematic signals that can assist with modeling the time course of brain response to stimulus and response events in experimental settings. They came to the GPU Hackathon to work on their open-source application fitgrid v0.4.10, a Python library for modeling time-varying patterns of activity in sensor-array data streams on a 2-D grid, to learn about new tools and approaches that could help port and optimize their application to make it more usable.
The Kutas Lab team initially targeted efforts on the Linear Mixed Model (LMM) bottleneck but soon pivoted to large-scale Linear Modeling (LM) bootstrap resampling. With no learning curve, no new code and no memory penalty, Team Kutas Lab used CuPy to replace large sections of NumPy and achieved a 25X speedup.
This gave them the confidence to tackle new targets, including: multivariate regression where the team achieved speedups of 6-14X; time-frequency analysis using cuSignal, a drop-in package from the RAPIDS, to achieve speedups of 5-20X; and additional functions such as Short-time Fourier Transform (STFT) where 30 minutes of EEG recordings at 32 channels are solved with 20 lines of Python in 265 milliseconds.
Other teams met with success as well. Team AMRelectrodynamiX was able to optimize two application codes utilizing the AMRex library, WarpX and SHIBA, to realize 11x and 8x speedups consecutively. Team GEOSX, named after their simulation framework code, was able to accelerate multiple kernels within the application for speedups ranging from 2-70x. Lastly, the team from NERSC worked on their Many-Fermion Dynamics---nuclear, or MFDn, configuration interaction code for nuclear structure calculations. By the end of the event, the team was able to achieve a 2-3X speedup using OpenACC.
All in all, the conclusion of the GPU Hackathon not only saw all seven teams successfully achieve speedups, but also leave the event with solid roadmaps for future development.
Full Steam Ahead for 2020
The first GPU Hackathon of 2020 showcased that the completely remote format is as productive as the physical face-to-face events. All the teams at the SDSC GPU Hackathon were able to successfully collaborate to achieve their goals for the event as well as develop a forward strategy for continued work on their scientific applications. So from adversity, the new format developed in response to challenging times has proven that these digital events are helping to keep the community engaged and connected to the resources needed to accelerate science. Several GPU Hackathon events are scheduled globally so please visit our events page for a complete list.