September 6, 2022 — One Exascale calculation project–The funded team developed MemHC, a GPU memory management framework that optimizes the many-body correlation function.
The computational core, fundamental to modern physics computing applications, is computationally and memory intensive. MemHC speeds up the computation of many-body correlation functions with a series of new memory reduction designs. While other recent efforts have focused on optimizing individual tensor contractions and result in suboptimal performance, MemHC optimizes memory function across contractions to reduce GPU memory allocation redundancy, redundancy CPU-GPU communication, GPU oversubscription and more efficient calculations. The framework is portable for platforms using Nvidia and AMD GPUs. The team’s work was published in the March 2022 issue of ACM Transactions on Architecture and Code Optimization.
Many-body correlation functions are widely used in scientific physics systems such as lattice quantum chromodynamics and are essential for physics observables such as predicting the properties of light nuclei. Computations from these functions are inefficient due to the difficulty in fully utilizing the GPU’s computing power; producing large intermediate results, which adds complexity and can overwhelm available GPUs; and the lack of data reuse, which generates a large amount of GPU I/O work. MemHC uses duplication-aware management and delayed release of GPU memories for better data reuse (e.g. intermediate outputs used as inputs for later allocations); implements data shuffling and on-demand synchronization to eliminate redundant or unnecessary data transfers between CPUs and GPUs; and leverages the pre-protected LRU to reduce evictions and take advantage of memory accesses. In testing, MemHC achieved 2.17 to 10.73× higher GFLOPS compared to Unified Memory Management for general correlation functions and improved execution time from 3.56 to 6.12× and a speedup of 3.56 to 6.08× in GFLOPS for three real physical correlation functions. MemHC’s optimized LRU eviction policy outperformed the original policy with an improvement of up to 1.36 times.
Future work includes extending MemHC to handle more types of hadronic systems and further optimizing capabilities for high-rank tensor contractions, such as tetra systems based on 4D tensors, which are much more complex in terms of memory usage and computational expense. The team also plans to extend the framework to a multi-node cluster with GPUs and optimize intra-node and inter-node communications, including asynchronous data copying and data prefetching.
The Diary: Qihan Wang, Zhen Peng, Bin Ren, Jie Chen, and Robert G. Edwards. “MemHC: a GPU memory management framework optimized to accelerate many-body correlation.” 2022. ACM Transactions on Architecture and Code Optimization (March).
Source: Exascale Calculation Project