fsucas.jpg

Computer Architecture and SysTems Research Lab (CASTL)

Address: 114 Milton Carothers Hall (MCH), Tallahassee, FL 32306; Contact: Dr. Weikuan Yu (yuw@cs.fsu.edu), 850-644-5442

Orchestrating a Cache-Memory Concert for Massive Parallelism

Contact: Dr. Weikuan Yu

There has been a rapid rise of massive parallelism in modern processors while the core count is expected to reach thousands in a decade. In contrast, memory bandwidth lags behind, causing an ever-growing gap between off-chip memory bandwidth and the cumulated computing power in a machine. The massive parallelism leads to a host of challenging issues. Particularly, it causes an explosion of recently accessed datasets with temporal locality of very short duration and spatial locality of different strides, frequently with memory accesses of unique striding patterns. This has led to serious challenges for conventional cache algorithms to achieve an effective use of limited cache capacity. In addition, the increasing width of massive parallelism leads to congested memory accesses in the memory pipeline, stalling the warp schedulers and degrading the performance. Hence, there is a critical need of a cache-memory concert that can orchestrate cache and memory management to meet the challenges of massive parallelism. At the Department of Computer Science at the Florida State University, Dr. Weikuan Yu and his team has carried out a number of recent research studies to orchestrate a cache-memory concert.

Address Indexing Permutation for Intra-Conflict Misses

GPUs have employed a hierarchy of caches, which can reduce the latency of memory operations and save the on-chip network and off-chip memory bandwidth when there is locality within the accesses. However, the contention from massive parallelism often makes the caching performance unpredictable. Notably, the bursty divergent accesses can cause associativity stalls when they are pathologically concentrated into a few cache sets. Without spreading bursty intra- warp accesses evenly into all cache sets, associativity conflicts inevitably undercut the potential performance benefits that other optimizations can bring. For example, the concurrency throttling techniques would become futile.

Cache indexing functions play a key role in reducing conflict misses by spreading accesses evenly among all sets of cache blocks. For example, pseudo-random cache indexing methods have been extensively studied to reduce conflict misses within CPU systems. However, no prior indexing method has exploited the pathological behaviors of GPU cache indexing. In CPU systems, parallelism is supported at a moderate level, in which memory accesses are often dispersed over time. However for GPUs, high thread counts are common, intra-warp accesses often come in long bursts and memory bandwidth plays a critical role in sustaining high computation throughput. These distinctive features pose challenges in designing a GPU-specific indexing method. Due to the lock- step execution, the GPU Load/Store units would be stalled when intra-warp concentration exceeds available cache associativity.

Through an in-depth analysis of GPU access patterns, we find that column-majored strided accesses are likely to incur high intra-warp concentration. Based on the analysis, we propose a Full Permutation (FUP) based indexing method that adapts to both large and medium strides in this pattern. Across the 10 highly cache-sensitive GPU applications we have evaluated, FUP eliminates intra-warp associativity conflicts and outperforms two state-of-the-art indexing methods by 22% and 15%, respectively.

Memory Divergence-Aware GPU Cache Management

Graphics Processing Unit (GPU) has proven itself as a viable technology for a wide variety of applications to exploit its massive computing capability. It allows an application to be programmed as thousands of threads running the same code in a lock-step manner, in which warps of 32 threads can be scheduled for execution in every cycle with zero switching overhead. The massive parallelism from these Single-Instruction Multiple Data (SIMD) threads helps GPUs achieve a dramatic improvement in computational power compared to CPUs. To reduce the latency of memory operations, GPU has employed multiple levels of data caches to save off-chip memory bandwidth when there is locality within the accesses.

Due to massive multithreading, per-thread data cache capacity often diminishes. For example, Fermi supports a maximum of 48 warps (1536 threads) on each Streaming Multiprocessor (SM), and these warps share 16KB or 48KB L1 Data Cache (L1D) [27]. Thus coalescing each warp’s per-thread global memory accesses into fewer memory transactions not only minimizes the consump- tion of memory bandwidth, but also alleviates cache contention. But when a warp’s accesses cannot be coalesced into one or two cache blocks, which is referred to as memory divergence, its cache footprint is often boosted by one order of magnitude, e.g., from 1 to 32 cache blocks. This leads to severe contention among warps, i.e., inter-warp contention, on limited L1D capacity.

The lock-step execution model of GPU requires a warp to have the data blocks for all its threads before execution. However, there is a lack of salient cache mechanisms that can recognize the need of managing GPU cache blocks at the warp level for increasing the number of warps ready for execution. In addition, warp schedul-ing is very important for GPU-specific cache management to reduce both intra- and inter-warp conflicts and maximize data locality. In this paper, we propose a Divergence-Aware Cache (Da- Cache) management that can orchestrate L1D cache management and warp scheduling together for GPGPUs. In DaCache, the in- sertion position of an incoming data block depends on the fetching warp’s scheduling priority. Blocks of warps with lower priorities are inserted closer to the LRU position of the LRU-chain so that they have shorter lifetime in cache. This fine-grained insertion policy is extended to prioritize coherent loads over divergent loads so that coherent loads are less vulnerable to both inter- and intra-warp thrashing. DaCache also adopts a constrained replacement policy with L1D bypassing to sustain a good supply of Fully Cached Warps (FCW), along with a dynamic mechanism to adjust FCW during runtime. Our experiments demonstrate that DaCache achieves 40.4% performance improvement over the baseline GPU and outperforms two state-of-the-art thrashing-resistant techniques RRIP and DIP by 40% and 24.9%, respectively.

Memory Occlusion Aware Warp Scheduling

GPUs have employed a hierarchy of data caches to reduce memory latency and save the on-chip network and off-chip memory bandwidth when there is locality within the accesses. However, the cache is frequently thrashed by divergent memory accesses. Recent works reported that throttling the number of concurrent warps reduces the accumulated working set so that the contention on cache capacity is alleviated and locality is preserved. While the previous efforts improved the memory performance of GPU applications, they overlooked some hazardous situations that are faced by GPU memory instructions.

In our research, we have closely examined execution stalls caused by divergent memory accesses in data-intensive GPU benchmarks. Our detailed analysis of GPU global memory accesses reveals that divergent accesses can lead to the occlusion of Load-Store units and a quick depletion of MSHR (Miss Status Holding Registers) entries. This memory occlusion in turn stalls the execution pipeline, degrading the overall performance. Our analysis shows that memory occlusion can significantly delay both coherent and divergent GPU instructions, propagate into more stalls in the pipeline, and deteriorate the overall utilization of GPU. Furthermore, such memory occlusion cannot be mitigated by a simple provision of more MSHR entries, when there is a lack of appropriate balance between warp parallelism and L1D locality.

We propose memory Occlusion Aware Warp Scheduling (OAWS) to monitor the usage of MSHR entries, predict the MSHR requirement of memory instructions, and schedule the warps that can be satisfied by the available MSHR entries, thereby preventing memory occlusion and increasing the effective parallelism among GPU warps. OAWS is designed with both static and dynamic methods to predict the required MSHR entries from GPU warps. Static OAWS predicts the misses from all warps with a fixed miss rate while dynamic OAWS takes into account the varying access patterns of different warps to predict cache misses on a per-warp basis. Particularly, dynamic OAWS has seamlessly integrated the warp priority and a concurrency estimation model in its prediction policy. It effectively prevents memory occlusion by dynamically predicting an optimal number of concurrent warps, and uses more MSHR entries effectively without thrashing L1D. Our dynamic OAWS policy can prevent memory occlusions and effectively leverage more MSHR entries for better IPC performance for GPU. Experimental results show that the static and dynamic versions of OAWS achieve 36.7% and 73.1% performance improvement, compared to the baseline GTO scheduling. Particularly, dynamic OAWS outperforms MASCAR, CCWS, and SWL-Best by 70.1%, 57.8%, and 11.4%, respectively.

People

  • FACULTY
  • STUDENTS
  1. Bin Wang
  2. Zhuo Liu
  3. Yue Zhu

Publications

  1. B. Wang*, Z. Liu*, X. Wang*, and W. Yu. Eliminating Intra-Warp Conflict Misses in GPU. The 18th Conference on Design Automation and Test in Europe (DATE'15, cceptance rate: 22.4%). Grenoble, Fr. March 2015.
  2. B. Wang*, W. Yu, X.H. Sun, X. Wang. DaCache: Memory Divergence-Aware GPU Cache Management. The 29th International Conference on Supercomputing (ICS’15, acceptance rate: 21%), June 2015, Newport Beach, CA.
  3. B. Wang*, Y. Zhu*, W. Yu. OAWS: Memory Occlusion Aware Warp Scheduling. International Conference on Parallel Architecture and Compilation Techniques (PACT 2016). September 2016. (Acceptance rate: 26%). Haifa, Israel.

Acknowledgements

This work is funded in part by Florida State University, Auburn University, and by an Alabama Innovation Award.


Personal Tools