research-article

Public Access

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Authors:
Du Shen

College of William and Mary, USA

College of William and Mary, USA
View Profile

,
Shuaiwen Leon Song

Pacific Northwest National Laboratory, USA

Pacific Northwest National Laboratory, USA
View Profile

,
Ang Li

Pacific Northwest National Laboratory, USA

Pacific Northwest National Laboratory, USA
View Profile

,
Xu Liu

College of William and Mary, USA

College of William and Mary, USA
View Profile

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and OptimizationFebruary 2018Pages 214–227https://doi.org/10.1145/3168831

Published:24 February 2018Publication History

Related Artifact: Replication Package for Article: CUDAAdvisor: LLVM-Based Runtime Profiling for Modern GPUs February 2018 software https://doi.org/10.1145/3190425

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

Pages 214–227

ABSTRACT

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

References

2017. NVIDIA Visual Profiler. NVIDIA. http://docs.nvidia.com/cuda/ profiler-users-guideGoogle Scholar
Jun. 2017. Top500 supercomputer sites. https://www.top500.org/lists/ 2017/06 . (Jun. 2017).Google Scholar
L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22 (2010), 685–701. Google ScholarCross Ref
R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 25–38. Google ScholarDigital Library
A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.Google Scholar
David Böhme, Markus Geimer, Lukas Arnold, Felix Voigtlaender, and Felix Wolf. 2016. Identifying the Root Causes of Wait States in LargeScale Parallel Applications. ACM Trans. Parallel Comput. 3, 2, Article 11 (July 2016), 24 pages. Google ScholarDigital Library
Milind Chabbi, Karthik Murthy, Michael Fagan, and John MellorCrummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 43, 12 pages. Google ScholarDigital Library
Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IEEE International Symposium on. IEEE, 44–54. Google ScholarDigital Library
Guoyang Chen and Xipeng Shen. 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 407–419. Google ScholarDigital Library
Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343–355. Google ScholarDigital Library
NVIDIA Corp. 2011. CUDA Tools SDK CUPTI User’s Guide DA-05679-001_v01. https://developer.nvidia.com/nvidia-visual-profiler . (October 2011).Google Scholar
NVIDIA Corp. 2017. NVIDIA Nsight. http://www.nvidia.com/object/ nsight.html . (2017).Google Scholar
Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 83–94. Google ScholarDigital Library
Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT ’10). ACM, New York, NY, USA, 353–364. Google ScholarDigital Library
Chen Ding and Zhong Yuntao. 2001. Reuse Distance Analysis. In Computer Science at University of Rochester Tech report UR-CS-TR-741. U of Rochester. Google ScholarDigital Library
Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. 2013. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 395–407. Google ScholarDigital Library
Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE.Google Scholar
LLVM Group. 2016. LLVM: User Guide for NVPTX Back-end. http: //llvm.org/docs/NVPTXUsage.html . (2016).Google Scholar
NVIDIA Group. 2017. NVIDIA DGX-1 AI Supercomputer. http://www. nvidia.com/object/deep-learning-system.html . (2017).Google Scholar
Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience 24, 1 (2012), 62–72.Google ScholarCross Ref
Google Inc. 2017. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org . (2017).Google Scholar
Intel 2017. Intel VTune Amplifier XE 2017. http://software.intel.com/ en-us/intel-vtune-amplifier-xe . (April 2017).Google Scholar
Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08 (2014).Google Scholar
W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 272–283.Google Scholar
Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 332–343. Google ScholarDigital Library
Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157– 166. Google ScholarDigital Library
Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California. Google ScholarDigital Library
Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 213–224. Google ScholarDigital Library
Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. AsanoviÄĞ. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–11. Google ScholarDigital Library
Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XXII). ACM, New York,NY, USA. Google ScholarDigital Library
Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17. Google ScholarDigital Library
Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 67–77. Google ScholarDigital Library
Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’17). ACM, 43. Google ScholarDigital Library
Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP ’11). IEEE Computer Society, Washington, DC, USA, 176–185. Google ScholarDigital Library
Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246. Google ScholarDigital Library
C. Nugteren, G. J. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 37–48.Google Scholar
NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. https://devblogs.nvidia.com/parallelforall/ cuda-7-5-pinpoint-performance-problems-instruction-level-profiling . (2015).Google Scholar
NVIDIA. 2015. CUDA Programming Guide. (2015). http://docs.nvidia. com/cuda/cuda-c-programming-guideGoogle Scholar
Oracle. 2012. Oracle Solaris Studio. http://www.oracle.com/ technetwork/server-storage/solarisstudio/overview/index.html . (2012).Google Scholar
Keshav Pingal. 2014. Galois. http://iss.ices.utexas.edu/?p=projects/ galois . (2014).Google Scholar
Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1–19. Google ScholarDigital Library
Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98. Google ScholarDigital Library
Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72–83. Google ScholarDigital Library
Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99–110. Google ScholarDigital Library
A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 174–185.Google Scholar
Shuaiwen Leon Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of powerperformance efficiency on emergent GPU architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 673–686. Google ScholarDigital Library
Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 185–197. Google ScholarDigital Library
John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73. Google ScholarDigital Library
Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, 3–15. Google ScholarDigital Library
Dominic A. Varley. 1993. Practical experience of the limitations of Gprof. Software: Practice and Experience 23, 4 (1993), 461–463. Google ScholarDigital Library
Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. GPUCC - An Open-Source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, NY, 105–116. http://dl.acm.org/citation. cfm?id=2854041 Google ScholarDigital Library
P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 284–295.Google Scholar
Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406. Google ScholarDigital Library
Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’13). IEEE Press, Piscataway, NJ, USA, 516–523. http: //dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarDigital Library
X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88.Google Scholar
Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380. Google ScholarDigital Library

Index Terms

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
1. General and reference
  1. Cross-computing tools and techniques

Recommendations

Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime
ICPP '23: Proceedings of the 52nd International Conference on Parallel Processing

GPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd ...
Read More
Efficient execution of OpenMP on GPUs
CGO '22: Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and Optimization

OpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This ...
Read More
Leveraging GPUs using cooperative loop speculation

Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization
February 2018
377 pages
ISBN:9781450356176
DOI:10.1145/3179541
General Chairs:
Jens Knoop
Vienna University of Technology, Austria
,
Markus Schordan
Lawrence Livermore National Laboratory, USA
,
Program Chairs:
Teresa Johnson
Google, USA
,
Michael O'Boyle
University of Edinburgh, UK
Copyright © 2018 ACM
Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 February 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
Author Tags
GPU
LLVM
Optimization
Profiling
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate312of1,061submissions,29%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 1,892
  Total Downloads
- Downloads (Last 12 months)359
- Downloads (Last 6 weeks)55
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization

ABSTRACT

References

Cited By

Index Terms

Recommendations

Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime

Efficient execution of OpenMP on GPUs

Leveraging GPUs using cooperative loop speculation