ABSTRACT
General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.
- 2017. NVIDIA Visual Profiler. NVIDIA. http://docs.nvidia.com/cuda/ profiler-users-guideGoogle Scholar
- Jun. 2017. Top500 supercomputer sites. https://www.top500.org/lists/ 2017/06 . (Jun. 2017).Google Scholar
- L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22 (2010), 685–701. Google ScholarCross Ref
- R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 25–38. Google ScholarDigital Library
- A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.Google Scholar
- David Böhme, Markus Geimer, Lukas Arnold, Felix Voigtlaender, and Felix Wolf. 2016. Identifying the Root Causes of Wait States in LargeScale Parallel Applications. ACM Trans. Parallel Comput. 3, 2, Article 11 (July 2016), 24 pages. Google ScholarDigital Library
- Milind Chabbi, Karthik Murthy, Michael Fagan, and John MellorCrummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 43, 12 pages. Google ScholarDigital Library
- Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IEEE International Symposium on. IEEE, 44–54. Google ScholarDigital Library
- Guoyang Chen and Xipeng Shen. 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 407–419. Google ScholarDigital Library
- Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343–355. Google ScholarDigital Library
- NVIDIA Corp. 2011. CUDA Tools SDK CUPTI User’s Guide DA-05679-001_v01. https://developer.nvidia.com/nvidia-visual-profiler . (October 2011).Google Scholar
- NVIDIA Corp. 2017. NVIDIA Nsight. http://www.nvidia.com/object/ nsight.html . (2017).Google Scholar
- Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 83–94. Google ScholarDigital Library
- Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT ’10). ACM, New York, NY, USA, 353–364. Google ScholarDigital Library
- Chen Ding and Zhong Yuntao. 2001. Reuse Distance Analysis. In Computer Science at University of Rochester Tech report UR-CS-TR-741. U of Rochester. Google ScholarDigital Library
- Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. 2013. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 395–407. Google ScholarDigital Library
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE.Google Scholar
- LLVM Group. 2016. LLVM: User Guide for NVPTX Back-end. http: //llvm.org/docs/NVPTXUsage.html . (2016).Google Scholar
- NVIDIA Group. 2017. NVIDIA DGX-1 AI Supercomputer. http://www. nvidia.com/object/deep-learning-system.html . (2017).Google Scholar
- Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience 24, 1 (2012), 62–72.Google ScholarCross Ref
- Google Inc. 2017. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org . (2017).Google Scholar
- Intel 2017. Intel VTune Amplifier XE 2017. http://software.intel.com/ en-us/intel-vtune-amplifier-xe . (April 2017).Google Scholar
- Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08 (2014).Google Scholar
- W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 272–283.Google Scholar
- Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 332–343. Google ScholarDigital Library
- Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157– 166. Google ScholarDigital Library
- Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California. Google ScholarDigital Library
- Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 213–224. Google ScholarDigital Library
- Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. AsanoviÄĞ. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–11. Google ScholarDigital Library
- Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XXII). ACM, New York,NY, USA. Google ScholarDigital Library
- Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17. Google ScholarDigital Library
- Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 67–77. Google ScholarDigital Library
- Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’17). ACM, 43. Google ScholarDigital Library
- Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP ’11). IEEE Computer Society, Washington, DC, USA, 176–185. Google ScholarDigital Library
- Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246. Google ScholarDigital Library
- C. Nugteren, G. J. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 37–48.Google Scholar
- NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. https://devblogs.nvidia.com/parallelforall/ cuda-7-5-pinpoint-performance-problems-instruction-level-profiling . (2015).Google Scholar
- NVIDIA. 2015. CUDA Programming Guide. (2015). http://docs.nvidia. com/cuda/cuda-c-programming-guideGoogle Scholar
- Oracle. 2012. Oracle Solaris Studio. http://www.oracle.com/ technetwork/server-storage/solarisstudio/overview/index.html . (2012).Google Scholar
- Keshav Pingal. 2014. Galois. http://iss.ices.utexas.edu/?p=projects/ galois . (2014).Google Scholar
- Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1–19. Google ScholarDigital Library
- Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98. Google ScholarDigital Library
- Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72–83. Google ScholarDigital Library
- Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99–110. Google ScholarDigital Library
- A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 174–185.Google Scholar
- Shuaiwen Leon Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of powerperformance efficiency on emergent GPU architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 673–686. Google ScholarDigital Library
- Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 185–197. Google ScholarDigital Library
- John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73. Google ScholarDigital Library
- Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, 3–15. Google ScholarDigital Library
- Dominic A. Varley. 1993. Practical experience of the limitations of Gprof. Software: Practice and Experience 23, 4 (1993), 461–463. Google ScholarDigital Library
- Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. GPUCC - An Open-Source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, NY, 105–116. http://dl.acm.org/citation. cfm?id=2854041 Google ScholarDigital Library
- P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 284–295.Google Scholar
- Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406. Google ScholarDigital Library
- Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’13). IEEE Press, Piscataway, NJ, USA, 516–523. http: //dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarDigital Library
- X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88.Google Scholar
- Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380. Google ScholarDigital Library
Index Terms
- CUDAAdvisor: LLVM-based runtime profiling for modern GPUs
Recommendations
Implementing OpenMP’s SIMD Directive in LLVM’s GPU Runtime
ICPP '23: Proceedings of the 52nd International Conference on Parallel ProcessingGPUs support three levels of parallelism: thread blocks, warps (or wavefronts) within a block, and threads within a warp. Some GPU programming models allow the use of all three of these levels, such as OpenMP offloading with the teams, parallel, and simd ...
Efficient execution of OpenMP on GPUs
CGO '22: Proceedings of the 20th IEEE/ACM International Symposium on Code Generation and OptimizationOpenMP is the preferred choice for CPU parallelism in High-Performance-Computing (HPC) applications written in C, C++, or Fortran. As HPC systems became heterogeneous, OpenMP introduced support for accelerator offloading via the target directive. This ...
Leveraging GPUs using cooperative loop speculation
Graphics processing units, or GPUs, provide TFLOPs of additional performance potential in commodity computer systems that frequently go unused by most applications. Even with the emergence of languages such as CUDA and OpenCL, programming GPUs remains a ...
Comments