skip to main content
10.1145/3168831acmconferencesArticle/Chapter ViewAbstractPublication PagescgoConference Proceedingsconference-collections

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Published:24 February 2018Publication History

ABSTRACT

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CUDAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

References

  1. 2017. NVIDIA Visual Profiler. NVIDIA. http://docs.nvidia.com/cuda/ profiler-users-guideGoogle ScholarGoogle Scholar
  2. Jun. 2017. Top500 supercomputer sites. https://www.top500.org/lists/ 2017/06 . (Jun. 2017).Google ScholarGoogle Scholar
  3. L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin, J. MellorCrummey, and N. R. Tallent. 2010. HPCToolkit: Tools for performance analysis of optimized parallel programs. Concurrency and Computation: Practice and Experience 22 (2010), 685–701. Google ScholarGoogle ScholarCross RefCross Ref
  4. R. Ausavarungnirun, S. Ghose, O. Kayiran, G. H. Loh, C. R. Das, M. T. Kandemir, and O. Mutlu. 2015. Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance. In 2015 International Conference on Parallel Architecture and Compilation (PACT). 25–38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt. 2009. Analyzing CUDA workloads using a detailed GPU simulator. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. 163–174.Google ScholarGoogle Scholar
  6. David Böhme, Markus Geimer, Lukas Arnold, Felix Voigtlaender, and Felix Wolf. 2016. Identifying the Root Causes of Wait States in LargeScale Parallel Applications. ACM Trans. Parallel Comput. 3, 2, Article 11 (July 2016), 24 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Milind Chabbi, Karthik Murthy, Michael Fagan, and John MellorCrummey. 2013. Effective Sampling-driven Performance Tools for GPU-accelerated Supercomputers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC ’13). ACM, New York, NY, USA, Article 43, 12 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W Sheaffer, Sang-Ha Lee, and Kevin Skadron. 2009. Rodinia: A benchmark suite for heterogeneous computing. In Workload Characterization. IEEE International Symposium on. IEEE, 44–54. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Guoyang Chen and Xipeng Shen. 2015. Free launch: optimizing GPU dynamic kernel launches through thread reuse. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 407–419. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Xuhao Chen, Li-Wen Chang, Christopher I. Rodrigues, Jie Lv, Zhiying Wang, and Wen-Mei Hwu. 2014. Adaptive Cache Management for Energy-Efficient GPU Computing. In Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-47). IEEE Computer Society, Washington, DC, USA, 343–355. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. NVIDIA Corp. 2011. CUDA Tools SDK CUPTI User’s Guide DA-05679-001_v01. https://developer.nvidia.com/nvidia-visual-profiler . (October 2011).Google ScholarGoogle Scholar
  12. NVIDIA Corp. 2017. NVIDIA Nsight. http://www.nvidia.com/object/ nsight.html . (2017).Google ScholarGoogle Scholar
  13. Zheng Cui, Yun Liang, Kyle Rupnow, and Deming Chen. 2012. An accurate GPU performance model for effective control flow divergence optimization. In Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 83–94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gregory Frederick Diamos, Andrew Robert Kerr, Sudhakar Yalamanchili, and Nathan Clark. 2010. Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques (PACT ’10). ACM, New York, NY, USA, 353–364. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Chen Ding and Zhong Yuntao. 2001. Reuse Distance Analysis. In Computer Science at University of Rochester Tech report UR-CS-TR-741. U of Rochester. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jayesh Gaur, Raghuram Srinivasan, Sreenivas Subramoney, and Mainak Chaudhuri. 2013. Efficient Management of Last-level Caches in Graphics Processors for 3D Scene Rendering Workloads. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 395–407. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In Innovative Parallel Computing (InPar). IEEE.Google ScholarGoogle Scholar
  18. LLVM Group. 2016. LLVM: User Guide for NVPTX Back-end. http: //llvm.org/docs/NVPTXUsage.html . (2016).Google ScholarGoogle Scholar
  19. NVIDIA Group. 2017. NVIDIA DGX-1 AI Supercomputer. http://www. nvidia.com/object/deep-learning-system.html . (2017).Google ScholarGoogle Scholar
  20. Daniel Hackenberg, Guido Juckeland, and Holger Brunst. 2012. Performance analysis of multi-level parallelism: inter-node, intra-node and hardware accelerators. Concurrency and Computation: Practice and Experience 24, 1 (2012), 62–72.Google ScholarGoogle ScholarCross RefCross Ref
  21. Google Inc. 2017. TensorFlow: An open-source software library for Machine Intelligence. https://www.tensorflow.org . (2017).Google ScholarGoogle Scholar
  22. Intel 2017. Intel VTune Amplifier XE 2017. http://software.intel.com/ en-us/intel-vtune-amplifier-xe . (April 2017).Google ScholarGoogle Scholar
  23. Hyeran Jeon, Gunjae Koo, and Murali Annavaram. 2014. CTA-aware Prefetching for GPGPU. Computer Engineering Technical Report Number CENG-2014-08 (2014).Google ScholarGoogle Scholar
  24. W. Jia, K. A. Shaw, and M. Martonosi. 2014. MRPB: Memory request prioritization for massively parallel processors. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 272–283.Google ScholarGoogle Scholar
  25. Adwait Jog, Onur Kayiran, Asit K Mishra, Mahmut T Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ACM SIGARCH Computer Architecture News 41, 3 (2013), 332–343. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Onur Kayıran, Adwait Jog, Mahmut Taylan Kandemir, and Chita Ranjan Das. 2013. Neither more nor less: optimizing thread-level parallelism for GPGPUs. In Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. IEEE Press, 157– 166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Chris Lattner and Vikram Adve. 2004. LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. In Proceedings of the 2004 International Symposium on Code Generation and Optimization (CGO’04). Palo Alto, California. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jaekyu Lee, Nagesh B Lakshminarayana, Hyesoon Kim, and Richard Vuduc. 2010. Many-thread aware prefetching mechanisms for GPGPU applications. In Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium on. IEEE, 213–224. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Lee, R. Krashinsky, V. Grover, S. W. Keckler, and K. AsanoviÄĞ. 2013. Convergence and scalarization for data-parallel architectures. In Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). 1–11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ang Li, Shuaiwen Leon Song, Weifeng Liu, Xu Liu, and Henk Corporaal. 2017. Locality-Aware CTA Clustering for Modern GPUs. In Proceedings of 22nd ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS XXII). ACM, New York,NY, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ang Li, Gert-Jan van den Braak, Akash Kumar, and Henk Corporaal. 2015. Adaptive and transparent cache bypassing for GPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 17. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Chao Li, Shuaiwen Leon Song, Hongwen Dai, Albert Sidelnik, Siva Kumar Sastry Hari, and Huiyang Zhou. 2015. Locality-driven dynamic gpu cache bypassing. In Proceedings of the 29th ACM on International Conference on Supercomputing (ICS’15). ACM, 67–77. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Lingda Li, Ari B Hayes, Shuaiwen Leon Song, and Eddy Z Zhang. 2016. Tag-Split Cache for Efficient GPGPU Cache Utilization. In Proceedings of the 2016 International Conference on Supercomputing (ICS’17). ACM, 43. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode, Stanimire Tomov, Guido Juckeland, Robert Dietrich, Duncan Poole, and Christopher Lamb. 2011. Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs. In Proceedings of the 2011 International Conference on Parallel Processing (ICPP ’11). IEEE Computer Society, Washington, DC, USA, 176–185. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Jiayuan Meng, David Tarjan, and Kevin Skadron. 2010. Dynamic Warp Subdivision for Integrated Branch and Memory Divergence Tolerance. In Proceedings of the 37th Annual International Symposium on Computer Architecture (ISCA ’10). ACM, New York, NY, USA, 235– 246. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. C. Nugteren, G. J. van den Braak, H. Corporaal, and H. Bal. 2014. A detailed GPU cache model based on reuse distance theory. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 37–48.Google ScholarGoogle Scholar
  37. NVIDIA. 2015. CUDA 7.5: Pinpoint Performance Problems with Instruction-Level Profiling. https://devblogs.nvidia.com/parallelforall/ cuda-7-5-pinpoint-performance-problems-instruction-level-profiling . (2015).Google ScholarGoogle Scholar
  38. NVIDIA. 2015. CUDA Programming Guide. (2015). http://docs.nvidia. com/cuda/cuda-c-programming-guideGoogle ScholarGoogle Scholar
  39. Oracle. 2012. Oracle Solaris Studio. http://www.oracle.com/ technetwork/server-storage/solarisstudio/overview/index.html . (2012).Google ScholarGoogle Scholar
  40. Keshav Pingal. 2014. Galois. http://iss.ices.utexas.edu/?p=projects/ galois . (2014).Google ScholarGoogle Scholar
  41. Steve Plimpton. 1995. Fast Parallel Algorithms for Short-range Molecular Dynamics. J. Comput. Phys. 117, 1 (March 1995), 1–19. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. 2013. A Locality-aware Memory Hierarchy for Energy-efficient GPU Architectures. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 86–98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Timothy G Rogers, Mike O’Connor, and Tor M Aamodt. 2012. Cacheconscious wavefront scheduling. In Proceedings of the 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE Computer Society, 72–83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Timothy G. Rogers, Mike O’Connor, and Tor M. Aamodt. 2013. Divergence-aware Warp Scheduling. In Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-46). ACM, New York, NY, USA, 99–110. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Sethia, D. A. Jamshidi, and S. Mahlke. 2015. Mascar: Speeding up GPU warps by reducing memory pitstops. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 174–185.Google ScholarGoogle Scholar
  46. Shuaiwen Leon Song, Chunyi Su, Barry Rountree, and Kirk W Cameron. 2013. A simplified and accurate model of powerperformance efficiency on emergent GPU architectures. In Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE, 673–686. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Mark Stephenson, Siva Kumar Sastry Hari, Yunsup Lee, Eiman Ebrahimi, Daniel R. Johnson, David Nellans, Mike O’Connor, and Stephen W. Keckler. 2015. Flexible Software Profiling of GPU Architectures. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture (ISCA ’15). ACM, New York, NY, USA, 185–197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. John E Stone, David Gohara, and Guochun Shi. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66–73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Jingweijia Tan, Shuaiwen Leon Song, Kaige Yan, Xin Fu, Andres Marquez, and Darren Kerbyson. 2016. Combating the Reliability Challenge of GPU Register File at Low Supply Voltage. In Proceedings of the 2016 International Conference on Parallel Architectures and Compilation (PACT ’16). ACM, New York, NY, USA, 3–15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dominic A. Varley. 1993. Practical experience of the limitations of Gprof. Software: Practice and Experience 23, 4 (1993), 461–463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Jingyue Wu, Artem Belevich, Eli Bendersky, Mark Heffernan, Chris Leary, Jacques Pienaar, Bjarke Roune, Rob Springer, Xuetian Weng, and Robert Hundt. 2016. GPUCC - An Open-Source GPGPU Compiler. In Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, NY, 105–116. http://dl.acm.org/citation. cfm?id=2854041 Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. P. Xiang, Y. Yang, and H. Zhou. 2014. Warp-level divergence in GPUs: Characterization, impact, and mitigation. In 2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA). 284–295.Google ScholarGoogle Scholar
  53. Xiaolong Xie, Yun Liang, Xiuhong Li, Yudong Wu, Guangyu Sun, Tao Wang, and Dongrui Fan. 2015. Enabling coordinated register allocation and thread-level parallelism optimization for GPUs. In Proceedings of the 48th International Symposium on Microarchitecture. ACM, 395–406. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Xiaolong Xie, Yun Liang, Guangyu Sun, and Deming Chen. 2013. An Efficient Compiler Framework for Cache Bypassing on GPUs. In Proceedings of the International Conference on Computer-Aided Design (ICCAD ’13). IEEE Press, Piscataway, NJ, USA, 516–523. http: //dl.acm.org/citation.cfm?id=2561828.2561929 Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. X. Xie, Y. Liang, Y. Wang, G. Sun, and T. Wang. 2015. Coordinated static and dynamic cache bypassing for GPUs. In 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA). 76–88.Google ScholarGoogle Scholar
  56. Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-fly Elimination of Dynamic Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          CGO 2018: Proceedings of the 2018 International Symposium on Code Generation and Optimization
          February 2018
          377 pages
          ISBN:9781450356176
          DOI:10.1145/3179541

          Copyright © 2018 ACM

          Publication rights licensed to ACM. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 24 February 2018

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate312of1,061submissions,29%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader