skip to main content
10.1145/3524059.3532377acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article
Open Access

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

Published:28 June 2022Publication History

ABSTRACT

Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations. However, how to efficiently design more emulated data types and choose a high-performance one without hurting correctness or precision for a given application still remains an open problem.

To address these challenges, we present Ape, which can 1) provide different strategies for emulating high-bitwidth data types using native data types with in-depth error analysis; 2) dynamically and automatically select proper data types and generate efficient code for a given computation in fine-granularity to achieve higher performance while maintaining both correctness and precision at the same time without human efforts. We implement Ape on both NVIDIA Tensor Core and Huawei Ascend. Results show that Ape can boost General Matrix Multiplication and convolution by up to 3.12X and 1.86X on Tensor Core over CUDA Core and accelerate various applications by up to 1.78X (1.65X on average).

References

  1. 2008. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008 (2008), 1--70. Google ScholarGoogle ScholarCross RefCross Ref
  2. Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.Google ScholarGoogle ScholarCross RefCross Ref
  3. L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Software 28, 2 (2002), 135--151.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google ScholarGoogle Scholar
  5. Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2017. Rigorous Floating-Point Mixed-Precision Tuning. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). Association for Computing Machinery, New York, NY, USA, 300--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Arnab Das, Ian Briggs, Ganesh Gopalakrishnan, Sriram Krishnamoorthy, and Pavel Panchekha. 2020. Scalable yet Rigorous Floating-Point Error Analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 51, 14 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Theodorus Jozef Dekker. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3 (1971), 224--242.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google ScholarGoogle Scholar
  9. Jack J Dongarra, Hans W Meuer, Erich Strohmaier, et al. 1997. TOP500 supercomputer sites. Supercomputer 13 (1997), 89--111.Google ScholarGoogle Scholar
  10. Liwen Fan, Ruixin Wang, Kuan Fang, and Xian Sun. 2019. cuBERT. https://github.com/zhihu/cuBERT.Google ScholarGoogle Scholar
  11. Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. 2021. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 278--291.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. 2021. Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Evelyn Fix. 1951. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine.Google ScholarGoogle Scholar
  14. Vincent Garcia, Eric Debreuve, and Michel Barlaud. 2008. Fast k nearest neighbor search using GPU. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  15. Google. 2020. Advanced neural network processing for low-power devices. https://coral.ai/technologyGoogle ScholarGoogle Scholar
  16. Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, and Rahul Sharma. 2019. Compiling KB-Sized Machine Learning Models to Tiny IoT Devices. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 79--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Huawei. 2022. Ascend to Pervasive Intelligence. https://e.huawei.com/en/products/servers/ascendGoogle ScholarGoogle Scholar
  18. Kyuyeon Hwang and Wonyong Sung. 2014. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  19. J Edward Jackson. 2005. A user's guide to principal components. Vol. 587. John Wiley & Sons.Google ScholarGoogle Scholar
  20. Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019).Google ScholarGoogle Scholar
  21. Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos. 2015. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236 (2015).Google ScholarGoogle Scholar
  23. Paresh Kharya. 2020. TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x. the NVIDIA Blog (2020).Google ScholarGoogle Scholar
  24. Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv preprint arXiv:1703.03073 (2017).Google ScholarGoogle Scholar
  25. Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing: Industry Track Paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789--801.Google ScholarGoogle ScholarCross RefCross Ref
  26. Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International conference on machine learning. PMLR, 2849--2858.Google ScholarGoogle Scholar
  27. Seppo Linnainmaa. 1981. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software (TOMS) 7, 3 (1981), 272--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google ScholarGoogle Scholar
  29. Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522--531.Google ScholarGoogle ScholarCross RefCross Ref
  30. NVIDIA. [n.d.]. cuBLAS. https://developer.nvidia.com/cublas.Google ScholarGoogle Scholar
  31. NVIDIA. 2013. NVIDIA/kmeans. https://github.com/NVIDIA/kmeans.Google ScholarGoogle Scholar
  32. NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. UNPRECEDENTED ACCELERATION AT EVERY SCALE. Version v1.0. NVIDIA (2020).Google ScholarGoogle Scholar
  33. Tesla NVIDIA. 2017. V100 GPU architecture. the world's most advanced data center GPU. Version WP-08608-001_v1.1. NVIDIA. Aug (2017).Google ScholarGoogle Scholar
  34. Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, and Wenguang Chen. 2021. AIPerf: Automated machine learning as an AI-HPC benchmark. Big Data Mining and Analytics 4, 3 (2021), 208--220.Google ScholarGoogle ScholarCross RefCross Ref
  35. Alexey Solovyev, Marek S Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous estimation of floating-point round-off errors with symbolic taylor expansions. ACM Transactions on Programming Languages and Systems (TOPLAS) 41, 1 (2018), 1--39.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Andrew Thall. 2006. Extended-precision floating-point numbers for GPU computation. In ACM SIGGRAPH 2006 research posters. 52--es.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Sudharshan S Vazhkudai, Bronis R De Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 661--672.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shibo Wang and Pankaj Kanwar. 2019. BFloat16: the secret to high performance on cloud TPUs. Google Cloud Blog (2019).Google ScholarGoogle Scholar
  39. Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the ACM International Conference on Supercomputing. 443--454.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficiently emulating high-bitwidth computation with low-bitwidth hardware

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
      June 2022
      514 pages
      ISBN:9781450392815
      DOI:10.1145/3524059

      Copyright © 2022 Owner/Author

      This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 June 2022

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate584of2,055submissions,28%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader