ABSTRACT
Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations. However, how to efficiently design more emulated data types and choose a high-performance one without hurting correctness or precision for a given application still remains an open problem.
To address these challenges, we present Ape, which can 1) provide different strategies for emulating high-bitwidth data types using native data types with in-depth error analysis; 2) dynamically and automatically select proper data types and generate efficient code for a given computation in fine-granularity to achieve higher performance while maintaining both correctness and precision at the same time without human efforts. We implement Ape on both NVIDIA Tensor Core and Huawei Ascend. Results show that Ape can boost General Matrix Multiplication and convolution by up to 3.12X and 1.86X on Tensor Core over CUDA Core and accelerate various applications by up to 1.78X (1.65X on average).
- 2008. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008 (2008), 1--70. Google ScholarCross Ref
- Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.Google ScholarCross Ref
- L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Software 28, 2 (2002), 135--151.Google ScholarDigital Library
- Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google Scholar
- Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2017. Rigorous Floating-Point Mixed-Precision Tuning. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). Association for Computing Machinery, New York, NY, USA, 300--315. Google ScholarDigital Library
- Arnab Das, Ian Briggs, Ganesh Gopalakrishnan, Sriram Krishnamoorthy, and Pavel Panchekha. 2020. Scalable yet Rigorous Floating-Point Error Analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 51, 14 pages.Google ScholarDigital Library
- Theodorus Jozef Dekker. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3 (1971), 224--242.Google ScholarDigital Library
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
- Jack J Dongarra, Hans W Meuer, Erich Strohmaier, et al. 1997. TOP500 supercomputer sites. Supercomputer 13 (1997), 89--111.Google Scholar
- Liwen Fan, Ruixin Wang, Kuan Fang, and Xian Sun. 2019. cuBERT. https://github.com/zhihu/cuBERT.Google Scholar
- Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. 2021. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 278--291.Google ScholarDigital Library
- Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. 2021. Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.Google ScholarDigital Library
- Evelyn Fix. 1951. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine.Google Scholar
- Vincent Garcia, Eric Debreuve, and Michel Barlaud. 2008. Fast k nearest neighbor search using GPU. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1--6.Google ScholarCross Ref
- Google. 2020. Advanced neural network processing for low-power devices. https://coral.ai/technologyGoogle Scholar
- Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, and Rahul Sharma. 2019. Compiling KB-Sized Machine Learning Models to Tiny IoT Devices. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 79--95. Google ScholarDigital Library
- Huawei. 2022. Ascend to Pervasive Intelligence. https://e.huawei.com/en/products/servers/ascendGoogle Scholar
- Kyuyeon Hwang and Wonyong Sung. 2014. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 1--6.Google ScholarCross Ref
- J Edward Jackson. 2005. A user's guide to principal components. Vol. 587. John Wiley & Sons.Google Scholar
- Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019).Google Scholar
- Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google ScholarDigital Library
- Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos. 2015. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236 (2015).Google Scholar
- Paresh Kharya. 2020. TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x. the NVIDIA Blog (2020).Google Scholar
- Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv preprint arXiv:1703.03073 (2017).Google Scholar
- Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing: Industry Track Paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789--801.Google ScholarCross Ref
- Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International conference on machine learning. PMLR, 2849--2858.Google Scholar
- Seppo Linnainmaa. 1981. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software (TOMS) 7, 3 (1981), 272--283.Google ScholarDigital Library
- James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google Scholar
- Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522--531.Google ScholarCross Ref
- NVIDIA. [n.d.]. cuBLAS. https://developer.nvidia.com/cublas.Google Scholar
- NVIDIA. 2013. NVIDIA/kmeans. https://github.com/NVIDIA/kmeans.Google Scholar
- NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. UNPRECEDENTED ACCELERATION AT EVERY SCALE. Version v1.0. NVIDIA (2020).Google Scholar
- Tesla NVIDIA. 2017. V100 GPU architecture. the world's most advanced data center GPU. Version WP-08608-001_v1.1. NVIDIA. Aug (2017).Google Scholar
- Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, and Wenguang Chen. 2021. AIPerf: Automated machine learning as an AI-HPC benchmark. Big Data Mining and Analytics 4, 3 (2021), 208--220.Google ScholarCross Ref
- Alexey Solovyev, Marek S Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous estimation of floating-point round-off errors with symbolic taylor expansions. ACM Transactions on Programming Languages and Systems (TOPLAS) 41, 1 (2018), 1--39.Google ScholarDigital Library
- Andrew Thall. 2006. Extended-precision floating-point numbers for GPU computation. In ACM SIGGRAPH 2006 research posters. 52--es.Google ScholarDigital Library
- Sudharshan S Vazhkudai, Bronis R De Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 661--672.Google ScholarDigital Library
- Shibo Wang and Pankaj Kanwar. 2019. BFloat16: the secret to high performance on cloud TPUs. Google Cloud Blog (2019).Google Scholar
- Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the ACM International Conference on Supercomputing. 443--454.Google ScholarDigital Library
Index Terms
- Efficiently emulating high-bitwidth computation with low-bitwidth hardware
Recommendations
Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingThe Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingNvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...
Accelerating High-Resolution Weather Models with Deep-Learning Hardware
PASC '19: Proceedings of the Platform for Advanced Scientific Computing ConferenceThe next generation of weather and climate models will have an unprecedented level of resolution and model complexity, and running these models efficiently will require taking advantage of future supercomputers and heterogeneous hardware.
In this paper, ...
Comments