research-article

Open Access

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

Authors:
Zixuan Ma

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Haojie Wang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Guanyu Feng

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Chen Zhang

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Lei Xie

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jiaao He

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Shengqi Chen

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

,
Jidong Zhai

Tsinghua University, Beijing, China

Tsinghua University, Beijing, China
View Profile

ICS '22: Proceedings of the 36th ACM International Conference on SupercomputingJune 2022Article No.: 5Pages 1–12https://doi.org/10.1145/3524059.3532377

Published:28 June 2022Publication History

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

Pages 1–12

ABSTRACT

Domain-Specific Accelerators (DSAs) are being rapidly developed to support high-performance domain-specific computation. Although DSAs provide massive computation capability, they often only support limited native data types. To mitigate this problem, previous works have explored software emulation for certain data types, which provides some compensation for hardware limitations. However, how to efficiently design more emulated data types and choose a high-performance one without hurting correctness or precision for a given application still remains an open problem.

To address these challenges, we present Ape, which can 1) provide different strategies for emulating high-bitwidth data types using native data types with in-depth error analysis; 2) dynamically and automatically select proper data types and generate efficient code for a given computation in fine-granularity to achieve higher performance while maintaining both correctness and precision at the same time without human efforts. We implement Ape on both NVIDIA Tensor Core and Huawei Ascend. Results show that Ape can boost General Matrix Multiplication and convolution by up to 3.12X and 1.86X on Tensor Core over CUDA Core and accelerate various applications by up to 1.78X (1.65X on average).

References

2008. IEEE Standard for Floating-Point Arithmetic. IEEE Std 754-2008 (2008), 1--70. Google ScholarCross Ref
Naomi S Altman. 1992. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician 46, 3 (1992), 175--185.Google ScholarCross Ref
L Susan Blackford, Antoine Petitet, Roldan Pozo, Karin Remington, R Clint Whaley, James Demmel, Jack Dongarra, Iain Duff, Sven Hammarling, Greg Henry, et al. 2002. An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Software 28, 2 (2002), 135--151.Google ScholarDigital Library
Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. 2014. cudnn: Efficient primitives for deep learning. arXiv preprint arXiv:1410.0759 (2014).Google Scholar
Wei-Fan Chiang, Mark Baranowski, Ian Briggs, Alexey Solovyev, Ganesh Gopalakrishnan, and Zvonimir Rakamarić. 2017. Rigorous Floating-Point Mixed-Precision Tuning. In Proceedings of the 44th ACM SIGPLAN Symposium on Principles of Programming Languages (Paris, France) (POPL 2017). Association for Computing Machinery, New York, NY, USA, 300--315. Google ScholarDigital Library
Arnab Das, Ian Briggs, Ganesh Gopalakrishnan, Sriram Krishnamoorthy, and Pavel Panchekha. 2020. Scalable yet Rigorous Floating-Point Error Analysis. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Atlanta, Georgia) (SC '20). IEEE Press, Article 51, 14 pages.Google ScholarDigital Library
Theodorus Jozef Dekker. 1971. A floating-point technique for extending the available precision. Numer. Math. 18, 3 (1971), 224--242.Google ScholarDigital Library
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).Google Scholar
Jack J Dongarra, Hans W Meuer, Erich Strohmaier, et al. 1997. TOP500 supercomputer sites. Supercomputer 13 (1997), 89--111.Google Scholar
Liwen Fan, Ruixin Wang, Kuan Fang, and Xian Sun. 2019. cuBERT. https://github.com/zhihu/cuBERT.Google Scholar
Boyuan Feng, Yuke Wang, Guoyang Chen, Weifeng Zhang, Yuan Xie, and Yufei Ding. 2021. EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. 278--291.Google ScholarDigital Library
Boyuan Feng, Yuke Wang, Tong Geng, Ang Li, and Yufei Ding. 2021. Apnn-tc: Accelerating arbitrary precision neural networks on ampere gpu tensor cores. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1--13.Google ScholarDigital Library
Evelyn Fix. 1951. Discriminatory analysis: nonparametric discrimination, consistency properties. USAF School of Aviation Medicine.Google Scholar
Vincent Garcia, Eric Debreuve, and Michel Barlaud. 2008. Fast k nearest neighbor search using GPU. In 2008 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1--6.Google ScholarCross Ref
Google. 2020. Advanced neural network processing for low-power devices. https://coral.ai/technologyGoogle Scholar
Sridhar Gopinath, Nikhil Ghanathe, Vivek Seshadri, and Rahul Sharma. 2019. Compiling KB-Sized Machine Learning Models to Tiny IoT Devices. In Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation (Phoenix, AZ, USA) (PLDI 2019). Association for Computing Machinery, New York, NY, USA, 79--95. Google ScholarDigital Library
Huawei. 2022. Ascend to Pervasive Intelligence. https://e.huawei.com/en/products/servers/ascendGoogle Scholar
Kyuyeon Hwang and Wonyong Sung. 2014. Fixed-point feedforward deep neural network design using weights+ 1, 0, and- 1. In 2014 IEEE Workshop on Signal Processing Systems (SiPS). IEEE, 1--6.Google ScholarCross Ref
J Edward Jackson. 2005. A user's guide to principal components. Vol. 587. John Wiley & Sons.Google Scholar
Zhe Jia, Blake Tillman, Marco Maggioni, and Daniele Paolo Scarpazza. 2019. Dissecting the graphcore ipu architecture via microbenchmarking. arXiv preprint arXiv:1912.03413 (2019).Google Scholar
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, et al. 2017. In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 1--12.Google ScholarDigital Library
Patrick Judd, Jorge Albericio, Tayler Hetherington, Tor Aamodt, Natalie Enright Jerger, Raquel Urtasun, and Andreas Moshovos. 2015. Reduced-precision strategies for bounded memory in deep neural nets. arXiv preprint arXiv:1511.05236 (2015).Google Scholar
Paresh Kharya. 2020. TensorFloat-32 in the A100 GPU Accelerates AI Training, HPC up to 20x. the NVIDIA Blog (2020).Google Scholar
Liangzhen Lai, Naveen Suda, and Vikas Chandra. 2017. Deep convolutional neural network inference with floating-point weights and fixed-point activations. arXiv preprint arXiv:1703.03073 (2017).Google Scholar
Heng Liao, Jiajin Tu, Jing Xia, Hu Liu, Xiping Zhou, Honghui Yuan, and Yuxing Hu. 2021. Ascend: a Scalable and Unified Architecture for Ubiquitous Deep Neural Network Computing: Industry Track Paper. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 789--801.Google ScholarCross Ref
Darryl Lin, Sachin Talathi, and Sreekanth Annapureddy. 2016. Fixed point quantization of deep convolutional networks. In International conference on machine learning. PMLR, 2849--2858.Google Scholar
Seppo Linnainmaa. 1981. Software for doubled-precision floating-point computations. ACM Transactions on Mathematical Software (TOMS) 7, 3 (1981), 272--283.Google ScholarDigital Library
James MacQueen et al. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, Vol. 1. Oakland, CA, USA, 281--297.Google Scholar
Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jeffrey S Vetter. 2018. Nvidia tensor core programmability, performance & precision. In 2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 522--531.Google ScholarCross Ref
NVIDIA. [n.d.]. cuBLAS. https://developer.nvidia.com/cublas.Google Scholar
NVIDIA. 2013. NVIDIA/kmeans. https://github.com/NVIDIA/kmeans.Google Scholar
NVIDIA. 2020. NVIDIA A100 Tensor Core GPU Architecture. UNPRECEDENTED ACCELERATION AT EVERY SCALE. Version v1.0. NVIDIA (2020).Google Scholar
Tesla NVIDIA. 2017. V100 GPU architecture. the world's most advanced data center GPU. Version WP-08608-001_v1.1. NVIDIA. Aug (2017).Google Scholar
Zhixiang Ren, Yongheng Liu, Tianhui Shi, Lei Xie, Yue Zhou, Jidong Zhai, Youhui Zhang, Yunquan Zhang, and Wenguang Chen. 2021. AIPerf: Automated machine learning as an AI-HPC benchmark. Big Data Mining and Analytics 4, 3 (2021), 208--220.Google ScholarCross Ref
Alexey Solovyev, Marek S Baranowski, Ian Briggs, Charles Jacobsen, Zvonimir Rakamarić, and Ganesh Gopalakrishnan. 2018. Rigorous estimation of floating-point round-off errors with symbolic taylor expansions. ACM Transactions on Programming Languages and Systems (TOPLAS) 41, 1 (2018), 1--39.Google ScholarDigital Library
Andrew Thall. 2006. Extended-precision floating-point numbers for GPU computation. In ACM SIGGRAPH 2006 research posters. 52--es.Google ScholarDigital Library
Sudharshan S Vazhkudai, Bronis R De Supinski, Arthur S Bland, Al Geist, James Sexton, Jim Kahle, Christopher J Zimmer, Scott Atchley, Sarp Oral, Don E Maxwell, et al. 2018. The design, deployment, and evaluation of the CORAL pre-exascale systems. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 661--672.Google ScholarDigital Library
Shibo Wang and Pankaj Kanwar. 2019. BFloat16: the secret to high performance on cloud TPUs. Google Cloud Blog (2019).Google Scholar
Chen Zhang, Zeyu Song, Haojie Wang, Kaiyuan Rong, and Jidong Zhai. 2021. HyQuas: hybrid partitioner based quantum circuit simulation system on GPU. In Proceedings of the ACM International Conference on Supercomputing. 443--454.Google ScholarDigital Library

Index Terms

Efficiently emulating high-bitwidth computation with low-bitwidth hardware
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel algorithms

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

The Tensor Core Unit (TCU) has been increasingly adopted on modern high performance processors, specialized in boosting the performance of general matrix multiplication (GEMM). Due to its highly optimized hardware design, TCU can significantly ...
Read More
EGEMM-TC: accelerating scientific computing on tensor cores with extended precision
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Nvidia Tensor Cores achieve high performance with half-precision matrix inputs tailored towards deep learning workloads. However, this limits the application of Tensor Cores especially in the area of scientific computing with high precision ...
Read More
Accelerating High-Resolution Weather Models with Deep-Learning Hardware
PASC '19: Proceedings of the Platform for Advanced Scientific Computing Conference

The next generation of weather and climate models will have an unprecedented level of resolution and model complexity, and running these models efficiently will require taking advantage of future supercomputers and heterogeneous hardware.

In this paper, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing
June 2022
514 pages
ISBN:9781450392815
DOI:10.1145/3524059
General Chairs:
Lawrence Rauchwerger
University of Illinois at Urbana-Champaign
,
Kirk Cameron
Virginia Tech
,
Program Chairs:
Dimitrios S. Nikolopoulos
Virginia Tech
,
Dionisios Pnevmatikatos
National Technical University of Athens
Copyright © 2022 Owner/Author
This work is licensed under a Creative Commons Attribution-NonCommercial International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 June 2022
Check for updates
Author Tags
domain specific accelerator
emulation
tensor core
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 898
  Total Downloads
- Downloads (Last 12 months)513
- Downloads (Last 6 weeks)78
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Accelerating High-Resolution Weather Models with Deep-Learning Hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficiently emulating high-bitwidth computation with low-bitwidth hardware

ICS '22: Proceedings of the 36th ACM International Conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Toward accelerated stencil computation by adapting tensor core unit on GPU

EGEMM-TC: accelerating scientific computing on tensor cores with extended precision

Accelerating High-Resolution Weather Models with Deep-Learning Hardware

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media