skip to main content
10.1145/3461615.3491106acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
short-paper

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Published:17 December 2021Publication History

ABSTRACT

Voice conversion (VC) systems have made significant progress owing to advanced deep learning methods. Current research is not only concerned with high-quality and fast audio synthesis, but also richer expressiveness. The most popular VC system was constructed from the concatenation of an automatic speech recognition module with a text-to-speech module (ASR-TTS). Yet this system suffers from errors in recognition and pronunciation and it also requires a large amount of data for a pre-trained ASR mode l. We propose an approach to improve the model stability and training efficiency of a VC system. Firstly, a data redundancy reduction method is used to balance the distribution of vocabulary to avoid uncommon words being ignored during the training process; by adding connectionist temporal classification (CTC) loss, the word error rate (WER) of our system reduces to 3.02%, which is 5.63 percentage points lower than that of the ASR-TTS system (8.65%), and the inference speed (e.g., real-time rate 19.32) of our VC system is much higher than that of the baseline system (real-time rate 2.24). Finally, emotional embedding is added to the pre-trained VC system to generate expressive speech conversion. The results show that after fine-tuning on the multi-emotional dataset, the system can achieve high quality and expressive speech synthesis.

References

  1. [n. d.]. Computing Error Rates. ([n. d.]). https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorratesGoogle ScholarGoogle Scholar
  2. Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2012. GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing 2, 5 (2012), 134–138.Google ScholarGoogle ScholarCross RefCross Ref
  3. Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and Ye Jia. 2019. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation. (2019).Google ScholarGoogle Scholar
  4. Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12(2014), 1859–1872.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Minghui Dong, Chenyu Yang, Yanfeng Lu, Jochen Walter Ehnes, Dongyan Huang, Huaiping Ming, Rong Tong, Siu Wa Lee, and Haizhou Li. 2015. Mapping frames with DNN-HMM recognizer for non-parallel voice conversion. In 2015 APSIPA. IEEE, 488–494.Google ScholarGoogle Scholar
  6. et al. Junyi Sun. 2013. ”Jieba”: Chinese text segmentation. (2013). https://github.com/fxsjy/jiebaGoogle ScholarGoogle Scholar
  7. Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27, 6 (2006), 349–353.Google ScholarGoogle Scholar
  8. Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 ICASSP. IEEE, 4835–4839.Google ScholarGoogle Scholar
  9. Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv preprint arXiv:2010.05646(2020).Google ScholarGoogle Scholar
  10. Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Klimkov Viacheslav. 2018. Effect of data reduction on sequence-to-sequence neural TTS.Google ScholarGoogle Scholar
  11. Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6706–6713.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, and Dong Yu. 2019. Maximizing mutual information for tacotron. arXiv preprint arXiv:1909.01145(2019).Google ScholarGoogle Scholar
  13. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99, 7 (2016), 1877–1884.Google ScholarGoogle ScholarCross RefCross Ref
  14. Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. 2013. A fast Griffin-Lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.Google ScholarGoogle ScholarCross RefCross Ref
  15. ST-CMDS-20170001_1. 2017. Free ST Chinese Mandarin Corpus. (2017). https://openslr.org/38/Google ScholarGoogle Scholar
  16. Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng. 2015. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. (2015), 4869–4873.Google ScholarGoogle Scholar
  17. Hanna M Wallach. 2004. Conditional random fields: An introduction. Technical Reports (CIS)(2004), 22.Google ScholarGoogle Scholar
  18. Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 ICASSP. IEEE, 4879–4883.Google ScholarGoogle Scholar
  19. Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li. 2014. Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 10(2014), 1506–1521.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda. 2020. Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 80–98. https://doi.org/10.21437/VCC_BC.2020-14Google ScholarGoogle ScholarCross RefCross Ref
  21. Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448(2019).Google ScholarGoogle Scholar
  22. Xiao Zhou, Zhen-Hua Ling, and Simon King. 2020. The Blizzard Challenge 2020. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 1–18. https://doi.org/10.21437/VCC_BC.2020-1Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction
          October 2021
          418 pages
          ISBN:9781450384711
          DOI:10.1145/3461615

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 17 December 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • short-paper
          • Research
          • Refereed limited

          Acceptance Rates

          Overall Acceptance Rate453of1,080submissions,42%
        • Article Metrics

          • Downloads (Last 12 months)14
          • Downloads (Last 6 weeks)1

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format