ABSTRACT
Voice conversion (VC) systems have made significant progress owing to advanced deep learning methods. Current research is not only concerned with high-quality and fast audio synthesis, but also richer expressiveness. The most popular VC system was constructed from the concatenation of an automatic speech recognition module with a text-to-speech module (ASR-TTS). Yet this system suffers from errors in recognition and pronunciation and it also requires a large amount of data for a pre-trained ASR mode l. We propose an approach to improve the model stability and training efficiency of a VC system. Firstly, a data redundancy reduction method is used to balance the distribution of vocabulary to avoid uncommon words being ignored during the training process; by adding connectionist temporal classification (CTC) loss, the word error rate (WER) of our system reduces to 3.02%, which is 5.63 percentage points lower than that of the ASR-TTS system (8.65%), and the inference speed (e.g., real-time rate 19.32) of our VC system is much higher than that of the baseline system (real-time rate 2.24). Finally, emotional embedding is added to the pre-trained VC system to generate expressive speech conversion. The results show that after fine-tuning on the multi-emotional dataset, the system can achieve high quality and expressive speech synthesis.
- [n. d.]. Computing Error Rates. ([n. d.]). https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorratesGoogle Scholar
- Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2012. GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing 2, 5 (2012), 134–138.Google ScholarCross Ref
- Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and Ye Jia. 2019. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation. (2019).Google Scholar
- Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12(2014), 1859–1872.Google ScholarDigital Library
- Minghui Dong, Chenyu Yang, Yanfeng Lu, Jochen Walter Ehnes, Dongyan Huang, Huaiping Ming, Rong Tong, Siu Wa Lee, and Haizhou Li. 2015. Mapping frames with DNN-HMM recognizer for non-parallel voice conversion. In 2015 APSIPA. IEEE, 488–494.Google Scholar
- et al. Junyi Sun. 2013. ”Jieba”: Chinese text segmentation. (2013). https://github.com/fxsjy/jiebaGoogle Scholar
- Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27, 6 (2006), 349–353.Google Scholar
- Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 ICASSP. IEEE, 4835–4839.Google Scholar
- Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv preprint arXiv:2010.05646(2020).Google Scholar
- Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Klimkov Viacheslav. 2018. Effect of data reduction on sequence-to-sequence neural TTS.Google Scholar
- Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6706–6713.Google ScholarDigital Library
- Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, and Dong Yu. 2019. Maximizing mutual information for tacotron. arXiv preprint arXiv:1909.01145(2019).Google Scholar
- Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99, 7 (2016), 1877–1884.Google ScholarCross Ref
- Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. 2013. A fast Griffin-Lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.Google ScholarCross Ref
- ST-CMDS-20170001_1. 2017. Free ST Chinese Mandarin Corpus. (2017). https://openslr.org/38/Google Scholar
- Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng. 2015. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. (2015), 4869–4873.Google Scholar
- Hanna M Wallach. 2004. Conditional random fields: An introduction. Technical Reports (CIS)(2004), 22.Google Scholar
- Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 ICASSP. IEEE, 4879–4883.Google Scholar
- Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li. 2014. Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 10(2014), 1506–1521.Google ScholarDigital Library
- Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda. 2020. Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 80–98. https://doi.org/10.21437/VCC_BC.2020-14Google ScholarCross Ref
- Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448(2019).Google Scholar
- Xiao Zhou, Zhen-Hua Ling, and Simon King. 2020. The Blizzard Challenge 2020. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 1–18. https://doi.org/10.21437/VCC_BC.2020-1Google ScholarCross Ref
Index Terms
- Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System
Recommendations
Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients
Highlights- Proposed a dysarthria voice conversion system that can effectively improve dysarthria speech intelligibility.
Abstract Background and ObjectiveMost dysarthric patients encounter communication problems due to unintelligible speech. Currently, there are many voice-driven systems aimed at improving their speech intelligibility; however, the ...
Foreign accent conversion in computer assisted pronunciation training
Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation ...
Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech
An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely ...
Comments