short-paper

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

Authors:
Zhiyuan Zhao

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Jingjun Liang

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Zehong Zheng

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Linhuang Yan

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Zhiyong Yang

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Wan Ding

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

,
Dongyan Huang

UBTECH Robotics Corp, China

UBTECH Robotics Corp, China
View Profile

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal InteractionOctober 2021Pages 75–79https://doi.org/10.1145/3461615.3491106

Published:17 December 2021Publication History

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

Pages 75–79

ABSTRACT

Voice conversion (VC) systems have made significant progress owing to advanced deep learning methods. Current research is not only concerned with high-quality and fast audio synthesis, but also richer expressiveness. The most popular VC system was constructed from the concatenation of an automatic speech recognition module with a text-to-speech module (ASR-TTS). Yet this system suffers from errors in recognition and pronunciation and it also requires a large amount of data for a pre-trained ASR mode l. We propose an approach to improve the model stability and training efficiency of a VC system. Firstly, a data redundancy reduction method is used to balance the distribution of vocabulary to avoid uncommon words being ignored during the training process; by adding connectionist temporal classification (CTC) loss, the word error rate (WER) of our system reduces to 3.02%, which is 5.63 percentage points lower than that of the ASR-TTS system (8.65%), and the inference speed (e.g., real-time rate 19.32) of our VC system is much higher than that of the baseline system (real-time rate 2.24). Finally, emotional embedding is added to the pre-trained VC system to generate expressive speech conversion. The results show that after fine-tuning on the multi-emotional dataset, the system can achieve high quality and expressive speech synthesis.

References

[n. d.]. Computing Error Rates. ([n. d.]). https://sites.google.com/site/textdigitisation/qualitymeasures/computingerrorratesGoogle Scholar
Ryo Aihara, Ryoichi Takashima, Tetsuya Takiguchi, and Yasuo Ariki. 2012. GMM-based emotional voice conversion using spectrum and prosody features. American Journal of Signal Processing 2, 5 (2012), 134–138.Google ScholarCross Ref
Fadi Biadsy, Ron J Weiss, Pedro J Moreno, Dimitri Kanvesky, and Ye Jia. 2019. Parrotron: An End-to-End Speech-to-Speech Conversion Model and its Applications to Hearing-Impaired Speech and Speech Separation. (2019).Google Scholar
Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 12(2014), 1859–1872.Google ScholarDigital Library
Minghui Dong, Chenyu Yang, Yanfeng Lu, Jochen Walter Ehnes, Dongyan Huang, Huaiping Ming, Rong Tong, Siu Wa Lee, and Haizhou Li. 2015. Mapping frames with DNN-HMM recognizer for non-parallel voice conversion. In 2015 APSIPA. IEEE, 488–494.Google Scholar
et al. Junyi Sun. 2013. ”Jieba”: Chinese text segmentation. (2013). https://github.com/fxsjy/jiebaGoogle Scholar
Hideki Kawahara. 2006. STRAIGHT, exploitation of the other aspect of VOCODER: Perceptually isomorphic decomposition of speech sounds. Acoustical science and technology 27, 6 (2006), 349–353.Google Scholar
Suyoun Kim, Takaaki Hori, and Shinji Watanabe. 2017. Joint CTC-attention based end-to-end speech recognition using multi-task learning. In 2017 ICASSP. IEEE, 4835–4839.Google Scholar
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. arXiv preprint arXiv:2010.05646(2020).Google Scholar
Javier Latorre, Jakub Lachowicz, Jaime Lorenzo-Trueba, Thomas Merritt, Thomas Drugman, Srikanth Ronanki, and Klimkov Viacheslav. 2018. Effect of data reduction on sequence-to-sequence neural TTS.Google Scholar
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. 2019. Neural speech synthesis with transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6706–6713.Google ScholarDigital Library
Peng Liu, Xixin Wu, Shiyin Kang, Guangzhi Li, Dan Su, and Dong Yu. 2019. Maximizing mutual information for tacotron. arXiv preprint arXiv:1909.01145(2019).Google Scholar
Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. 2016. WORLD: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems 99, 7 (2016), 1877–1884.Google ScholarCross Ref
Nathanaël Perraudin, Peter Balazs, and Peter L Søndergaard. 2013. A fast Griffin-Lim algorithm. In 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics. IEEE, 1–4.Google ScholarCross Ref
ST-CMDS-20170001_1. 2017. Free ST Chinese Mandarin Corpus. (2017). https://openslr.org/38/Google Scholar
Lifa Sun, Shiyin Kang, Kun Li, and Helen Meng. 2015. Voice conversion using deep bidirectional long short-term memory based recurrent neural networks. (2015), 4869–4873.Google Scholar
Hanna M Wallach. 2004. Conditional random fields: An introduction. Technical Reports (CIS)(2004), 22.Google Scholar
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 ICASSP. IEEE, 4879–4883.Google Scholar
Zhizheng Wu, Tuomas Virtanen, Eng Siong Chng, and Haizhou Li. 2014. Exemplar-based sparse representation with residual compensation for voice conversion. IEEE/ACM Transactions on Audio, Speech, and Language Processing 22, 10(2014), 1506–1521.Google ScholarDigital Library
Zhao Yi, Wen-Chin Huang, Xiaohai Tian, Junichi Yamagishi, Rohan Kumar Das, Tomi Kinnunen, Zhen-Hua Ling, and Tomoki Toda. 2020. Voice Conversion Challenge 2020 –- Intra-lingual semi-parallel and cross-lingual voice conversion –-. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 80–98. https://doi.org/10.21437/VCC_BC.2020-14Google ScholarCross Ref
Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosenberg, and Bhuvana Ramabhadran. 2019. Learning to speak fluently in a foreign language: Multilingual speech synthesis and cross-language voice cloning. arXiv preprint arXiv:1907.04448(2019).Google Scholar
Xiao Zhou, Zhen-Hua Ling, and Simon King. 2020. The Blizzard Challenge 2020. In Proc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020. 1–18. https://doi.org/10.21437/VCC_BC.2020-1Google ScholarCross Ref

Index Terms

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Speech recognition
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Hardware
  1. Communication hardware, interfaces and storage

Index terms have been assigned to the content through auto-classification.

Recommendations

Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients
Highlights
- Proposed a dysarthria voice conversion system that can effectively improve dysarthria speech intelligibility.
Abstract Background and Objective
Most dysarthric patients encounter communication problems due to unintelligible speech. Currently, there are many voice-driven systems aimed at improving their speech intelligibility; however, the ...
Read More
Foreign accent conversion in computer assisted pronunciation training

Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation ...
Read More
Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech

An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction
October 2021
418 pages
ISBN:9781450384711
DOI:10.1145/3461615
Editors:
Zakia Hammal
Carnegie Mellon University
,
Carlos Busso
University of Texas at Dallas
,
Catherine Pelachaud
CNRS - ISIR, Sorbonne University
,
Sharon Oviatt
Monash University
,
Albert Ali Salah
Utrecht University and Boğaziçi University
,
Guoying Zhao
University of Oulu
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 17 December 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
disentangle correlation
redundancy reduction
voice conversion
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate453of1,080submissions,42%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 60
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Improving Model Stability and Training Efficiency in Fast, High Quality Expressive Voice Conversion System

ICMI '21 Companion: Companion Publication of the 2021 International Conference on Multimodal Interaction

ABSTRACT

References

Cited By

Index Terms

Recommendations

Phonetic posteriorgram-based voice conversion system to improve speech intelligibility of dysarthric patients

Foreign accent conversion in computer assisted pronunciation training

Speaking-aid systems using GMM-based voice conversion for electrolaryngeal speech