TY - JOUR
T1 - Integrating articulatory information in deep learning-based text-To-speech synthesis
AU - Cao, Beiming
AU - Kim, Myungjong
AU - Van Santen, Jan
AU - Mau, Ted
AU - Wang, Jun
N1 - Funding Information:
This work was supported by the National Institutes of Health (NIH) under award number R03 DC013990 and by the American Speech-Language-Hearing Foundation through a New Century Scholar Research Grant. We thank Kristin Teplansky, Katie Purdum, and the volunteering participants.
Publisher Copyright:
Copyright © 2017 ISCA.
PY - 2017
Y1 - 2017
N2 - Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).
AB - Articulatory information has been shown to be effective in improving the performance of hidden Markov model (HMM)-based text-To-speech (TTS) synthesis. Recently, deep learningbased TTS has outperformed HMM-based approaches. However, articulatory information has rarely been integrated in deep learning-based TTS. This paper investigated the effectiveness of integrating articulatory movement data to deep learning-based TTS. The integration of articulatory information was achieved in two ways: (1) direct integration, where articulatory and acoustic features were the output of a deep neural network (DNN), and (2) direct integration plus forward-mapping, where the output articulatory features were mapped to acoustic features by an additional DNN; These forward-mapped acoustic features were then combined with the output acoustic features to produce the final acoustic features. Articulatory (tongue and lip) and acoustic data collected from male and female speakers were used in the experiment. Both objective measures and subjective judgment by human listeners showed the approaches integrated articulatory information outperformed the baseline approach (without using articulatory information) in terms of naturalness and speaker voice identity (voice similarity).
KW - Text-To-speech synthesis
KW - articulatory data
KW - deep learning
KW - deep neural network
UR - http://www.scopus.com/inward/record.url?scp=85039167284&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85039167284&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2017-1762
DO - 10.21437/Interspeech.2017-1762
M3 - Conference article
AN - SCOPUS:85039167284
SN - 2308-457X
VL - 2017-August
SP - 254
EP - 258
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 18th Annual Conference of the International Speech Communication Association, INTERSPEECH 2017
Y2 - 20 August 2017 through 24 August 2017
ER -