Abstract
We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.
Original language | English (US) |
---|---|
Pages (from-to) | 1009-1013 |
Number of pages | 5 |
Journal | Proceedings of the International Conference on Speech Prosody |
Volume | 2016-January |
State | Published - 2016 |
Event | 8th Speech Prosody 2016 - Boston, United States Duration: May 31 2016 → Jun 3 2016 |
Keywords
- Artificial neural networks
- Intonation modeling
- Prosody
- Text-to-speech synthesis
ASJC Scopus subject areas
- Language and Linguistics
- Linguistics and Language