Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani, Jan Van Santen, Seyed Hamidreza Mohammadi, Alexander Kain

Research output: Contribution to journalConference articlepeer-review

3 Scopus citations

Abstract

We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

Original languageEnglish (US)
Pages (from-to)1596-1600
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2015-January
StatePublished - 2015
Event16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany
Duration: Sep 6 2015Sep 10 2015

Keywords

  • Intonation modeling
  • Prosody
  • Text-to-Speech Synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Fingerprint

Dive into the research topics of 'Data-driven foot-based intonation generator for text-to-speech synthesis'. Together they form a unique fingerprint.

Cite this