Foot-based intonation for text-to-speech synthesis using neural networks

Mahsa Sadat Elyasi Langarani, Jan van Santen

Research output: Contribution to journalConference articlepeer-review


We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

Original languageEnglish (US)
Pages (from-to)1009-1013
Number of pages5
JournalProceedings of the International Conference on Speech Prosody
StatePublished - 2016
Event8th Speech Prosody 2016 - Boston, United States
Duration: May 31 2016Jun 3 2016


  • Artificial neural networks
  • Intonation modeling
  • Prosody
  • Text-to-speech synthesis

ASJC Scopus subject areas

  • Language and Linguistics
  • Linguistics and Language


Dive into the research topics of 'Foot-based intonation for text-to-speech synthesis using neural networks'. Together they form a unique fingerprint.

Cite this