Foot-based intonation for text-to-speech synthesis using neural networks

Mahsa Sadat Elyasi Langarani; Jan van Santen

Foot-based intonation for text-to-speech synthesis using neural networks

Mahsa Sadat Elyasi Langarani, Jan van Santen

Institute on Development and Disability

Research output: Contribution to journal › Conference article › peer-review

Abstract

We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

Original language	English (US)
Pages (from-to)	1009-1013
Number of pages	5
Journal	Proceedings of the International Conference on Speech Prosody
Volume	2016-January
State	Published - 2016
Event	8th Speech Prosody 2016 - Boston, United States Duration: May 31 2016 → Jun 3 2016

Keywords

Artificial neural networks
Intonation modeling
Prosody
Text-to-speech synthesis

ASJC Scopus subject areas

Language and Linguistics
Linguistics and Language

Cite this

@article{6830e56955fd43d68e3e0952cb300328,

title = "Foot-based intonation for text-to-speech synthesis using neural networks",

abstract = "We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.",

keywords = "Artificial neural networks, Intonation modeling, Prosody, Text-to-speech synthesis",

author = "Langarani, {Mahsa Sadat Elyasi} and {van Santen}, Jan",

year = "2016",

language = "English (US)",

volume = "2016-January",

pages = "1009--1013",

journal = "Proceedings of the International Conference on Speech Prosody",

issn = "2333-2042",

}

TY - JOUR

T1 - Foot-based intonation for text-to-speech synthesis using neural networks

AU - Langarani, Mahsa Sadat Elyasi

AU - van Santen, Jan

PY - 2016

Y1 - 2016

N2 - We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

AB - We propose a method (“FONN”) for F0 contour generation for text-to-speech synthesis. Training speech is automatically segmented into left-headed feet, annotated with syllable start/end times, foot position in the sentence, and the number of syllables in the foot. During training, we fit a superpositional intonation model comprising accent curves associated with feet and phrase curves. We propose to use a neural network for model parameter estimation. We tested the method against the HMM-based Speech Synthesis System (HTS) as well as against a template based variant of FONN (“DRIFT”) by imposing contours generated by the methods onto natural speech and obtaining quality ratings. Test sets varied in degree of coverage by training data. Contours generated by DRIFT and FONN were strongly preferred over HTS-generated contours, especially for poorly-covered test items, with DRIFT slightly preferred over FONN. We conclude that the new methods hold promise for high-quality F0 contour generation while making efficient use of training data.

KW - Artificial neural networks

KW - Intonation modeling

KW - Prosody

KW - Text-to-speech synthesis

UR - http://www.scopus.com/inward/record.url?scp=84982994414&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84982994414&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84982994414

SN - 2333-2042

VL - 2016-January

SP - 1009

EP - 1013

JO - Proceedings of the International Conference on Speech Prosody

JF - Proceedings of the International Conference on Speech Prosody

T2 - 8th Speech Prosody 2016

Y2 - 31 May 2016 through 3 June 2016

ER -

Foot-based intonation for text-to-speech synthesis using neural networks

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this