Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani; Jan Van Santen; Seyed Hamidreza Mohammadi; Alexander Kain

Data-driven foot-based intonation generator for text-to-speech synthesis

Mahsa Sadat Elyasi Langarani, Jan Van Santen, Seyed Hamidreza Mohammadi, Alexander Kain

Institute on Development and Disability

Research output: Contribution to journal › Conference article › peer-review

3 Scopus citations

Abstract

We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

Original language	English (US)
Pages (from-to)	1596-1600
Number of pages	5
Journal	Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume	2015-January
State	Published - 2015
Event	16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 - Dresden, Germany Duration: Sep 6 2015 → Sep 10 2015

Keywords

Intonation modeling
Prosody
Text-to-Speech Synthesis

ASJC Scopus subject areas

Language and Linguistics
Human-Computer Interaction
Signal Processing
Software
Modeling and Simulation

Cite this

Data-driven foot-based intonation generator for text-to-speech synthesis. / Langarani, Mahsa Sadat Elyasi; Van Santen, Jan; Mohammadi, Seyed Hamidreza et al.
In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2015-January, 2015, p. 1596-1600.

Research output: Contribution to journal › Conference article › peer-review

@article{8ce2deea673a417ea80f6b58d1a877bb,

title = "Data-driven foot-based intonation generator for text-to-speech synthesis",

abstract = "We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.",

keywords = "Intonation modeling, Prosody, Text-to-Speech Synthesis",

author = "Langarani, {Mahsa Sadat Elyasi} and {Van Santen}, Jan and Mohammadi, {Seyed Hamidreza} and Alexander Kain",

note = "Publisher Copyright: Copyright {\textcopyright} 2015 ISCA.; 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015 ; Conference date: 06-09-2015 Through 10-09-2015",

year = "2015",

language = "English (US)",

volume = "2015-January",

pages = "1596--1600",

journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",

issn = "2308-457X",

}

TY - JOUR

T1 - Data-driven foot-based intonation generator for text-to-speech synthesis

AU - Langarani, Mahsa Sadat Elyasi

AU - Van Santen, Jan

AU - Mohammadi, Seyed Hamidreza

AU - Kain, Alexander

PY - 2015

Y1 - 2015

N2 - We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

AB - We propose a method for generating F0 contours for text-tospeech synthesis. Training speech is automatically annotated in terms of feet, with features indicating start and end times of syllables, foot position, and foot length. During training, we fit a foot-based superpositional intonation model comprising accent curves and phrase curves. During synthesis, the method searches for stored, fitted accent curves associated with feet that optimally match to-be-synthesized feet in the feature space, while minimizing differences between successive accent curve heights. We tested the proposed method against the HMMbased Speech Synthesis System (HTS) by imposing contours generated by these two methods onto natural speech, and obtaining quality ratings. Test sets varied in how well they were covered by the training data. Contours generated by the proposed method were preferred over HTS-generated contours, especially for poorly-covered test items. To test the new method's usefulness for processing marked-up text input, we compared its ability to convey contrastive stress with that of natural speech recordings, and found no difference. We conclude that the new method holds promise for generating comparatively highquality F0 contours, especially when training data are sparse and when mark-up is required.

KW - Intonation modeling

KW - Prosody

KW - Text-to-Speech Synthesis

UR - http://www.scopus.com/inward/record.url?scp=84959075368&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959075368&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84959075368

SN - 2308-457X

VL - 2015-January

SP - 1596

EP - 1600

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

T2 - 16th Annual Conference of the International Speech Communication Association, INTERSPEECH 2015

Y2 - 6 September 2015 through 10 September 2015

ER -

Data-driven foot-based intonation generator for text-to-speech synthesis

Abstract

Keywords

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this