Spectral voice conversion for text-to-speech synthesis

A. Kain; M. W. MacOn

doi:10.1109/ICASSP.1998.674423

Spectral voice conversion for text-to-speech synthesis

A. Kain, M. W. MacOn

Institute on Development and Disability

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

564 Scopus citations

Abstract

A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

Original language	English (US)
Title of host publication	Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998
Publisher	Institute of Electrical and Electronics Engineers Inc.
Pages	285-288
Number of pages	4
ISBN (Print)	0780344286, 9780780344280
DOIs	https://doi.org/10.1109/ICASSP.1998.674423
State	Published - 1998
Event	1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 - Seattle, WA, United States Duration: May 12 1998 → May 15 1998

Publication series

Name	ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
Volume	1
ISSN (Print)	1520-6149

Conference

Conference	1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998
Country/Territory	United States
City	Seattle, WA
Period	5/12/98 → 5/15/98

ASJC Scopus subject areas

Software
Signal Processing
Electrical and Electronic Engineering

Access to Document

10.1109/ICASSP.1998.674423

Cite this

Kain, A., & MacOn, M. W. (1998). Spectral voice conversion for text-to-speech synthesis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 (pp. 285-288). Article 674423 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 1). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.1998.674423

Spectral voice conversion for text-to-speech synthesis. / Kain, A.; MacOn, M. W.
Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998. Institute of Electrical and Electronics Engineers Inc., 1998. p. 285-288 674423 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings; Vol. 1).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Kain, A & MacOn, MW 1998, Spectral voice conversion for text-to-speech synthesis. in Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998., 674423, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, vol. 1, Institute of Electrical and Electronics Engineers Inc., pp. 285-288, 1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998, Seattle, WA, United States, 5/12/98. https://doi.org/10.1109/ICASSP.1998.674423

Kain A, MacOn MW. Spectral voice conversion for text-to-speech synthesis. In Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998. Institute of Electrical and Electronics Engineers Inc. 1998. p. 285-288. 674423. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). doi: 10.1109/ICASSP.1998.674423

@inproceedings{7ea32226e1f040fbbadb26d8a58e05a1,

title = "Spectral voice conversion for text-to-speech synthesis",

abstract = "A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.",

author = "A. Kain and MacOn, {M. W.}",

year = "1998",

doi = "10.1109/ICASSP.1998.674423",

language = "English (US)",

isbn = "0780344286",

series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

pages = "285--288",

booktitle = "Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998",

note = "1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 ; Conference date: 12-05-1998 Through 15-05-1998",

}

TY - GEN

T1 - Spectral voice conversion for text-to-speech synthesis

AU - Kain, A.

AU - MacOn, M. W.

PY - 1998

Y1 - 1998

N2 - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

AB - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.

UR - http://www.scopus.com/inward/record.url?scp=0031623661&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0031623661&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.1998.674423

DO - 10.1109/ICASSP.1998.674423

M3 - Conference contribution

AN - SCOPUS:0031623661

SN - 0780344286

SN - 9780780344280

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 285

EP - 288

BT - Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - 1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998

Y2 - 12 May 1998 through 15 May 1998

ER -

Spectral voice conversion for text-to-speech synthesis

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this