TY - GEN
T1 - Spectral voice conversion for text-to-speech synthesis
AU - Kain, A.
AU - MacOn, M. W.
PY - 1998
Y1 - 1998
N2 - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.
AB - A new voice conversion algorithm that modifies a source speaker's speech to sound as if produced by a target speaker is presented. It is applied to a residual-excited LPC text-to-speech diphone synthesizer. Spectral parameters are mapped using a locally linear transformation based on Gaussian mixture models whose parameters are trained by joint density estimation. The LPC residuals are adjusted to match the target speakers average pitch. To study effects of the amount of training on performance, data sets of varying sizes are created by automatically selecting subsets of all available diphones by a vector quantization method. In an objective evaluation, the proposed method is found to perform more reliably for small training sets than a previous approach. In perceptual tests, it was shown that nearly optimal spectral conversion performance was achieved, even with a small amount of training data. However, speech quality improved with increases in the training set size.
UR - http://www.scopus.com/inward/record.url?scp=0031623661&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0031623661&partnerID=8YFLogxK
U2 - 10.1109/ICASSP.1998.674423
DO - 10.1109/ICASSP.1998.674423
M3 - Conference contribution
AN - SCOPUS:0031623661
SN - 0780344286
SN - 9780780344280
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 285
EP - 288
BT - Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 1998 23rd IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998
Y2 - 12 May 1998 through 15 May 1998
ER -