TY - JOUR
T1 - The contribution of various sources of spectral mismatch to audible discontinuities in a diphone database
AU - Klabbers, Esther
AU - Van Santen, Jan P.H.
AU - Kain, Alexander
N1 - Funding Information:
Manuscript received January 5, 2006; revised July 14, 2006. This work was supported by the National Science Foundation under Grant 0313383: “Objective Methods for Predicting and Optimizing Synthetic Speech Quality.” The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Bayya Yegnanarayana.
PY - 2007/3
Y1 - 2007/3
N2 - One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pairwise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy
AB - One of the major problems in concatenative synthesis is the occurrence of audible discontinuities between two successive concatenative units. Several studies have attempted to discover objective distance measures that predict the audibility of these discontinuities. In this paper, we investigate mid-vowel joins for three vowels with a range of post-vocalic consonant contexts typical for diphone databases. A first perceptual experiment uses a pairwise comparison procedure to find two subsets of unit combinations: Those with versus without audible discontinuities. A second perceptual experiment uses these two subsets in a procedure where formant resynthesis is used to manipulate three sources of discontinuity separately: formant frequencies, formant bandwidths, and overall energy. Results show mismatch in formant frequencies provides the largest contribution to audible discontinuity, followed by mismatch in overall energy
KW - Audible discontinuities
KW - Diphones
KW - Spectral distance measures
KW - Speech synthesis
UR - http://www.scopus.com/inward/record.url?scp=56149089359&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=56149089359&partnerID=8YFLogxK
U2 - 10.1109/TASL.2006.885250
DO - 10.1109/TASL.2006.885250
M3 - Article
AN - SCOPUS:56149089359
SN - 1558-7916
VL - 15
SP - 949
EP - 956
JO - IEEE Transactions on Speech and Audio Processing
JF - IEEE Transactions on Speech and Audio Processing
IS - 3
M1 - 4100687
ER -