TY - GEN
T1 - Multirate ASR models for phone-class dependent N-best list rescoring
AU - Gadde, Venkata R.
AU - Sönmez, Kemal
AU - Franco, Horacio
PY - 2005
Y1 - 2005
N2 - Speech comprises a variety of acoustical phenomena occurring at differing rates. Fixed-rate ASR systems assume in effect a constant temporal rate of information flow via incorporating uniform statistics in proportion to a sound's duration. The usual tradeoff window length of 25-30 milliseconds represents a time-frequency resolution compromise, which aims to allow reasonable speed for following changes in the spectral trajectories and sufficient number of samples to estimate the harmonic structure. In this work, we describe a technique to augment a recognizer that uses this compromise with information from multiple-rate spectral models that emphasize either better time or better frequency resolution in order to improve performance. The main idea is to use the hypotheses generated by a fixed-rate recognizer to determine the appropriate model rate for a segment of the speech waveform. This is realized through a technique based on rescoring of N-best lists with acoustical models using different temporal windows by a phone-dependent posterior-like score. We report results on the NIST Evaluation 2002 dataset, and demonstrate that the rescoring method produces word error rate (WER) improvements in a baseline system.
AB - Speech comprises a variety of acoustical phenomena occurring at differing rates. Fixed-rate ASR systems assume in effect a constant temporal rate of information flow via incorporating uniform statistics in proportion to a sound's duration. The usual tradeoff window length of 25-30 milliseconds represents a time-frequency resolution compromise, which aims to allow reasonable speed for following changes in the spectral trajectories and sufficient number of samples to estimate the harmonic structure. In this work, we describe a technique to augment a recognizer that uses this compromise with information from multiple-rate spectral models that emphasize either better time or better frequency resolution in order to improve performance. The main idea is to use the hypotheses generated by a fixed-rate recognizer to determine the appropriate model rate for a segment of the speech waveform. This is realized through a technique based on rescoring of N-best lists with acoustical models using different temporal windows by a phone-dependent posterior-like score. We report results on the NIST Evaluation 2002 dataset, and demonstrate that the rescoring method produces word error rate (WER) improvements in a baseline system.
UR - http://www.scopus.com/inward/record.url?scp=33846247945&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33846247945&partnerID=8YFLogxK
U2 - 10.1109/ASRU.2005.1566513
DO - 10.1109/ASRU.2005.1566513
M3 - Conference contribution
AN - SCOPUS:33846247945
SN - 0780394798
SN - 9780780394797
T3 - Proceedings of ASRU 2005: 2005 IEEE Automatic Speech Recognition and Understanding Workshop
SP - 157
EP - 161
BT - Proceedings of ASRU 2005
PB - IEEE Computer Society
T2 - ASRU 2005: 2005 IEEE Automatic Speech Recognition and Understanding Workshop
Y2 - 27 November 2005 through 1 December 2005
ER -