Recent innovations in speech-to-text transcription at SRI-ICSI-UW

Andreas Stolcke, Barry Chen, Horacio Franco, Venkata Ramana Rao Gadde, Martin Graciarena, Mei Yuh Hwang, Katrin Kirchhoff, Arindam Mandal, Nelson Morgan, Xin Lei, Tim Ng, Mari Ostendorf, Kemal Sönmez, Anand Venkataraman, Dimitra Vergyri, Wen Wang, Jing Zheng, Qifeng Zhu

Research output: Contribution to journalArticlepeer-review

71 Scopus citations


We summarize recent progress in automatic specch-to-text transcription at SRI, ICSI, and the University of Washington. The work encompasses all components of speech modeling found in a state-of-the-art recognition system, from acoustic features, to acoustic modeling and adaptation, to language modeling. In the front end, we experimented with nonstandard features, including various measures of voicing, discriminative phone posterior features estimated by multilayer perceptrons, and a novel phone-level macro-averaging for cepstral normalization. Acoustic modeling was improved with combinations of front ends operating at multiple frame rates, as well as by modifications to the standard methods for discriminative Gaussian estimation. We show that acoustic adaptation can be improved by predicting the optimal regression class complexity for a given speaker. Language modeling innovations include the use of a syntax-motivated almost-parsing language model, as well as principled vocabulary-selection techniques. Finally, we address portability issues, such as the use of imperfect training transcripts, and language-specific adjustments required for recognition of Arabic and Mandarin.

Original languageEnglish (US)
Pages (from-to)1729-1742
Number of pages14
JournalIEEE Transactions on Audio, Speech and Language Processing
Issue number5
StatePublished - Sep 2006
Externally publishedYes


  • Broadcast news (BN)
  • Conversational telephone speech (CTS)
  • Specch-to-text (STT)

ASJC Scopus subject areas

  • Acoustics and Ultrasonics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Recent innovations in speech-to-text transcription at SRI-ICSI-UW'. Together they form a unique fingerprint.

Cite this