TY - GEN
T1 - OPTIMIZE WAV2VEC2S ARCHITECTURE FOR SMALL TRAINING SET THROUGH ANALYZING ITS PRE-TRAINED MODELS ATTENTION PATTERN
AU - Chen, Liu
AU - Asgari, Meysam
AU - Dodge, Hiroko H.
N1 - Funding Information:
This work was supported by Oregon Roybal Center for Aging and Technology Pilot Grant (P30 AG024978) and the I-CONECT project (WWW.i-conect.org) funded by NIH R01-AG051628 and R01-AG056102
Publisher Copyright:
© 2022 IEEE
PY - 2022
Y1 - 2022
N2 - Transformer-based automatic speech recognition (ASR) systems have shown their success in the presence of large datasets. But, in medical research, we have to create ASR for the non-typical population, i.e. pre-school children with speech disorders, with small training dataset. To increase training efficiency on small datasets, we optimize the architecture of Wav2Vec 2.0, a variation of Transformer, through analyzing its pre-trained model's block-level attention pattern. We show that block-level patterns can serve as an indicator for narrowing down the optimization direction. To ensure the reproducibility of our experiments, we leverage Librispeech-100-clean as training data to simulate the limited data condition. We leverage two techniques, local attention mechanism and cross-block parameter sharing, with counter-intuitive configurations. Our optimized architecture outperforms the vanilla architecture about 1.8% absolute word error rate (WER) on dev-clean and 1.4% on test-clean.
AB - Transformer-based automatic speech recognition (ASR) systems have shown their success in the presence of large datasets. But, in medical research, we have to create ASR for the non-typical population, i.e. pre-school children with speech disorders, with small training dataset. To increase training efficiency on small datasets, we optimize the architecture of Wav2Vec 2.0, a variation of Transformer, through analyzing its pre-trained model's block-level attention pattern. We show that block-level patterns can serve as an indicator for narrowing down the optimization direction. To ensure the reproducibility of our experiments, we leverage Librispeech-100-clean as training data to simulate the limited data condition. We leverage two techniques, local attention mechanism and cross-block parameter sharing, with counter-intuitive configurations. Our optimized architecture outperforms the vanilla architecture about 1.8% absolute word error rate (WER) on dev-clean and 1.4% on test-clean.
KW - Transformer
KW - architecture optimization
KW - attention pattern
KW - automatic speech recognition
KW - self-supervise learning
UR - http://www.scopus.com/inward/record.url?scp=85134082793&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134082793&partnerID=8YFLogxK
U2 - 10.1109/ICASSP43922.2022.9747831
DO - 10.1109/ICASSP43922.2022.9747831
M3 - Conference contribution
AN - SCOPUS:85134082793
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 7112
EP - 7116
BT - 2022 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 47th IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2022
Y2 - 23 May 2022 through 27 May 2022
ER -