TY - GEN
T1 - Mixed Orthographic/Phonemic Language Modeling
T2 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023
AU - Gale, Robert C.
AU - Salem, Alexandra C.
AU - Fergadiotis, Gerasimos
AU - Bedrick, Steven
N1 - Publisher Copyright:
© 2023 Association for Computational Linguistics.
PY - 2023
Y1 - 2023
N2 - Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.
AB - Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.
UR - http://www.scopus.com/inward/record.url?scp=85174539485&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85174539485&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85174539485
T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics
SP - 212
EP - 225
BT - ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop
A2 - Can, Burcu
A2 - Mozes, Maximilian
A2 - Cahyawijaya, Samuel
A2 - Saphra, Naomi
A2 - Kassner, Nora
A2 - Ravfogel, Shauli
A2 - Ravichander, Abhilasha
A2 - Zhao, Chen
A2 - Augenstein, Isabelle
A2 - Rogers, Anna
A2 - Cho, Kyunghyun
A2 - Grefenstette, Edward
A2 - Voita, Lena
PB - Association for Computational Linguistics (ACL)
Y2 - 13 July 2023
ER -