Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

Robert C. Gale, Alexandra C. Salem, Gerasimos Fergadiotis, Steven Bedrick

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

Original languageEnglish (US)
Title of host publicationACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop
EditorsBurcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
PublisherAssociation for Computational Linguistics (ACL)
Pages212-225
Number of pages14
ISBN (Electronic)9781959429777
StatePublished - 2023
Event8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023 - Toronto, Canada
Duration: Jul 13 2023 → …

Publication series

NameProceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)0736-587X

Conference

Conference8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023
Country/TerritoryCanada
CityToronto
Period7/13/23 → …

ASJC Scopus subject areas

  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Fingerprint

Dive into the research topics of 'Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)'. Together they form a unique fingerprint.

Cite this