Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

Robert C. Gale; Alexandra C. Salem; Gerasimos Fergadiotis; Steven Bedrick

Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

Robert C. Gale, Alexandra C. Salem, Gerasimos Fergadiotis, Steven Bedrick

Medical Informatics and Clinical Epidemiology

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

Original language	English (US)
Title of host publication	ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop
Editors	Burcu Can, Maximilian Mozes, Samuel Cahyawijaya, Naomi Saphra, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Chen Zhao, Isabelle Augenstein, Anna Rogers, Kyunghyun Cho, Edward Grefenstette, Lena Voita
Publisher	Association for Computational Linguistics (ACL)
Pages	212-225
Number of pages	14
ISBN (Electronic)	9781959429777
State	Published - 2023
Event	8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023 - Toronto, Canada Duration: Jul 13 2023 → …

Publication series

Name	Proceedings of the Annual Meeting of the Association for Computational Linguistics
ISSN (Print)	0736-587X

Conference

Conference	8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023
Country/Territory	Canada
City	Toronto
Period	7/13/23 → …

ASJC Scopus subject areas

Computer Science Applications
Linguistics and Language
Language and Linguistics

Cite this

Gale, R. C., Salem, A. C., Fergadiotis, G., & Bedrick, S. (2023). Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). In B. Can, M. Mozes, S. Cahyawijaya, N. Saphra, N. Kassner, S. Ravfogel, A. Ravichander, C. Zhao, I. Augenstein, A. Rogers, K. Cho, E. Grefenstette, & L. Voita (Eds.), ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop (pp. 212-225). (Proceedings of the Annual Meeting of the Association for Computational Linguistics). Association for Computational Linguistics (ACL).

Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). / Gale, Robert C.; Salem, Alexandra C.; Fergadiotis, Gerasimos et al.
ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop. ed. / Burcu Can; Maximilian Mozes; Samuel Cahyawijaya; Naomi Saphra; Nora Kassner; Shauli Ravfogel; Abhilasha Ravichander; Chen Zhao; Isabelle Augenstein; Anna Rogers; Kyunghyun Cho; Edward Grefenstette; Lena Voita. Association for Computational Linguistics (ACL), 2023. p. 212-225 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Gale, RC, Salem, AC, Fergadiotis, G & Bedrick, S 2023, Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). in B Can, M Mozes, S Cahyawijaya, N Saphra, N Kassner, S Ravfogel, A Ravichander, C Zhao, I Augenstein, A Rogers, K Cho, E Grefenstette & L Voita (eds), ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop. Proceedings of the Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics (ACL), pp. 212-225, 8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023, Toronto, Canada, 7/13/23.

Gale RC, Salem AC, Fergadiotis G, Bedrick S. Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT). In Can B, Mozes M, Cahyawijaya S, Saphra N, Kassner N, Ravfogel S, Ravichander A, Zhao C, Augenstein I, Rogers A, Cho K, Grefenstette E, Voita L, editors, ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop. Association for Computational Linguistics (ACL). 2023. p. 212-225. (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

Gale, Robert C. ; Salem, Alexandra C. ; Fergadiotis, Gerasimos et al. / Mixed Orthographic/Phonemic Language Modeling : Beyond Orthographically Restricted Transformers (BORT). ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop. editor / Burcu Can ; Maximilian Mozes ; Samuel Cahyawijaya ; Naomi Saphra ; Nora Kassner ; Shauli Ravfogel ; Abhilasha Ravichander ; Chen Zhao ; Isabelle Augenstein ; Anna Rogers ; Kyunghyun Cho ; Edward Grefenstette ; Lena Voita. Association for Computational Linguistics (ACL), 2023. pp. 212-225 (Proceedings of the Annual Meeting of the Association for Computational Linguistics).

@inproceedings{0a373bf57f4d477c9506176794c8349c,

title = "Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)",

abstract = "Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.",

author = "Gale, {Robert C.} and Salem, {Alexandra C.} and Gerasimos Fergadiotis and Steven Bedrick",

note = "Publisher Copyright: {\textcopyright} 2023 Association for Computational Linguistics.; 8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023 ; Conference date: 13-07-2023",

year = "2023",

language = "English (US)",

series = "Proceedings of the Annual Meeting of the Association for Computational Linguistics",

publisher = "Association for Computational Linguistics (ACL)",

pages = "212--225",

editor = "Burcu Can and Maximilian Mozes and Samuel Cahyawijaya and Naomi Saphra and Nora Kassner and Shauli Ravfogel and Abhilasha Ravichander and Chen Zhao and Isabelle Augenstein and Anna Rogers and Kyunghyun Cho and Edward Grefenstette and Lena Voita",

booktitle = "ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop",

}

TY - GEN

T1 - Mixed Orthographic/Phonemic Language Modeling

T2 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023, co-located with ACL 2023

AU - Gale, Robert C.

AU - Salem, Alexandra C.

AU - Fergadiotis, Gerasimos

AU - Bedrick, Steven

PY - 2023

Y1 - 2023

N2 - Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

AB - Speech language pathologists rely on information spanning the layers of language, often drawing from multiple layers (e.g. phonology & semantics) at once. Recent innovations in large language models (LLMs) have been shown to build powerful representations for many complex language structures, especially syntax and semantics, unlocking the potential of large datasets through self-supervised learning techniques. However, these datasets are overwhelmingly orthographic, favoring writing systems like the English alphabet, a natural but phonetically imprecise choice. Meanwhile, LLM support for the international phonetic alphabet (IPA) ranges from poor to absent. Further, LLMs encode text at a word- or near-word level, and pre-training tasks have little to gain from phonetic/phonemic representations. In this paper, we introduce BORT, an LLM for mixed orthography/IPA meant to overcome these limitations. To this end, we extend the pre-training of an existing LLM with our own self-supervised pronunciation tasks. We then fine-tune for a clinical task that requires simultaneous phonological and semantic analysis. For an “easy” and “hard” version of these tasks, we show that fine-tuning from our models is more accurate by a relative 24% and 29%, and improves on character error rates by a relative 75% and 31%, respectively, than those starting from the original model.

UR - http://www.scopus.com/inward/record.url?scp=85174539485&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85174539485&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85174539485

T3 - Proceedings of the Annual Meeting of the Association for Computational Linguistics

SP - 212

EP - 225

BT - ACL 2023 - 8th Workshop on Representation Learning for NLP, RepL4NLP 2023 - Proceedings of the Workshop

A2 - Can, Burcu

A2 - Mozes, Maximilian

A2 - Cahyawijaya, Samuel

A2 - Saphra, Naomi

A2 - Kassner, Nora

A2 - Ravfogel, Shauli

A2 - Ravichander, Abhilasha

A2 - Zhao, Chen

A2 - Augenstein, Isabelle

A2 - Rogers, Anna

A2 - Cho, Kyunghyun

A2 - Grefenstette, Edward

A2 - Voita, Lena

PB - Association for Computational Linguistics (ACL)

Y2 - 13 July 2023

ER -

Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

Abstract

Publication series

Conference

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this