Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Neil R. Smalheiser; Aaron M. Cohen; Gary Bonifield

doi:10.1016/j.jbi.2019.103096

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Neil R. Smalheiser, Aaron M. Cohen, Gary Bonifield

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Article › peer-review

14 Scopus citations

Abstract

Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50–500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5–0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

Original language	English (US)
Article number	103096
Journal	Journal of Biomedical Informatics
Volume	90
DOIs	https://doi.org/10.1016/j.jbi.2019.103096
State	Published - Feb 2019

Keywords

Dimensional reduction
Implicit features
Natural language processing
Pvtopic
Semantic similarity
Text mining
Vector representation
Word2vec

ASJC Scopus subject areas

Computer Science Applications
Health Informatics

Access to Document

10.1016/j.jbi.2019.103096

Cite this

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings. / Smalheiser, Neil R.; Cohen, Aaron M.; Bonifield, Gary.
In: Journal of Biomedical Informatics, Vol. 90, 103096, 02.2019.

Research output: Contribution to journal › Article › peer-review

@article{7082677e4d8449439c22f924ef965b06,

title = "Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings",

abstract = "Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50–500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5–0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.",

keywords = "Dimensional reduction, Implicit features, Natural language processing, Pvtopic, Semantic similarity, Text mining, Vector representation, Word2vec",

author = "Smalheiser, {Neil R.} and Cohen, {Aaron M.} and Gary Bonifield",

note = "Publisher Copyright: {\textcopyright} 2019 Elsevier Inc.",

year = "2019",

month = feb,

doi = "10.1016/j.jbi.2019.103096",

language = "English (US)",

volume = "90",

journal = "Journal of Biomedical Informatics",

issn = "1532-0464",

publisher = "Academic Press Inc.",

}

TY - JOUR

T1 - Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

AU - Smalheiser, Neil R.

AU - Cohen, Aaron M.

AU - Bonifield, Gary

PY - 2019/2

Y1 - 2019/2

N2 - Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50–500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5–0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

AB - Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50–500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is transparent to inspection. We have created a near-comprehensive vector representation of words, and selected bigrams, trigrams and abbreviations, using the set of titles and abstracts in PubMed as a corpus. This vector is used to create several novel implicit word-word and text-text similarity metrics. The implicit word-word similarity metrics correlate well with human judgement of word pair similarity and relatedness, and outperform or equal all other reported methods on a variety of biomedical benchmarks, including several implementations of neural embeddings trained on PubMed corpora. Our implicit word-word metrics capture different aspects of word-word relatedness than word2vec-based metrics and are only partially correlated (rho = 0.5–0.8 depending on task and corpus). The vector representations of words, bigrams, trigrams, abbreviations, and PubMed title + abstracts are all publicly available from http://arrowsmith.psych.uic.edu/arrowsmith_uic/word_similarity_metrics.html for release under CC-BY-NC license. Several public web query interfaces are also available at the same site, including one which allows the user to specify a given word and view its most closely related terms according to direct co-occurrence as well as different implicit similarity metrics.

KW - Dimensional reduction

KW - Implicit features

KW - Natural language processing

KW - Pvtopic

KW - Semantic similarity

KW - Text mining

KW - Vector representation

KW - Word2vec

UR - http://www.scopus.com/inward/record.url?scp=85060518039&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85060518039&partnerID=8YFLogxK

U2 - 10.1016/j.jbi.2019.103096

DO - 10.1016/j.jbi.2019.103096

M3 - Article

C2 - 30654030

AN - SCOPUS:85060518039

SN - 1532-0464

VL - 90

JO - Journal of Biomedical Informatics

JF - Journal of Biomedical Informatics

M1 - 103096

ER -

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this