Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

Aaron M. Cohen

doi:10.1197/jamia.M2434

Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

Aaron M. Cohen

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Article › peer-review

31 Scopus citations

Abstract

We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

Original language	English (US)
Pages (from-to)	32-35
Number of pages	4
Journal	Journal of the American Medical Informatics Association
Volume	15
Issue number	1
DOIs	https://doi.org/10.1197/jamia.M2434
State	Published - Jan 2008

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1197/jamia.M2434

Cite this

@article{e5c646d6b0214103a1aa42f404066ef4,

title = "Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes",

abstract = "We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.",

author = "Cohen, {Aaron M.}",

year = "2008",

month = jan,

doi = "10.1197/jamia.M2434",

language = "English (US)",

volume = "15",

pages = "32--35",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "1",

}

TY - JOUR

T1 - Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

AU - Cohen, Aaron M.

PY - 2008/1

Y1 - 2008/1

N2 - We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

AB - We participated in the i2b2 smoking status classification challenge task. The purpose of this task was to evaluate the ability of systems to automatically identify patient smoking status from discharge summaries. Our submission included several techniques that we compared and studied, including hot-spot identification, zero-vector filtering, inverse class frequency weighting, error-correcting output codes, and post-processing rules. We evaluated our approaches using the same methods as the i2b2 task organizers, using micro- and macro-averaged F1 as the primary performance metric. Our best performing system achieved a micro-F1 of 0.9000 on the test collection, equivalent to the best performing system submitted to the i2b2 challenge. Hot-spot identification, zero-vector filtering, classifier weighting, and error correcting output coding contributed additively to increased performance, with hot-spot identification having by far the largest positive effect. High performance on automatic identification of patient smoking status from discharge summaries is achievable with the efficient and straightforward machine learning techniques studied here.

UR - http://www.scopus.com/inward/record.url?scp=36749057750&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=36749057750&partnerID=8YFLogxK

U2 - 10.1197/jamia.M2434

DO - 10.1197/jamia.M2434

M3 - Article

C2 - 17947623

AN - SCOPUS:36749057750

SN - 1067-5027

VL - 15

SP - 32

EP - 35

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 1

ER -

Five-way Smoking Status Classification Using Text Hot-Spot Identification and Error-correcting Output Codes

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this