Automated feature selection of predictors in electronic medical records data

Jessica Gronsbell; Jessica Minnier; Sheng Yu; Katherine Liao; Tianxi Cai

doi:10.1111/biom.12987

Automated feature selection of predictors in electronic medical records data

Jessica Gronsbell, Jessica Minnier, Sheng Yu, Katherine Liao, Tianxi Cai

Research output: Contribution to journal › Article › peer-review

22 Scopus citations

Abstract

The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

Original language	English (US)
Pages (from-to)	268-277
Number of pages	10
Journal	Biometrics
Volume	75
Issue number	1
DOIs	https://doi.org/10.1111/biom.12987
State	Published - Mar 2019

Keywords

electronic medical records
feature selection
prediction accuracy
regularized regression
risk prediction

ASJC Scopus subject areas

Statistics and Probability
General Biochemistry, Genetics and Molecular Biology
General Immunology and Microbiology
General Agricultural and Biological Sciences
Applied Mathematics

Access to Document

10.1111/biom.12987

Cite this

@article{65e5c7918fd64420b3db38d02f75218f,

title = "Automated feature selection of predictors in electronic medical records data",

abstract = "The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.",

keywords = "electronic medical records, feature selection, prediction accuracy, regularized regression, risk prediction",

author = "Jessica Gronsbell and Jessica Minnier and Sheng Yu and Katherine Liao and Tianxi Cai",

note = "Publisher Copyright: {\textcopyright} 2018 International Biometric Society",

year = "2019",

month = mar,

doi = "10.1111/biom.12987",

language = "English (US)",

volume = "75",

pages = "268--277",

journal = "Biometrics",

issn = "0006-341X",

publisher = "Wiley-Blackwell",

number = "1",

}

TY - JOUR

T1 - Automated feature selection of predictors in electronic medical records data

AU - Gronsbell, Jessica

AU - Minnier, Jessica

AU - Yu, Sheng

AU - Liao, Katherine

AU - Cai, Tianxi

PY - 2019/3

Y1 - 2019/3

N2 - The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

AB - The use of Electronic Health Records (EHR) for translational research can be challenging due to difficulty in extracting accurate disease phenotype data. Historically, EHR algorithms for annotating phenotypes have been either rule-based or trained with billing codes and gold standard labels curated via labor intensive medical chart review. These simplistic algorithms tend to have unpredictable portability across institutions and low accuracy for many disease phenotypes due to imprecise billing codes. Recently, more sophisticated machine learning algorithms have been developed to improve the robustness and accuracy of EHR phenotyping algorithms. These algorithms are typically trained via supervised learning, relating gold standard labels to a wide range of candidate features including billing codes, procedure codes, medication prescriptions and relevant clinical concepts extracted from narrative notes via Natural Language Processing (NLP). However, due to the time intensiveness of gold standard labeling, the size of the training set is often insufficient to build a generalizable algorithm with the large number of candidate features extracted from EHR. To reduce the number of candidate predictors and in turn improve model performance, we present an automated feature selection method based entirely on unlabeled observations. The proposed method generates a comprehensive surrogate for the underlying phenotype with an unsupervised clustering of disease status based on several highly predictive features such as diagnosis codes and mentions of the disease in text fields available in the entire set of EHR data. A sparse regression model is then built with the estimated outcomes and remaining covariates to identify those features most informative of the phenotype of interest. Relying on the results of Li and Duan (1989), we demonstrate that variable selection for the underlying phenotype model can be achieved by fitting the surrogate-based model. We explore the performance of our methods in numerical simulations and present the results of a prediction model for Rheumatoid Arthritis (RA) built on a large EHR data mart from the Partners Health System consisting of billing codes and NLP terms. Empirical results suggest that our procedure reduces the number of gold-standard labels necessary for phenotyping thereby harnessing the automated power of EHR data and improving efficiency.

KW - electronic medical records

KW - feature selection

KW - prediction accuracy

KW - regularized regression

KW - risk prediction

UR - http://www.scopus.com/inward/record.url?scp=85061025010&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85061025010&partnerID=8YFLogxK

U2 - 10.1111/biom.12987

DO - 10.1111/biom.12987

M3 - Article

C2 - 30353541

AN - SCOPUS:85061025010

SN - 0006-341X

VL - 75

SP - 268

EP - 277

JO - Biometrics

JF - Biometrics

IS - 1

ER -

Automated feature selection of predictors in electronic medical records data

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this