Abstract
Background: Cardiovascular outcomes for people with familial hypercholesterolaemia can be improved with diagnosis and medical management. However, 90% of individuals with familial hypercholesterolaemia remain undiagnosed in the USA. We aimed to accelerate early diagnosis and timely intervention for more than 1·3 million undiagnosed individuals with familial hypercholesterolaemia at high risk for early heart attacks and strokes by applying machine learning to large health-care encounter datasets. Methods: We trained the FIND FH machine learning model using deidentified health-care encounter data, including procedure and diagnostic codes, prescriptions, and laboratory findings, from 939 clinically diagnosed individuals with familial hypercholesterolaemia (395 of whom had a molecular diagnosis) and 83 136 individuals presumed free of familial hypercholesterolaemia, sampled from four US institutions. The model was then applied to a national health-care encounter database (170 million individuals) and an integrated health-care delivery system dataset (174 000 individuals). Individuals used in model training and those evaluated by the model were required to have at least one cardiovascular disease risk factor (eg, hypertension, hypercholesterolaemia, or hyperlipidemia). A Health Insurance Portability and Accountability Act of 1996-compliant programme was developed to allow providers to receive identification of individuals likely to have familial hypercholesterolaemia in their practice. Findings: Using a model with a measured precision (positive predictive value) of 0·85, recall (sensitivity) of 0·45, area under the precision–recall curve of 0·55, and area under the receiver operating characteristic curve of 0·89, we flagged 1 331 759 of 170 416 201 patients in the national database and 866 of 173 733 individuals in the health-care delivery system dataset as likely to have familial hypercholesterolaemia. Familial hypercholesterolaemia experts reviewed a sample of flagged individuals (45 from the national database and 103 from the health-care delivery system dataset) and applied clinical familial hypercholesterolaemia diagnostic criteria. Of those reviewed, 87% (95% Cl 73–100) in the national database and 77% (68–86) in the health-care delivery system dataset were categorised as having a high enough clinical suspicion of familial hypercholesterolaemia to warrant guideline-based clinical evaluation and treatment. Interpretation: The FIND FH model successfully scans large, diverse, and disparate health-care encounter databases to identify individuals with familial hypercholesterolaemia. Funding: The FH Foundation funded this study. Support was received from Amgen, Sanofi, and Regeneron.
Original language | English (US) |
---|---|
Pages (from-to) | e393-e402 |
Journal | The Lancet Digital Health |
Volume | 1 |
Issue number | 8 |
DOIs | |
State | Published - Dec 2019 |
ASJC Scopus subject areas
- Medicine (miscellaneous)
- Health Informatics
- Decision Sciences (miscellaneous)
- Health Information Management
Access to Document
Other files and links
Fingerprint
Dive into the research topics of 'Precision screening for familial hypercholesterolaemia: a machine learning study applied to electronic health encounter data'. Together they form a unique fingerprint.Cite this
- APA
- Standard
- Harvard
- Vancouver
- Author
- BIBTEX
- RIS
In: The Lancet Digital Health, Vol. 1, No. 8, 12.2019, p. e393-e402.
Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Precision screening for familial hypercholesterolaemia
T2 - a machine learning study applied to electronic health encounter data
AU - Myers, Kelly D.
AU - Knowles, Joshua W.
AU - Staszak, David
AU - Shapiro, Michael D.
AU - Howard, William
AU - Yadava, Mrinal
AU - Zuzick, David
AU - Williamson, Latoya
AU - Shah, Nigam H.
AU - Banda, Juan M.
AU - Leader, Joe
AU - Cromwell, William C.
AU - Trautman, Ed
AU - Murray, Michael F.
AU - Baum, Seth J.
AU - Myers, Seth
AU - Gidding, Samuel S.
AU - Wilemon, Katherine
AU - Rader, Daniel J.
N1 - Funding Information: The FH Foundation, a 501(c)(3) organisation, funded this study. Support was received from Amgen, Sanofi, and Regeneron. Amgen is the founding sponsor of the FIND FH initiative. JWK received support from the American Heart Association (grant 15IRG222930034), the Stanford Data Science Initiative, the Stanford Diabetes Research Center (P30DK116074), and the National Institutes of Health (grant U41HG009649). MDS is supported by the National Institutes of Health (grant K12HD043488). The FH Foundation would like to thank Penn Medicine, Stanford University Medical Center, Geisinger Medical Center, The Ohio State University Wexner Medical Center, Oregon Health & Science University, and Laboratory Corporation of America for providing vital data and shared vision for the FIND FH initiative. Moreover, individuals beyond the authors from each institution played a key role in providing data to help to train, test, and validate the FIND FH model. We thank Yuliya Borovskly and Dan Soffer (Penn Medicine); Kylie McElheran, Beth Wilson, and Kathy Lee (Oregon Health & Science University); Kelly J Scheiderer, Brian Myers, and Jing Ding (Ohio State University); Jim Fleming, Wade Tanico, Arren Fisher, Sherrie Duke, Eric Rotthoff, Lee Terrell, and M J Lewis (Laboratory Corporation of America) for their critical contributions. Funding Information: We describe here the development of a machine learning model, FIND FH, which is designed to identify phenotypic familial hypercholesterolaemia when applied to large medical datasets. FIND FH was built on longitudinal medical data from individuals with at least one documented cardiovascular disease risk factor in their history. This requirement was applied to allow us to collect data for model training and to optimise the model to the envisioned test cases, finding individuals embedded in a health-care system. Additionally, it forces the model to focus on discriminating between familial hypercholesterolaemia and other cardiac conditions that might mimic familial hypercholesterolaemia. The FIND FH model is a precision screening tool and does not replace clinical evaluations nor existing diagnostic criteria. It differs from the traditional diagnostic criteria (ie, Dutch Lipid Clinic Network, MEDPED, or Simon Broome) in that it does not require specific information, such as tendon xanthomas or family history, that either might not be present or cannot be regularly and reliably extracted from current EHR data at the national scale. FIND FH was structured to identify a phenotype consistent with familial hypercholesterolaemia whereas the other criteria were designed to evaluate the likelihood of having a positive genetic test for familial hypercholesterolaemia. 18,19 This is why we validated the model using familial hypercholesterolaemia expert clinical evaluation. 1 As genetic testing results become more readily available, we intend to include this feature in future versions. Importantly, the FIND FH model does not only rely on predetermined thresholds for lipid concentrations. The algorithm was developed with data from patients with an existing diagnosis of familial hypercholesterolaemia; at some point in the past, these patients would have met the lipid diagnostic criteria for this condition, but those data were not in the EHR for many of the patients on whom the algorithm was trained. This is a key difference between FIND FH and conventional scoring systems. Although lipid concentrations are helpful to the machine learning algorithm, many patients identified in this study either did not have lipid levels obtained and were flagged by other characteristics or they were taking lipid-lowering medications and had lipid levels below pre-treatment diagnostic thresholds. Future studies should assess the effect of having lipid levels available on a higher percentage of the available cohort on model performance. Application of FIND FH to a national database consisting of diagnosis, procedure, and medication transactions in more than 170 million Americans flagged 1 345 477 individuals with medical profiles consistent with familial hypercholesterolaemia. FIND FH was also applied to a health-care delivery system dataset consisting of structured EHR data in more than 170 000 individuals, in which it flagged 866 individuals. The proportion of individuals flagged in these cohorts, which is empirically lower than the training prevalence of 1:71, is a direct reflection of the high precision threshold chosen. We chose this threshold to avoid the possibility of too many false positives at the start of the outreach programme to attending physicians. Chart review of the flagged individuals categorised 77–87% of them as having possible, probable, or definite familial hypercholesterolaemia, indicating a high enough clinical suspicion of familial hypercholesterolaemia to warrant a guideline-based, formal clinical evaluation and treatment. More than half of the individuals flagged would not have been identified with a simple screen for elevated LDL cholesterol levels alone; 20,21 this test does not capture data crucial to conventional diagnostic criteria, such as family history, and misses situations such as an individual on statins with LDL cholesterol levels below threshold. Furthermore, the model flagged individuals undergoing statin therapy and those that were not. Of the 79 individuals categorised as having risk of possible familial hypercholesterolaemia or greater in the OHSU dataset, 31 of them had no record of any statin prescriptions in the previous 2 years. These results indicate that application of a machine learning approach such as FIND FH to medical big data might be feasible for identifying many undiagnosed individuals with familial hypercholesterolaemia. FIND FH performed comparably across two types of health-care data: a national health-care encounter database and an integrated health-care delivery system with a structured EHR database. This portability was a design consideration and arises from the fact that the model is built on structured health-care encounter and laboratory result data. Although we 15 and others have found success when including unstructured EHR data and clinical notes in machine learning models, 14,22 to our knowledge, no national database with such data currently exists. The fact that the model performs similarly in distinct health-care data frameworks suggests that it might be generalisable to other institutions, agencies, employers, and health-care delivery systems. Our previous model 15 took the complementary approach and showed good performance in identifying previously undiagnosed familial hypercholesterolaemia patients within a single institution. lmportantly, the fact that the individuals identified in this latter case were already within the institution lead to easier and quicker individual engagement. We have developed a novel HIPAA-compliant outreach process to notify health-care providers of their patients flagged by the FIND FH model, a disclosure that is permitted by the treatment exception to HIPAA privacy rules. This process can be easily implemented across diverse health-care systems. Providers, or an integrated delivery system, can opt in to participate in the programme and then learn the identities of individuals in their practices flagged as having probable familial hypercholesterolaemia. These individuals can then be evaluated by their providers and, if formally diagnosed with familial hypercholesterolaemia, receive the necessary medical therapy. This study has several limitations and caveats. By design, the model is given a 3-year snapshot of an individual's full medical history to calculate their likelihood of familial hypercholesterolaemia. It is not possible to account for important pieces of information outside of the 3-year interval. The time windows chosen for building and applying the model to a dataset are a balance between the positive contribution of more data and the increased costs and other issues associated with using longer time windows. We investigated shorter and longer windows in previous FIND FH versions (data not shown) and found that 3 years yielded a good balance. In the training data, this limitation is mitigated by using a large number of individuals from multiple health-care systems. For logistical reasons, the physician review of flagged individuals could only be done on a small subset of those identified. We cannot rule out selection bias in the validation results because neither scenario was perfectly random. In the national database, we relied on professional (second to third degree) connections to collect physician reviews. In the OHSU dataset, we imposed the practical requirement that the patient have at least one LDL cholesterol laboratory result, so that the physicians could easily assess those flagged with conventional diagnostic criteria. Individuals with LDL cholesterol values represent the simplest patients to assess, and we expect this cohort to be the most commonly reviewed in practice. An additional limitation of the study is that the mean age of individuals identified by the FIND FH model was 61 years (SD 15) in the national dataset and 59 years (SD 14) in the OHSU dataset, despite the fact that familial hypercholesterolaemia is a genetic condition and therefore present from birth. This result probably stems from two factors: first, the model was trained on individuals diagnosed in specialty lipid clinics where individuals are typically referred later in the course of their preventive or cardiac care (mean age of cases and presumed controls in the institutional training datasets were 49–63 years and 60–67 years, respectively), 23 and second, the very low prevalence of lipid data in individuals younger than 40 years of age in the databases. The final model identified those patients that it was trained to find—namely, older individuals with familial hypercholesterolaemia. The best value from the model might be achieved by successful cascade screening of family members of identified and diagnosed cases. Adding relevant clinical notes—particularly family history—would be an important development. However, there is currently no database at the national scale with this information available, nor are these data routinely included in EHRs. Therefore, addition of these data to the model would prevent using the model to scan the full national population. In summary, when applied to two distinct types of large medical datasets, FIND FH identified a large number of individuals with probable familial hypercholesterolaemia who had not been previously diagnosed. Additional validation and demonstration of clinical utility of FIND FH will be needed before large-scale adoption of this approach. A crucial hurdle will be engaging providers to become familiar with machine learning approaches designed to reconnect them with their patients regarding new diagnoses not presented in previous medical encounters. This new tool carries the promise of finding new individuals with familial hypercholesterolaemia at scale and leading to more effective preventive therapy for them and newly identified family members. Contributors KDM, JWK, KW, DJR, SJB, MFM, NHS, JMB, and SSG conceived, designed, and oversaw the development of the FIND FH algorithm and this study. KDM, SM, WH, and DS developed the model and did the main analysis. MDS, MY, and SJB did clinical patient reviews and contributed to the analysis. KDM, JL, WCC, DZ, and ET developed the HIPAA-compliant methodology and acquired necessary data for this work. LW and DS searched the literature. KDM and DS wrote the initial publication draft and created the figures. All authors contributed to the critical review, editing, and final approval of this manuscript. Declaration of interests KW, SSG, DZ, and LW are employees of the FH Foundation; KDM, WH, and DS are paid consultants for the FH Foundation. JWK is the unpaid chief research advisor for the FH Foundation and the FIND FH project and has enrolled patients and adjudicated outcomes in PCSK9i trials. DJR serves on the science advisory board for Alnylam, Novartis, and Pfizer and is an unpaid advisor to the FH Foundation. MDS is supported by NIH K12HD043488. MFM reports receiving grants from Regeneron Pharmaceuticals as well as personal fees from Invitae, both outside the scope of the study. SJB serves on the scientific advisory board for Amgen, Sanofi, Novartis, Regeneron, and Akcea; is a consultant at Sanofi, Amgen, Cleveland Heart Labs, GLG Group, Guidepoint Global, Regeneron, Novo Nordisk, and Akcea; and is a national speaker for Amgen, Aralez, Boehringer Ingelheim Pharmaceutical, Novo Nordisk, and Akcea. NHS is a co-founder and scientific advisor to Cardinal Analytx and an advisor to TwoXaR. WCC was an employee of the Laboratory Corporation of America Holdings. ET is an employee of Pfizer. All other authors declare no competing interests. Acknowledgments The FH Foundation, a 501(c)(3) organisation, funded this study. Support was received from Amgen, Sanofi, and Regeneron. Amgen is the founding sponsor of the FIND FH initiative. JWK received support from the American Heart Association ( grant 15IRG222930034 ), the Stanford Data Science Initiative, the Stanford Diabetes Research Center ( P30DK116074 ), and the National Institutes of Health ( grant U41HG009649 ). MDS is supported by the National Institutes of Health ( grant K12HD043488 ). The FH Foundation would like to thank Penn Medicine, Stanford University Medical Center, Geisinger Medical Center, The Ohio State University Wexner Medical Center, Oregon Health & Science University, and Laboratory Corporation of America for providing vital data and shared vision for the FIND FH initiative. Moreover, individuals beyond the authors from each institution played a key role in providing data to help to train, test, and validate the FIND FH model. We thank Yuliya Borovskly and Dan Soffer (Penn Medicine); Kylie McElheran, Beth Wilson, and Kathy Lee (Oregon Health & Science University); Kelly J Scheiderer, Brian Myers, and Jing Ding (Ohio State University); Jim Fleming, Wade Tanico, Arren Fisher, Sherrie Duke, Eric Rotthoff, Lee Terrell, and M J Lewis (Laboratory Corporation of America) for their critical contributions. Publisher Copyright: © 2019 The Author(s). Published by Elsevier Ltd. This is an Open Access article under the CC BY-NC-ND 4.0 license
PY - 2019/12
Y1 - 2019/12
N2 - Background: Cardiovascular outcomes for people with familial hypercholesterolaemia can be improved with diagnosis and medical management. However, 90% of individuals with familial hypercholesterolaemia remain undiagnosed in the USA. We aimed to accelerate early diagnosis and timely intervention for more than 1·3 million undiagnosed individuals with familial hypercholesterolaemia at high risk for early heart attacks and strokes by applying machine learning to large health-care encounter datasets. Methods: We trained the FIND FH machine learning model using deidentified health-care encounter data, including procedure and diagnostic codes, prescriptions, and laboratory findings, from 939 clinically diagnosed individuals with familial hypercholesterolaemia (395 of whom had a molecular diagnosis) and 83 136 individuals presumed free of familial hypercholesterolaemia, sampled from four US institutions. The model was then applied to a national health-care encounter database (170 million individuals) and an integrated health-care delivery system dataset (174 000 individuals). Individuals used in model training and those evaluated by the model were required to have at least one cardiovascular disease risk factor (eg, hypertension, hypercholesterolaemia, or hyperlipidemia). A Health Insurance Portability and Accountability Act of 1996-compliant programme was developed to allow providers to receive identification of individuals likely to have familial hypercholesterolaemia in their practice. Findings: Using a model with a measured precision (positive predictive value) of 0·85, recall (sensitivity) of 0·45, area under the precision–recall curve of 0·55, and area under the receiver operating characteristic curve of 0·89, we flagged 1 331 759 of 170 416 201 patients in the national database and 866 of 173 733 individuals in the health-care delivery system dataset as likely to have familial hypercholesterolaemia. Familial hypercholesterolaemia experts reviewed a sample of flagged individuals (45 from the national database and 103 from the health-care delivery system dataset) and applied clinical familial hypercholesterolaemia diagnostic criteria. Of those reviewed, 87% (95% Cl 73–100) in the national database and 77% (68–86) in the health-care delivery system dataset were categorised as having a high enough clinical suspicion of familial hypercholesterolaemia to warrant guideline-based clinical evaluation and treatment. Interpretation: The FIND FH model successfully scans large, diverse, and disparate health-care encounter databases to identify individuals with familial hypercholesterolaemia. Funding: The FH Foundation funded this study. Support was received from Amgen, Sanofi, and Regeneron.
AB - Background: Cardiovascular outcomes for people with familial hypercholesterolaemia can be improved with diagnosis and medical management. However, 90% of individuals with familial hypercholesterolaemia remain undiagnosed in the USA. We aimed to accelerate early diagnosis and timely intervention for more than 1·3 million undiagnosed individuals with familial hypercholesterolaemia at high risk for early heart attacks and strokes by applying machine learning to large health-care encounter datasets. Methods: We trained the FIND FH machine learning model using deidentified health-care encounter data, including procedure and diagnostic codes, prescriptions, and laboratory findings, from 939 clinically diagnosed individuals with familial hypercholesterolaemia (395 of whom had a molecular diagnosis) and 83 136 individuals presumed free of familial hypercholesterolaemia, sampled from four US institutions. The model was then applied to a national health-care encounter database (170 million individuals) and an integrated health-care delivery system dataset (174 000 individuals). Individuals used in model training and those evaluated by the model were required to have at least one cardiovascular disease risk factor (eg, hypertension, hypercholesterolaemia, or hyperlipidemia). A Health Insurance Portability and Accountability Act of 1996-compliant programme was developed to allow providers to receive identification of individuals likely to have familial hypercholesterolaemia in their practice. Findings: Using a model with a measured precision (positive predictive value) of 0·85, recall (sensitivity) of 0·45, area under the precision–recall curve of 0·55, and area under the receiver operating characteristic curve of 0·89, we flagged 1 331 759 of 170 416 201 patients in the national database and 866 of 173 733 individuals in the health-care delivery system dataset as likely to have familial hypercholesterolaemia. Familial hypercholesterolaemia experts reviewed a sample of flagged individuals (45 from the national database and 103 from the health-care delivery system dataset) and applied clinical familial hypercholesterolaemia diagnostic criteria. Of those reviewed, 87% (95% Cl 73–100) in the national database and 77% (68–86) in the health-care delivery system dataset were categorised as having a high enough clinical suspicion of familial hypercholesterolaemia to warrant guideline-based clinical evaluation and treatment. Interpretation: The FIND FH model successfully scans large, diverse, and disparate health-care encounter databases to identify individuals with familial hypercholesterolaemia. Funding: The FH Foundation funded this study. Support was received from Amgen, Sanofi, and Regeneron.
UR - http://www.scopus.com/inward/record.url?scp=85075522435&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075522435&partnerID=8YFLogxK
U2 - 10.1016/S2589-7500(19)30150-5
DO - 10.1016/S2589-7500(19)30150-5
M3 - Article
AN - SCOPUS:85075522435
SN - 2589-7500
VL - 1
SP - e393-e402
JO - The Lancet Digital Health
JF - The Lancet Digital Health
IS - 8
ER -