Automated confidence ranked classification of randomized controlled trial articles: An aid to evidence-based medicine

Aaron M. Cohen; Neil R. Smalheiser; Marian S. McDonagh; Clement Yu; Clive E. Adams; John M. Davis; Philip S. Yu

doi:10.1093/jamia/ocu025

Automated confidence ranked classification of randomized controlled trial articles: An aid to evidence-based medicine

Aaron M. Cohen, Neil R. Smalheiser, Marian S. McDonagh, Clement Yu, Clive E. Adams, John M. Davis, Philip S. Yu

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Article › peer-review

34 Scopus citations

Abstract

Objective: For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT. Materials and Methods: The LibSVM classifier was used with forward selection of potential feature sets on a large human-related subset of MEDLINE to create a classification model requiring only the citation, abstract, and MeSH terms for each article. Results: The model achieved an area under the receiver operating characteristic curve of 0.973 and mean squared error of 0.013 on the held out year 2011 data. Accurate confidence estimates were confirmed on a manually reviewed set of test articles. A second model not requiring MeSH terms was also created, and performs almost as well. Discussion: Both models accurately rank and predict article RCT confidence. Using the model and the manually reviewed samples, it is estimated that about 8000 (3%) additional RCTs can be identified in MEDLINE, and that 5% of articles tagged as RCTs in Medline may not be identified. Conclusion: Retagging human-related studies with a continuously valued RCT confidence is potentially more useful for article ranking and review than a simple yes/no prediction. The automated RCT tagging tool should offer significant savings of time and effort during the process of writing SRs, and is a key component of a multistep text mining pipeline that we are building to streamline SR workflow. In addition, the model may be useful for identifying errors in MEDLINE publication types. The RCT confidence predictions described here have been made available to users as a web service with a user query form front end at: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi.

Original language	English (US)
Pages (from-to)	707-717
Number of pages	11
Journal	Journal of the American Medical Informatics Association
Volume	22
Issue number	3
DOIs	https://doi.org/10.1093/jamia/ocu025
State	Published - May 2015

Keywords

Evidence-based medicine
Information retrieval
Natural language processing
Randomized controlled trials as topic
Support vector machines
Systematic reviews

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1093/jamia/ocu025

Cite this

@article{13418419aba94c33aab77d876fd3455d,

title = "Automated confidence ranked classification of randomized controlled trial articles: An aid to evidence-based medicine",

abstract = "Objective: For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT. Materials and Methods: The LibSVM classifier was used with forward selection of potential feature sets on a large human-related subset of MEDLINE to create a classification model requiring only the citation, abstract, and MeSH terms for each article. Results: The model achieved an area under the receiver operating characteristic curve of 0.973 and mean squared error of 0.013 on the held out year 2011 data. Accurate confidence estimates were confirmed on a manually reviewed set of test articles. A second model not requiring MeSH terms was also created, and performs almost as well. Discussion: Both models accurately rank and predict article RCT confidence. Using the model and the manually reviewed samples, it is estimated that about 8000 (3%) additional RCTs can be identified in MEDLINE, and that 5% of articles tagged as RCTs in Medline may not be identified. Conclusion: Retagging human-related studies with a continuously valued RCT confidence is potentially more useful for article ranking and review than a simple yes/no prediction. The automated RCT tagging tool should offer significant savings of time and effort during the process of writing SRs, and is a key component of a multistep text mining pipeline that we are building to streamline SR workflow. In addition, the model may be useful for identifying errors in MEDLINE publication types. The RCT confidence predictions described here have been made available to users as a web service with a user query form front end at: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi.",

keywords = "Evidence-based medicine, Information retrieval, Natural language processing, Randomized controlled trials as topic, Support vector machines, Systematic reviews",

author = "Cohen, {Aaron M.} and Smalheiser, {Neil R.} and McDonagh, {Marian S.} and Clement Yu and Adams, {Clive E.} and Davis, {John M.} and Yu, {Philip S.}",

note = "Publisher Copyright: {\textcopyright} The Author 2015.",

year = "2015",

month = may,

doi = "10.1093/jamia/ocu025",

language = "English (US)",

volume = "22",

pages = "707--717",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "3",

}

TY - JOUR

T1 - Automated confidence ranked classification of randomized controlled trial articles

T2 - An aid to evidence-based medicine

AU - Cohen, Aaron M.

AU - Smalheiser, Neil R.

AU - McDonagh, Marian S.

AU - Yu, Clement

AU - Adams, Clive E.

AU - Davis, John M.

AU - Yu, Philip S.

N1 - Publisher Copyright: © The Author 2015.

PY - 2015/5

Y1 - 2015/5

N2 - Objective: For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT. Materials and Methods: The LibSVM classifier was used with forward selection of potential feature sets on a large human-related subset of MEDLINE to create a classification model requiring only the citation, abstract, and MeSH terms for each article. Results: The model achieved an area under the receiver operating characteristic curve of 0.973 and mean squared error of 0.013 on the held out year 2011 data. Accurate confidence estimates were confirmed on a manually reviewed set of test articles. A second model not requiring MeSH terms was also created, and performs almost as well. Discussion: Both models accurately rank and predict article RCT confidence. Using the model and the manually reviewed samples, it is estimated that about 8000 (3%) additional RCTs can be identified in MEDLINE, and that 5% of articles tagged as RCTs in Medline may not be identified. Conclusion: Retagging human-related studies with a continuously valued RCT confidence is potentially more useful for article ranking and review than a simple yes/no prediction. The automated RCT tagging tool should offer significant savings of time and effort during the process of writing SRs, and is a key component of a multistep text mining pipeline that we are building to streamline SR workflow. In addition, the model may be useful for identifying errors in MEDLINE publication types. The RCT confidence predictions described here have been made available to users as a web service with a user query form front end at: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi.

AB - Objective: For many literature review tasks, including systematic review (SR) and other aspects of evidence-based medicine, it is important to know whether an article describes a randomized controlled trial (RCT). Current manual annotation is not complete or flexible enough for the SR process. In this work, highly accurate machine learning predictive models were built that include confidence predictions of whether an article is an RCT. Materials and Methods: The LibSVM classifier was used with forward selection of potential feature sets on a large human-related subset of MEDLINE to create a classification model requiring only the citation, abstract, and MeSH terms for each article. Results: The model achieved an area under the receiver operating characteristic curve of 0.973 and mean squared error of 0.013 on the held out year 2011 data. Accurate confidence estimates were confirmed on a manually reviewed set of test articles. A second model not requiring MeSH terms was also created, and performs almost as well. Discussion: Both models accurately rank and predict article RCT confidence. Using the model and the manually reviewed samples, it is estimated that about 8000 (3%) additional RCTs can be identified in MEDLINE, and that 5% of articles tagged as RCTs in Medline may not be identified. Conclusion: Retagging human-related studies with a continuously valued RCT confidence is potentially more useful for article ranking and review than a simple yes/no prediction. The automated RCT tagging tool should offer significant savings of time and effort during the process of writing SRs, and is a key component of a multistep text mining pipeline that we are building to streamline SR workflow. In addition, the model may be useful for identifying errors in MEDLINE publication types. The RCT confidence predictions described here have been made available to users as a web service with a user query form front end at: http://arrowsmith.psych.uic.edu/cgi-bin/arrowsmith_uic/RCT_Tagger.cgi.

KW - Evidence-based medicine

KW - Information retrieval

KW - Natural language processing

KW - Randomized controlled trials as topic

KW - Support vector machines

KW - Systematic reviews

UR - http://www.scopus.com/inward/record.url?scp=84940368805&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84940368805&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocu025

DO - 10.1093/jamia/ocu025

M3 - Article

C2 - 25656516

AN - SCOPUS:84940368805

SN - 1067-5027

VL - 22

SP - 707

EP - 717

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 3

ER -

Automated confidence ranked classification of randomized controlled trial articles: An aid to evidence-based medicine

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this