Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C. Wallace; Anna Noel-Storr; Iain J. Marshall; Aaron M. Cohen; Neil R. Smalheiser; James Thomas

doi:10.1093/jamia/ocx053

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Byron C. Wallace, Anna Noel-Storr, Iain J. Marshall, Aaron M. Cohen, Neil R. Smalheiser, James Thomas

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Article › peer-review

118 Scopus citations

Abstract

Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

Original language	English (US)
Pages (from-to)	1165-1168
Number of pages	4
Journal	Journal of the American Medical Informatics Association
Volume	24
Issue number	6
DOIs	https://doi.org/10.1093/jamia/ocx053
State	Published - Nov 1 2017

Keywords

Crowdsourcing
Evidence-based medicine
Human computation
Machine learning
Natural language processing

ASJC Scopus subject areas

Health Informatics

Access to Document

10.1093/jamia/ocx053

Cite this

@article{91e5c4eb452f4463ba682e8704c33cea,

title = "Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach",

abstract = "Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.",

keywords = "Crowdsourcing, Evidence-based medicine, Human computation, Machine learning, Natural language processing",

author = "Wallace, {Byron C.} and Anna Noel-Storr and Marshall, {Iain J.} and Cohen, {Aaron M.} and Smalheiser, {Neil R.} and James Thomas",

note = "Publisher Copyright: {\textcopyright} The Author 2017.",

year = "2017",

month = nov,

day = "1",

doi = "10.1093/jamia/ocx053",

language = "English (US)",

volume = "24",

pages = "1165--1168",

journal = "Journal of the American Medical Informatics Association",

issn = "1067-5027",

publisher = "Oxford University Press",

number = "6",

}

TY - JOUR

T1 - Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

AU - Wallace, Byron C.

AU - Noel-Storr, Anna

AU - Marshall, Iain J.

AU - Cohen, Aaron M.

AU - Smalheiser, Neil R.

AU - Thomas, James

N1 - Publisher Copyright: © The Author 2017.

PY - 2017/11/1

Y1 - 2017/11/1

N2 - Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

AB - Objectives: Identifying all published reports of randomized controlled trials (RCTs) is an important aim, but it requires extensive manual effort to separate RCTs from non-RCTs, even using current machine learning (ML) approaches. We aimed tomake this process more efficient via a hybrid approach using both crowdsourcing andML. Methods: We trained a classifier to discriminate between citations that describe RCTs and those that do not. We then adopted a simple strategy of automatically excluding citations deemed very unlikely to be RCTs by the classifier and deferring to crowdworkers otherwise. Results: Combining ML and crowdsourcing provides a highly sensitive RCT identification strategy (our estimates suggest 95%-99% recall) with substantially less effort (we observed a reduction of around 60%-80%) than relying on manual screening alone. Conclusions: Hybrid crowd-ML strategies warrant further exploration for biomedical curation/annotation tasks.

KW - Crowdsourcing

KW - Evidence-based medicine

KW - Human computation

KW - Machine learning

KW - Natural language processing

UR - http://www.scopus.com/inward/record.url?scp=85028670339&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85028670339&partnerID=8YFLogxK

U2 - 10.1093/jamia/ocx053

DO - 10.1093/jamia/ocx053

M3 - Article

C2 - 28541493

AN - SCOPUS:85028670339

SN - 1067-5027

VL - 24

SP - 1165

EP - 1168

JO - Journal of the American Medical Informatics Association

JF - Journal of the American Medical Informatics Association

IS - 6

ER -

Identifying reports of randomized controlled trials (RCTs) via a hybrid machine learning and crowdsourcing approach

Abstract

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this