A probabilistic automated tagger to identify human-related publications

Aaron M. Cohen, Zackary O. Dunivin, Neil R. Smalheiser

Research output: Contribution to journalArticlepeer-review

1 Scopus citations


The Medical Subject Heading 'Humans' is manually curated and indicates human-related studies within MEDLINE. However, newly published MEDLINE articles may take months to be indexed and non-MEDLINE articles lack consistent, transparent indexing of this feature. Therefore, for up to date and broad literature searches, there is a need for an independent automated system to identify whether a given publication is human-related, particularly when they lack Medical Subject Headings. One million MEDLINE records published in 1987-2014 were randomly selected. Text-based features from the title, abstract, author name and journal fields were extracted. A linear support vector machine was trained to estimate the probability that a given article should be indexed as Humans and was evaluated on records from 2015 to 2016. Overall accuracy was high: area under the receiver operating curve = 0.976, F1 = 95% relative to MeSH indexing. Manual review of cases of extreme disagreement with MEDLINE showed 73.5% agreement with the automated prediction. We have tagged all articles indexed in PubMed with predictive scores and have made the information publicly available at http://arrowsmith.psych.uic.edu/evidence-based-medicine/index.html. We have also made available a web-based interface to allow users to obtain predictive scores for non-MEDLINE articles. This will assist in the triage of clinical evidence for writing systematic reviews.

Original languageEnglish (US)
Issue number2018
StatePublished - Jan 1 2018

ASJC Scopus subject areas

  • Information Systems
  • General Biochemistry, Genetics and Molecular Biology
  • General Agricultural and Biological Sciences


Dive into the research topics of 'A probabilistic automated tagger to identify human-related publications'. Together they form a unique fingerprint.

Cite this