A comparison of techniques for classification and ad hoc retrieval of biomedical documents

A. M. Cohen; J. Yang; W. R. Hersh

A comparison of techniques for classification and ad hoc retrieval of biomedical documents

Medical Informatics and Clinical Epidemiology

Research output: Contribution to journal › Conference article › peer-review

Abstract

Oregon Health & Science University participated in both the classification and ad hoc retrieval tasks of the TREC 2005 Genomics Track. To better understand the text classification techniques that lead to improved performance, we applied a set of general purpose biomedical document classification systems to the four triage tasks, varying one system feature or text processing technique at a time. We found that our best and most consistent system consisted of a voting perceptron classifier, chi-square feature selection on full text articles, binary feature weighting, stemming and stopping, and prefiltering based on the MeSH term Mice. This system approached, but did not surpass, the performance of the best TREC entry for each of the four tasks. Full text provided a substantial benefit over only title plus abstract. Other common techniques such as inverse-document frequency feature weighting, and cosine normalization were ineffective. For the ad hoc retrieval task, we used Zettair search engine. Both of our submissions used Okapi measure with the parameters optimized using the sample topics that were provided. Two different query sets were used in our runs; one with all the words and the other with only the keywords from the topic file. Queries with only keywords consistently outperformed queries with all words from the topic file. Optimization of the Okapi parameters improved our performance.

Original language	English (US)
Journal	NIST Special Publication
State	Published - 2005
Event	14th Text REtrieval Conference, TREC 2005 - Gaithersburg, MD, United States Duration: Nov 15 2005 → Nov 18 2005

ASJC Scopus subject areas

General Engineering

Cite this

@article{e8f17f2f4b17494eae2b793f9bd8b799,

title = "A comparison of techniques for classification and ad hoc retrieval of biomedical documents",

abstract = "Oregon Health & Science University participated in both the classification and ad hoc retrieval tasks of the TREC 2005 Genomics Track. To better understand the text classification techniques that lead to improved performance, we applied a set of general purpose biomedical document classification systems to the four triage tasks, varying one system feature or text processing technique at a time. We found that our best and most consistent system consisted of a voting perceptron classifier, chi-square feature selection on full text articles, binary feature weighting, stemming and stopping, and prefiltering based on the MeSH term Mice. This system approached, but did not surpass, the performance of the best TREC entry for each of the four tasks. Full text provided a substantial benefit over only title plus abstract. Other common techniques such as inverse-document frequency feature weighting, and cosine normalization were ineffective. For the ad hoc retrieval task, we used Zettair search engine. Both of our submissions used Okapi measure with the parameters optimized using the sample topics that were provided. Two different query sets were used in our runs; one with all the words and the other with only the keywords from the topic file. Queries with only keywords consistently outperformed queries with all words from the topic file. Optimization of the Okapi parameters improved our performance.",

author = "Cohen, {A. M.} and J. Yang and Hersh, {W. R.}",

year = "2005",

language = "English (US)",

journal = "NIST Special Publication",

issn = "1048-776X",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

note = "14th Text REtrieval Conference, TREC 2005 ; Conference date: 15-11-2005 Through 18-11-2005",

}

TY - JOUR

T1 - A comparison of techniques for classification and ad hoc retrieval of biomedical documents

AU - Cohen, A. M.

AU - Yang, J.

AU - Hersh, W. R.

PY - 2005

Y1 - 2005

N2 - Oregon Health & Science University participated in both the classification and ad hoc retrieval tasks of the TREC 2005 Genomics Track. To better understand the text classification techniques that lead to improved performance, we applied a set of general purpose biomedical document classification systems to the four triage tasks, varying one system feature or text processing technique at a time. We found that our best and most consistent system consisted of a voting perceptron classifier, chi-square feature selection on full text articles, binary feature weighting, stemming and stopping, and prefiltering based on the MeSH term Mice. This system approached, but did not surpass, the performance of the best TREC entry for each of the four tasks. Full text provided a substantial benefit over only title plus abstract. Other common techniques such as inverse-document frequency feature weighting, and cosine normalization were ineffective. For the ad hoc retrieval task, we used Zettair search engine. Both of our submissions used Okapi measure with the parameters optimized using the sample topics that were provided. Two different query sets were used in our runs; one with all the words and the other with only the keywords from the topic file. Queries with only keywords consistently outperformed queries with all words from the topic file. Optimization of the Okapi parameters improved our performance.

AB - Oregon Health & Science University participated in both the classification and ad hoc retrieval tasks of the TREC 2005 Genomics Track. To better understand the text classification techniques that lead to improved performance, we applied a set of general purpose biomedical document classification systems to the four triage tasks, varying one system feature or text processing technique at a time. We found that our best and most consistent system consisted of a voting perceptron classifier, chi-square feature selection on full text articles, binary feature weighting, stemming and stopping, and prefiltering based on the MeSH term Mice. This system approached, but did not surpass, the performance of the best TREC entry for each of the four tasks. Full text provided a substantial benefit over only title plus abstract. Other common techniques such as inverse-document frequency feature weighting, and cosine normalization were ineffective. For the ad hoc retrieval task, we used Zettair search engine. Both of our submissions used Okapi measure with the parameters optimized using the sample topics that were provided. Two different query sets were used in our runs; one with all the words and the other with only the keywords from the topic file. Queries with only keywords consistently outperformed queries with all words from the topic file. Optimization of the Okapi parameters improved our performance.

UR - http://www.scopus.com/inward/record.url?scp=84873552514&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873552514&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84873552514

SN - 1048-776X

JO - NIST Special Publication

JF - NIST Special Publication

T2 - 14th Text REtrieval Conference, TREC 2005

Y2 - 15 November 2005 through 18 November 2005

ER -

A comparison of techniques for classification and ad hoc retrieval of biomedical documents

Abstract

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this