Finna: A paragraph prioritization system for biocuration in the neurosciences

Kyle H. Ambert; Aaron M. Cohen; Gully A.P.C. Burns; Eilis Boudreau; Kemal Sonmez

Finna: A paragraph prioritization system for biocuration in the neurosciences

Kyle H. Ambert, Aaron M. Cohen, Gully A.P.C. Burns, Eilis Boudreau, Kemal Sonmez

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

The emphasis of multilevel modeling techniques in the neurosciences has led to an increased need for large-scale, computationally-accessible databases containing neuroscientific data. Despite this, such databases are not being populated at a rate commensurate with their demand amongst Neuroinformaticians. The reasons for this are common to scientific database curation in general, namely, limitation of resources. Much of neuroscience's long tradition of research has been documented in computationally inaccessible formats, such as the pdf, making large-scale data extraction laborious and expensive. Here, we present a system for alleviating one bottleneck in the workflow for curating a typical knowledge base of neuroscience-related information. Finna is designed to rank-order the composite paragraphs of a publication that is predicted to contain information relevant to a knowledge base, in terms of the probability that each documents relevant data. We were able to achieve excellent performance with our classifier (AUC > 0.90) on our manually-curated neuroscience document corpus. Our approach would allow curators to read only a median of 2 paragraphs for each document, in order to identify information relevant to a neuron-related knowledge base. To our knowledge, this is the first system of its kind, and will be a useful baseline for developing similar resources for the neurosciences, and curation in general.

Original language	English (US)
Title of host publication	Discovery Informatics
Subtitle of host publication	AI Takes a Science-Centered View on Big Data - Papers from the AAAI Fall Symposium, Technical Report
Publisher	AI Access Foundation
Pages	2-7
Number of pages	6
ISBN (Print)	9781577356394
State	Published - 2013
Event	2013 AAAI Fall Symposium - Arlington, VA, United States Duration: Nov 15 2013 → Nov 17 2013

Publication series

Name	AAAI Fall Symposium - Technical Report
Volume	FS-13-01

Other

Other	2013 AAAI Fall Symposium
Country/Territory	United States
City	Arlington, VA
Period	11/15/13 → 11/17/13

ASJC Scopus subject areas

General Engineering

Cite this

Ambert, K. H., Cohen, A. M., Burns, G. A. P. C., Boudreau, E., & Sonmez, K. (2013). Finna: A paragraph prioritization system for biocuration in the neurosciences. In Discovery Informatics: AI Takes a Science-Centered View on Big Data - Papers from the AAAI Fall Symposium, Technical Report (pp. 2-7). (AAAI Fall Symposium - Technical Report; Vol. FS-13-01). AI Access Foundation.

Finna: A paragraph prioritization system for biocuration in the neurosciences. / Ambert, Kyle H.; Cohen, Aaron M.; Burns, Gully A.P.C. et al.
Discovery Informatics: AI Takes a Science-Centered View on Big Data - Papers from the AAAI Fall Symposium, Technical Report. AI Access Foundation, 2013. p. 2-7 (AAAI Fall Symposium - Technical Report; Vol. FS-13-01).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Ambert, KH, Cohen, AM, Burns, GAPC, Boudreau, E & Sonmez, K 2013, Finna: A paragraph prioritization system for biocuration in the neurosciences. in Discovery Informatics: AI Takes a Science-Centered View on Big Data - Papers from the AAAI Fall Symposium, Technical Report. AAAI Fall Symposium - Technical Report, vol. FS-13-01, AI Access Foundation, pp. 2-7, 2013 AAAI Fall Symposium, Arlington, VA, United States, 11/15/13.

@inproceedings{41114c7967d242719226ca2e3d11322f,

title = "Finna: A paragraph prioritization system for biocuration in the neurosciences",

abstract = "The emphasis of multilevel modeling techniques in the neurosciences has led to an increased need for large-scale, computationally-accessible databases containing neuroscientific data. Despite this, such databases are not being populated at a rate commensurate with their demand amongst Neuroinformaticians. The reasons for this are common to scientific database curation in general, namely, limitation of resources. Much of neuroscience's long tradition of research has been documented in computationally inaccessible formats, such as the pdf, making large-scale data extraction laborious and expensive. Here, we present a system for alleviating one bottleneck in the workflow for curating a typical knowledge base of neuroscience-related information. Finna is designed to rank-order the composite paragraphs of a publication that is predicted to contain information relevant to a knowledge base, in terms of the probability that each documents relevant data. We were able to achieve excellent performance with our classifier (AUC > 0.90) on our manually-curated neuroscience document corpus. Our approach would allow curators to read only a median of 2 paragraphs for each document, in order to identify information relevant to a neuron-related knowledge base. To our knowledge, this is the first system of its kind, and will be a useful baseline for developing similar resources for the neurosciences, and curation in general.",

author = "Ambert, {Kyle H.} and Cohen, {Aaron M.} and Burns, {Gully A.P.C.} and Eilis Boudreau and Kemal Sonmez",

year = "2013",

language = "English (US)",

isbn = "9781577356394",

series = "AAAI Fall Symposium - Technical Report",

publisher = "AI Access Foundation",

pages = "2--7",

booktitle = "Discovery Informatics",

note = "2013 AAAI Fall Symposium ; Conference date: 15-11-2013 Through 17-11-2013",

}

TY - GEN

T1 - Finna

T2 - 2013 AAAI Fall Symposium

AU - Ambert, Kyle H.

AU - Cohen, Aaron M.

AU - Burns, Gully A.P.C.

AU - Boudreau, Eilis

AU - Sonmez, Kemal

PY - 2013

Y1 - 2013

N2 - The emphasis of multilevel modeling techniques in the neurosciences has led to an increased need for large-scale, computationally-accessible databases containing neuroscientific data. Despite this, such databases are not being populated at a rate commensurate with their demand amongst Neuroinformaticians. The reasons for this are common to scientific database curation in general, namely, limitation of resources. Much of neuroscience's long tradition of research has been documented in computationally inaccessible formats, such as the pdf, making large-scale data extraction laborious and expensive. Here, we present a system for alleviating one bottleneck in the workflow for curating a typical knowledge base of neuroscience-related information. Finna is designed to rank-order the composite paragraphs of a publication that is predicted to contain information relevant to a knowledge base, in terms of the probability that each documents relevant data. We were able to achieve excellent performance with our classifier (AUC > 0.90) on our manually-curated neuroscience document corpus. Our approach would allow curators to read only a median of 2 paragraphs for each document, in order to identify information relevant to a neuron-related knowledge base. To our knowledge, this is the first system of its kind, and will be a useful baseline for developing similar resources for the neurosciences, and curation in general.

AB - The emphasis of multilevel modeling techniques in the neurosciences has led to an increased need for large-scale, computationally-accessible databases containing neuroscientific data. Despite this, such databases are not being populated at a rate commensurate with their demand amongst Neuroinformaticians. The reasons for this are common to scientific database curation in general, namely, limitation of resources. Much of neuroscience's long tradition of research has been documented in computationally inaccessible formats, such as the pdf, making large-scale data extraction laborious and expensive. Here, we present a system for alleviating one bottleneck in the workflow for curating a typical knowledge base of neuroscience-related information. Finna is designed to rank-order the composite paragraphs of a publication that is predicted to contain information relevant to a knowledge base, in terms of the probability that each documents relevant data. We were able to achieve excellent performance with our classifier (AUC > 0.90) on our manually-curated neuroscience document corpus. Our approach would allow curators to read only a median of 2 paragraphs for each document, in order to identify information relevant to a neuron-related knowledge base. To our knowledge, this is the first system of its kind, and will be a useful baseline for developing similar resources for the neurosciences, and curation in general.

UR - http://www.scopus.com/inward/record.url?scp=84898867244&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84898867244&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84898867244

SN - 9781577356394

T3 - AAAI Fall Symposium - Technical Report

SP - 2

EP - 7

BT - Discovery Informatics

PB - AI Access Foundation

Y2 - 15 November 2013 through 17 November 2013

ER -

Finna: A paragraph prioritization system for biocuration in the neurosciences

Abstract

Publication series

Other

ASJC Scopus subject areas

Other files and links

Fingerprint

Cite this