A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William A. Baumgartner Jr., Kevin Livingston, Lawrence Hunter (UC Denver)
Questions that could be answered using large corpora • Second source of data for validation/corroboration • Ligand binding site validation – Verspoor et al • Rough ideas/leads to ppi from co-occurence • Protein co-occurrence fraction for use in Hanalyzer networks • Mine more and more recent knowledge than available from curated on ontologies
Available Tools and Data • Data • Large corpora: PMC OA, publisher-arranged collections • Curated Ontologies: PRO, GO, etc. • Tools • UIMA for NLP Processing • Batch schedulers (SGE, Torque) to scale UIMA • Hadoop to collate data • RDF to represent knowledge • Triple Store (Franz AllegroGraph) to store and access large amounts of RDF data
Bio Trends: a Sample Integration Project • Function: • Count occurrences of proteins in articles • Collate by date, and display on a web app. • Design • UIMA over SGE for protein ID, store in RDF files • Read RDF files and collate with Hadoop • Call out to Allegrograph for ID and attribute lookup • Format resulting data as JSON for availability to web app
Prepare Available Data • Start with raw text: PMC Open Access: • 250k full-text journal articles • Identify (annotate) interesting spans (genes) • UIMA pipeline, NERs: ABNER, BANNER, etc, concept mapper on PRO dictionary to noramlize • Output to RDF for various uses
Options to Analyze Data • Load into triple store and query • Necessity for exploring queries with complex results over entire graph • Ex. • Load individual files into in-memory store and query in small groups • Possible for exploring simple queries over many small regions of the graph: article related • Easier to federate • Hybrid • Some data not available from RDF files, but the triple store.
Map-Reduce • Inspired by Lisp functions “map” and “reduce” • Map applies a function to each element of a list • (a1, a2,…an), f(x) (f(a1), f(a2), …f(an)) • Reduce combines lists by applying a function successively • (a1, a2,…an), f(x,y) f(f(f(a1,a2),a3), a4) • (1,2,…n), + (((1+2) + 3) + 4)
Map Reduce on HashMaps • Map can be used to transform from one kind of key, value to a different kind of key, value • (Filename, text) (gene, count) • Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key. • (gene, count) (gene, count) • (BRCA1, 1), (BRCA1, 3), (BRCA1, 1) (BRCA1, 5)
Hadoop: a distributed map-reduce on maps or hash tables • Can divide into parallel friendly tasks by key • Distributes files over network • Reduces network traffic by performing computation where data is • Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co-occurrence count). • Reduce is used to collate results.
Results • PMC OA • Medline Abstracts
Screen Shot • Grants:
Thank You / Questions • http://www.compbio.ucdenver/bio-trends • Co-authors • William Baumgartner for data generation • Kevin Livingston for RDF and Clojure help • Grants and PIs • Larry Hunter, UCDenver SOM • NIH 2R01LM009254-04, NIH 2R01LM008111-04A1, NIH 5R01GM083649-02 • Karin Verspoor, UCDenver SOM • NIH R01 LM010120-01 • Gully Burns, ISI • NSF 0849977