A Distributed Framework for Computation on the Results of Large Scale NLP

A Distributed Framework for Computation on the Results of Large Scale NLP Christophe Roeder, William A. Baumgartner Jr., Kevin Livingston, Lawrence Hunter (UC Denver)

Questions that could be answered using large corpora • Second source of data for validation/corroboration • Ligand binding site validation – Verspoor et al • Rough ideas/leads to ppi from co-occurence • Protein co-occurrence fraction for use in Hanalyzer networks • Mine more and more recent knowledge than available from curated on ontologies

Available Tools and Data • Data • Large corpora: PMC OA, publisher-arranged collections • Curated Ontologies: PRO, GO, etc. • Tools • UIMA for NLP Processing • Batch schedulers (SGE, Torque) to scale UIMA • Hadoop to collate data • RDF to represent knowledge • Triple Store (Franz AllegroGraph) to store and access large amounts of RDF data

Bio Trends: a Sample Integration Project • Function: • Count occurrences of proteins in articles • Collate by date, and display on a web app. • Design • UIMA over SGE for protein ID, store in RDF files • Read RDF files and collate with Hadoop • Call out to Allegrograph for ID and attribute lookup • Format resulting data as JSON for availability to web app

Prepare Available Data • Start with raw text: PMC Open Access: • 250k full-text journal articles • Identify (annotate) interesting spans (genes) • UIMA pipeline, NERs: ABNER, BANNER, etc, concept mapper on PRO dictionary to noramlize • Output to RDF for various uses

Options to Analyze Data • Load into triple store and query • Necessity for exploring queries with complex results over entire graph • Ex. • Load individual files into in-memory store and query in small groups • Possible for exploring simple queries over many small regions of the graph: article related • Easier to federate • Hybrid • Some data not available from RDF files, but the triple store.

Map-Reduce • Inspired by Lisp functions “map” and “reduce” • Map applies a function to each element of a list • (a1, a2,…an), f(x)  (f(a1), f(a2), …f(an)) • Reduce combines lists by applying a function successively • (a1, a2,…an), f(x,y)  f(f(f(a1,a2),a3), a4) • (1,2,…n), +  (((1+2) + 3) + 4)

Map Reduce on HashMaps • Map can be used to transform from one kind of key, value to a different kind of key, value • (Filename, text)  (gene, count) • Reduce must have same kind of key and value output as input. A call to reduce gets all values for a particular key. • (gene, count)  (gene, count) • (BRCA1, 1), (BRCA1, 3), (BRCA1, 1)  (BRCA1, 5)

Hadoop: a distributed map-reduce on maps or hash tables • Can divide into parallel friendly tasks by key • Distributes files over network • Reduces network traffic by performing computation where data is • Map is used to move from one key-value type to another. From (filename => contents), to (protein-protein, co-occurrence count). • Reduce is used to collate results.

Results • PMC OA • Medline Abstracts

Screen Shot • Grants:

Thank You / Questions • http://www.compbio.ucdenver/bio-trends • Co-authors • William Baumgartner for data generation • Kevin Livingston for RDF and Clojure help • Grants and PIs • Larry Hunter, UCDenver SOM • NIH 2R01LM009254-04, NIH 2R01LM008111-04A1, NIH 5R01GM083649-02 • Karin Verspoor, UCDenver SOM • NIH R01 LM010120-01 • Gully Burns, ISI • NSF 0849977

A Distributed Framework for Computation on the Results of Large Scale NLP

A Distributed Framework for Computation on the Results of Large Scale NLP

Presentation Transcript

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

P4P: A Practical Framework for Privacy-Preserving Distributed Computation

Thesis Defense Large -Scale Graph Computation on Just a PC

Map-Reduce for large scale similarity computation

Sailfish: A Framework For Large Scale Data Processing

Large-Scale Distributed Systems

Large-Scale Distributed Computing in the Netherlands

Large-Scale Distributed Computing in the Netherlands

iMapReduce : A Distributed Computing Framework for Iterative Computation

A Framework for assessing the performance of DWM at large Scale

Large Scale Distributed Computing Systems

Large-scale Deployment in P2P Experiments Using the JXTA Distributed Framework

Going Large-Scale in P2P Experiments Using the JXTA Distributed Framework

GraphChi : Large-Scale Graph Computation on Just a PC

Large-Scale Distributed Systems

A Generic Architecture for Large-Scale Distributed Simulations

Scheduling Strategies for Numerical Methods on Large Scale Distributed Platforms

Introduction to Large-Scale Graph Computation

Double P-Tree: A Distributed Architecture for Large-Scale Video-on-Demand

DS-Grid: Large Scale Distributed Simulation on the Grid

Going Large-Scale in P2P Experiments Using the JXTA Distributed Framework

Contents – Large-Scale Distributed Systems