260 likes | 365 Vues
Solving ILP Problems in the EELA infrastructure. In ês Dutra Departamento de Ciência de Computadores Universidade do Porto, Portugal. Outline. Introduction ILP Examples Motivation Experiments Conclusions Future Work. Introduction. EELA selected application
E N D
Solving ILP Problems in the EELA infrastructure Inês Dutra Departamento de Ciência de Computadores Universidade do Porto, Portugal
Outline • Introduction • ILP • Examples • Motivation • Experiments • Conclusions • Future Work
Introduction • EELA selected application • Task 3.3: additional applications
Introduction • What is ILP? • It is NOT Instruction Level Parallelism • It is NOT Integer Linear Programming • So, what is it???? • .......
Introduction • It is Inductive Logic Programming • data mining • machine learning • Knowledge/information extraction • Where: • Given: • Set of observations (positive and negative) • Background knowledge (descriptions) • Language bias • Find: • A hypothesis (in first order language) that best explains all positive observations and none of the negatives.
Introduction • Advantages: • Use of an understandable description language • Relational knowledge
Introduction: example TRAINS GOING EAST TRAINS GOING WEST
Introduction: example TRAINS GOING EAST TRAINS GOING WEST
Introduction: example TRAINS GOING EAST TRAINS GOING WEST eastbound(T) IF has_car(T,C) AND short(C) AND closed(C)
Another less “toyish” example: extracting knowledge from mammograms is_malignant(A) if 'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'Age'(A,age6570), previous_finding(A,B,C), 'MassesShape'(B,none), 'Calc_Punctate'(B,notPresent), previous_finding(A,C), 'BIRADS_category'(C,b3). This rule states that finding (A) IS malignant IF it is: classified as BI-RADS 5 AND had a mass present in a patient who: was between the ages of 65 and 70 had two prior mammograms (B, C) and prior mammogram (B): had no mass shape described had no punctate calcifications and prior mammogram (C) was classified as BI-RADS 3
Introduction: Motivation • Applications: • Link discovery • Social Network Analysis • Equivalent identities • Drug design • Protein unfolding • Protein metabolism • Why not? Classifying grid failures () • And...many others!
Introduction: Motivation • Why does ILP need a grid? • Search space can become large very quickly • Need many experiments to have statistical significant results • Cross-validation • Training, tuning, testing • Can combine classifiers: ensembles
Introduction: Motivation • Assume we want to run a task for one domain: find a “good” hypothesis that describes pos examples • Assume we run 5x4-fold cross-validation • Assume we have 100 classifiers per fold • # of experiments: 2,000
Introduction: Motivation • Now assume each experiment takes 1 hour to run • How long would it take to generate the 2,000 classifiers to be combined? ~ 83 days!!! • If we consider varying learning parameters and learning algorithms, this number can be really big!!
Experiment • Predict carcinogenecity in rodents • Difficult task • large search space! • Important problem • Phase 1: • Tuning using 5x4-fold cross-validaton • Generating ensembles up to 100 • Aleph: well-known ILP system • Yap: Yet another prolog
Experiment: one of the classifiers active(A) if atom(A,_,n,32,B), B ≤-0.401, has_property(A,cytogen_sce,n), methyl(A,_). Sister Chromatid Exchange (SCE) SCE is used for the determination of mutagenity
Experiment • 2 submissions: • From LA • From EU
Experiment EELA resources utilised ~ 300 resources in LA 211 jobs in LA
Experiments • Why 1,969 out of 2,000??? • 2 reasons: • Proxy expiration: • On submission (takes loooooong!!!) • On execution • Use of dynamic libraries
Submitting jobs from EU... • from a non-EELA site, BUT • Using the EELA VO: • Jobs run only on EU resources... • Reasons: • Misconfiguration? • Closer brokers with more machines?
Conclusions • Happiness: EELA is working!!! • We can run thousands of experiments! • Frida is happy!!! (see Condor introductory tutorials, if you feel curious about Frida ) • Experiment showed good utilization of EELA resources in LA and EU • Low failure rate (1%) • Failures motivated by: • Dynamic libs not available in the remote machine • Proxy expiration
Future work • More detailed analysis of jobs and logs • Full ILP experiment • More domains • Other kinds of experiments based on Statistical Relational Learning • And, do not forget: ILP can help to model and diagnose errors in the grid environment!
Collaborators • Fernando Silva (DCC-UPorto) • Vítor Santos Costa (DCC-UPorto) • Rui Camacho (FE-UPorto) • Nuno Fonseca (IBMC/IBMEC, Porto) • Beth Burnside (UW-Madison hospital) • David Page (UW-Madison) • Jesse Davis (UWashington)
Thanks!!! Questions??