Solving Inductive Logic Programming Problems in EELA Infrastructure

Solving ILP Problems in the EELA infrastructure Inês Dutra Departamento de Ciência de Computadores Universidade do Porto, Portugal

Outline • Introduction • ILP • Examples • Motivation • Experiments • Conclusions • Future Work

Introduction • EELA selected application • Task 3.3: additional applications

Introduction • What is ILP? • It is NOT Instruction Level Parallelism • It is NOT Integer Linear Programming • So, what is it???? • .......

Introduction • It is Inductive Logic Programming • data mining • machine learning • Knowledge/information extraction • Where: • Given: • Set of observations (positive and negative) • Background knowledge (descriptions) • Language bias • Find: • A hypothesis (in first order language) that best explains all positive observations and none of the negatives.

Introduction • Advantages: • Use of an understandable description language • Relational knowledge

Introduction: example TRAINS GOING EAST TRAINS GOING WEST

Introduction: example

Introduction: example TRAINS GOING EAST TRAINS GOING WEST

Introduction: example TRAINS GOING EAST TRAINS GOING WEST eastbound(T) IF has_car(T,C) AND short(C) AND closed(C)

Another less “toyish” example: extracting knowledge from mammograms is_malignant(A) if 'BIRADS_category'(A,b5), 'MassPAO'(A,present), 'Age'(A,age6570), previous_finding(A,B,C), 'MassesShape'(B,none), 'Calc_Punctate'(B,notPresent), previous_finding(A,C), 'BIRADS_category'(C,b3). This rule states that finding (A) IS malignant IF it is: classified as BI-RADS 5 AND had a mass present in a patient who: was between the ages of 65 and 70 had two prior mammograms (B, C) and prior mammogram (B): had no mass shape described had no punctate calcifications and prior mammogram (C) was classified as BI-RADS 3

Introduction: Motivation • Applications: • Link discovery • Social Network Analysis • Equivalent identities • Drug design • Protein unfolding • Protein metabolism • Why not? Classifying grid failures () • And...many others!

Introduction: Motivation • Why does ILP need a grid? • Search space can become large very quickly • Need many experiments to have statistical significant results • Cross-validation • Training, tuning, testing • Can combine classifiers: ensembles

Introduction: Motivation • Assume we want to run a task for one domain: find a “good” hypothesis that describes pos examples • Assume we run 5x4-fold cross-validation • Assume we have 100 classifiers per fold • # of experiments: 2,000

Introduction: Motivation • Now assume each experiment takes 1 hour to run • How long would it take to generate the 2,000 classifiers to be combined? ~ 83 days!!! • If we consider varying learning parameters and learning algorithms, this number can be really big!!

Experiment • Predict carcinogenecity in rodents • Difficult task • large search space! • Important problem • Phase 1: • Tuning using 5x4-fold cross-validaton • Generating ensembles up to 100 • Aleph: well-known ILP system • Yap: Yet another prolog 

Experiment: one of the classifiers active(A) if atom(A,_,n,32,B), B ≤-0.401, has_property(A,cytogen_sce,n), methyl(A,_). Sister Chromatid Exchange (SCE) SCE is used for the determination of mutagenity

Experiment • 2 submissions: • From LA • From EU

Submitting jobs from LA....

Experiment EELA resources utilised ~ 300 resources in LA 211 jobs in LA

Experiments • Why 1,969 out of 2,000??? • 2 reasons: • Proxy expiration: • On submission (takes loooooong!!!) • On execution • Use of dynamic libraries

Submitting jobs from EU... • from a non-EELA site, BUT • Using the EELA VO: • Jobs run only on EU resources... • Reasons: • Misconfiguration? • Closer brokers with more machines?

Conclusions • Happiness: EELA is working!!! • We can run thousands of experiments! • Frida is happy!!! (see Condor introductory tutorials, if you feel curious about Frida ) • Experiment showed good utilization of EELA resources in LA and EU • Low failure rate (1%) • Failures motivated by: • Dynamic libs not available in the remote machine • Proxy expiration

Future work • More detailed analysis of jobs and logs • Full ILP experiment • More domains • Other kinds of experiments based on Statistical Relational Learning • And, do not forget: ILP can help to model and diagnose errors in the grid environment!

Collaborators • Fernando Silva (DCC-UPorto) • Vítor Santos Costa (DCC-UPorto) • Rui Camacho (FE-UPorto) • Nuno Fonseca (IBMC/IBMEC, Porto) • Beth Burnside (UW-Madison hospital) • David Page (UW-Madison) • Jesse Davis (UWashington)

Thanks!!! Questions??

Solving Inductive Logic Programming Problems in EELA Infrastructure

Solving Inductive Logic Programming Problems in EELA Infrastructure

Presentation Transcript

From Solving Homework Problems to Solving Research Problems to Solving Real-World Problems

Solving the Problems

Solving Homework Problems in SPSS

An Introduction to Grids and the EELA-2 infrastructure

Solving Problems

SOLVING PROBLEMS

Persevere in Solving Problems

Solving Problems

Solving Problems in Groups

EELA, EELA-2,…. LGI

The EELA Project

Applications in EELA

Grid Infrastructure in Latin America The EELA-2 story and legacy

The EELA Project

Infrastructure Problems

SOLVING PROBLEMS

The EELA Grid Infrastructure and HEP Applications in Latin America

The EELA Project

Solving Problems

Solving Problems in SPSS

The EELA Project

Solving Some of the Problems in Collaboration