330 likes | 453 Vues
e-Science Central: Doing Science on the Web, powered by the Cloud. Paul Watson Director, Digital Institute School of Computing Science Newcastle University, UK Paul.Watson@ncl.ac.uk. The team: Hugo Hiden, Simon Woodman, David Leahy, Jacek Cala
E N D
e-Science Central: Doing Science on the Web, powered by the Cloud Paul Watson Director, Digital Institute School of Computing Science Newcastle University, UK Paul.Watson@ncl.ac.uk • The team: Hugo Hiden, Simon Woodman, David Leahy, Jacek Cala • Dominic Searson, Vladimir Sykora, Martyn Taylor, Joanna Berry • With thanks to: • Microsoft External Research, EPSRC, OneNE, RCUK • Christophe Poulain, Savas Parastatidis
Why Clouds? • Cloud computing can revolutionise e-science • give access to resources when needed • reduce time from idea to realisation • provide sustainable infrastructure
Good Workload Patterns for Clouds(with acknowledgements to Dianne O’Brien) Fast Growth Bursting new data processed new algorithm runs event triggers computation (e.g. earthquake) Compute Compute • datasets / applications with rapidly growing popularity Time Time
Science in the Cloud. Option #1 Users • Problem: • Building the complex, scalable, • dependable systems • researchers need is still hard: • high-level IT skills • on-going management costs • bespoke Science App 1 Science App n .... Cloud Infrastructure: Storage & Compute
The Long Tail of Scientists • individuals, research groups, SMEs • lack skills & access to resources • largely untouched by e-Science
Cloud Challenge for e-Science • How can we increase the number of researchers who benefit? • x100,000 • across a wide range of research areas • in academia and industry
Science Cloud Option #2 Science App 1 Science App n Users Users .... Science Cloud Platform Science App 1 Science App n .... Cloud Infrastructure: Storage & Compute Cloud Infrastructure: Storage & Compute
e-Science Central Science App 1 Science App n Science as a Service for users Users .... Science Cloud Platform ? Science Cloud Platform for developers Cloud Infrastructure: Storage & Compute
North East Regional e-Science Centre Aim - a regional centre of excellence in e-science Edinburgh 2001- Newcastle – North East Centre Belfast Manchester Cambridge Oxford Cardiff Imperial Southampton
Research Areas – over 25+ funded projects (€50M+) • Bioinformatics • Ageing & Health • Neuroscience • Chemical Engineering • Chemistry • Transport • Geomatics • Video Archives • Artistic Performance Analysis • Computer Performance Analysis • Computer Science
Identify Common IT Needs of Research Data(instruments, experimental data, sensors...)
.... App App Analysis Services e-Science Central App API Security Social Networking Science Cloud Platform Provenance Workflow Enactment Metadata Processing Cloud Infrastructure Storage
Case Study: Project Junior • Predicting Chemical Activity • A collaboration with Prof David Leahy’s Chemistry research group • Funded by Microsoft External Research
Chemists want to know: Q1. What are the properties of this molecule? Toxicity Biological Activity Solubility Q2. What molecule would have aqueous solubility of 0.1 μg/mL?
Answering the Question by performing experiments ..... time consuming, expensive, ethical Issues
An alternative to experimentation: QSAR Quantitative Structure Activity Relationship - predict properties based on similar molecules Activity≈ f( ) quantifiable structural attributes, e.g. #atoms logp shape .....
Generating the models -Discovery Bus (Leahy et al) Data Model-Builders Models www.openqsar.com New Data or Model-Builders Model Generation New/ Improved Models
Chemical Structures & their Activities Separate Training & Test Data Test Data Training Data Calculate Descriptors from Structures Descriptors + Responses Combine Descriptors Selected Descriptors + Responses Combined Descriptors + Responses Filter Descriptors Multiple Linear Regression Neural Network Partial Least Squares Classification Trees Build & Test Models Independently ..... Select Best Models Add to Model Database
Increasing amounts of data for model building... CHEMBL : data on 622,824 compounds, collected from 33,956 publications WOMBAT : data on 251,560 structures, for over 1,966 targets WOMBAT-PK: data on 1,230 compounds, for over 13,000 clinical measurements All contain structure information & numerical activity data More models Better models • Computationally expensive: • 5 years for new datasets on existing server
Chemical Structures & their Activities Separate Training & Test Data Test Data Training Data Calculate Descriptors from Structures Descriptors + Responses Combine Descriptors Combined Descriptors + Responses Filter Descriptors Selected Descriptors + Responses Multiple Linear Regression Neural Network Partial Least Squares Classification Trees Build & Test Models Independently ..... Select Best Models Add to Model Database
Discovery Bus Good Workload Patterns for Clouds(with acknowledgements to Dianne O’Brien) New Model Builder New Data Fast Growth Bursting new data processed new algorithm runs event triggers computation (e.g. earthquake) Compute Compute • datasets / applications with rapidly growing popularity Time Time
Project JUNIOR Aim to use Azure & e-Science Central to generate models in weeks not years .... make models available on the web ... so that researchers can generate predictions for their own molecules
Discovery Bus Planner Amazon AWS Analysis Services e-Science Central App API Security Social Networking Provenance Workflow Enactment Metadata Processing Windows Azure Storage
2 Workflow decomposed to Message Plan 1 Discovery Bus invokes e-Science Central Workflow via API Temporary workflow storage assigned, Message Plan queued for execution. 3 4 Message Plan Call Message Internal Service RMI / JMS NFS Response Message Workflow temporary storage Messages sent in sequence Call Message Azure Service HTTP HTTP Post Response Message 5 5 Workflow Execution Completes Discovery Bus notified with results Results data stored in e-Science Central folder
e-Science Central Blob Storage Web Node Worker Node Worker Node Worker Node Results Queue Azure
Result • Successfully used Windows Azure to generate models quickly • - 100 workers gave result in 3 weeks (not 5 years!) • 750K new models available • (50x more than previously available)
Current e-Science Central Status • 40+ regular users (and growing) • 200K workflows enacted • exploring business models to provide sustainable science as a service : www.inkspotscience.com • In Venus-C • enhancing and moving workflow engine into Azure • exploring competitive workflow as a generic cloud pattern
Summary • Cloud computing can revolutionise e-science • provide sustainable infrastructure • reduce time from idea to realisation • Clouds can revolutionise e-science • but they do NOT make it easier to build the complex, scalable, dependable systems that science needs • e-Science Central offers a Science Cloud Platform • reduces complexity of developing cloud applications • hides cloud entirely from end-users • demo available