Dheeraj Bhardwaj Department of Computer Science & Engineering

Dheeraj Bhardwaj Department of Computer Science & Engineering Indian Institute of Technology, Delhi –110 016 India http://www.cse.iitd.ac.in/~dheerajb BioGrid Challenges, Problems and Opportunities

BIOLOGICAL PHENOMENON measurement process inference, conclusions DATA MODEL data analysis, learning

Bioinformatics Biocomputing Bioinformatics Vs. Biocomputing IT BT

“Maze” on a Jigsaw Puzzle Phenome Genome Biological Data

Equipments for New Quest High Performance Computers Data, Knowledge and Tools Collaboration of Human Experts The illustrations are quoted from the following sites: www.dnr.state.wi.us/org/ aw/air/ed/educatio.htm www.mtnbrook.k12.al.us/academy/2ndgrade/mtn/map.htm www.dnr.state.wi.us/org/ aw/air/ed/educatio.htm

Needs of High Performance Computing • Increase of Genome Sequence Information • Combinatorial Increase of Search Space • Genome * Transcriptome * Proteome* ... * Phenome • Computer Simulation and Unknown Parameter Estimation Knowledge integration in “Omic Space”

Needs of High Performance Computing • Impact of Genome Sequence Projects • Human Genome (3,000 Mbp, 2000) • Rapid Increase of Genome Sequence Databases • Strong Computation Demand for Homology Search • Start of Structural Genomics Projects • Determine 10,000 folds in 5 years • Strong Computation Demand for Molecular Simulation

1st Issue:Homology Search ・Rapid Increase of Data Size; double per year, daily update （１７ million entry, 50 Giga Bytes @ 2002 Oct.） Rough Estimation Homology Search Time for Mouse cDNA (5,000 Seq.) * Human Genome (3,000 M bp) 1cpu 8cpu 32cpu 256cpu 6,400cpu 1 year 1 month 1 week 1 day 1 hour

Mg GTP Lys16 2nd Issue Molecular Simulation Nano seconds order Molecular Dynamics simulation of protein molecules with 100,000 – 1,000,000 molecular weight • Stability Analysis • Affinity Analysis • Folding Simulation Ex. Ras p21 G # of residues: 189 Molecular weight: 21kD Oncogene Variant Gly12 →Val 5ns 1000h/32Gflops Computer

Needs of Resource Sharing • Biological Databases (Unigene, TrEMBL,...) • Bioinformatics Tools (BLAST, HMMER, ...) • Programming Language (Bioperl, Biojava, ...)

Needs of Human Collaboration

Grid for Bioinformatics • Effective for “Embarrassing Parallel Computation”: • Homology Search, Motif Search, • Unknown Parameter Estimation for Cellular Models • etc • “Distributed Resource Sharing” among organizations: • Web Services, Workflow and Computational Pipeline, • Autonomous Database Update, • etc • “Field” for Human Collaboration: • Group Works for Genome Annotation, Whole Cell Simulation, • Collaboration between Biologists and Computer Scientists, • etc

Summary of Bioinformatics Trend • Rapid increase of Genomic database size causes severe overhead for database service • Demand for Molecular Dynamics Simulation requires High performance computers (including special-purpose computers) Needs a new Bioinformatics Platform for sharing Databases and High performance computers

Strategic Technology Domain Modeling and Simulation From Molecular to Cell Information Integration from Genome to Phenome Grid High Performance Computing (PC-cluster, SMP, Vector)

Evolution of the Scientific Process • Pre-electronic • Theorize &/or experiment, alone or in small teams; publish paper • Post-electronic • Construct and mine very large databases of observational or simulation data • Develop computer simulations & analyses • Exchange information quasi-instantaneously within large, distributed, multidisciplinary teams

COMPUTATIONAL GRID LINUX CLUSTERS CRAY T3E SGI Origin IBM SP SUN ES 10000 ~1000 GFLOPS per $ million ~100 GFLOPS per $ million CRAY T3E SGI Origin IBM SP ~20 GFLOPS per $ million SGI Power Ch IBM SP2 CM5 ~5-8 GFLOPS per $ million CRAY YMP CONVEX C2 ~2-3 GFLOPS per $ million Algorithmic Complexity/Data Volume CRAY XMP CONVEX C1 ALLIANT ~200-400 MFLOPS per $ million • Systems getting larger by 2- 3- 4x per year !! • Increasing parallelism: add more and more processors • New Kind of Parallelism: GRID • Harness the power of Computing Resources which are growing ~60 MFLOPS per $ million CRAY 1 CDC 203 DEC VAX/FPS IBM, CDC UNIVAC ~20 MFLOPS per $ million ~5 MFLOPS per $ million IBM 360/370 CDC 1604/600 UNIVAC 1100 ~3 MFLOPS per $ million Compute Requirements 1990 1980 1985 1995 1975 1970 2005 2000 Distributed& Grid Scalable Parallel Systems Mainframes Vector Processors Supercomputers MPP/SMP

HPC Applications Issues • Architectures and Programming Models • Distributed Memory Systems MPP, Clusters– Message Passing • Shared Memory Systems SMP – Shared Memory Programming • Specialized Architectures– Vector Processing, Data Parallel Programming • The Computational Grid– Grid Programming • Applications I/O • Parallel I/O • Need for high performance I/O systems and techniques, scientific data libraries, and standard data representation • Checkpointing and Recovery • Monitoring and Steering • Visualization (Remote Visualization) • Programming Frameworks

Future of Scientific Computing • Require Large Scale Simulations, beyond reach of any machine • Require Large Geo-distributed Cross Disciplinary Collaborations • Systems getting larger by 2- 3- 4x per year !! • Increasing parallelism: add more and more processors • New Kind of Parallelism: GRID • Harness the power of Computing Resources which are growing

What do we want to Achieve ? • Develop High Performance Computing Applications (HPC) which are • Portable ( Laptop Supercomputers Grid) • Future Proof • Grid Ready • Develop HPC Infrastructure (Parallel & Grid Systems)which is • User Friendly • Based on Open Source • Efficient in Problem Solving • Able to Achieve High Performance • Able to Handle Large Data Volumes

Parallel Computer and Grid A parallel computer is a “Collection of processing elements that communicate and co-operate to solve large problems fast”. A Computational Grid is an emerging infrastructure that enables the integrated use of remote high-end computers, databases, scientific instruments, networks and other resources.

A Comparison • SERIAL • Fetch/Store • Compute • PARALLEL • Fetch/Store • Compute/ communicate • Cooperative game • GRID • Fetch/Store • Discovery of Resources • Interaction with remote application • Authentication / Authorization • Security • Compute/Communicate • Etc

Serial and Parallel Algorithms - Evaluation • Serial Algorithm • Execution time as a function of size of input • Parallel Algorithm • Execution time as a function of input size, parallel architecture and number of processors used Parallel System A parallel system is the combination of an algorithm and the parallel architecture on which its implemented

What is the Grid • “Grid Computing [is] distinguished from conventional distributed computing by its focus on large-scale resource sharing, innovative applications, and, in some cases, high performance orientation…we review the “Grid problem”, which we define as flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources- what we refer to as virtual organizations.” From “The Anatomy of the Grid: Enabling Scalable Virtual Organizations” by Foster, Kesselman and Tuecke

Distributed Computing vs. GRID • Grid is an evolution of distributed computing • Dynamic • Geographically independent • Built around standards • Internet backbone • Distributed computing is an “older term” • Typically built around proprietary software and network • Tightly couples systems/organization

Web vs. GRID • Web • Uniform naming access to documents • Grid - Uniform, high performance access to computational resources http:// http:// Software Catalogs Sensor nets Colleges/R&D Labs

Is the World Wide Web a Grid ? • Seamless naming? Yes • Uniform security and Authentication? No • Information Service? Yes or No • Co-Scheduling? No • Accounting & Authorization ? No • User Services? No • Event Services? No • Is the Browser a Global Shell ? No

What does the World Wide Web bring to the Grid ? • Uniform Naming • A seamless, scalable information service • A powerful new meta-data language: XML • XML will be standard language for describing information in the grid • SOAP – simple object access protocol • Uses XML for encoding. HTML for protocol • SOAP may become a standard RPC mechanism for Grid services • Uses XML for encoding. HTML for protocol • Portal Ideas

The Ultimate Goal • In future I will not know or care where my application will be executed as I will acquire and pay to use these resources as I need them

Why Grids? • Large-scale science and engineering are done through the interaction of people, heterogeneous computing resources, information systems, and instruments, all of which are geographically and organizationally dispersed. • The overall motivation for “Grids” is to facilitate the routine interactions of these resources in order to support large-scale science and Engineering.

Why Now ? • Moore’s law improvements in computing produce highly functional endsystems • The internet and burgeoning wired and wireless provide universal connectivity • Changing modes of working and problem solving emphasize teamwork, computation • Network exponentials produce dramatic changes in geometry and geography

Network Exponentials • Network vs. computer performance • Computer speed doubles every 18 months • Network speed doubles every 9 months • Difference = order of magnitude per 5 years • 1986 to 2000 • Computers: x 500 • Networks: x 340,000 • 2001 to 2010 • Computers: x 60 • Networks: x 4000 Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.

Why Grid ? • We are seeing a Fundamental Change in Scientific Applications • They have become multidisciplinary • Require incredible mix of varies technologies and expertise Motivation:When the network is as fast as the computer's internal links, the machine disintegrates across the net into a set of special purpose appliances.Glider Technology Report, June 2002 “Many problems require tightly coupled computers, with low latencies and high communication bandwidths; Grid computing may well increase … demand for such systems by making access easier” - Foster, Kesselman, Tuecke The Anatomy of the Grid

Convergence between e-Science and e-Business • A biochemist exploits 10, 000 computers to screen 100,000 compounds in an hour • A biologist combines a range of diverse and distributed resources (databases, tools, instruments) to answer complex questions • 1,000 physicists worldwide pool resources for petaop analyses of petabytes of data • Civil engineer collaborate to design, execute, & analyze shake stable experiments. • An enterprise configures internal & external resources to support eBusiness workload From Steve Tuecke 12 Oct’01

Convergence between e-Science and e-Business • Climate Scientist visualize, annotate, & analyze terabytes simulation datasets • An emergency response team couples real time data, weather model, population data • A multidisciplinary analysis in aerospace couples code and data in four companies • A home user invokes architectural design functions at an application service provider • An insurance company mines data from partner hospitals for fraud detection

Important Grid Applications • Data-intensive • Distributed computing (metacomputing) • Collaborative • Remote access to, and computer enhancement of, experimental facilities

An Example Virtual Organization: CERN’s Large Hadron Collider 1800 Physicists, 150 Institutes, 32 Countries 100 PB of data by 2010; 50,000 CPUs? www.griphyn.orgwww.ppdg.org www.eu-datagrid.org

~PBytes/sec ~100 MBytes/sec Offline Processor Farm ~20 TIPS There is a “bunch crossing” every 25 nsecs. There are 100 “triggers” per second Each triggered event is ~1 MByte in size ~100 MBytes/sec Online System Tier 0 CERN Computer Centre ~622 Mbits/sec or Air Freight (deprecated) Tier 1 FermiLab ~4 TIPS France Regional Centre Germany Regional Centre Italy Regional Centre ~622 Mbits/sec Tier 2 Tier2 Centre ~1 TIPS Caltech ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS Tier2 Centre ~1 TIPS HPSS HPSS HPSS HPSS HPSS ~622 Mbits/sec Institute ~0.25TIPS Institute Institute Institute Physics data cache ~1 MBytes/sec 1 TIPS is approximately 25,000 SpecInt95 equivalents Physicists work on analysis “channels”. Each institute will have ~10 physicists working on one or more channels; data for these channels should be cached by the institute server Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Pentium II 300 MHz Tier 4 Physicist workstations Grid Communities & Applications:Data Grids for High Energy Physics www.griphyn.org www.ppdg.net www.eu-datagrid.org

A Brainis a Lotof Data!(Mark Ellisman, UCSD) And comparisons must be made among many We need to get to one micron to know location of every cell. We’re just now starting to get to 10 microns – Grids will help get us there and further

Biomedical InformaticsResearch Network (BIRN) • Evolving reference set of brains provides essential data for developing therapies for neurological disorders (multiple sclerosis, Alzheimer’s, etc.). • Today • One lab, small patient base • 4 TB collection • Tomorrow • 10s of collaborating labs • Larger population sample • 400 TB data collection: more brains, higher resolution • Multiple scale data integration and analysis

The Grid: A Brief History • Early 90s • Gigabit testbeds, metacomputing • Mid to late 90s • Early experiments (e.g., I-WAY), academic software projects (e.g., Globus, Legion), application experiments • 2002 • Dozens of application communities & projects • Major infrastructure deployments • Significant technology base (esp. Globus ToolkitTM) • Growing industrial interest • Global Grid Forum: ~500 people, 20+ countries

A single system interface Transparent wide-area access to large data banks Transparent wide-area access to applications on heterogeneous platforms Transparent wide-area access to processing resources Security, certification, single sing-on authentication Grid Security Infrastructure Data access, Transfer & Replication GridFTP, Giggle Computational resource discovery, allocation and Process creation GRAM, Unicore, Condor-G Today’s Grid

First Generation Grid Computationally intensive, file access/transfer Bag of various heterogeneous protocols & toolkits Recognizes internet, ignores web Academic Team Second Generation Grid Data intensive knowledge intensive Service based architecture Recognizes Web and Web services Global Grid Forum Industry participation Grid Evolution

Challenging Technical Requirements • Dynamic formation and management of virtual organizations • Online negotiation of access to services: who, what, why, when, how • Establishment of applications and systems able to deliver multiple qualities of service • Autonomic management of infrastructure elements Open Grid Services Architecture http://www.globus.org/ogsa

Elements of the Problem • Resource sharing • Computers, storage, sensors, networks, … • Heterogeneity of device, mechanism, policy • Sharing conditional: negotiation, payment, … • Coordinated problem solving • Integration of distributed resources • Compound quality of service requirements • Dynamic, multi-institutional virtual orgs • Dynamic overlays on classic org structures • Map to underlying control mechanisms

The Grid • Administrative Issues • Security • Multiple organizations • Coordinated problem Solving • Diverse Resources • Dynamic • Unreliable • Shared

Grid Technologies:Resource Sharing Mechanisms That … • Address security and policy concerns of resource owners and users • Are flexible enough to deal with many resource types and sharing modalities • Scale to large number of resources, many participants, many program components • Operate efficiently when dealing with large amounts of data & computation

Aspects of the Problem • Need for interoperability when different groups want to share resources • Diverse components, policies, mechanisms • E.g., standard notions of identity, means of communication, resource descriptions • Need for shared infrastructure services to avoid repeated development, installation • E.g., one port/service/protocol for remote access to computing, not one per tool/appln • E.g., Certificate Authorities: expensive to run • A common need for protocols & services

Hence, a Protocol-Oriented Viewof Grid Architecture, that Emphasizes … • Development of Grid protocols & services • Protocol-mediated access to remote resources • New services: e.g., resource brokering • “On the Grid” = speak Intergrid protocols • Mostly (extensions to) existing protocols • Development of Grid APIs & SDKs • Interfaces to Grid protocols & services • Facilitate application development by supplying higher-level abstractions

The Hourglass Model • Focus on architecture issues • Propose set of core services as basic infrastructure • Use to construct high-level, domain-specific solutions • Design principles • Keep participation cost low • Enable local control • Support for adaptation • “IP hourglass” model A p p l i c a t i o n s Diverse global services Core services Local OS

Application Application Internet Protocol Architecture “Coordinating multiple resources”: ubiquitous infrastructure services, app-specific distributed services Collective “Sharing single resources”: negotiating access, controlling use Resource “Talking to things”: communication (Internet protocols) & security Connectivity Transport Internet “Controlling things locally”: Access to, & control of, resources Fabric Link Layered Grid Architecture(By Analogy to Internet Architecture)

Dheeraj Bhardwaj Department of Computer Science & Engineering