1 / 47

Cyberinfrastructure and Scientific Collaboration

Cyberinfrastructure and Scientific Collaboration. Science and Technology in the Pacific Century (STIP) East Asian Colloquium, sponsored by the East Asian Studies Center Indiana University Ballantine Hall 004 November 30 2007 Geoffrey Fox Computer Science, Informatics, Physics

fgill
Télécharger la présentation

Cyberinfrastructure and Scientific Collaboration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Cyberinfrastructure and Scientific Collaboration Science and Technology in the Pacific Century (STIP) East Asian Colloquium, sponsored by the East Asian Studies Center Indiana University Ballantine Hall 004 November 30 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.org

  2. Abstract • eScience (or better eResearch) denotes a research model of virtual organizations of people, data, instruments, and computers linked across the globe. • Although the United States pioneered technologies that support this model, technology efforts are now much stronger in Europe and Asia. • NSF’s new CDI (Cyber-Enabled Discovery and Innovation) and other initiatives are establishing the virtual organization implied by eResearch. • Professor Fox will discuss particular examples of well-known international projects and how this collaboration model impacts the role of brick and mortar organizations.

  3. e-moreorlessanything ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ from inventor of term John Taylor Director General of Research Councils UK, Office of Science and Technology e-Science is about developing tools and technologies that allow scientists to do ‘faster, better or different’ research Similarly e-Business captures an emerging view of corporations as dynamic virtual organizations linking employees, customers and stakeholders across the world. This generalizes to e-moreorlessanything including presumably e-PacificResearch and e-olympics, e-education …. A deluge of data of unprecedented and inevitable size must be managed and understood. People (see Web 2.0), computers, data (including sensors and instruments)must be linked. On demand assignment of experts, computers, networks and storage resources must be supported 3

  4. Community Grids Laboratory Research • Develop and apply technology to distributed enterprises – mainly science • Funded by NSF NASA NIH DoE and DoD • Cheminformatics – High Throughput Screening data and filtering; PubChem PubMed including document analysis • Interactive global Particle Physics (and Plasma Physics) Data Analysis • Earthquake Science predicting earthquakes using simulations and satellite and GPS global positioning system Sensor Grid • Ice Sheet Dynamics – melting of Glaciers • Navajo Nation Grid Education (Science Gateways) and Healthcare; supporting digital repositories of American Indian cculture • Architecture of Air Force Sensor and Decision support systems • eSports collaboration for real time trainers and sportsman with HPER IU School of Health, Physical Education, and Recreation.

  5. Some Collaborations • Naïve Strategy is work with best group in a given field • This group is rarely at Indiana University; does this have implications for broad schools like Informatics? • Earthquake Science with California (UC Davis USC JPL) and Australia, Japan, China in a group ACES originally set up under APEC • Ice Sheet Dynamics with Kansas University and Elizabeth City State (North Carolina) • Cheminformatics with Cambridge University UK and IU Informatics • Particle Physics with Caltech and a small Business Deep Web Technologies in Santa Fe (SBIR) • DoD architectures with Ball Aerospace and a small business Anabas in California (SBIR) • Visualization for Plasma Physics with General Atomics and Anabas (STTR) • Minority Serving Institutions with University Houston Downtown, AIHEC and HACU • Technologies with IU Computer Science, Open Grid Forum and many around the world • International contacts best with P. R. China and United Kingdom

  6. Applications, Infrastructure, Technologies • This field is confused by inconsistent use of terminology; I define • Web Services, Grids and (aspects of) Web 2.0 (Enterprise 2.0) are technologies • Grids could be everything (Broad Grids implementing some sort of managed web) or reserved for specific architectures like OGSA or Web Services (Narrow Grids) • These technologies combine and compete to build electronic infrastructures termed e-infrastructure or Cyberinfrastructure • e-moreorlessanything is an emerging application area of broad importance that is hosted on the infrastructures e-infrastructure or Cyberinfrastructure • e-Science or perhaps better e-Research is a special case of e-moreorlessanything

  7. What is Cyberinfrastructure Cyberinfrastructure is (from NSF) infrastructure that supports distributed science (e-Science)– data, people, computers Clearly core concept more general than Science Exploits Internet technology (Web2.0) adding (via Grid technology) management, security, supercomputers etc. It has two aspects: parallel – low latency (microseconds) between nodes and distributed – highish latency (milliseconds) between nodes Parallel needed to get high performance on individual large simulations, data analysis etc.; must decompose problem Distributed aspect integrates already distinct components – especially natural for data Cyberinfrastructure is in general a distributed collection of parallel systems Cyberinfrastructure is made of services (originally Web services) that are “just” programs or data sources packaged for distributed access 7

  8. Underpinnings of Cyberinfrastructure • Distributed software systems are being “revolutionized” by developments from e-commerce, e-Science and the consumer Internet. There is rapid progress in technology families termed “Web services”, “Grids” and “Web 2.0” • The emerging distributed system picture is of distributed services with advertised interfaces but opaque implementations communicating by streams of messages over a variety of protocols • Complete systems are built by combining either services or predefined/pre-existing collections of services together to achieve new capabilities • As well as Internet/Communication revolutions (distributed systems), multicore chips will likely be hugely important (parallel systems) • Industry not academia is leading innovation in these technologies

  9. TeraGrid resources include more than 250 teraflops of computing capability and more than 30 petabytes of online and archival data storage, with rapid access and retrieval over high-performance networks. TeraGrid is coordinated at the University of Chicago, working with the Resource Provider sites: Indiana University, Oak Ridge National Laboratory, National Center for Supercomputing Applications, Pittsburgh Supercomputing Center, Purdue University, San Diego Supercomputer Center, Texas Advanced Computing Center, University of Chicago/Argonne National Laboratory, and the National Center for Atmospheric Research. Computing and Cyberinfrastructure: TeraGrid Grid Infrastructure Group (UChicago) UW PSC UC/ANL NCAR PU NCSA UNC/RENCI IU Caltech ORNL USC/ISI SDSC TACC Resource Provider (RP) Software Integration Partner

  10. Large Hadron Collider CERN, Geneva: 2008 Start • pp s =14 TeV L=1034 cm-2 s-1 • 27 km Tunnel in Switzerland & France CMS TOTEM pp, general purpose; HI 5000+ Physicists 250+ Institutes 60+ Countries Atlas ALICE : HI LHCb: B-physics Higgs, SUSY, Extra Dimensions, CP Violation, QG Plasma, … the Unexpected Challenges: Analyze petabytes of complex data cooperativelyHarness global computing, data & network resources

  11. Tier2 Center Tier2 Center Tier2 Center Tier2 Center Tier2 Center The LHC Data Grid Hierarchy:Developed at Caltech (1999) >10 Tier1 and ~100 Tier2 CentersTransforming Science ~PByte/sec ~150-1500 MBytes/sec Online System Experiment CERN Center PBs of Disk; Tape Robot Tier 0 +1 Tier 1 10 - 40 Gbps FNAL Center IN2P3 Center INFN Center RAL Center ~10 Gbps Tier 2 ~1-10 Gbps Tier 3 Institute Institute Institute Institute Tens of Petabytes by 2007-8An Exabyte ~5-7 Years later100 Gbps+ Data Networks Physics data cache 1 to 10 Gbps Tier 4 Workstations Emerging Vision: A Richly Structured, Global Dynamic System

  12. Tier-2s The Proliferation of Tier2s LHC Computing will beMore Dynamic & Network-Oriented ~100 Identified – Number still growing J. Knobloch

  13. Closing CMS for the first time (July)

  14. Higgs diphoton Analysis using Rootlets

  15. Data and Cyberinfrastructure • DIKW: Data  Information  Knowledge Wisdomtransformation • Applies to e-Science, Distributed Business Enterprise (including outsourcing), Military Command and Control and general decision support • (SOAP or just RSS) messages transport information expressed in a semantically rich fashion between sources and services that enhance and transform information so that complete system provides • Semantic Web technologies like RDF and OWL might help us to have rich expressivity but they might be too complicated • We are meant to build application specific information management/transformation systems for each domain • Each domain has Specific Services/Standards (for API’s and Information such as KML and GML for Geographical Information Systems) • and will use Generic Services (like R for datamining) and • Generic Standards (such as RDF, WSDL) • Standards made before consensus or not observant of technology progress are dubious

  16. Information and Cyberinfrastructure SS Database SS SS SS SS SS SS SS Raw Data  Data  Information  Knowledge  Wisdom AnotherGrid Decisions AnotherGrid SS SS SS SS FS FS OS MD MD FS Portal OS OS FS OS OS Inter-Service Messages FS FS FS AnotherService FS FS MD MD OS MD OS OS FS Other Service FS FS FS FS MD OS OS OS FS FS FS MD MD FS Filter Service OS AnotherGrid FS MetaData FS FS FS MD Sensor Service SS SS SS SS SS SS SS SS SS SS AnotherService

  17. Information Cyberinfrastructure Architecture • The Party Line approach to Information Infrastructure is clear – one creates a Cyberinfrastructure consisting of distributed services accessed by portals/gadgets/gateways/RSS feeds • Services include: • Computing • “original data” • Transformations or filters implementing DIKW (Data Information Knowledge Wisdom) pipeline • Final “Decision Support” step converting wisdom into action • Generic services such as security, profiles etc. • Some filters could correspond to large simulations • Infrastructure will be set up as a System of Systems (Grids of Grids) • Services and/or Grids just accept some form of DIKW and produce another form of DIKW • “Original data” has no explicit input; just output

  18. Virtual Observatory Astronomy GridIntegrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map

  19. Minority Serving Institutions and the Grid • Historically the R1 Research University powerhouses dominated research due to their concentration of expertise • Cyberinfrastructure allows others to participate in same way it supports distributed collaboration in spirit of old distance education • Navajo Nation (Colorado Plateau covering over 25,000 square miles in northeast Arizona, northwest New Mexico, and southeast Utah) with 110 communities and over 40% unemployment. Building a wireless grid for education, healthcare • http://www.win-hec.org/ World Indigenous Nations Higher Education Consortium • Cyberinfrastructure allows Nations to preserve their geographical identity but participate fully with world class jobs and research • Some 335 MSI’s in Alliance have similar hopes for Cyberinfrastructure to jump start their advancement! Is this really true? Didn’t work for distance education?

  20. Navajo Nation Wireless Grid • Internet to Hogan dedicated January 29 2007 at Navajo Technical College Crownpoint NM

  21. Example: Setting up a Polar CI-Grid • The North and South poles are melting with potential huge environmental impact • As a result of MSI meetings, I am working with MSI ECSU in North Carolina and Kansas University to design and set up a Polar Grid (Cyberinfrastructure) • This is a network of computers, sensors (on robots and satellites), data and people aimed at understanding science of ice-sheets and impact of global warming • We have changed the 100,000 year Glacier cycle into a ~50 year cycle; the field has increased dramatically in importance and interest • Good area to get involved in as not so much established work

  22. Jacobshavn • Greenland’s mass loss doubled in the last decade: • 0.23 ± 0.08 mm slr / yr in 1996 • 0.57 ± 0.1 mm slr / yr in 2005 • 2/3 of the loss is caused by ice dynamics • 1/3 is due to enhanced runoff Rignot and Kanagaratnam, Science (2006) Jakobshavns Discharge: 24 km3 / yr (5.6 mile3 / yr) in 1996 46 km3 / yr (10.8 mile3 / yr)in 2005

  23. Slide courtesy of Dr. Yehuda Bock: http://sopac.ucsd.edu/input/realtime/CRTN_NGGPSUG.ppt

  24. APEC Cooperation for Earthquake Simulation • ACES is a eight year-long collaboration among scientists interested in earthquake and tsunami predication • iSERVO is Infrastructure to supportwork of ACES • SERVOGrid is (completed) US Grid that is a prototype of iSERVO • http://www.quakes.uq.edu.au/ACES/ • Chartered under APEC – the Asia Pacific Economic Cooperation of 21 economies

  25. Field Trip Data Database ? GISGrid Discovery Services RepositoriesFederated Databases Streaming Data Sensors Database Sensor Grid Database Grid Research Education SERVOGrid Compute Grid Customization Services From Researchto Education Data FilterServices ResearchSimulations Analysis and VisualizationPortal EducationGrid Computer Farm Grid of Grids: Research Grid and Education Grid

  26. SERVOGrid and Cyberinfrastructure • Grids are the technology based on Web services that implement Cyberinfrastructure i.e. support eScience or science as a team sport • Internet scale managed services that link computers data repositories sensors instruments and people • There is a portal and services in SERVOGrid for • Applications such as GeoFEST, RDAHMM, Pattern Informatics, Virtual California (VC), Simplex, mesh generating programs ….. • Job management and monitoring web services for running the above codes. • File management web services for moving files between various machines. • Geographical Information System services • Quaketables earthquake specific database • Sensors as well as databases • Context (dynamic metadata) and UDDI system long term metadata services • Services support streaming real-time data

  27. Grid Workflow Datamining in Earth Science Work with Scripps Institute Grid services controlled by workflow process real time data from ~70 GPS Sensors in Southern California Streaming Data Support Archival Transformations Data Checking Hidden MarkovDatamining (JPL) Real Time Display (GIS) NASA GPS Earthquake 30

  28. Grid-style portal as used in Earthquake Grid The Portal is built from portlets – providing user interface fragments for each service that are composed into the full interface – uses OGCE technology as does planetary science VLAB portal with University of Minnesota Now to Portals 31

  29. a Site-specific Irregular Scalar Measurements a Constellations for Plate Boundary-Scale Vector Measurements Ice Sheets a Volcanoes PBO Greenland Long Valley, CA Topography 1 km Stress Change Northridge, CA Earthquakes Hector Mine, CA

  30. ACES Components

  31. Grid Workflow Data Assimilation in Earth Science • Grid services triggered by abnormal events and controlled by workflow process real time data from radar and high resolution simulations for tornado forecasts Typical graphical interface to service composition

  32. Service or Web Service Approach • One uses GML, CML etc. to define the datastructure in a system and one uses services to capture “methods” or “programs” • In eScience, important services fall in three classes • Simulations • Data access, storage, federation, discovery • Filters for data mining and manipulation • Services could use something like WSDL (Web Service Definition Language) to define interoperable interfaces but Web 2.0 follows old library practice: one just specifies interface • Service Interface (WSDL) establishes a “contract” independent of implementation between two services or a service and a client • Services should be loosely coupled which normally means they are coarse grain • Services will be composed (linked together) by mashups (typically scripts) or workflow (often XML – BPEL) • Software Engineering and Interoperability/Standards are closely related

  33. Relevance of Web 2.0 • They say that Web 1.0 was a read-only Web while Web 2.0 is the wildly read-write collaborative Web • Web 2.0 can help e-Science in many ways • Its tools can enhance scientific collaboration, i.e. effectively support virtual organizations, in different ways from grids • The popularity of Web 2.0 can provide high quality technologies and software that (due to large commercial investment) can be very useful in e-Science and preferable to Grid or Web Service solutions • The usability and participatory nature of Web 2.0 can bring science and its informatics to a broader audience • Web 2.0 can even help the emerging challenge of using multicore chips i.e. in improving parallel computing programming and runtime environments

  34. Grid Capabilities for Science • Open technologies for any large scale distributed system that is adopted by industry, many sciences and many countries (including UK, EU, USA, Asia) • Security, Reliability, Management and state standards • Service and messaging specifications • User interfaces via portals and portlets virtualizing to desktops, email, PDA’s etc. • ~20 TeraGrid Science Gateways (their name for portals) • OGCE Portal technology effort led by Indiana • Uniform approach to access distributed (super)computers supporting single (large) jobs and spawning lots of related jobs • Data and meta-data architecture supporting real-time and archives as well as federation • Links to Semantic web and annotation • Grid (Web service) workflow with standards and several successful instantiations (such as Taverna and MyLead) • Many Earth science grids including ESG (DoE), GEON, LEAD, SCEC, SERVO; LTER and NEON for Environment • http://www.nsf.gov/od/oci/ci-v7.pdf

  35. OSCAR Document Analysis InChI Generation/Search Computational Chemistry (Gamess, Jaguar etc.) Varuna.net Quantum Chemistry CICC Chemical Informatics and Cyberinfrastructure Collaboratory Web Service Infrastructure Portal Services RSS Feeds User Profiles Collaboration as in Sakai Core Grid Services Service Registry Job Submission and Management Local Clusters IU Big Red, TeraGrid, Open Science Grid

  36. Process Chemistry-Biology Interaction Data from HTS (High Throughput Screening) Percent Inhibition or IC50 data is retrieved from HTS Scientists at IU prefer Web 2.0 to Grid/Web Service for workflow Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis A Grid of Grids linking collections of services atPubChem ECCR centers MLSCN centers Workflows encoding plate & control well statistics, distribution analysis, etc Question: Was this screen successful? Workflows encoding distribution analysis of screening results Question: What should the active/inactive cutoffs be? Question: What can we learn about the target protein or cell line from this screen? Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc Compound data submitted to PubChem PROCESS CHEMINFORMATICS GRIDS

  37. Workflows - Taverna (taverna.sourceforge.net)

  38. Supporting distributed Enterprise I • Technologies support “virtual organizations” which are real organizations linked electronically – these refer to linkage of • Asynchronous: There are rather difficult to use Grid technologies and powerful but not so security/privacy sensitive Web 2.0 technologies varying from YouTube, Connotea to email, Wikis and Blogs • Synchronous: There are audio-video conferencing and Polycom/WebEx style tools • Such real time collaboration tools are still unreliable (I have worked on them since 1997) and you still need a lot of travel

  39. Supporting distributed Enterprise II • Technologies support “linkage of resources among themselves and to people” • People and data are intrinsically distributed but computers are not • Particle Physics has one accelerator but raw data becomes processed data in some 50 places around globe • Earthquakes occur all over and so is their data • Polar science uses UAV gathering data all over poles • Cloud Computing offers seamless access to a “pile of computers” anywhere • Grid computing integrates multiple computers in different places which is harder as must link computers with different owners and policies

  40. Distance Education etc. • 10 years ago, I expected distance education to be very important • See http://www.old-npac.org/users/gcf/icwujan98/index.html or http://www.npac.syr.edu/users/gcf/virtuniv95/index.html • Plans for ICWU International Collaborative University • This describes the plans of NPAC (Fox) and Peking University (Li) to set up an International Collaborative Web University and offer initial courses • Initial plans are a 6 course Graduate Internetics Program and a 2 course Web/Java High School Program Students and Instruction will be spread over at least 6 institutions • NPAC and Several Chinese Universities are already committed and we expect other Asian U.S. and European participation • Very little was actually done – why? • The quality of real time interactive experience is still poor or needs more infrastructure than most people have • Tools supporting my distance education classes 2001-2005 (http://grids.ucs.indiana.edu/ptliupages/jsucourse2005/) poor compared to those I had 1998-2000

  41. Teaching Jackson State Fall 97 to Spring 2005 Syracuse JSU

  42. The Virtual University I • Motivated either by decreased cost or increased quality of learning environment • Will succeed due to market pressures (it will offer the best product) • Is technologically possible today and can only get better • Main problem is pervasive Quality of Service for digital audio and video • In structured settings like briefings, lectures etc., support is easier as at fixed times and digital video of secondary importance • Brainstorming and general “collaboration” technically harder

  43. The Virtual University II • “Centers of Excellence” (“Hermits Cave Virtual University”) is natural entity to produce and deliver classes • Today 1 faculty delivers 2 courses a semester -- each to say 25 students • Instead 3 faculty collaborate on 1 course and deliver to some 200 students -- perhaps in multiple sessions (200 students required to fund quality curricula and 200 students requires distance education except in a few classes) • University acts as an integrator putting together a set of classes where it may only teach some 20% but acts as a mentor to all • Important issues as to certification and “natural unit of instruction” (smaller than typical degree)

  44. Global Computer Science Status • There was a major federal computer science initiative 1990-2000 (HPCC High Performance Computing and Communication) • At that time, Europe tried and failed to compete and Japan was a serious but not leading activity in Asia (Japan had a failed fifth generation project based on dubious Artificial Intelligence ideas) • Grids and Cyberinfrastructure have replaced HPCC and here the status is different • US Business (Google, Amazon, Microsoft) clearly dominant • Government sponsored work is in “classic Grid” strongest in Europe • US Research rather chaotic but may be correctly pointing to change • Core of field due to US research of around 10 years ago • Work in China, Japan, India, Australia, Latin America world class and has interesting EU support • China needs to make step from “developing research” to “major research” power!

More Related