470 likes | 529 Vues
High Performance Data Streaming in a Service Architecture. Jackson State University Internet Seminar November 18 2004 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu
E N D
High Performance Data Streaming in a Service Architecture Jackson State University Internet Seminar November 18 2004 Geoffrey FoxComputer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 gcf@indiana.edu http://www.infomall.orghttp://www.grid2002.org
Abstract • We discuss a class of HPC applications characterized by large scale simulations linked to large data streams coming from sensors, data repositories and other simulations. • Such applications will increase in importance to support "data-deluged science”. • We show how Web service and Grid technologies offer significant advantages over traditional approaches from the HPC community. • We cover Grid workflow (contrasting it with dataflow) and how Web Service (SOAP) protocols can achieve high performance
Parallel Computing • Parallel processing is built on breaking problems up into parts and simulating each part on a separate computer node • There are several ways of expressing this breakup into parts with Software: • Message Passing as in MPI or • OpenMP model for annotating traditional languages • Explicitly parallel languages like High Performance Fortran • And several computer architectures designed to support this breakup • Distributed Memory with or without custom interconnect • Shared Memory with or without good cache • Vectors with usually good memory bandwidth
The Six Fundamental MPI routines • MPI_Init (argc, argv) -- initialize • MPI_Comm_rank (comm, rank) -- find process label (rank) in group • MPI_Comm_size(comm, size) -- find total number of processes • MPI_Send (sndbuf,count,datatype,dest,tag,comm) -- send a message • MPI_Recv (recvbuf,count,datatype,source,tag,comm,status) -- receive a message • MPI_Finalize( ) -- End Up
Whatever the Software/Parallel Architecture ….. • The software is a set of linked parts • Threads, Processes sharing the same memory or independent programs on different computers • And the parts must pass information between them in to synchronize themselves and ensure they really are working the same problem • The same of course is true in any system • Neurons pass electrical signals in the brain • Humans use a variety of information passing schemes to build communities: voice, book, phone • Ants and Bees use chemical messages • Systems are built of parts and in interesting systems the parts communicate with each other and this communication expresses “why it is a system” and not a bunch of independent bits
Passing Information • Information passing between parts covers a wide range in size (number of bits electronically) and “urgency” • Communication Time = Latency + (Information Size)/Bandwidth • From Society we know that we choose multiple mechanisms with different tradeoffs • Planes and high latency and bandwidth • Walking is low latency but low bandwidths • Cars are somewhat in between theses cases • We can always think of information being transferred as a message • If airplane passenger, sound waves or a posted letter • Whether if an MPI message or UNIX Pipe between processes or a method call between threads
Parallel Computing and Message Passing • We worked very hard to get a better programming model for parallel computing that removed need for user to • Explicitly decompose problem and derive parallel algorithm for decomposed parts • Write MPI programs expressing explicit decomposition • This effort wasn’t so successful and on distributed memory machines (including BlueGene/L) at least message passing of MPI style is the execution model even if one uses a higher level language • So for parallelism, we are forced to use message passing and this is efficient but intellectually hard
What about Web Services? PaymentCredit Card WSDL interfaces Security Catalog Warehouse shipping WSDL interfaces • Web Services are distributed computer programs that can be in any language (Fortran .. Java .. Perl .. Python) • The simplest implementations involve XML messages (SOAP) and programs written in net friendly languages like Java and Python • Here is a typical e-commerce use?
Internet Programming Model • Web Services are designed as the latest distributed computing programming paradigm motivated by the Internet and the expectation that enterprise software will be built on the same software base • Parallel Computing is centered on DECOMPOSITION • Internet Programming is centered on COMPOSITION • The components of e-commerce (catalog, shipping, search, payment) are NATURALLY separated (although they are often mistakenly integrated in older implementations) • These same components are naturally linked by Messages • MPI is replaced by SOAP and the COMPOSITION model is called Workflow • Parallel Computing and the Internet have the same execution model (processes exchanging messages) but very different REQUIREMENTS
Requirements for MPI Messaging tcalc tcomm tcalc • MPI and SOAP Messaging both send data from a source to a destination • MPI supports multicast (broadcast) communication; • MPI specifies destination and a context (in comm parameter) • MPI specifies data to send • MPI has a tag to allow flexibility in processing in source processor • MPI has calls to understand context (number of processors etc.) • MPI requires very low latency and high bandwidth so that tcomm/tcalc is at most 10 • BlueGene/L has bandwidth between 0.25 and 3 Gigabytes/sec/node and latency of about 5 microseconds • Latency determined so Message Size/Bandwidth > Latency
BlueGene/L MPI I http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
BlueGene/L MPI II http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf
BlueGene/L MPI III http://www.llnl.gov/asci/platforms/bluegene/papers/6almasi.pdf 500 Megabytes/sec
Requirements for SOAP Messaging • Web Services has much of the same requirements as MPI with two differences where MPI more stringent than SOAP • Latencies are inevitably 1 (local) to 100 milliseconds which is 200 to 20,000 times that of BlueGene/L • 1) 0.000001 ms – CPU does a calculation • 2) 0.001 to 0.01 ms – MPI latency • 3) 1 to 10 ms – wake-up a thread or process • 4) 10 to 1000 ms – Internet delay • Bandwidths for many business applications are low as one just needs to send enough information for ATM and Bank to define transactions • SOAP has MUCH greater flexibility in areas like security, fault-tolerance, “virtualizing addressing” because one can run a lot of software in 100 milliseconds • Typically takes 1-3 milliseconds to gobble up a modest message in Java and “add value”
Closely coupled Java/Python … Coarse Grain Service Model Service B Service A Module B Module A Messages Service B Service A 0.1 to 1000 millisecond latency MESSAGE BASED .001 to 1 millisecond METHOD CALL BASED Ways of Linking Software Modules EVENT BASED with brokered messages “Listener”Subscribe to Events Publisher Post Events Message Queue in the Sky
MPI and SOAP Integration • Note SOAP Specifies format and through WSDL interfaces • MPI only specifies interface and so interoperability between different MPIs requires additional work • IMPI http://impi.nist.gov/IMPI/ • Pervasive networks can support high bandwidth (Terabits/sec soon) but latency issue is not resolvable in general way • Can combine MPI interfaces with SOAP messaging but I don’t think this has been done • Just as walking, cars, planes, phones coexist with different properties; so SOAP and MPI are both good and should be used where appropriate
NaradaBrokering • http://www.naradabrokering.org • We have built a messaging system that is designed to support traditional Web Services but has an architecture that allows it to support high performance data transport as required for Scientific applications • We suggest using this system whenever your application can tolerate 1-10 millisecond latency in linking components • Use MPI when you need much lower latency • Use SOAP approach when MPI interfaces required but latency high • As in linking two parallel applications at remote sites • Technically it forms an overlay network supporting in software features often done at IP Level
Mean transit delay for message samples in NaradaBrokering: Different communication hops 9 hop-2 hop-3 8 hop-5 7 hop-7 6 5 Transit Delay (Milliseconds) 4 3 2 1 0 100 1000 Message Payload Size (Bytes) Pentium-3, 1GHz, 256 MB RAM 100 Mbps LAN JRE 1.3 Linux
Average Video Delays for one broker – divide by N for N load balanced brokers Multiple sessions One session Latency ms 30 frames/sec # Receivers
NB-enhanced GridFTP Adds Reliability and Web Service Interfaces to GridFTP Preserves parallel TCP performance and offers choice of transport and Firewall penetration
Service-1 Service-3 Role of Workflow • Programming SOAP and Web Services (the Grid): Workflow describes linkage between services • As distributed, linkage must be by messages • Linkage is two-way and has both control and data • Apply to multi-disciplinary, multi-scale linkage, multi-program linkage, link visualization to simulation, GIS to simulations and visualization filters to each other • Microsoft-IBM specification BPEL is current preferred Web Service XML specification of workflow Service-2
Example workflow Here a sensor feeds a data-mining application (We are extending data-mining in DoD applications with Grossman from UIC) The data-mining application drives a visualization
SERVOGrid Codes, Relationships Elastic Dislocation Inversion Viscoelastic FEM Viscoelastic Layered BEM Elastic Dislocation Pattern Recognizers Fault Model BEM This linkage called Workflow in Grid/Web Service parlance
Two-level Programming I Service Data • The Web Service (Grid) paradigm implicitly assumes a two-level Programming Model • We make a Service (same as a “distributed object” or “computer program” running on a remote computer) using conventional technologies • C++ Java or Fortran Monte Carlo module • Data streaming from a sensor or Satellite • Specialized (JDBC) database access • Such services accept and produce data from users files and databases • The Grid is built by coordinating such services assuming we have solved problem of programming the service
Service1 Service3 Service2 Service4 Two-level Programming II • The Grid is discussing the composition of distributed serviceswith the runtime interfaces to Grid as opposed to UNIX pipes/data streams • Familiar from use of UNIX Shell, PERL or Python scripts to produce real applications from core programs • Such interpretative environments are the single processor analog of Grid Programming • Some projects like GrADS from Rice University are looking at integration between service and composition levels but dominant effort looks at each level separately
3 Layer Programming Model Application (level 1 Programming) MPI Fortran C++ etc. Semantic Web Application Semantics (Metadata, Ontology) Level 2 “Programming” Basic Web Service Infrastructure Web Service 1 WS 2 WS 3 WS 4 Workflow (level 3) Programming BPEL Workflow will be built on top of NaradaBrokering as messaging layer
Structure of SOAP • SOAP defines a very obvious message structure with a header and a body just like email • The header contains information used by the “Internet operating system” • Destination, Source, Routing, Context, Sequence Number … • The message body is partly further information used by the operating system and partly information for application when it is not looked at by “operating system” except to encrypt, compress it etc. • Note WS-Security supports separate encryption for different parts of a document • Much discussion in field revolves around what is referenced in header • This structure makes it possible to define VERY Sophisticated messaging
WS-……..Handler WS-RMHandler Deployment Issues for “System Services” • “System Services” (handlers/filters) are ones that act before the real application logic of a service • They gobble up part of the SOAP header identified by the namespace they care about and possibly part or all of the SOAP body • e.g. the XML elements in header from the WS-RM namespace • They return a modified SOAP header and body to next handler in chain Header Body e.g. ……. Could be WS-Eventing WS-Transfer ….
Pure SOAP SOAP over UDP Binary over UDP Fast Web Service Communication I • Internet Messaging systems allow one to optimize message streams at the cost of “startup time”, • Web Services can deliver the fastest possible interconnections with or without reliable messaging • Typical results from Grossman (UIC) comparing Slow SOAP over TCP with binary and UDP transport (latter gains a factor of 1000) 7020 5.60
Fast Web Service Communication II • Mechanism only works for streams – sets of related messages • SOAP header in streams is constant except for sequence number (Message ID), time-stamp .. • One needs two types of new Web Service Specification • “WS-StreamNegotiation” to define how one can use WS-Policy to send messages at start of a stream to define the methodology for treating remaining messages in stream • “WS-FlexibleRepresentation” to define new encodings of messages
Fast Web Service Communication III • Then use “WS-StreamNegotiation” to negotiate stream in Tortoise SOAP – ASCII XML over HTTP and TCP – • Deposit basic SOAP header through connection – it is part of context for stream (linking of 2 services) • Agree on firewall penetration, reliability mechanism, binary representation and fast transport protocol • Naturally transport UDP plus WS-RM • Use “WS-FlexibleRepresentation” to define encoding of a Fast transport (On a different port) with messages just having “FlexibleRepresentationContextToken”, Sequence Number, Time stamp if needed • RTP packets have essentially this structure • Could add stream termination status • Can monitor and control with original negotiation stream • Can generate different streams optimized for different end-points
Data Deluged Science • In the past, we worried about data in the form of parallel I/O or MPI-IO, but we didn’t consider it as an enabler of new algorithms and new ways of computing • Data assimilation was not central to HPCC • DoE ASC set up because didn’t want test data! • Now particle physics will get 100 petabytes from CERN • Nuclear physics (Jefferson Lab) in same situation • Use around 30,000 CPU’s simultaneously 24X7 • Weather, climate, solid earth (EarthScope) • Bioinformatics curated databases (Biocomplexity only 1000’s of data points at present) • Virtual Observatory and SkyServer in Astronomy • Environmental Sensor nets
Data Assimilation Information Simulation Model Datamining Ideas Reasoning ComputationalScience Data DelugedScienceComputing Paradigm Informatics
Virtual Observatory Astronomy GridIntegrate Experiments Radio Far-Infrared Visible Dust Map Visible + X-ray Galaxy Density Map
DAME Data Deluged Engineering In flight data ~5000 engines ~ Gigabyte per aircraft per Engine per transatlantic flight Global Network Such as SITA Ground Station Airline Engine Health (Data) Center Maintenance Centre Internet, e-mail, pager Rolls Royce and UK e-Science ProgramDistributed Aircraft Maintenance Environment
USArray Seismic Sensors
a Site-specific Irregular Scalar Measurements a Constellations for Plate Boundary-Scale Vector Measurements Ice Sheets a Volcanoes PBO Greenland Long Valley, CA Topography 1 km Stress Change Northridge, CA Earthquakes Hector Mine, CA
Data Data Filter Filter Filter Data Filter Data OGSA-DAIGrid Services AnalysisControl Visualize Grid Data Filter Data Deluged ScienceComputing Architecture HPC Simulation Grid Data Assimilation Other Gridand Web Services Distributed Filters massage data For simulation
Data Assimilation • Data assimilation implies one is solving some optimization problem which might have Kalman Filter like structure • Due to data deluge, one will become more and more dominated by the data (Nobs much larger than number of simulation points). • Natural approach is to form for each local (position, time) patch the “important” data combinations so that optimization doesn’t waste time on large error or insensitive data. • Data reduction done in natural distributed fashion NOT on HPC machine as distributed computing most cost effective if calculations essentially independent • Filter functions must be transmitted from HPC machine
Distributed Filtering Nobslocal patch >> Nfilteredlocalpatch≈ Number_of_Unknownslocalpatch In simplest approach, filtered data gotten by linear transformations on original data based on Singular Value Decomposition of Least squares matrix Send needed Filter Receive filtered data Nobslocal patch 1 Filter Data Nfilteredlocal patch 1 Geographically DistributedSensor patches Nobslocal patch 2 Filter Data Nfilteredlocal patch 2 HPC Machine Factorize Matrixto product of local patches Distributed Machine