190 likes | 323 Vues
This research presents advanced methods for content-based event routing using XML data encoding, with a focus on achieving high performance across diverse applications such as collaborative work, distributed systems, and high-throughput environments. The work includes innovations in flexible XML structures, dynamic schema discovery, efficient serialization techniques, and type safety considerations. The aim is to develop encodings that minimize size and conversion time while ensuring classification efficiency, catering to the needs of embedded systems and large-scale networks alike.
E N D
XML Data Binding:Encoding for High-Performance Content-Based Event Routing Gail Kaiser Phil Gross Columbia University Programming Systems Lab
Overview • PSL Intro • MEET Project • Encoding Conversion Efficiency • Encoding Size Efficiency • Encoding Classification Efficiency
Programming Systems Lab • “PSL conducts research on Web technologies, collaborative work, virtual worlds, process/workflow, extended transaction models, software development environments and tools, software engineering, information management, and distributed programming systems” • Lately, lots of XML stuff
PSL XML-related Research • FlexML: Flexible XML • Open-ended XML streams that may include “new” tags • Dynamic schema and semantics discovery and composition • XUES: XML-based Universal Event Service • Event Packager: Data mining over XML structured data • Event Distiller: XML event poset pattern matching • Learning new application-domain events to recognize • DISCUS: Decentralized Information Spaces for Composition and Unification of Services • Rapid and secure application composition using Web Services • Trust Evolution: PGP Trust + KeyNote + real-world business
MEET • Multiply Extensible Event Transport • Content-based multicast routing • Must be efficient enough for embedded and high-performance applications
MEET Motivations • Personal Life Recorder (sensor oriented) • GroupWork Recorder (computer/DB oriented) • Parallel/Grid computing • Distributed simulation • Battlefield C4I • Last, but not least: • Dissertation submission
Machine A Relational Machine B XML Relationship to Other Work • Generally modeling communication like • What actually goes over the line is afterthought • But with N-Way Internet-scale communication • Millions of publishers and subscribers • We can (must!) do better than ASCII text… • Line speed => ≈250 assembly instructions per packet
MEET Extensibility • Want to scale up, to millions of pubs and subs • Want to scale down, to embedded and wireless • No single solution satisfactory at all scales • Composed of hot-swappable subsystems • Router, transports, clock/causality, types, etc.
Why Types • Event data is not just an opaque bag of bits • Subscriptions are Boolean functions over events • Type safety would be nice • What type system to use?
Initial MEET Type Design • Initial design calls for supporting Java, C#, and XML Schema defined objects “out of the box” • XML Schema used as Ur-language/Esperanto for conversions • Subscriptions are arbitrary boolean functions on datatypes • XML Schema is not ideal ur-type • Excessively complex, verbose, etc.
Encodings for Efficiency • Java, C#, XML, ASN.1 have well-defined but proprietary encodings for instances • Would be nice to have an independent encoding scheme with some desirable properties missing from the above • Fast serialization/deserialization • Elimination of redundant information from message sequences • Data organized for rapid classification/routing
Conversion Efficiency • Need to get to and from wire format as fast as possible • Leverage homogeneity to eliminate unnecessary conversions, e.g., network byte order • ECho system from Eisenhauer et. al., Georgia Tech • Using “native data” for ultra-low latency • Necessary for HPC
Size Efficiency • Ideal for single message is self-describing data • With multiple messages of same type, one can pull out redundant type info, e.g., schema • Goal is to go further: If 90% of content of messages is the same, generate a new subtype with fixed values • From self-describing to all-schema is a continuum
Classification Efficiency • When bits start arriving serially at the router, would like to begin cut-through routing as soon as possible • Avoid the curse of IP/IPv6: source address first • Want key routing bits as close to the front as possible • Want data in fixed locations
Fast Classifying: First Things First • In the packet, type info first (after magic) • Would like to represent type codes as bit string with “most significant” info e.g. parent type first, followed by subtype identifier, sub-subtype, etc. • Need access to type hierarchy • Popular classification fields at the front • Need to tag with popularity metadata • “subscribers will want to select on me”
Fast Classifying: Fixed Positions • Would like to avoid scanning through long or variable-length fields • Long/Variable data needs to be in a separate channel/section • Primitives and fixed-length references at the front • References point into data section • Classifier can jump large, uninteresting data quickly
Plus: Schema Format • We’d like the schema format to be amenable to programmatic manipulation and analysis • For instance, when negotiating formats, we’d like to be able to compute how our original format offer differs from the counter-offer • XML Schema is pretty good for this
Conclusions • Efficient instance transfer is an interesting case for data-binding • Special needs for efficiency • But we can negotiate our own format among the communicating parties • Some explicit support for this in a general data-binding solution could help acceptance