250 likes | 379 Vues
Optimizing XML Processing for Grid Applications Using an Emulation Framework. Rajdeep Bhowmik 1 , Chaitali Gupta 1 , Madhusudhan Govindaraju 1 , Aneesh Aggarwal 2 1. Grid Computing Research Laboratory (GCRL), Department of Computer Science 2. Electrical and Computer Engineering
E N D
Optimizing XML Processing for Grid Applications Using an Emulation Framework Rajdeep Bhowmik1, Chaitali Gupta1, Madhusudhan Govindaraju1, Aneesh Aggarwal2 1. Grid Computing Research Laboratory (GCRL), Department of Computer Science 2. Electrical and Computer Engineering State University of New York at Binghamton IPDPS 2008, Miami, Florida
Motivation • Emergence of Chip Multiprocessors (CMPs) • Need to study XML-based grid middleware and applications for • performance limitations, bottlenecks, and optimization opportunities • How should grid middleware and applications be re-structured and re-tooled for multi-core processors? • What designs will ensure that middleware and applications scale well with the increase in the number of processing cores?
McGrid • McGrid: Multi-core Grid Emulator • An emulation framework for Grid middleware • Built on top of SESC: a cycle accurate full-system multi-core simulator • Configurable for system and micro-architectural parameters • Current focus • Obtain performance results for XML-based grid middleware documents on multi-core systems
Grid Simulators • Many grid emulators and simulators exist • GridSim, Gangsim, SimGrid, MicroGrid • do not give feedback at micro-architecture levels • memory access patterns, cache coherency overheads, synchronization between the threads of the application • Some fundamental challenges for code on CMPs • fair and efficient allocation of shared resources between concurrent threads • automatic detection of independent modules • modules that can be executed in parallel
McGrid Design Goals • Micro-architectural Simulator – • Designed on top of SESC • Allows pinning of threads to specific processing cores • Provide Micro-architectural Feedback – • cache access patterns of multiple threads • cache misses for different cache sizes • invalidations due to cache coherency protocol • conflicts in accesses to shared resources • CPU cycles wasted due to synchronization
McGrid Design Goals (2) • Configurable Design • allow analysis of grid-middleware performance for different processor types used in the heterogeneous grid environment. • Configuration options • Cache and physical memory size • Processor and memory speed • Number of on-chip cores • Pipeline and pre-fetch depth in each core • Execution width of each core
Porting to Multi-core systems • Initial analysis focus • XML based documents for job submission • Event stream documents • Workflow specifications • SOAP Messages with complex types • Serialized data formats • Decomposition • Parts that need to be thread-private • Parts that can be shared among threads • Scheduling • Mix of threads executing in parallel on CMPs • Choice of core for a particular thread
XML-based Grid Middleware Design Considerations • Role of XML in Grid Middleware • Namespaces • XML Docs with Repetition of Elements • XML Docs without Repetition of Elements • Buffering • Scanning and Caching • Co-Referenced Objects and Graphs
Bio-Medical Document The element atom appears repeatedly Each atom element shares namespaces defined at the top
WS-Security Document Non sequence-based Some elements are more expensive to process
Research Questions • How should namespaces be defined and used in XML processing to avoid triggering expensive synchronization algorithms between the cores? • What are the ways to cache frequently used namespaces that result in performance gains in a multi-core processor? • For what class of grid applications will the use of multiple-threads in a multi-core processor provide significant speed-up compared to the serial processing model that is widely used for XML processing documents on a single core processor?
Research Questions (2) • What optimizations can be enabled when the size of sequence based XML documents is known in advance? • What are the algorithms that can detect the cache access pattern of the application and dynamically distribute the processing load evenly among the various cores? • This aspect of the research is part of future work
Performance Results • Experimental Setup – • SESC – a cycle accurate architectural simulator • Each core has • Private 32Kbyte 4-way set associative Level-1 data cache • Private 32Kbyte 2-way set associative Level-1 instruction cache • Private 512K 8-way set associative Level-2 cache • Cache Replacement Policy • LRU • Cache Coherence Protocol • MESI • Cache Line Size • 64-byte • For our performance tests • MIPS cross-compiler built from the tool-chain gcc 3.4, glibc-2.3.2, Linux kernel headers 2.4.15
3 Threading Approaches • Single threaded • A single thread is used on a single core • Scanned threaded • First thread scans the document • determines points of parallelism • New threads process in parallel after that • Direct threaded • Same as scanned threaded except • the scanning part is skipped • assumed that parallel processing points are known • based on processing in previous runs • same document size and type
Threading Configuration Measurements • Direct threading over single-threading: 92% for all document sizes. • Scanned-threading over single-threading: 20% for 500 element document and about 12% for 4000 element document.
Direct-threading Performance • Performance almost doubles with doubling of the number of cores. Speed-up of about 92% for 2000 and 4000 elements
Performance Impact of Caching • Performance of direct-threading for varying number of elements per core. • Processing is done by two threads running on two different cores. • Elements are evenly divided between the threads. • Results for 3 cases – • Case 1 – Document preparation and processing is done on different cores. • Case 2 – Document is prepared in the core that processes the bottom half of the elements. • Case 3 – Document is prepared in the core that processes the top half of the elements.
Performance Impact of Caching • Performance of the two processing cores for the three cases of direct-threading for various document sizes.
Results for Even and Un-even Distribution of elements with direct-threading • With even distribution of elements – • Core 1 has the shortest running time among the cores • Core 3 has the longest running time among the cores • With uneven distribution of elements – • Best performance is obtained for the distribution when the running time of all cores are equal
Performance Impact of Cache Coherency • Configuration Details • Shared data structure for XML processing • Shared hash table to process a co-referenced object • Config 1 – Each write of an element is followed by a read of the element • Config 2 – Each write of an element is followed by three reads of the element
Performance Impact of Cache Coherency • Performance for the two configurations of the shared hash table for various application document sizes and number of cores.
Table-lookup and Shared Stack based Namespace Implementations • Performance of the two configurations of the shared namespace stack for various document sizes and cores.
Conclusions • XML docs should avoid redefinition of namespaces in inner elements • prevent expensive synchronization algorithms between the various cores. • The number of elements in XML doc may have to be un-evenly divided among the multiple cores • taking into account the cache access patterns of the threads. • When size of the sequence-based document is known and can be guessed accurately, a simple threading approach of equal distribution of the elements between the threads performs the best • because the processing of the document is equally divided between the threads. • Threads must be scheduled in cores that have already cached the whole or part of the data. • Non-sequence based documents should be scanned first. • The processing loads should then balanced among the different cores.
Future Work Future work includes – • Run the emulator for a larger number of representative XML documents and grid middleware services. • Run the emulator for representative grid applications. • Study the effect of different thread scheduling schemes on cache access patterns for each core. • Quantify the benefits of parallel XML parsing techniques for different document types and sizes. • Use of the network simulator from the MicroGrid project to simulate the inter-node communication between various grid nodes.