100 likes | 196 Vues
This project focuses on implementing scalable tree-based algorithms using MRNet for efficient handling of reverse indexes and online queries. Explore the background of MRNet, reverse indexing for keyword queries, Wisconsin Badgers case study, objectives and experiments, progress in data and design/implementation, and next steps for deploying and optimizing performance. Learn how tree topology affects application performance through macro and micro benchmarks and scale-up/speed-up curves with cluster node increase.
E N D
Implementing Scalable Tree-based Algorithms using MRNet Ting ChenMark Cowlishaw
Introduction • Background on MRNet • Our Applications • Reverse index • Online queries • Description of Experiments • Progress • Next steps
Background: MRNet [Roth, Arnold, Miller03] • Tree-Based Overlay Network • Nodes of a distributed application are arranged in a tree-structure • Leaves producing data that is aggregated and filtered by higher levels of the tree • Separation of programs and their running tree topology • A TBON program can run on tree-network of different topologies
Reverse indexes for Keyword Queries • A keyword query is a list of words: < w1,w2, ...,wn > . • A typical reverse index is a list of words • Each word points to a list of document IDs containing the word. • A document ID list can either be sorted by document names or by the number of times the word appears in the document. Wisconsin Badgers
On-line queries: N-best frequency (N=2) DocC: 7 DocB: 3 DocD: 0 DocF: 6 DocA: 5 DocC: 7 DocF: 6 DocD: 0 DocB: 3 DocA: 5 DocE: 2
Objective / Experiments • How does tree topology affect application performance • Macro-benchmarks: throughput/time-to-completion for index building and response time for on-line queries • Micro-benchmarks: the amount of total IO, I/O performed for each node, data transferred • Scale-Up and Speed-Up curves with the increase of cluster nodes • Is Multiple-level ( > 2) helpful?
Progress - data • Experiment Data • All Wikipedia documents • 8GB of data • 4 Million documents • Probably enough for a tree with 32 leaves • Data in Wiki-text format
Progress – Design / Implementation • Message Formats • Document/keyword delivery • Acknowledgment • keys / statistics • Messages for Microbenchmarks • Shared Classes • DocumentStatistics (back end) [in test] • StatisticsEntry (all) [in test] • StatisticsList (all) • Familiarity with MRNet Toolset • Debugging MRNet Programs
Next Steps • Deploy at medium scale (~32 Nodes) • Experiments to determine fan-out • Maximize throughput (bytes/second) • Minimize time-to-completion • Microbenchmarks • Collected and filtered using MRNet • Idle time • Messages per second, total message traffic
Futures • Replace N-best with distance from median • Add support for new document types • XML • HTML • Generalize to other MapReduce[Dean04] Applications • More realistic relevance ranking • Reverse hyperlink count • Collection term vectors