COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System

COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System # $ # * $ Aoying Zhou , Rong Zhang , Quang Hieu Vu , Weining Qian # Department of Computer Science and Engineering, Fudan University, Shanghai, China; {ayzhou,rongzh}@fudan.edu.cn * ¤Singapore MIT Alliance, National University of Singapore, Singapore; hieuvq@nus.edu.sg $ Software Engineering Institute, East China Normal University, Shanghai, China; {ayzhou,wnqian}@sei.ecnu.edu.cn Our Method: Motivation: Existing methods supporting content-based search in P2P systems use either servers or super-peers to maintain global statistic. These methods are not scalable and are prone to the bottleneck problem. • COSTA (COntent-based Search using Term Aggregation) is based on an adaptive indexing structure that combines a Chord ring and a balanced tree. • The tree built on top of the Chord ring is used to aggregate and classify terms adaptively. • The Chord ring is used to index terms of tree nodes. • At each tree node, terms are classified to • Important terms: can distinguish a node from its neighbor nodes. These terms are indexed directly to the Chord ring. • Unimportant terms: are either popular or rare terms. They are aggregated to higher level nodes. System Architecture Term Indexing • Classification: Standard IR method • Documents: term vectors in Cartesian space. • Weight of a term: • Aggregation: • Each leaf node summaries its shared documents to a summary vector and sends this vector to its parent. • Each internal node calculates weights of terms of its children from which terms are classified. • Important terms are indexed to the Chord ring. • Others are aggregated to a summary vector sent to the parent node. Node Structure *Term index: term, its local frequency, its local inverse document frequency and information of the publisher node Query Processing Algorithm: Query format: Q= keyword1, …,keywordn • Look up indices of queried terms from the Chord ring. • Based on indices, compute the similarity score of each tree node involving Q. • Rank these tree nodes according to their similarity scores and select top-K of them for query processing. • * Query distribution quota is used to control the number of queries a node can send to its descendant leaf nodes. Conclusion: By adaptively aggregating and classifying terms along a hierarchical tree structure, COSTA is able to support content-based search well without using servers or super-peers for maintaining global statistic. Reference: http://homepage.fudan.edu.cn/~wnqian/Costa.pdf

COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System

COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System

Presentation Transcript

Locality Sensitive Hashing and Large Scale Image Search

The Indexing or Dividing Head

Lecture 5: Large-Scale Path Loss

Book 3: Use of Accommodations in Large-Scale Assessments

Latent Semantic Indexing

Large scale proteome comparisons Genome trees

Hadoop and Amazon Web Services

IR - Indexing

Large-Scale Data Processing with MapReduce

IR - Indexing

CMPT 454

Scale, Scale Factor, & Scale Drawings

Chapter 22: Distributed Databases

The Phenomenology of Large Scale Structure

Distributed system

Distributed Databases

Introduction to Large-Scale Graph Computation

Week 4 The Large Scale Universe

Utilizing DeltaV Adaptive Control

CS 245: Database System Principles Notes 4: Indexing

2. Communication in Distributed Systems

Operating System Security

COSTA: Adaptive Indexing for Terms in a Large-scale Distributed System