Créer une présentation
Télécharger la présentation

Télécharger la présentation
## On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**On Scaling Latent Semantic Indexing for Large Peer-to-Peer**Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM SIGIR2004 Session: Dimensionality reduction**Abstract (1/2)**• Promising direction • Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost • pSearch • Places docs onto a p2p overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI) • Limitation (inherits LSI) • When the corpus is large, retrieval quality is bad • The Singular Value Decomposition (SVD) in LSI is unscalable in terms of both memory and time.**Abstract (2/2)**• Contributions • To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection • Proper normalization of semantic vectors for terms and docs improves recall by 76% • To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection**Introduction (1/3)**• Info. grow exponentially • Exceeds 10^18 bytes each year • P2P systems • Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems • pSearch • Populates docs in the network according to doc semantics derived from LSI • The search cost for a query is reduced to route hops**Introduction (2/3)**• The limitations of pSearch • When the corpus is large, retrieval quality is bad • The SVD that LSI uses to derive semantic vectors of docs is not scalable in terms of memory consumption and computation time • Propose techniques to address these limitations • eLSI (efficient LSI): doc clustering and term selection • Proper normalization of semantic vector for terms and docs improve recall by 76% • LSI+Okapi: use low-dimensional subvectors of semantic vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection**Introduction (3/3)**• Contributions • Deriving low-dimensional representation for high-dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI • The proper configuration we found for LSI should be of general interest to the LSI community • Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.**pSearch System Overview (1/4)**• An example of how the system works • pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI. • Vector Space Model (VSM) • ltc term weighting**pSearch System Overview (2/4)**• Latent Semantic Indexing • A: term-doc matrix, rank=r • LSI approximate A with a rank-k matrix by omitting all but the k largest singular values • Content-Addressable Network (CAN) • CAN partitions a d-dimensional Cartesian space into zones and assign each zone to a node**pSearch System Overview (3/4)**• The pLSI Algorithm • The pLSI algorithm combine LSI and CAN to build pSearch • Upon reaching the destination, the query is flooded to nodes within a small radius r • Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next • LSI uses k=50~350 dimensional space for small corpora**pSearch System Overview (4/4)**• Dimension mismatch between CAN and LSI • The real dimension of a CAN can’t be higher than l=O(log(n)) • Partitions a k-dimensional semantic vector into multiple l-dimension subvectors • Given a doc, we store its index at p places in the CAN using its first p subvectors as DHT keys (p=4) • Two similar subvectors ensure their full vectors are also similar • accuracy = |A∩B|/|A| • A: retrieve 15 docs for each TREC7&8 based on 300 dimension semantic vectors**Improving Retrieval Quality (1/5)**• Proper LSI Configuration • Term normalization • Doc normalization • The choice of using to project vectors • Experiment • SVDPACK • Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB • Queries: the title field of topics 351-450 • use ltc to generate the term-doc matrix for SVD • Due to memory limitations, select only 15% of the TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)**Improving Retrieval Quality (2/5)**• Improvement • retrieve 1,000 docs for each query and report the average # of relevant docs • return more 76% more relevant docs, when norm both • normalizing terms improves performance by emphasizing terms • normalizing docs corroborates the belief that cosine is a robust measure for similarity**Improving Retrieval Quality (3/5)**• TREC vs. Medlars Corpus • Medlars: 1,033 docs and 30 queries • Docs and queries are projected into a 50-dimension space • 50-dimension is sufficient for the small corpus • 300-dimension is insufficient for the large corpus • Normalization is beneficial if the dimension of the semantic space are insufficient in capturing the fine structure of the corpus.**Improving Retrieval Quality (4/5)**• LSI is bad for large corpus • LSI does not exploit doc length in ranking • 300-dimension semantic space is insufficient for TREC. LSI’s performance can be improved by increasing dimensionality. • LSI+Okapi • use 4-plane pLSI (each plane 25 dimensions) • each plane retrieve 1,000 docs, use Okapi to rank the returned 4,000 docs**Improving Retrieval Quality (5/5)**• Precision-recall for TREC • Precision-recall for Medlars • High-end precision for TREC • P@i: precision when retrieving i docs for a query • The performance of LSI+Okapi • High-end precision approaches that of Okapi, but the low-end still lags behind. • The low-end precision can be improved by allowing each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.**Improving the Efficiency of LSI (1/)**• Traditionally • LSI use term-doc matrix as the input for SVD • for a matrix A≡Rt*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c) • The eLSI algorithm • Use spherical k-means to cluster docs C =[c1 c2… cs] ≡Rt*s • The aggregate weight of a term i: • we select a subset of e rows from matrix C to construct a row-reduced matrix • e: top e terms with the largest aggregate weight**Improving the Efficiency of LSI (2/)**• For TREC corpus • The complete term-doc matrix has 408,653 rows and 528,155 columns • The matrix has less than 2,000 rows and 2,000 cols • Projection • Projects terms into the semantic space using Vk • Project a doc (or query) vector q into the semantic space and normalize it to unit length**Improving the Efficiency of LSI (3/)**• Other Dimensionality Reduction Methods • Random Projection (RP) • The first step of all other algorithms partitions docs into k clusters, G=[g1 g2… gk] ≡Rt*k • Concept Indexing (CI) • The third algorithm solves the least-squares problem • QR decomposition**Improving the Efficiency of LSI (4/)**• RP-eLSI F is a random matrix • Comparing Dimension Reduction Methods • RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data