On Scaling Latent Semantic Indexing for Large Peer-to-Peer Systems Chunqiang Tang, Sandhya Dwarkadas, Zhichen Xu University of Rochester; Yahoo! Inc. ACM SIGIR2004 Session: Dimensionality reduction
Abstract (1/2) • Promising direction • Combine IR with peer-to-peer technology for scalability, fault-tolerance and low administration cost • pSearch • Places docs onto a p2p overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI) • Limitation (inherits LSI) • When the corpus is large, retrieval quality is bad • The Singular Value Decomposition (SVD) in LSI is unscalable in terms of both memory and time.
Abstract (2/2) • Contributions • To reduce the cost of SVD, we reduce the size of its input matrix through doc clustering and term selection • Proper normalization of semantic vectors for terms and docs improves recall by 76% • To improve further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and doc selection
Introduction (1/3) • Info. grow exponentially • Exceeds 10^18 bytes each year • P2P systems • Scalability, fault-tolerance, self-organizing nature, raising hope for building large-scale IR systems • pSearch • Populates docs in the network according to doc semantics derived from LSI • The search cost for a query is reduced to route hops
Introduction (2/3) • The limitations of pSearch • When the corpus is large, retrieval quality is bad • The SVD that LSI uses to derive semantic vectors of docs is not scalable in terms of memory consumption and computation time • Propose techniques to address these limitations • eLSI (efficient LSI): doc clustering and term selection • Proper normalization of semantic vector for terms and docs improve recall by 76% • LSI+Okapi: use low-dimensional subvectors of semantic vector to implicitly cluster docs, and then use Okapi to guide search process and doc selection
Introduction (3/3) • Contributions • Deriving low-dimensional representation for high-dimensional data is a common theme for many fields. ex: Principal Component Analysis (PCA), LSI • The proper configuration we found for LSI should be of general interest to the LSI community • Since nearest neighbor search in a high-dimensional space is prohibitive, we propose pSearch.
pSearch System Overview (1/4) • An example of how the system works • pSearch uses a CAN to organize Engine nodes into an extension of LSI to answer queries, called pLSI. • Vector Space Model (VSM) • ltc term weighting
pSearch System Overview (2/4) • Latent Semantic Indexing • A: term-doc matrix, rank=r • LSI approximate A with a rank-k matrix by omitting all but the k largest singular values • Content-Addressable Network (CAN) • CAN partitions a d-dimensional Cartesian space into zones and assign each zone to a node
pSearch System Overview (3/4) • The pLSI Algorithm • The pLSI algorithm combine LSI and CAN to build pSearch • Upon reaching the destination, the query is flooded to nodes within a small radius r • Content-directed search algorithm: each node samples content stored on its neighbors and use them to decide which one to search next • LSI uses k=50~350 dimensional space for small corpora
pSearch System Overview (4/4) • Dimension mismatch between CAN and LSI • The real dimension of a CAN can’t be higher than l=O(log(n)) • Partitions a k-dimensional semantic vector into multiple l-dimension subvectors • Given a doc, we store its index at p places in the CAN using its first p subvectors as DHT keys (p=4) • Two similar subvectors ensure their full vectors are also similar • accuracy = |A∩B|/|A| • A: retrieve 15 docs for each TREC7&8 based on 300 dimension semantic vectors
Improving Retrieval Quality (1/5) • Proper LSI Configuration • Term normalization • Doc normalization • The choice of using to project vectors • Experiment • SVDPACK • Corpus: disk4 and 5 from TREC, 528,543 docs, 2GB • Queries: the title field of topics 351-450 • use ltc to generate the term-doc matrix for SVD • Due to memory limitations, select only 15% of the TREC corpus to construct a 83,098-term by 79,316-docs matrix to SVD, which project vectors into a 300-dimension space. (memory: 1.7GB, time: 57mins on 2GHz Pentium 4)
Improving Retrieval Quality (2/5) • Improvement • retrieve 1,000 docs for each query and report the average # of relevant docs • return more 76% more relevant docs, when norm both • normalizing terms improves performance by emphasizing terms • normalizing docs corroborates the belief that cosine is a robust measure for similarity
Improving Retrieval Quality (3/5) • TREC vs. Medlars Corpus • Medlars: 1,033 docs and 30 queries • Docs and queries are projected into a 50-dimension space • 50-dimension is sufficient for the small corpus • 300-dimension is insufficient for the large corpus • Normalization is beneficial if the dimension of the semantic space are insufficient in capturing the fine structure of the corpus.
Improving Retrieval Quality (4/5) • LSI is bad for large corpus • LSI does not exploit doc length in ranking • 300-dimension semantic space is insufficient for TREC. LSI’s performance can be improved by increasing dimensionality. • LSI+Okapi • use 4-plane pLSI (each plane 25 dimensions) • each plane retrieve 1,000 docs, use Okapi to rank the returned 4,000 docs
Improving Retrieval Quality (5/5) • Precision-recall for TREC • Precision-recall for Medlars • High-end precision for TREC • P@i: precision when retrieving i docs for a query • The performance of LSI+Okapi • High-end precision approaches that of Okapi, but the low-end still lags behind. • The low-end precision can be improved by allowing each plane to return more candidate docs for Okapi to rank, but this would increase the search cost.
Improving the Efficiency of LSI (1/) • Traditionally • LSI use term-doc matrix as the input for SVD • for a matrix A≡Rt*d with about c nonzero elements per column, the time complexity of SVD is O(t*d*c) • The eLSI algorithm • Use spherical k-means to cluster docs C =[c1 c2… cs] ≡Rt*s • The aggregate weight of a term i: • we select a subset of e rows from matrix C to construct a row-reduced matrix • e: top e terms with the largest aggregate weight
Improving the Efficiency of LSI (2/) • For TREC corpus • The complete term-doc matrix has 408,653 rows and 528,155 columns • The matrix has less than 2,000 rows and 2,000 cols • Projection • Projects terms into the semantic space using Vk • Project a doc (or query) vector q into the semantic space and normalize it to unit length
Improving the Efficiency of LSI (3/) • Other Dimensionality Reduction Methods • Random Projection (RP) • The first step of all other algorithms partitions docs into k clusters, G=[g1 g2… gk] ≡Rt*k • Concept Indexing (CI) • The third algorithm solves the least-squares problem • QR decomposition
Improving the Efficiency of LSI (4/) • RP-eLSI F is a random matrix • Comparing Dimension Reduction Methods • RP performs well when the dim of the reduced space is sufficient in capturing the real dim of the data