1 / 18

Dimensionality reduction by random projection and latent semantic indexing

Dimensionality reduction by random projection and latent semantic indexing. Ângelo Cardoso IST/UTL December 2009. Jessica Lin and Dimitrios Gunopulos. Outline. Introduction Latent Semantic Indexing (LSI) Random Projection (RP) Combining LSI and Random Projection Experiments

karena
Télécharger la présentation

Dimensionality reduction by random projection and latent semantic indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Dimensionality reduction by random projection and latent semantic indexing Ângelo Cardoso IST/UTL December 2009 Jessica Lin and DimitriosGunopulos

  2. Outline • Introduction • Latent Semantic Indexing (LSI) • Random Projection (RP) • Combining LSI and Random Projection • Experiments • Dataset and pre-processing • Document Similarity • Document Clustering

  3. IntroductionLatent Semantic Indexing • Vector-space model • Term-to-document matrix where each entry is the relative frequency of a term in the document • Find a subspace with k dimensions to project the original term-to-document matrix • SVD is the optimal solution in mean squared error sense • Speed up queries • Address synonymy • Find the intrinsic dimensionality of data

  4. Introduction Random Projection • What if we randomly construct the subspace to project? • Johnson-Lindenstrauss lemma • If points in vector space are projected onto a randomly selected subspace of suitably high dimensions, then the distances between the points are approximately preserved • Making the subspace orthogonal is computationally expensive • However we can rely on a result by Hecht-Nielsen: • In a high-dimensional space, there exists a much larger number of almost orthogonal than orthogonal directions

  5. Combining LSI and Random ProjectionMotivation • LSI • Captures the underlying semantics • Highly accurate • Can improve retrieval performance • Time complexity is expensive • O(cmn) where m is the number of terms, c is the average number of terms per document and n the number of documents • Random Projection • Efficient in terms of computational time • Does not preserve as much information as LSI

  6. Combining LSI and Random ProjectionAlgorithm • Proposed in • Latent Semantic Indexing: A Probalistic Analsys; Papadimitriou, C.H. and Raghavan, P. and Tamaki, H. and Vempala, S.; Journal of Computer and System Sciences; 2000 • Idea • Improve Random Projection accuracy • Improve LSI computional time • First the data is pre-processed to a lower dimension k1 using Random Projection • LSI is applied on the reduced lower-dimensional data, to further reduce the data to the desired dimension k2 • Complexity is O(ml (l + c)) • RP on original data • O(mcl) • LSI on reduced lower-dimensional data) • O(ml²)

  7. Experiments – SimilarityDataset and Pre-processing • Two subsets of Reuters categorization text collection • Common and rare words are removed • Porter stemming • Term-document matrix representation • Normalized to unit length • Sets • Larger subset • 10377 documents • 12113 terms • Term-document matrix density is 0,4% • Smaller subset • 1831 documents • 5414 terms • Term-document matrix density is 0,8%

  8. Experiments – SimilarityLayout • Three techniques for dimensionality reduction are compared • Latent Semantic Indexing (LSI) • Random Projection (RP) • Combination of Random Projection and LSI (RP_LSI) • The dimensionality of the original data is reduced to lower k-dimensions • k = 50, 100, 200, 300, 400, 500, 600

  9. Experiments – SimilarityMetrics • Euclidean Distance • Cosine of the angle between documents • Determining the error • Randomly select 100 document pairs and then calculate their distances before and after dimensionality reduction • Compute the correlation between the distance vectors before (x)and after (y) dimensionality reduction • Error is defined as

  10. The best technique in terms of error is LSI as expected We can see that RP_LSI improves the accuracy of RP in terms of euclidean distance and dot product Experiments - SimilarityDistance before and after dimensionality reduction * RP_LSI: k1 = 600

  11. The amount of the second reduction (the final dimension) is more important to achieve a smaller error than the amount of the first reduction This suggests that LSI plays a more important role in preserving similarity than RP Experiments - SimilarityRP_LSI - k1 and k2 parameters

  12. Experiments - SimilarityRunning Time • RP_LSI performs slightly worse than LSI for the larger dataset (more sparse) • RP_LSI achieves a significant improvement over LSI in the smaller dataset (less sparse) * RP_LSI: k1 = 600

  13. Experiments – ClusteringLayout • Clustering is applied on the data before and after dimensionality reduction. • Experiments are performed on the smaller dataset • Clustering algorithm choosen is classic k-Means • Effective • Low computional cost • Documents vectors are normalized to unit lenght before clustering • Centroids are normalized to unit lenght after clustering

  14. Experiments – Clusteringk-Means • k-Means objective function is to minimize the sum of intra-cluster errors • The quality of dimensionality reduction is evaluated using this criterion • Since the dimensionality of data is reduced we have to compute this criteria on the original space to make the comparison possible • The number of clusters is set to 5 • Since it’s rougly the number of main topics in the dataset • Initialization is random • k-Means is repeated 20 times for each experiment and the average is taken

  15. Experiments – ClusteringResults • LSI and RP_LSI show results similar to the original data even for smaller dimensions • RP shows significantly worse performance for smaller dimensions and more similar performance for larger dimensions • LSI shows slightly better results than RP_LSI • Clustering results using euclidean distance are similar

  16. Conclusion • LSI and Random Projection were compared • The combination of Random Projection and LSI is analyzed • The sparseness of the data seems to play central role in the effectiveness of this technique • The technique appears to be more effective the less sparse the original data is • SVD complexity is linear on the sparseness of the data • Random Projection makes the data completely dense • The gain in reducing first the data dimensionality rivals with the additional complexity added to the SVD calculation by making the data completely dense

  17. Conclusion • Additional experiments are necessary to prove that it is indeed the sparsness of the data that causes the discrepancy on the running time to what was previously expected • Other dimensionality reduction algorithms that preserve the sparseness of the data might be useful in improving the running time of LSI

  18. Questions

More Related