180 likes | 301 Vues
This paper explores methods for text filtering and collaborative filtering, focusing on how collective user interests can be harnessed to recommend relevant documents. It discusses the construction of content profiles, the application of pure collaborative filtering, and the integration of content-based filtering with collaborative methods to enhance recommendations. By utilizing latent semantic indexing (LSI) and extensive experimental results, the work aims to improve filtering performance and outlines future directions including analysis of various datasets and techniques.
E N D
Techniques for Collaboration inText Filtering Ian Soboroff Department of Computer Science and Electrical Engineering University of Maryland, Baltimore County ian@cs.umbc.edu
Overview • Text filtering and collaborative filtering • Finding collaboration among content profiles • Experimental results • Ongoing work
Information Filtering • Given • a stream of documents (news articles, movies) • a set of users (with stable and specific interests) • Recommend documents to users who will be interested in them • "Tell me when a jazz CD comes out that I'll like." • "Tell me when an earthquake is reported."
Content Filtering • Construct profiles from example documents • vector of weights for terms in documents • can use known relevant and nonrelevant docs • can use external resources such as a home page, job description, or research papers • Match new documents against content profiles
Filtering in a Community • Many people will be watching the same stream • Some of them may have overlapping interests • earthquakes, mideast politics, building codes, Turkey • Charles Mingus, Duke Ellington, Kenny G • Want to take advantage of group effort
"Pure" Collaborative Filtering • collect users' ratings for documents • thumbs up/down, or 1-5 scale • compute correlations among users • predict ratings for new/unseen items using existing ratings and correlation values
Pure CF Example Comedies Dramas Alice 5 7 Bob ? 9 7 ? 2 9 Carmen 4 9 7 8 1 8 Doug ? 9
Combining Content and Collaboration • Pure collaborative filtering • can recommend anything • must have ratings to give predictions • don't know much about documents or ratings • Adding content to collaboration • content filtering can recommend an unrated document • exploit common themes among content profiles
One Approach to CBCF • Construct content profiles • Documents are vectors of weighted features • Build profiles from known relevant and nonrelevant documents • Collaborative step • Combine profile vectors into single matrix • Compute latent semantic index of profile collection • Route new documents in profiles' "LSI space"
Latent Semantic Indexing DT wtd T = r r r d t d t r • Compute singular value decomposition of a content matrix • D, a representation of M in r dimensions • T, a matrix for transforming new documents • gives relative importance of dimensions
Collaborating with LSI • LSI dimensions are ... • based on term co-occurrence patterns between documents (profiles) • ordered by their prominence in collection • LSI space built from profiles • highlights common patterns among profiles • "noisy" dimensions can be pruned • project new documents into a collaborative space for routing
Experiments with Cranfield • Cranfield, a standard (if small) IR collection • 1398 documents, 255 scored queries • Profiles: selected Cranfield queries • 26 queries with ³ 15 relevant documents • 70% of profile's relevant docs used in each profile • Results shows improvement for using LSI of profiles • compared to using profiles alone • compared to using LSI of all of Cranfield
Results: Average Precision k-value Set 1 Set 2 0.2894 0.2705 - Content (log-tfidf) Content LSI 25 0.2656 0.1980 50 0.3136 0.2686 (LSI of all of Cranfield) 100 0.3251 0.3053 0.3314 200 0.3144 0.3149 500 0.3302 Collaborative LSI 8 0.3136 0.2583 0.4151 0.3745 15 (LSI of profiles) 18 0.3600 0.3615
Experiments with TREC • TREC-8 routing task • Profiles: 50 topics (351-400) • Test Documents: Financial Times 1993-4 • Training Documents: FT 92, LA Times 89-90, FBIS • Building profiles • short topic description • known relevant documents in training set • sample of non-relevant documents from training set
Average Precision in TREC • Average precision... • with profiles alone = 0.4464 • with profile LSI = 0.3971 • LSI shows no improvement over original profiles • Some topics conceivably have common interests • "hydrogen energy"; "hydrogen fuel automobiles"; "hybrid fuel cars" • "clothing sweatshops"; "human smuggling" • But too little training overlap?
Conclusions • LSI can improve filtering performance • but might not, if SVD can't find anything to work with • LSI of profiles is much cheaper to compute than LSI of a whole collection (or even a sample!)
Current and Future Work • Looking at other collections • More TREC! • Reuters-21578 • Collaborative filtering collections... such as? • Looking at other techniques • Comparison to collaboration alone? • Other methods of combining content and collaboration