190 likes | 355 Vues
Automatic Collection “Recruiter”. Shuang Song. Project Goal. Given a collection, automatically suggest other items to add to the collection Design a process to achieve the task Apply different filtering algorithms Evaluate the result. Collection. New Items. 1. 3. Query Terms.
E N D
Automatic Collection “Recruiter” Shuang Song
Project Goal • Given a collection, automatically suggest other items to add to the collection • Design a process to achieve the task • Apply different filtering algorithms • Evaluate the result
Collection New Items 1 3 Query Terms Training Sets 2 External Source Filter Query Results The Process • Tokenization and frequency counting • New items extraction • New items filtering and ranking
Filtering Algorithms • Latent Semantic Analysis (LSA) • Pre-processing, no stemming • SVD over term by document matrix • Pseudo-document representation of new items • Gzip Compression Algorithms
Relevance Measure - LSA Collection Signature Vector Pseudo-document Vector V V* LSA Feature Space
First Experiment – Math Forum Collection • 19 courseware in the collection • 10 items in the experiment set • First 5 from math forum • The other 5 from other collections in www.smete.org
Second Experiment – Collaborative Filtering Collection • 12 papers in the collection • 11 items in the experiment set • First 10 from Citeseer • Query terms submitted: (information 284) (algorithm 250) (ratings 217) (filtering 159) (system 197) (query 149) (reputation 114) (reviewer 109) (collaborative 106) (recommendations 98) • Last one is the paper we read in class: “An Algorithm for Automated Rating of Reviewers”
Second Experiment – User Study • 6 people in my research lab participated in this study • 3 of them with IR background • 3 of them without IR background • They were asked to rate the 11 items in the experiment set in according to the the degree of relevance to the given collection
Second Experiment Result –comparison of w/o SVD and w/o weightings
Second Experiment –precision and recall (cutoff: RLSA >0.5 & Rgzip>0.2)
Second Experiment –precision and recall (cutoff: RLSA >0.4 & Rgzip>0.17)
Comparison of Two Filtering Algorithms • Gzip works well when input documents are just abstracts, while LSA works for both • LSA captures words association pattern and statistical importance, gzip scans for repetition only. • LSA is more computationally demanding, while gzip is simple • Effectiveness
To Do List And Future Work • Accurate and trustworthy evaluation from expert (collection owner?) • Extract full text and abstract from Citeseer automatically