Automatic Collection “Recruiter”

Automatic Collection “Recruiter” Shuang Song

Project Goal • Given a collection, automatically suggest other items to add to the collection • Design a process to achieve the task • Apply different filtering algorithms • Evaluate the result

Collection New Items 1 3 Query Terms Training Sets 2 External Source Filter Query Results The Process • Tokenization and frequency counting • New items extraction • New items filtering and ranking

Filtering Algorithms • Latent Semantic Analysis (LSA) • Pre-processing, no stemming • SVD over term by document matrix • Pseudo-document representation of new items • Gzip Compression Algorithms

Relevance Measure - LSA Collection Signature Vector Pseudo-document Vector V V* LSA Feature Space

Relevance Measure - gzip

First Experiment – Math Forum Collection • 19 courseware in the collection • 10 items in the experiment set • First 5 from math forum • The other 5 from other collections in www.smete.org

First Experiment Result

Second Experiment – Collaborative Filtering Collection • 12 papers in the collection • 11 items in the experiment set • First 10 from Citeseer • Query terms submitted: (information 284) (algorithm 250) (ratings 217) (filtering 159) (system 197) (query 149) (reputation 114) (reviewer 109) (collaborative 106) (recommendations 98) • Last one is the paper we read in class: “An Algorithm for Automated Rating of Reviewers”

Second Experiment Result

Second Experiment – User Study • 6 people in my research lab participated in this study • 3 of them with IR background • 3 of them without IR background • They were asked to rate the 11 items in the experiment set in according to the the degree of relevance to the given collection

Second Experiment Result – Human Rating

Second Experiment Result – Another View

Second Experiment Result –comparison of w/o SVD and w/o weightings

Second Experiment – Correlation with human rating

Second Experiment –precision and recall (cutoff: RLSA >0.5 & Rgzip>0.2)

Second Experiment –precision and recall (cutoff: RLSA >0.4 & Rgzip>0.17)

Comparison of Two Filtering Algorithms • Gzip works well when input documents are just abstracts, while LSA works for both • LSA captures words association pattern and statistical importance, gzip scans for repetition only. • LSA is more computationally demanding, while gzip is simple • Effectiveness

To Do List And Future Work • Accurate and trustworthy evaluation from expert (collection owner?) • Extract full text and abstract from Citeseer automatically

Automatic Collection “Recruiter”