Learning to Cluster Web Search Results.

Learning to Cluster Web Search Results. Hua-jun zeng Qi-cai He Zheng Chen Wei-Ying Ma Jinwen Ma

Contents • Motivation • Algorithm -Salient phrases extraction -Learning to rank salient phrases •Experiments •Conclusion

Motivation • Organizing Web search results into clusters facilitates user’s quick browsing through search results. • Traditional clustering techniques don’t generate clusters with highly readable namessalient phrase

Motivation(cont’d)

Problem formalization and algorithm • Algorithm is composed of the four steps: 1.Search result fetching 2.Document parsing and phrase property calculation 3.Salient phrase ranking 4.Post-processing

Salient Phrases extraction • Phrase Frequency/Inverted Document Frequency( TFIDF) • Phrase Length • Intra-Cluster Similarity • Cluster Entropy • Phrase Independence

TFIDF w=current phrase D(w)=the set of documents that contains w N=the number of total documents f(w)=frequency caclulation

Phrase Length • Generally, a longer name is preferred for users’ browsing

Intra-Cluster Similarity(ICS) •First,convert documents into vectors •For each candidate cluster, we then calculate its centroid as: •ICS is calculated as the average cosine similarity between the documents and the centroid

Cluster Entropy

Phrase Independence • a phrase is independent when the entropy of its context is high (I,e, the left and right contexts are random enough).we use a IND to measure the independence of phrases.

The IND,value for right context could be calculated similarly • The final IND value is the average of those two

Learning to rank salient phrases. • Regression tries to determine the relationship between two random variables X=(x1,x2,..,xp)and y x= (TFIDF,LEN,ICS,CE,IND) • Linear Regression

Experiments • Each query,200 returned documents from search engines. • Extract all n-grams from the documents where n<=3. • Use SVM_Light(2) to do support vector regression.

Experiments(cont’d) • Evaluation Measure -precision(P) at top N results to measure the performance:

Experiments(cont’d) • Training Data Collection

Experimental Results • Property Comparison.

Experimental Results (cont’d) • Learning Methods Comparison

Experimental Results(cont’d) • Learning Methods Comparison. -the coefficients of one of the linear regression models as follows: Y=-0.427 +0.146 X TFIDF +0.241 X LEN -0.022 X ICS +0.065 X CE +0.266 X IND

Experimental Results • In put Document Number

Experimental Results(cont’d) • Coverage and Overlap

Conclusion • Reformalizes the search result clustering problem as a supervised salient phrase ranking problem. • Several properties and regression models are proposed to calculate salience score for salient phrase

Learning to Cluster Web Search Results.

Learning to Cluster Web Search Results.

Presentation Transcript

Search Results

Clustering Web Search Results

Search Results

Clustering Web Search Results

Web Search Results Visualization: Evaluation of Two Semantic Search Engines

Web Search For a Planet: The Google Cluster Architecture

Learning to Judge Image Search Results for Synonymous Queries

Grouper: A Dynamic CLUSTERIN G INTERFACE to WEB SEARCH RESULTS

Web Search for a Planet: The Google Cluster Architecture

Web Search Strategies: Learning English Online

Learning to Cluster Web Search Results

Cluster Learning Theory

Learning Management System Search Results

Improving Web Search Results Using Affinity Graph

Improving Web Search Results Using Affinity Graph

Online Clustering of Web Search results

Search Results

Clustering Personalized Web Search Results

V5 cluster search

Learning User Clicks in Web Search

Learning to Search

Using Web Search Methods Refining Results