Semantic, Hierarchical, Online Clustering of Web Search Results

Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong 발표자 : 조이현

Overview • Previous step result • Identifying base cluster • Basic idea • Basic definition • Orthogonal clustering • Determine cluster number • Combining base clusters • Prototype system • Conclusion

Previous step result • Term-document matrix • Row vectors represent the terms (key-phrases). • Column vectors represent the documents. • The element A(i, j)=1 if i-th term Ti occurs in j-th document Dj . The mⅹn matrix A

Terms (key phrases) Documents The association between terms and documents Basic idea • The terms(documents) linked with the same document(term) should be close in semantic. • Densely-linked terms or documents should be grouped together.

Definition concerning cluster • Cluster vector(xg) • Cg is a cluster of m objects t1, t2,···,tm . • xg is denoted as the cluster vector of Cg . • xg is a m-dimensional vector and |xg| = 1. • xg(i) represents the intensity ti belonging to Cg . • Cluster density • Assume xg(yg) is a cluster of the row(column) vectors of A. • The cluster density of xg(yg) is |xgTA|(|Ayg|).

Eigenvalue & Eigenvector • A is a linear transformation represented by a matrix A. • λis a eigenvalue • x is a right eigenvector

- x1 = A cluster with maximum density - x2 = another cluster. Orthogonal clustering def.(1/2) • Clusters with high density captures the main implicit concepts. • The larger η the higher cluster density of x2. • x2 will be arbitrary close to x1. (no constraint on x2). • x2 should be orthogonal x1. (to get a meaningful clusters)

Orthogonal clustering def.(2/2) • The orthogonal clustering of row (column) vectors of Ais discovering a set of cluster vectors x1, x2,···,xk . • xg(1≤g≤k) is the cluster with maximum density subject to being orthogonal to x1,···,xg-1 .

Find out solution(1/3)(orthogonal clustering problem) • Rayleigh Quotient def. • M is a real mⅹm symmetrical matrix. • λ1≥λ2≥···≥λm are eigenvalues of M • p1≥p2≥···≥pm are orthonormal eigenvectors corresponding eigenvalues. • Theorem1

Find out solution(2/3)(orthogonal clustering problem) • SVD (Singular Value Decomposition) def. • A is a mⅹn matrix • rank(A) = r • λ1≥λ2≥···≥λr>0 = r non-zero eigenvalues of AAT • x1, x2,···,xm (y1, y2,···,yn)= orthonormal eigenvectors of AAT(ATA) (called left(right) singular vectors of A) • U=[x1, x2,···,xm], V= [y1, y2,···,yn ] • σg is called singular value of A

Find out solution(3/3)(orthogonal clustering problem) • Theorem2 • The left(right) singular vector of A are the cluster vectors discovered through orthogonal clustering of row(column) vectors of A. • proof • cg should have maximum density subject to being orthogonal to x1,···,cg-1.(by definition of orthogonal clustering) • cg must be the g-th eigenvector pg of AAT.

Determine cluster number(1/2) • Cluster matrix def. • The clusters described respectively by xg and yg in fact have the same “meaning”.

Determine cluster number(2/2) • Orthogonal clustering quality def. • Given a quality threshold q*(e.g. 80%), the ideal cluster number k* is the minimum number k satisfying q(A,k) ≥ q*.

Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } } } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; } Combining base clusters

Prototype system • Crate a prototype system named WICE (Web Information Clustering Engine). • Doing well for dealing with the special problems related to Chinese. • Output for query “object oriented” • object oriented programming • object oriented analysis, etc.

Conclusion • Main contribution • The benefit of using key phrase. • Method based on suffix array for key phrase. • The concept of orthogonal clustering. • The WICE system is designed and implemented. • Further works • Further experimenting. • Detailed analysis and interpretation of experiment results. • Comparing with other clustering algorithms.

Semantic, Hierarchical, Online Clustering of Web Search Results

Semantic, Hierarchical, Online Clustering of Web Search Results

Presentation Transcript

Clustering Web Search Results

Clustering Web Search Results

Hierarchical Clustering

Web Search Results Visualization: Evaluation of Two Semantic Search Engines

Hierarchical Clustering

Hierarchical Clustering

Topical Clustering of Search Results

Hierarchical Clustering

A Personalized Search Engine Based on Web Snippet Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Semantic Web Search

Interactive Exploration of Hierarchical Clustering Results HCE (Hierarchical Clustering Explorer)

Instance Based Clustering of Semantic Web Resources

Clustering Search Results Using PLSA

Online Clustering of Web Search results

Hierarchical Clustering

Clustering Personalized Web Search Results

Hierarchical Clustering

Hierarchical Clustering

Hierarchical Clustering

Clustering Search Results Using PLSA