1 / 16

Semantic, Hierarchical, Online Clustering of Web Search Results

Semantic, Hierarchical, Online Clustering of Web Search Results. Yisheng Dong. 발표자 : 조이현. Overview. Previous step result Identifying base cluster Basic idea Basic definition Orthogonal clustering Determine cluster number Combining base clusters Prototype system Conclusion.

linnea
Télécharger la présentation

Semantic, Hierarchical, Online Clustering of Web Search Results

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Semantic, Hierarchical, Online Clustering of Web Search Results Yisheng Dong 발표자 : 조이현

  2. Overview • Previous step result • Identifying base cluster • Basic idea • Basic definition • Orthogonal clustering • Determine cluster number • Combining base clusters • Prototype system • Conclusion

  3. Previous step result • Term-document matrix • Row vectors represent the terms (key-phrases). • Column vectors represent the documents. • The element A(i, j)=1 if i-th term Ti occurs in j-th document Dj . The mⅹn matrix A

  4. Terms (key phrases) Documents The association between terms and documents Basic idea • The terms(documents) linked with the same document(term) should be close in semantic. • Densely-linked terms or documents should be grouped together.

  5. Definition concerning cluster • Cluster vector(xg) • Cg is a cluster of m objects t1, t2,···,tm . • xg is denoted as the cluster vector of Cg . • xg is a m-dimensional vector and |xg| = 1. • xg(i) represents the intensity ti belonging to Cg . • Cluster density • Assume xg(yg) is a cluster of the row(column) vectors of A. • The cluster density of xg(yg) is |xgTA|(|Ayg|).

  6. Eigenvalue & Eigenvector • A is a linear transformation represented by a matrix A. • λis a eigenvalue • x is a right eigenvector

  7. - x1 = A cluster with maximum density - x2 = another cluster. Orthogonal clustering def.(1/2) • Clusters with high density captures the main implicit concepts. • The larger η the higher cluster density of x2. • x2 will be arbitrary close to x1. (no constraint on x2). • x2 should be orthogonal x1. (to get a meaningful clusters)

  8. Orthogonal clustering def.(2/2) • The orthogonal clustering of row (column) vectors of Ais discovering a set of cluster vectors x1, x2,···,xk . • xg(1≤g≤k) is the cluster with maximum density subject to being orthogonal to x1,···,xg-1 .

  9. Find out solution(1/3)(orthogonal clustering problem) • Rayleigh Quotient def. • M is a real mⅹm symmetrical matrix. • λ1≥λ2≥···≥λm are eigenvalues of M • p1≥p2≥···≥pm are orthonormal eigenvectors corresponding eigenvalues. • Theorem1

  10. Find out solution(2/3)(orthogonal clustering problem) • SVD (Singular Value Decomposition) def. • A is a mⅹn matrix • rank(A) = r • λ1≥λ2≥···≥λr>0 = r non-zero eigenvalues of AAT • x1, x2,···,xm (y1, y2,···,yn)= orthonormal eigenvectors of AAT(ATA) (called left(right) singular vectors of A) • U=[x1, x2,···,xm], V= [y1, y2,···,yn ] • σg is called singular value of A

  11. Find out solution(3/3)(orthogonal clustering problem) • Theorem2 • The left(right) singular vector of A are the cluster vectors discovered through orthogonal clustering of row(column) vectors of A. • proof • cg should have maximum density subject to being orthogonal to x1,···,cg-1.(by definition of orthogonal clustering) • cg must be the g-th eigenvector pg of AAT.

  12. Determine cluster number(1/2) • Cluster matrix def. • The clusters described respectively by xg and yg in fact have the same “meaning”.

  13. Determine cluster number(2/2) • Orthogonal clustering quality def. • Given a quality threshold q*(e.g. 80%), the ideal cluster number k* is the minimum number k satisfying q(A,k) ≥ q*.

  14. Combine base cluster X and Y if ( |X ∩ Y| / |X ∪ Y| > t1 ) { X and Y are merged into one cluster; } else { if ( |X| > |Y| ) { if ( |X ∩ Y| / |Y| > t2 ) { let Y become X’s child; } } else { if ( |X ∩ Y| / |X| > t2 ) { let X become Y’s child; } } } Merging Label if ( label x is a substring of label y ) { label_xy = label_y; } else if ( label_y is a substring of label_x ){ label_xy = label_x; } else { label_xy = “ label_x + label_y ”; } Combining base clusters

  15. Prototype system • Crate a prototype system named WICE (Web Information Clustering Engine). • Doing well for dealing with the special problems related to Chinese. • Output for query “object oriented” • object oriented programming • object oriented analysis, etc.

  16. Conclusion • Main contribution • The benefit of using key phrase. • Method based on suffix array for key phrase. • The concept of orthogonal clustering. • The WICE system is designed and implemented. • Further works • Further experimenting. • Detailed analysis and interpretation of experiment results. • Comparing with other clustering algorithms.

More Related