1 / 32

Clustering Applications at Yahoo!

Clustering Applications at Yahoo!. Deepayan Chakrabarti (deepay@yahoo-inc.com). Outline. Clustering, in itself, is often not the primary problem Exploratory analysis is rarely needed Methods are often tied to the “final” goal. Data Problem Method Issues. Data Problem Method Issues.

Gabriel
Télécharger la présentation

Clustering Applications at Yahoo!

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Clustering Applications at Yahoo! Deepayan Chakrabarti (deepay@yahoo-inc.com)

  2. Outline • Clustering, in itself, is often not the primary problem • Exploratory analysis is rarely needed • Methods are often tied to the “final” goal Data Problem Method Issues Data Problem Method Issues Data Problem Method Issues …

  3. Outline • Advertiser-keyword graph(+ social networks) • Graph Partitioning • CTR predictions for ads on webpages • Co-clustering • Query refinement and suggestions • Local search methods • Conclusions

  4. Outline • Advertiser-keyword graph(+ social networks) • Graph Partitioning • CTR predictions for ads on webpages • Co-clustering • Query refinement and suggestions • Local search methods • Conclusions

  5. Graph Partitioning (Applications) • Find clusters of advertisers and keywords • Keyword suggestions • Running experiments on some “natural” clusters • Similar: • Y! Answers • Flickr bids Bidded Keyword Advertiser ~10M nodes

  6. Graph Partitioning (Applications) • Find clusters of IM users • Targeted advertising • Exploratory analysis • Clusters of the Web Graph • Distributed pagerank computation Who-messages-whom IM graph ~100M nodes

  7. Graph Partitioning (Methods) • Basic “Global” spectral partitioning [Ng+/01] • Find 2nd eigenvector of the graph Laplacian • This embeds all nodes on the real line • Split the line in two, to get two clusters • Can approximate the optimal conductance cut • For more clusters: • Use k eigenvectors (for known k), or • Split in two, and recurse on each cluster

  8. Graph Partitioning (Methods) • However, this has problems [Lang/05, Leskovec+/08]: • Min. conductance or quotient cuts lead to “small chunks” Better balance  worse cuts Large-sized low-quotient cuts are actually just unions of whiskers “Whiskers”

  9. Graph Partitioning (Methods) • However, this has problems: • Min. conductance or quotient cuts lead to “small chunks” • Balance is not very strongly encouraged • Recursive partitioning takes too long • Each eigenvector computation yields only a small cluster being broken off • Two alternatives: • More balanced cuts  recursion is faster • Unbalanced cuts, but much faster computation

  10. Perfect balance on real line: NP-Hard Perfect balance on hypersphere: SDP formulation [Lang/05] Graph Partitioning (Methods) • Achieving better balance • Combine algorithms (e.g., [Anderson+/08]) • METIS (more balanced cuts), followed by • Flow-based improvement (conductance) • Stronger balance constraints Many nodes One node Spectral embedding

  11. Graph Partitioning (Methods) • Faster Computation via “local” graph partitioning [Spielman+/04, Anderson+/06] • Pick seeds randomly • Build local clusters around seed • Bite off cluster, and repeat • Time for each iteration is proportional to the size of the local cluster (scalability) • Better for large graphs, or when not all clusters are needed

  12. Graph Partitioning (Issues) • Many complex networks are very different from planar/mesh-like networks • Good small cuts • Good large cuts are hard to find, and may not even exist • Hard to have a good hierarchical partitioning • Is conductance really the best objective? • METIS+flow finds lower conductance cuts, but • Local spectral methods finds more “tightly-knit” cuts [Leskovec+/08] • Is there a good compromise?

  13. Graph Partitioning (Issues) • Speed and Scalability • Most implementations have trouble with large graphs, or graphs with some extremely high-degree nodes • Map-reduce style partitioning algorithms?

  14. Outline • Advertiser-keyword graph(+ social networks) • Graph Partitioning • CTR predictions for ads on webpages • Co-clustering • Query refinement and suggestions • Local search methods • Conclusions

  15. Co-clustering (Applications) Ads • The Content Match problem • Predict click-thru rate (CTR) • Extreme sparsity • Few views • Even fewer clicks Webpages clicks = CTR views

  16. Co-clustering (Methods) Ad clusters Cell CTR = row effect + column effect + block effect • Approximate cell CTR using block CTR [Dhillon+/03] • Minimize divergence between the original matrix and its reconstruction • Combats sparsity Webpage clusters

  17. Co-clustering (Issues) • Picking the right number of clusters • Use MDL [Chakrabarti+/2004] • Handling new ads/pages • Combine co-clustering with feature-based prediction models [Agarwal+/2007] • Handling extra information • E.g., if each page and ad can be categorized into a taxonomy [Chakrabarti+/2007] • Can the taxonomy be automatically modified?

  18. Co-clustering (Issues) • Iterative process (hard for map-reduce) • Factor-3 approximation by just clustering webpages and ads separately [Dasgupta+/2008]

  19. Outline • Advertiser-keyword graph(+ social networks) • Graph Partitioning • CTR predictions for ads on webpages • Co-clustering • Query refinement and suggestions • Local search methods • Conclusions

  20. Query Refinement • User inputs ambiguous query (“madonna”) • Search engine asks:“Did you mean: songs, videos, pictures?” Refinement = cluster of terms lyrics “song title” album …

  21. Query Refinement • Suppose we could relate queries with keywords Related Keywords Queries Madonna Beatles songs lyrics albums pictures photos Honda Ford

  22. Query Refinement (Problem) • Different from plain bipartite graph partitioning • Don’t confuse users! • only 3 or fewer clusters for each query • only a few easily-distinguishable clusters overall • Clustering quality now also depends on • the query workload • the algo that picks the “top-3” clusters for any query

  23. Query Refinement (Method) • Can optimally pick top-k clusters for any query [Wang+/2009] • for a wide range of matching functions • only if clusters are disjoint • Iteratively improve clustering via local search • Move a keyword to a new cluster • Update top-k clusters for all queries in workload • Repeat

  24. Query Refinement (Issues) • Iterative  slow • Each iteration has to go over the entire historical query logs • Optimality guarantees? • Modeling issues: • How do we present a cluster to the user? • Cluster naming? • How does a user interact with a cluster?

  25. Outline • Advertiser-keyword graph(+ social networks) • Graph Partitioning • CTR predictions for ads on webpages • Co-clustering • Query refinement and suggestions • Local search methods • Conclusions

  26. Conclusions • Standalone clustering applications are rare! • Constraints on clustering • What use will it serve (as in query refinement)? • What is the desired “natural” cluster size? • Clustering for predictions (e.g., CTR) • Missing values • Extreme sparsity • Combining clustering with explore-exploit

  27. Conclusions • Even for standalone clustering: • Scalability concerns • 10s to 100s of millions of nodes • Skewed degree distributions • Algorithms amenable to map-reduce • The right “balance” relaxation • Tradeoffs between cluster “compactness”, conductance, and partition sizes • Is it even reasonable to require balance?

  28. References • On Spectral Clustering: Analysis and an Algorithm, by Ng, Jordan, & Weiss, in NIPS 2002 • Nearly-linear time algorithms for graph partitioning, graph sparsification, and solving linear systems, by Spielman & Teng, STOC 2004 • Fixing two weaknesses of the Spectral Method, by Lang, NIPS 2005 • Local Graph Partitioning using PageRank Vectors, by Anderson, Chung, & Lang, SODA 2008 • An algorithm for improving graph partitions, by Anderson & Lang, SODA 2008 • Finding dense and isolated submarkets in a sponsored search spending graph, by Lang & Anderson, CIKM 2007 • Clustering of bipartite advertiser-keyword graph, by Carrasco, Fain, Lang, & Zhukov, IEEE Computer Society, 2003 • Statistical Properties of Community Structure in Large Social and Information Networks, by Leskovec, Lang, Dasgupta, & Mahoney, WWW 2008 • Approximate Algorithms for Co-Clustering, by Anagnostopoulos, Dasgupta, & Kumar, PODS 2008 • Information-theoretic co-clustering, by Dhillon, Mallela, & Modha, in KDD 2003 • Fully Automatic Cross-Associations, by Chakrabarti, Papadimitriou, & Faloutsos, in KDD 2004 • Predictive discrete latent factor models for large scale dyadic data, by Merugu, & Agarwal, in KDD 2007 • Estimating Rates of Rare Events at Multiple Resolutions, by Agarwal, Broder, Chakrabarti, Diklic, Josifovski, & Sayyadian, in KDD 2007 • Mining Broad Latent Query Aspects from Search Sessions, by Wang, Chakrabarti, & Punera, in KDD 2009

  29. Graph Partitioning (Applications) • Find clusters of users posting in similar groups • Enhance group activity • Are their special groups? Can we build special tools for them? • Similar: • Y! Answers • Flickr Yahoo! Groups Users ~100M users, ~10M groups

  30. Graph Partitioning (Issues) • Many complex networks are very different from planar/mesh-like networks • Good small cuts • Good large cuts are hard to find, and may not even exist • Hard to have a good hierarchical partitioning • Good large cuts • If they exist, most algorithms find them • If not: METIS+flow finds lower conductance cuts, but local spectral methods finds more “coherent” cuts • Is there a good compromise?

  31. Query Refinement • Given an ambiguous query, present user with up to 5 refinements • madonna  songs, videos, pictures • Each refinement represents a cluster of terms • songs = (lyrics, “song title”, album, …)

  32. Query Refinement • How can we relate “madonna” and “songs”? Search“madonna songs” Search “madonna” click no click Beatles Honda Ford lyrics albums pictures photos songs Madonna Queries Reformulations

More Related