1 / 33

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection. Deepayan Chakrabarti (deepay@cs.cmu.edu). Problem Definition. People. People Groups. People Groups. People.

laneh
Télécharger la présentation

AutoPart: Parameter-Free Graph Partitioning and Outlier Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AutoPart: Parameter-Free Graph Partitioning and Outlier Detection Deepayan Chakrabarti (deepay@cs.cmu.edu)

  2. Problem Definition People People Groups People Groups People Group people in a social network, or, species in a food web, or, proteins in protein interaction graphs …

  3. Reminder People People Graph: N nodes and E directed edges

  4. Problem Definition People Groups People People People Groups • Goals: • [#1] Find groups (of people, species, proteins, etc.) • [#2] Find outlier edges (“bridges”) • [#3] Compute inter-group “distances” (how similar are two groups of proteins?)

  5. Problem Definition People Groups People People People Groups • Properties: • Fully Automatic (estimate the number of groups) • Scalable • Allow incremental updates

  6. Related Work • Graph Partitioning • METIS (Karypis+/1998) • Spectral partitioning (Ng+/2001) • Clustering Techniques • K-means and variants(Pelleg+/2000,Hamerly+/2003) • Information-theoreticco-clustering (Dhillon+/2003) • LSI(Deerwester+/1990) Measure of imbalance between clusters, OR Number of partitions Rows and columns are considered separately, OR Not fully automatic Choosing the number of “concepts”

  7. Outline • Problem Definition • Related Work • Finding clusters in graphs • Outliers and inter-group distances • Experiments • Conclusions

  8. Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions

  9. What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups • Similar nodes are grouped together • As few groups as necessary A few, homogeneous blocks Good Clustering Good Compression implies

  10. Binary Matrix Node groups Node groups Main Idea Good Compression Good Clustering implies pi1 = ni1 / (ni1 + ni0) +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Description Cost Code Cost

  11. One node group Examples high low +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost low high n node groups

  12. +Σi Cost of describing ni1, ni0 and groups Σi (ni1+ni0)* H(pi1) Total Encoding Cost = Description Cost Code Cost What is a “good” clustering Why is this better? Node Groups versus Node Groups Node Groups Node Groups low low

  13. Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions

  14. Algorithms k = 5 node groups

  15. Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k

  16. Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k

  17. Node groups Node groups Fixed number of groups k • Reassign:for each node: • reassign it to the group which minimizes the code cost

  18. Algorithms Find good groups for fixed k Start with initial matrix Lower the encoding cost Final grouping Choose better values for k

  19. Choosing k • Split: • Find the group R with the maximum entropy per node • Choose the nodes in R whose removal reduces the entropy per node in R • Send these nodes to the new group, and set k=k+1

  20. Algorithms Find good groups for fixed k Reassign Start with initial matrix Lower the encoding cost Final grouping Choose better values for k Splits

  21. Algorithms • Properties: • Fully Automatic  number of groups is found automatically • Scalable  O(E) time • Allow incremental updates  reassign new node/edge to the group with least cost, and continue…

  22. Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions

  23. Outlier edges Node Groups Node Groups Outlier Edges Nodes Nodes Deviations from “normality” Lower quality compression Outliers Find edges whose removal maximally reduces cost

  24. Grp1 Grp2 Node Groups Grp3 Node Groups Inter-cluster distances Nodes Nodes Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

  25. Grp1 Grp2 Grp3 Inter-cluster distances Grp1 5.5 Grp2 Node Groups 5.1 4.5 Grp3 Node Groups Two groups are “close” Merging them does not increase cost by much distance(i,j) = relative increase in cost on merging i and j

  26. Outline • Problem Definition • Related Work • Finding clusters in graphs • What is a good clustering? • How can we find such a clustering? • Outliers and inter-group distances • Experiments • Conclusions

  27. Experiments “Quasi block-diagonal” graph with noise=10%

  28. Experiments • DBLP dataset • 6,090 authors in: • SIGMOD • ICDE • VLDB • PODS • ICDT • 175,494 “dots”, one “dot” per co-citation Authors Authors

  29. Experiments Author groups Authors Authors Author groups Stonebraker, DeWitt, Carey k=8 author groups found

  30. Experiments Grp1 Grp8 Author groups Author groups Inter-group distances

  31. Experiments • Epinions dataset • 75,888 users • 508,960 “dots”, one “dot” per “trust” relationship • k=19 groups found User groups User groups Small dense “core”

  32. Experiments Time (in seconds) Number of “dots” Linear in the number of “dots”  Scalable

  33. Conclusions • Goals: • Find groups • Find outliers • Compute inter-group “distances” • Properties: • Fully Automatic • Scalable • Allow incremental updates

More Related