1 / 13

Recognizing Communities on the Web

Recognizing Communities on the Web. CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee. Introduction.

nita-barron
Télécharger la présentation

Recognizing Communities on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognizing Communities on the Web CS349 Presentation by Audrey Kao Recognizing Nepotistic Links on the Web, Brian D. Davison. Self-Organization of the Web and Identification of Communities, Gary William Flake, Steve Lawrence, C. Lee Giles, Frans M. Coetzee.

  2. Introduction • How do links determine web communities? • Natural community formation vs. web authors manipulating nepotistic links • Theoretical graph theory vs. artificial learning program • Both papers are fairly dated, from 2002

  3. What is a Web Community? A collection of web pages where each member page has more links within the community than outside the community. Goal: To identify web communities. Why? For practical applications and web analysis

  4. Maximum Flow Communities • Given a directed graph G = (V, E), with edge capacities c(u, v)ϵ Z+, and two vertices s, tϵV, find the maximum flow that can be routed from the source, s, to the sink, t, that obeys all capacity constraints • The Max Flow-Min Cut theorem proves that the maximum flow of the network = minimum cut that separates s and t

  5. Exact vs. Approximate Flow Communities • Exact: The “sink” is artificial and generic, ie. it receives from every edge from every other vertex • Accepts any bi-directional link • The community is very connected internally, but isolated from the rest of the graph • Approximate: Determined by a fixed depth crawl • Uses the exact-flow-community algorithm, then chooses the highest-ranked sites and repeats the algorithm • Rank determined by number of edges site has to within the community • This model used for study as it better represents the actual web Score determined by total # of inbound and outbound links a page has to other pages in its community…

  6. Francis Crick Community 80 Biography of Francis Harry Compton Crick (Nobel Foundation) 79 Biography of James Dewey Watson (Nobel Foundation) 51 The Nobel Prize in Physiology or Medicine 1962 (Nobel Foundation) 50 Biographical Sketch of James Dewey Watson (Cold Spring Harbor Lab.) 41 A structure for Deoxyribose Nucleic Acid (Nature, April 2, 1953) ... 1 Felix D’Herelle and the Origins of Molecular Biology (Amazon.com) 1 Biography of Gregor Mendel 1 Magazine: HMS Beagle Home 1 The Alfred Russel Wallace Page 1 U.S. Human Genome Project 5 Year Plan Stephen Hawking Community 85 Professor Stephen W. Hawking’s web pages 46 Stephen Hawking’s Universe at PBS 17 The Stephen Hawking Pages 15 Stephen Hawking Builds Robotic Exoskeleton (parody at the Onion) 14 Stephen Hawking and Intel ... 1 Did the cosmos arise from nothing? MSNBC story 1 Spanish page for Stephen Hawking’s Universe 1 Relativity Group at DAMTP, Cambridge 1 Millennium Mathematics Project 1 Particle Physics Education and Information Sites Ronald Rivest Community 86 Ronald L. Rivest : Home Page 29 Chaffing and Winnowing: Confidentiality without Encryption 20 Thomas H. Cormen’s home page at Dartmouth 9 The Mathematical Guts of RSA Encryption 8 German news story on Cryptography ... 1 Phil Zimmermann’s PGP web page 1 A Very Brief History of Computer Science 1 Cormen / Leiserson / Rivest: Introduction to Algorithms 1 Security and Encryption Links 1 HotBot Directory: Computers & Internet, Computer Science, People: R Sample Results Community Most Significant Text Features crick, nobel, dna, “francis crick”, “the nobel”, “of dna”, watson, “james watson”, francis, molecular, biology, genetics, “watson and”, “structure of”, “crick and” hawking, “stephen hawking”, stephen, “hawking s”, “s universe”, physics, “black holes”, “the universe”, cambridge, cosmology, einstein, relativity, damtp, “universe the” rivest, “l rivest”, “ronald l”, ronald, cryptography, rsa, “ron rivest”, lcs, “theory lcs”, encryption, “lcs mit”, theory, chaffing, winnowing, crypto

  7. Results, con’t • Communities are strongly topically related in the form of binary classifiers • Study used three-term binary classifiers like crickor nobel or darwin (54% match for the Francis Crick community, but only 0.5% for random web pages), hawking or relativity or “for mathematical”(84% Stephen Hawking community, 0.2% random pages) to determine communities • Breadth-first crawling strategies do not yield topically relevant pages (only 10% of pages at a depth of two matched classification rules)

  8. What are Nepotistic Links? • Nepotistic Links: Links between pages that are present for reasons other than merit • Sites that are run by the same administrative control, like About.com • Advertising/paid links • Note: different from duplicate pages or mirrored sites Eba6.com Mapquesy.com

  9. Preliminary Experiments • Two data sets were used: • 1. 1536 arbitrarily selected manually labeled links • 2. 750 random links from DiscoWeb search engine’s 7 million pages, also manually labeled as either nepotistic or not • 75 binary features were used: • Identical page titles or descriptions? • Page descriptions overlapped at least some percentage of the text • Identical complete host names? • Some number of initial IP address identical? • Pages share at least some percentage of outgoing links • Domains had same contact email address?

  10. Machine Learning • C4.5 decision tree package used to determine the binary features

  11. Results

  12. Results, con’t • Can classify links with more accuracy if one uses already categorized search engine results as “training data” • Second set of data too small – does not represent the variety of sites on the web • Nepotistic links largely do not affect popular pages

  13. Conclusions • Both experiments focused on binary classifiers • Naïve researchers: scale of web is too large to run any of these algorithms on it, both used small sample sizes to begin with

More Related