Download
slide1 n.
Skip this Video
Loading SlideShow in 5 Seconds..
Acknowledgements and: how to use these slides PowerPoint Presentation
Download Presentation
Acknowledgements and: how to use these slides

Acknowledgements and: how to use these slides

190 Vues Download Presentation
Télécharger la présentation

Acknowledgements and: how to use these slides

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Web structure mining / link miningand Web communitiesBettina Berendt„Knowledge and the Web“ summer semester 2005http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/last updated: 2005-05-04

  2. Acknowledgementsand: how to use these slides • Some of these slides were taken from the slide set of the Web Mining book by Baldi, Frasconi, and Smyth (http://ibook.ics.uci.edu/Slides/MIW%20Chapter%205.ppt) – thank you for a great book and slides! • These slides are marked at the bottom left corner • I also based the slide layout on that slide set • Some figures were taken from the two presented articles (see p.4). • Further materials can be found in the directory of this session (http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/Session4) • Slides that just carry a title were developed in class and on the blackboard. • Please feel free to re-use these slides in your own teaching, and please credit their origin.

  3. Objectives • To explore “what’s in a link” and to see what knowlede can therefore be extracted by analysing links • To calculate the popularity of a site based on link analysis • To see how linkage defines communities

  4. Outline: Theory and applications of link analysis for ... • Search: ranking of search engine results  • Scientific communities: co-citation analysis and other bibliometrics • Chen, C. & Carr, L. (1999) Visualizing the evolution of a subject domain: A case study. Proc. IEEE Visualization 1999. • An example of a resulting archive: citeseer • Identification of Web communities • Flake, G.W., Lawrence, S., & Giles, C.L. (2000). Efficient identification of web communities. Proc. KDD 2000. • Outlook: Social network analysis

  5. The Web is a graph;(Scientific) literature is a graph

  6. Recall: Trees (slide from ISI) • A is the root node • B is the parent of D and E • D and E are children of B • (C,F) is an edge • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves • A, B, C, D, E, F, G, H, I are internal nodes • The level (or depth) of E is 2 (number of edges to root) • The height (or order) of the tree is 4 (max number of edges from root to a leaf node) • The degree of node B is 2 (number of children) Based on Tom Blough, Introduction to Programming. http://www.rh.edu/~blought/fall02_cish4960/notes/lecture11-12.ppt

  7. Graphs (data structure def.) • Definition: A set of items connected by edges. Each item is called a vertex or node. Formally, a graph is a set of vertices and a binary relation between vertices, adjacency. • Formal Definition: A graph G can be defined as a pair (V,E), where V is a set of vertices, and E is a set of edges between the vertices E = {(u,v) | u, v in* V}. If the graph is undirected, the adjacency relation defined by the edges is symmetric, or E = {{u,v} | u, v in V} (sets of vertices rather than ordered pairs). If the graph does not allow self-loops, adjacency is irreflexive. (http://www.nist.gov/dads/HTML/graph.html) Note: Edges are also called links (esp. in hypertext graphs like the WWW). • * „in“ denotes the „element-of“ relation

  8. What‘s in a link?1. „This is good.“ Basic Assumptions of early link analysis • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site, the more it is judged important

  9. Outline of the “link analysis for search engine ranking” part • Early Approaches to Link Analysis • Hubs and Authorities: HITS • Page Rank • Stability • Probabilistic Link Analysis • Limitation of Link Analysis Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  10. Early Approaches Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points •  Limitation: failure to capture the relative importance of different parents (children) sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  11. Early Approaches Mark 1988 • To calculate the score S of a document at vertex v 1 Σ S(w) S(v) = s(v) + | ch[v] | w Є |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: • - Require G to be a directed acyclic graph (DAG) • - If v has a single link to w, S(v) > S(w) • If v has a long path to w and s(v) < s(w), then S(v) > S (w) •  unreasonable Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  12. HITS - Kleinberg’s Algorithm • HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  13. Authority and Hubness 5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4) Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  14. Authority and Hubness Convergence • Recursive dependency: • a(v)  Σ h(w) • h(v)  Σ a(w) w Є pa[v] w Є ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  15. HITS Example Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R  A new set S (base subgraph)  Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  16. HITS Example Hubs and authorities: two n-dimensional a and h • HubsAuthorities(G) • 1  [1,…,1] Є R • a  h  1 • t  1 • repeat • for each v in V • do a (v)  Σ h (w) • h (v)  Σ a (w) • a  a / || a || • h  h / || h || • t  t + 1 • until || a – a || + || h – h || < ε • return (a , h ) |V| 0 0 t w Є pa[v] t -1 w Є pa[v] t t -1 t t t t t t t t -1 t t -1 t t Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  17. HITS Example Results Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  18. HITS Improvements Brarat and Henzinger (1998) • HITS problems • The document can contain many identical links to the same document in another host • Links are generated automatically (e.g. messages posted on newsgroups) • Solutions • Assign weight to identical multiple edges, which are inversely proportional to their multiplicity • Prune irrelevant nodes or regulating the influence of a node with a relevance weight Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  19. Markov Chain Notation • Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt= M rt-1M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution (as too) regardless of the initial ranks ? Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  20. Limits of Link Analysis • META tags/ invisible text • Search engines relying on meta tags in documents are often misled (intentionally) by web developers • Pay-for-place • Search engine bias : organizations pay search engines and page rank • Advertisements: organizations pay high ranking pages for advertising space • With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  21. Limits of Link Analysis • Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift – similar to TKC • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  22. What‘s in a link? 2. „This has something to do with my document / me.“

  23. Ex.: the citeseer archive

  24. Co-citation analysis and bibliographic coupling: basic ideas

  25. Matrix Notation Adjacent Matrix A = * http://www.kusatro.kyoto-u.com Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine

  26. Co-citation matrices; document similarity

  27. Turning similarity into spatial proximity: an MDS map

  28. A clearer visualization: pathfinder networks

  29. Visualizing evolution with pathfinder networks

  30. Web communities • A community is a collection of web pages created by individuals or any kind of associations that have a common interest on a specific topic, such as fan pages of a baseball team, and official pages of PC vendors. Another example are Blog communities. • Formally: Flake et al.‘s definition of an „ideal community“ • == „A Pokemon web site is a site that links to or is linked by more Pokemon sites than non-Pokemon sites.“

  31. An example, and how to find communities

  32. Approximate communities • To apply this nice theorem, we would need to have the whole Web on our hard disk! • Realistically, we crawl a part of the Web starting with some pages that are in the community we are interested in. • Questions: • What is crawling? • What pages are retrieved during this crawl? • What other assumptions have to be made?

  33. Crawling • Archives are not always given • Crawling = techniques for assembling archives from the Web • Simple: Unix command-line utility wget • Sophisticated: WIRE (contains analysis)  next week • Crawling contains graph search

  34. Focused community crawling

  35. What is the virtual sink (= the site that is definitely not in the community)? • In the ideal version: • In the approximate version, use artificicial virtual sink (a theorem ensures correctness even if this is not really at the center of the graph)

  36. Repetition for a better result

  37. What's in a link? 3. "This is my boss." • Examples of problems created by such „nepotistic links“: • Web: link farms • Much work since 2000 - http://www.cse.lehigh.edu/~brian/pubs/2000/aaaiws/aaai2000ws.pdf • Science / citation analysis

  38. Outlook: social network analysis • Bibliometrics and link mining have their roots in a much older are: social network analysis • Direct transfer of the link analysis methods we have found: find "opinion leaders" in ciao.de and similar sites •  see also viral marketing • Others: analyse communication patterns, prestige, power, ...

  39. Social networks example