Créer une présentation
Télécharger la présentation

Télécharger la présentation
## Acknowledgements and: how to use these slides

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Web structure mining / link miningand Web communitiesBettina**Berendt„Knowledge and the Web“ summer semester 2005http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/last updated: 2005-05-04**Acknowledgementsand: how to use these slides**• Some of these slides were taken from the slide set of the Web Mining book by Baldi, Frasconi, and Smyth (http://ibook.ics.uci.edu/Slides/MIW%20Chapter%205.ppt) – thank you for a great book and slides! • These slides are marked at the bottom left corner • I also based the slide layout on that slide set • Some figures were taken from the two presented articles (see p.4). • Further materials can be found in the directory of this session (http://www.wiwi.hu-berlin.de/~berendt/lehre/2005s/kaw/Session4) • Slides that just carry a title were developed in class and on the blackboard. • Please feel free to re-use these slides in your own teaching, and please credit their origin.**Objectives**• To explore “what’s in a link” and to see what knowlede can therefore be extracted by analysing links • To calculate the popularity of a site based on link analysis • To see how linkage defines communities**Outline: Theory and applications of link analysis for ...**• Search: ranking of search engine results • Scientific communities: co-citation analysis and other bibliometrics • Chen, C. & Carr, L. (1999) Visualizing the evolution of a subject domain: A case study. Proc. IEEE Visualization 1999. • An example of a resulting archive: citeseer • Identification of Web communities • Flake, G.W., Lawrence, S., & Giles, C.L. (2000). Efficient identification of web communities. Proc. KDD 2000. • Outlook: Social network analysis**Recall: Trees (slide from ISI)**• A is the root node • B is the parent of D and E • D and E are children of B • (C,F) is an edge • 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 are leaves • A, B, C, D, E, F, G, H, I are internal nodes • The level (or depth) of E is 2 (number of edges to root) • The height (or order) of the tree is 4 (max number of edges from root to a leaf node) • The degree of node B is 2 (number of children) Based on Tom Blough, Introduction to Programming. http://www.rh.edu/~blought/fall02_cish4960/notes/lecture11-12.ppt**Graphs (data structure def.)**• Definition: A set of items connected by edges. Each item is called a vertex or node. Formally, a graph is a set of vertices and a binary relation between vertices, adjacency. • Formal Definition: A graph G can be defined as a pair (V,E), where V is a set of vertices, and E is a set of edges between the vertices E = {(u,v) | u, v in* V}. If the graph is undirected, the adjacency relation defined by the edges is symmetric, or E = {{u,v} | u, v in V} (sets of vertices rather than ordered pairs). If the graph does not allow self-loops, adjacency is irreflexive. (http://www.nist.gov/dads/HTML/graph.html) Note: Edges are also called links (esp. in hypertext graphs like the WWW). • * „in“ denotes the „element-of“ relation**What‘s in a link?1. „This is good.“**Basic Assumptions of early link analysis • Hyperlinks contain information about the human judgment of a site • The more incoming links to a site, the more it is judged important**Outline of the “link analysis for search engine ranking”**part • Early Approaches to Link Analysis • Hubs and Authorities: HITS • Page Rank • Stability • Probabilistic Link Analysis • Limitation of Link Analysis Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Early Approaches**Bray 1996 • The visibility of a site is measured by the number of other sites pointing to it • The luminosity of a site is measured by the number of other sites to which it points • Limitation: failure to capture the relative importance of different parents (children) sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Early Approaches**Mark 1988 • To calculate the score S of a document at vertex v 1 Σ S(w) S(v) = s(v) + | ch[v] | w Є |ch(v)| v: a vertex in the hypertext graph G = (V, E) S(v): the global score s(v): the score if the document is isolated ch(v): children of the document at vertex v • Limitation: • - Require G to be a directed acyclic graph (DAG) • - If v has a single link to w, S(v) > S(w) • If v has a long path to w and s(v) < s(w), then S(v) > S (w) • unreasonable Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**HITS - Kleinberg’s Algorithm**• HITS – Hypertext Induced Topic Selection • For each vertex v Є V in a subgraph of interest: a(v) - the authority of v h(v) - the hubness of v • A site is very authoritative if it receives many citations. Citation from important sites weight more than citations from less-important sites • Hubness shows the importance of a site. A good hub is a site that links to many authoritative sites Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Authority and Hubness**5 2 3 1 1 6 4 7 h(1) = a(5) + a(6) + a(7) a(1) = h(2) + h(3) + h(4) Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Authority and Hubness Convergence**• Recursive dependency: • a(v) Σ h(w) • h(v) Σ a(w) w Є pa[v] w Є ch[v] • Using Linear Algebra, we can prove: a(v) and h(v) converge Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**HITS Example**Find a base subgraph: • Start with a root set R {1, 2, 3, 4} • {1, 2, 3, 4} - nodes relevant to the topic • Expand the root set R to include all the children and a fixed number of parents of nodes in R A new set S (base subgraph) Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**HITS Example**Hubs and authorities: two n-dimensional a and h • HubsAuthorities(G) • 1 [1,…,1] Є R • a h 1 • t 1 • repeat • for each v in V • do a (v) Σ h (w) • h (v) Σ a (w) • a a / || a || • h h / || h || • t t + 1 • until || a – a || + || h – h || < ε • return (a , h ) |V| 0 0 t w Є pa[v] t -1 w Є pa[v] t t -1 t t t t t t t t -1 t t -1 t t Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**HITS Example Results**Authority Hubness 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Authority and hubness weights Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**HITS Improvements**Brarat and Henzinger (1998) • HITS problems • The document can contain many identical links to the same document in another host • Links are generated automatically (e.g. messages posted on newsgroups) • Solutions • Assign weight to identical multiple edges, which are inversely proportional to their multiplicity • Prune irrelevant nodes or regulating the influence of a node with a relevance weight Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Markov Chain Notation**• Random surfer model • Description of a random walk through the Web graph • Interpreted as a transition matrix with asymptotic probability that a surfer is currently browsing that page rt= M rt-1M: transition matrix for a first-order Markov chain (stochastic) Does it converge to some sensible solution (as too) regardless of the initial ranks ? Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Limits of Link Analysis**• META tags/ invisible text • Search engines relying on meta tags in documents are often misled (intentionally) by web developers • Pay-for-place • Search engine bias : organizations pay search engines and page rank • Advertisements: organizations pay high ranking pages for advertising space • With a primary effect of increased visibility to end users and a secondary effect of increased respectability due to relevance to high ranking page Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Limits of Link Analysis**• Stability • Adding even a small number of nodes/edges to the graph has a significant impact • Topic drift – similar to TKC • A top authority may be a hub of pages on a different topic resulting in increased rank of the authority page • Content evolution • Adding/removing links/content can affect the intuitive authority rank of a page requiring recalculation of page ranks Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**What‘s in a link? 2. „This has something to do with my**document / me.“**Co-citation analysis and bibliographic coupling: basic ideas****Matrix Notation**Adjacent Matrix A = * http://www.kusatro.kyoto-u.com Modeling the Internet and the Web School of Information and Computer Science University of California, Irvine**Web communities**• A community is a collection of web pages created by individuals or any kind of associations that have a common interest on a specific topic, such as fan pages of a baseball team, and official pages of PC vendors. Another example are Blog communities. • Formally: Flake et al.‘s definition of an „ideal community“ • == „A Pokemon web site is a site that links to or is linked by more Pokemon sites than non-Pokemon sites.“**Approximate communities**• To apply this nice theorem, we would need to have the whole Web on our hard disk! • Realistically, we crawl a part of the Web starting with some pages that are in the community we are interested in. • Questions: • What is crawling? • What pages are retrieved during this crawl? • What other assumptions have to be made?**Crawling**• Archives are not always given • Crawling = techniques for assembling archives from the Web • Simple: Unix command-line utility wget • Sophisticated: WIRE (contains analysis) next week • Crawling contains graph search**What is the virtual sink (= the site that is definitely not**in the community)? • In the ideal version: • In the approximate version, use artificicial virtual sink (a theorem ensures correctness even if this is not really at the center of the graph)**What's in a link? 3. "This is my boss."**• Examples of problems created by such „nepotistic links“: • Web: link farms • Much work since 2000 - http://www.cse.lehigh.edu/~brian/pubs/2000/aaaiws/aaai2000ws.pdf • Science / citation analysis**Outlook: social network analysis**• Bibliometrics and link mining have their roots in a much older are: social network analysis • Direct transfer of the link analysis methods we have found: find "opinion leaders" in ciao.de and similar sites • see also viral marketing • Others: analyse communication patterns, prestige, power, ...