1 / 29

CS 124

Outline. Anchor textBackground on networksBibliometric (citation) networksSocial networksLink analysis for rankingPageRankHITSSearch Engine Optimization. Slide from Chris Manning. The Web as a Directed Graph. Slide from Chris Manning. 3. Assumption 1:A hyperlink between pages denotes author perceived relevance (quality signal).

Audrey
Télécharger la présentation

CS 124

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Dan Jurafsky Lecture 18: Networks part I: Link Analysis, PageRank CS 124/LINGUIST 180: From Languages to Information

    2. Slide from Chris Manning

    3. The Web as a Directed Graph Slide from Chris Manning 3

    4. Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBMs home page (mostly graphical) IBMs copyright page (high term freq. for ibm) Rivals spam page (arbitrarily high term freq.) Slide from Chris Manning 4

    5. Indexing anchor text When indexing a document D, include anchor text from links pointing to D. Slide from Chris Manning 5

    6. Indexing anchor text Can sometimes have unexpected side effects like what? Can score anchor text with weight depending on the authority of the anchor pages website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust the anchor text from them Slide from Chris Manning 6

    7. Anchor Text Other applications Weighting/filtering links in the graph Generating page descriptions from anchor text Slide from Chris Manning 7

    8. Roots of Web Link Analysis Bibliometrics Social network analysis Slide from Chris Manning 8

    9. Citation Analysis: Impact Factor Developed by Garfield in 1972 to measure the importance (quality, influence) of scientific journals. Measure of how often papers in the journal are cited by other scientists. Computed and published annually by the Institute for Scientific Information (ISI). The impact factor of a journal J in year Y is the average number of citations (from indexed documents published in year Y) to a paper published in J in year Y?1 or Y?2. Does not account for the quality of the citing article. Slide from Ray Mooney

    10. Citations vs. Links Web links are a bit different than citations: Many links are navigational. Many pages with high in-degree are portals not content providers. Not all links are endorsements. Company websites dont point to their competitors. Citations to relevant literature is enforced by peer-review. Slide from Ray Mooney

    11. Social network analysis Social network is the study of social entities (people in an organization, called actors), and their interactions and relationships. The interactions and relationships can be represented with a network or graph, each vertex (or node) represents an actor and each link represents a relationship. CS583, Bing Liu, UIC 11

    12. Centrality Important or prominent actors are those that are linked or involved with other actors extensively. A person with extensive contacts (links) or communications with many other people in the organization is considered more important than a person with relatively fewer contacts. The links can also be called ties. A central actor is one involved in many ties. CS583, Bing Liu, UIC 12

    13. Prestige Prestige is a more refined measure of prominence of an actor than centrality. Distinguish: ties sent (out-links) and ties received (in-links). A prestigious actor is one who is object of extensive ties as a recipient. To compute the prestige: we use only in-links. Difference between centrality and prestige: centrality focuses on out-links prestige focuses on in-links. PageRank is based on prestige CS583, Bing Liu, UIC 13

    14. Drawing on the citation work First attempt to do link analysis Slide from Chris Manning 14

    15. Query-independent ordering First generation: using link counts as simple measures of popularity. Two basic suggestions: Undirected popularity: Each page gets a score = the number of in-links plus the number of out-links (3+2=5). Directed popularity: Score of a page = number of its in-links (3). Slide from Chris Manning 15

    16. Query processing First retrieve all pages meeting the text query (say venture capital). Order these by their link popularity (either variant on the previous page). More nuanced use link counts as a measure of static goodness, combined with text match score Slide from Chris Manning

    17. Spamming simple popularity Exercise: How do you spam each of the following heuristics so your page gets a high score? Each page gets a static score = the number of in-links plus the number of out-links. Static score of a page = number of its in-links. Slide from Chris Manning

    18. Intuition of PageRank

    19. Pagerank scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably In the steady state each page has a long-term visit rate - use this as the pages score. Slide from Chris Manning

    20. Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. Slide from Chris Manning

    21. Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter. Slide from Chris Manning 21

    22. Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited (not obvious, will show this). How do we compute this visit rate? Slide from Chris Manning

    23. Markov chains A Markov chain consists of n states, plus an n?ntransition probability matrixP. At each step, we are in exactly one of the states. For 1 ? i,j ? n, the matrix entry Pij tells us the probability of j being the next state, given we are currently in state i. Slide from Chris Manning

    24. Markov chains Clearly, for all i, Markov chains are abstractions of random walks. Exercise: represent the teleporting random walk from 3 slides ago as a Markov chain, for this case: Slide from Chris Manning

    25. Ergodic Markov chains A Markov chain is ergodic if you have a path from any state to any other For any start state, after a finite transient time T0, the probability of being in any state at a fixed time T>T0 is nonzero. Slide from Chris Manning

    26. Ergodic Markov chains For any ergodic Markov chain, there is a unique long-term visit rate for each state. Steady-state probability distribution. Over a long time-period, we visit each state in proportion to this rate. It doesnt matter where we start. Slide from Chris Manning

    27. Probability vectors A probability (row) vector x= (x1, xn) tells us where the walk is at any point. E.g., (0001000) means were in state i. Slide from Chris Manning

    28. Change in probability vector If the probability vector is x= (x1, xn) at this step, what is it at the next step? Recall that row i of the transition prob. Matrix P tells us where we go next from state i. So from x, our next state is distributed as xP. Slide from Chris Manning

    29. Steady state example The steady state looks like a vector of probabilities a= (a1, an): ai is the probability that we are in state i. Slide from Chris Manning 29

    30. How do we compute this vector? Let a= (a1, an) denote the row vector of steady-state probabilities. If we our current position is described by a, then the next step is distributed as aP. But a is the steady state, so a=aP. Solving this matrix equation gives us a. So a is the (left) eigenvector for P. (Corresponds to the principal eigenvector of P with the largest eigenvalue.) Transition probability matrices always have largest eigenvalue 1. Slide from Chris Manning

    31. One way of computing a Recall, regardless of where we start, we eventually reach the steady state a. Start with any distribution (say x=(100)). After one step, were at xP; after two steps at xP2 , then xP3 and so on. Eventually means for large k, xPk = a. Algorithm: multiply x by increasing powers of P until the product looks stable. Slide from Chris Manning

    32. Pagerank summary Preprocessing: Given graph of links, build matrix P. From it compute a. The entry ai is a number between 0 and 1: the pagerank of page i. Query processing: Retrieve pages meeting query. Rank them by their pagerank. Order is query-independent. Slide from Chris Manning

    33. The reality Pagerank is used in Google, but is hardly the full story of ranking Many sophisticated features are used Some address specific query classes Machine learned ranking heavily used Slide from Chris Manning

    34. Pagerank: Issues and Variants How realistic is the random surfer model? What if we modeled the back button? Surfer behavior sharply skewed towards short paths Search engines, bookmarks & directories make jumps non-random. Biased Surfer Models Weight edge traversal probabilities based on match with topic/query (non-uniform edge selection) Bias jumps to pages on topic (e.g., based on personal bookmarks & categories of interest) Slide from Chris Manning 34

    35. Topic Specific Pagerank Goal pagerank values that depend on query topic Conceptually, we use a random surfer who teleports, with say 10% probability, using the following rule: Selects a category (say, one of the 16 top level ODP categories) based on a query & user -specific distribution over the categories Teleport to a page uniformly at random within the chosen category Sounds hard to implement: cant compute PageRank at query time! Slide from Chris Manning

    36. Topic Specific Pagerank Offline: Compute pagerank for individual categories Query independent as before Each page has multiple pagerank scores one for each ODP category, with teleportation only to that category Online: Distribution of weights over categories computed by query context classification Generate a dynamic pagerank score for each page - weighted sum of category-specific pageranks Slide from Chris Manning

    37. Input: Web graph W Influence vector v over topics v : (page ? degree of influence) Output: Rank vector r: (page ? page importance wrt v) r= PR(W , v) Influencing PageRank (Personalization) Slide from Chris Manning basis web graph!!basis web graph!!

    38. Non-uniform Teleportation Slide from Chris Manning

    39. Interpretation of Composite Score Given a set of personalization vectors {vj} ?j [wj PR(W , vj)] = PR(W , ?j [wjvj]) Given a users preferences over topics, express as a combination of the basis vectors vj Slide from Chris Manning

    40. Interpretation Slide from Chris Manning

    41. Interpretation Slide from Chris Manning

    42. Interpretation Slide from Chris Manning

    43. Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links on a subject. e.g., Bobs list of cancer-related links. Authority pages occur recurrently on good hubs for the subject. Best suited for broad topic queries rather than for page-finding queries. Gets at a broader slice of common opinion. Slide from Chris Manning

    44. Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation. Slide from Chris Manning

    45. The hope Slide from Chris Manning

    46. High-level scheme Extract from the web a base set of pages that could be good hubs or authorities. From these, identify a small set of top hub and authority pages; iterative algorithm. Slide from Chris Manning

    47. Spam in Search Or, Search Engine Optimization

    48. The trouble with paid search ads It costs money. Whats the alternative? Search Engine Optimization: Tuning your web page to rank highly in the algorithmic search results for select keywords Alternative to paying for placement Thus, intrinsically a marketing function Performed by companies, webmasters and consultants (Search engine optimizers) for their clients Some perfectly legitimate, some very shady Slide from Chris Manning

    49. Search engine optimization (Spam) Motives Commercial, political, religious, lobbies Promotion funded by advertising budget Operators Contractors (Search Engine Optimizers) for lobbies, companies Web masters Hosting services Forums E.g., Web master world ( www.webmasterworld.com ) Search engine specific tricks Discussions about academic papers Slide from Chris Manning

    50. Simplest forms First generation engines relied heavily on tf/idf The top-ranked pages for the query maui resort were the ones containing the most mauis and resorts SEOs responded with dense repetitions of chosen terms e.g., mauiresort maui resort maui resort Often, the repetitions would be in the same color as the background of the web page Repeated terms got indexed by crawlers But not visible to humans on browsers Slide from Chris Manning

    51. Variants of keyword stuffing Misleading meta-tags, excessive repetition Hidden text with colors, style sheet tricks, etc. Slide from Chris Manning

    52. Cloaking Serve fake content to search engine spider DNS cloaking: Switch IP address. Impersonate Slide from Chris Manning

    53. More spam techniques Doorway pages Pages optimized for a single keyword that re-direct to the real target page Link spamming Mutual admiration societies, hidden links, awards Domain flooding: numerous domains that point or re-direct to a target page Robots Millions of submissions via Add-Url Slide from Chris Manning

    54. The war against spam Quality signals - Prefer authoritative pages based on: Votes from authors (linkage signals) Votes from users (usage signals) Robust link analysis Ignore statistically implausible linkage (or text) Use link analysis to detect spammers (guilt by association) Spam recognition by machine learning Training set based on known spam Family friendly filters Linguistic analysis, general classification techniques, etc. For images: flesh tone detectors, source text analysis, etc. Editorial intervention Blacklists Top queries audited Complaints addressed Suspect pattern detection Slide from Chris Manning

    55. More on spam Web search engines have policies on SEO practices they tolerate/block http://help.yahoo.com/help/us/ysearch/index.html http://www.google.com/intl/en/webmasters/ Adversarial IR: the unending (technical) battle between SEOs and web search engines Research http://airweb.cse.lehigh.edu/ Slide from Chris Manning

    56. NY Times article on JC Penney link spam http://www.nytimes.com/2011/02/13/business/13search.html in the last several months, JCPenny.com in the #1 spot for: dresses, bedding, area rugs bedding area rugs Someone paid to have thousands of links placed on hundreds of sites scattered around the Web 2,015 pages with phrases like casual dresses, evening dresses, little black dress or cocktail dress. The NY Times informed Google At 7 p.m., J. C. Penney #1 result for Samsonite carry on luggage. Two hours later, it was at No. 71

    57. New Problem: Content Farms Demand Media: eHow, etc http://www.wired.com/magazine/2009/10/ff_demandmedia/all/1 Demand Medias legion of low-paid writers pump out 4,000 videoclips and articles a day. It starts with an algorithm based on: Search terms (popular terms from more than 100 sources comprising 2 billion searches a day), The ad market (a snapshot of which keywords are sought after and how much they are fetching), The competition (whats online already and where a term ranks in search results). Wired on Googles change: http://www.wired.com/epicenter/2011/02/google-clamp-down-content-factories/ - previousafb4700d42dc9287ba6d0e0a00756dfb Google updated its core ranking algorithmto decrease the prevalence ofcontent farms in top search results.

    58. How to address content farms? From Google blog: Weve been exploring different algorithms to detect content farms, which are sites with shallow or low-quality content. One of the signals we're exploring is explicit feedback from users. To that end, today were launching an early, experimental Chrome extension so people can block sites from their web search results. If installed, the extension also sends blocked site information to Google, and we will study the resulting feedback and explore using it as a potential ranking signal for our search results.

More Related