1 / 26

Crawling, Ranking and Indexing

Crawling, Ranking and Indexing. Organizing the Web. The Web is big. Really big. Over 3 billion pages, just in the indexable Web The Web is dynamic Problems: How to store a database of links? How to crawl the web? How to recommend pages that match a query?. Architecture of a Search Engine.

Télécharger la présentation

Crawling, Ranking and Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Crawling, Ranking and Indexing

  2. Organizing the Web • The Web is big. Really big. • Over 3 billion pages, just in the indexable Web • The Web is dynamic • Problems: • How to store a database of links? • How to crawl the web? • How to recommend pages that match a query?

  3. Architecture of a Search Engine • A web crawler gathers a snapshot of the Web 3. User submits a search query 4. Search engine ranks pages that match the query and returns an ordered list 2. The gathered pages are indexed for easy retrieval

  4. Indexing the Web • Once a crawl has collected pages, full text is compressed and stored in a repository • Each URL mapped to a unique ID • A document index is created • For each document, contains pointer to repository, status, checksum, pointer to URL & title • A hit list is created for each word in the lexicon. • Occurrences of a word in a particular document, including position, font, captialization, “plain or fancy” • Fancy: occurs in a title, tag, or URL

  5. Indexing the Web • Each word in the hit list has a wordID. • Forward index created • 64 barrels; each contains a range of wordIDs • If a document contains words for a particular barrel, the docID is added, along with a list of wordIDs and hit lists. • Maps words to documents. • Wrinkle: Can use TFIDF to only map “significant” keywords • Term Frequency * InverseDocumentFrequency

  6. Indexing the web • An inverted index is created • Forward index sorted according to word • For every valid wordID in the lexicon, create a pointer to the appropriate barrel. • Points to a list of docIDs and hit lists. • Maps keywords to URLs • Some wrinkles: • Morphology: stripping suffixes (stemming), singular vs. plural, tense, case folding • Semantic similarity • Words with similar meanings share an index. • Issue: trading coverage (number of hits) for precision (how closely hits match request)

  7. Indexing Issues • Indexing techniques were designed for static collections • How to deal with pages that change? • Periodic crawls, rebuild index. • Varied frequency crawls • Records need a way to be “purged” • Hash of page stored • Can use the text of a link to a page to help label that page. • Helps eliminate the addition of spurious keywords.

  8. Indexing Issues • Availability and speed • Most search engines will cache the page being referenced. • Multiple search terms • OR: separate searches concatenated • AND: intersection of searches computed. • Regular expressions not typically handled. • Parsing • Must be able to handle malformed HTML, partial documents

  9. Ranking • The primary challenge of a search engine is to return results that match a user’s needs. • A word will potentially map to millions of documents • How to order them?

  10. PageRank • Google uses PageRank to determine relevance. • Based on the “quality” of a page’s inward links. • A simplified version: • Let N be the outward links of a page. • R(page) = c * Sumv 2 inward R(v) / Nv • c is a normalizing factor

  11. PageRank • Average the PageRanks of each page that points to a given page, divided by their outdegree. • Let p be a page, with T1 – Tn linking to p. • PR(p) = (1-d) + d(SumI(Pr(TI)/outI)) • d is a ‘damping’ factor. • PR ‘propagates’ through a graph. • Defined recursively, but can be computed iteratively. • Repeat until PR does not change by more than some delta.

  12. PageRank • Intuition: A page is useful if many popular sites link to it. • Justification: • Imagine a random surfer who keeps clicking through links. • d is the probability she starts a new search. • Pros: difficult to game the system • Cons: Creates a “rich get richer” web structure where highly popular sites grow in popularity.

  13. HITS • HITS is also commonly used for document ranking. • Gives each page a hub score and an authority score • A good authority is pointed to by many good hubs. • A good hub points to many good authorities. • Users want good authorities.

  14. Hubs and Authorities • Common community structure • Hubs • Many outward links • Lists of resources • Authorities • Many inward links • Provide resources, content

  15. Hubs and Authorities Authorities Hubs Link structure estimates over 100,000 Web communities Often not categorized by portals

  16. Issues with Ranking Algorithms • Spurious keywords and META tags • Users reinforcing each other • Increases “authority” measure • Link Similarity vs. Content similarity • Topic drift • Many hubs link to more than one topic

  17. Crawling the web • How to collect Web data in the first place? • Spiders are used to crawl the web and collect pages. • A page is downloaded and its outward links are found. • Each outward link is then downloaded. • Exceptions: • Links from CGI interfaces • Robot Exclusion Standard

  18. Crawling the Web • We may want to be a bit smarter about selecting documents to crawl • Web is too big • Building a special-purpose search engine • Indexing a particular site • Choosing where to go first is a hard problem.

  19. Crawling the Web • Basic Algorithm: • Let Q be a queue, and S be a starting node • Enqueue(Q,S) • While (notempty(Q)) • W = dequeue(Q) • V1,…,Vn = outward links(Q) <- this is called the frontier • Enqueue(v1, …, Vn) • The Enqueue function is the tricky part.

  20. Crawling the Web • BestFirst • Sorts queue according to cosine similiarity • Sim(S,V) numerator: sumw in s and v fws fwp • Sim(S,V) denominator: • Sqrt(sumw in s fw2 * sumw in v fw2) • This is a generalization of Euclidean distance • Expand documents most similar to the starting document.

  21. Crawling the Web • PageRank can also be used to guide a crawl. • PageRank was designed to model a random walk through a web graph. • Select pages probabilistically based on their PageRank • One issue: PageRank must be recomputed frequently. • Leads to a crawl of the most “valuable” sites.

  22. Web structure • Structure is important for: • Predicting traffic patterns • Who will visit a site? • Where will visitors arrive from? • How many visitors can you expect? • Estimating coverage • Is a site likely to be indexed?

  23. Core • Compact • Short paths between sites • “Small world” phenomenon • Distances are small relative to average path length • Number if inward and outward links follows a power law. • Mechanism: preferential attachment • As new sites arrive, the probability of gaining an inward link is proportional to in-degree.

  24. Power laws and small worlds • Power laws occur everywhere in nature • Distribution of site sizes, city sizes, incomes, word frequencies, incomes, business sizes, earthquake magnitudes, spread of disease • Random networks tend to evolve according to a power law. • Small-world phenomenon • “Neighborhoods” will be joined by a common member • Hubs serve to connect neighborhoods • Linkage is closer than one might expect • Application: Construction of networks and protocols that produce maximal flow/efficiency

  25. Local structure • More diverse than a power law • Pages with similar topics self-organize into communities • Short average path length • High link density • Webrings • Inverse: Does a high link density imply the existence of a community? • Can this be used to study the emergence and growth of web communities?

  26. Web Communities • Alternate definition • Each member has more links to community members than non-community members. • Extension of a clique. • Can be discovered with network flow algorithms. • Can be used to discover new “categories” • Help people interested in a topic find each other. • Focused crawling, filtering, recommender systems

More Related