This paper is about …

National & Kapodistrian University of AthensDept.of Informatics & TelecommunicationsMSc. in Computer Systems TechnologyDistributed Systems Searching the Web By A.Arasu, J.Cho, H.Garcia-Molina, A.Paepcke, S.Raghavan Giorgos Matrozos M 414 g.matrozos@di.uoa.gr

This paper is about … Search Engines • Generic Architecture • Each Component’s Architecture • Each Component’s Design and Implementation Techniques • Crawling • Page Storage • Indexing • Link Analysis

A Quick Look • Why Use Search Engines - Why their Work is Hard ? Ans: Over a Billion pages Great Growth Rate About 23% of the pages update daily Linking between pages is very complicated • What about Information Retrieval ? Ans: It is used but it is unsuitable, because it is for small, coherent collections. The Web on the other hand is massive, incoherent, distributed and rapidly changing

Search Engine Components A search engine consists of • a Crawler module • a Crawler Control module • a Page Repository • an Indexer module • a Collection Analysis module • a Utility Index • a Query Engine module • a Ranking module

General Search Engine Architecture

The Crawler module • Starts with a set of URLs S0 • It has a prioritized queue from where it retrieves the URLs • Then the Crawler downloads the pages, extracts any new URL and places it in the queue • This is done until it decides to stop But some questions arise. What pages should the Crawler download ? Ans: Page Selection methods How should the Crawler refresh pages ? Ans: Page Refresh methods

Page Selection The Crawler may want to download important pages first for the collection to be of good quality But What is important? How the Crawler operates? How the Crawler guesses good pages? Hints Importance Metrics Crawler Models Ordering Metrics

Importance Metrics I • Interest Driven Given a query Q, the importance of the page P is defined as the textual similarity between P, Q. P, Q are considered vectors <w1, …, wn> where wi represents the ith word of the vocabulary. wi = #_of_appear * idf (inverse document frequency). idf = 1 / #_of_appear in the whole collection. Similarity between P,Q IS(P) = cosine products of P,Q vectors. Idf was not used because it relies on global info. But if we want to use idf factors, they must be estimated using reference idf from other times. Then the similarity is IS’(P) and it is estimated because we have not seen yet the entire collection to compute the actual IS(P).

Importance Metrics II • Popularity Driven A way to define popularity is to use a page’s backlink count, that is the links that point to this page. The number of these links determines its popularity  IB(P). Note also that the Crawler estimates IB’(P) because the actual metric needs information about the whole web. The estimate may be inaccurate early in the crawl. A more sophisticated but similar technique is also used in Page Ranking.

Importance Metrics III • Location Driven IL(P) is a function of its location, not its contents. If URL u leads to P, then IL(P) is a function of u. This is a way to evaluate the location of the page and through this its importance. Another way used is the slashes that appear in the address. Fewer slashes are considered more useful. FINALLY  IC(P) = k1 * IS(P) + k2 * IB(P) + k3 * IL(P)

Crawler Models I Now, for a given importance metric, the crawler must guess using a Quality Metric • Crawl and Stop Starts with initial page P0 and stops after K pages. K is fixed. It’s the number of downloaded pages in one crawl. A perfect crawler would have visited pages with R1…RK where these are ordered according to the importance metric. BUT the real crawler visits M  K ordered pages. So, the performance of the Crawler C is PCS(C) = M*100/K A crawler with random visits would have a performance of K*100/T, where T are the pages in the entire Web. Each page visited is a hot page with prob K/T. Thus the expected number of desired pages until the crawler stops is K2/T.

Crawler Models II • Crawl and Stop with Threshold In this technique, there is an importance target G and pages with importance higher than G are only considered. Lets assume that this number is H. The performance PST(C) is the percentage of the H hot pages. If K < H then  K*100/H If K  H then  the ideal crawler has 100% A random crawler is expected to visit (H/T)*K when it stops. Thus its performance is K*100/T

Ordering Metrics According to this metric the Crawler selects the URL from the queue. The ordering metric can only use information seen by the crawler. The ordering metric should be design with an importance metric in mind. For example if the crawler searches for high popularity pages, then the ordering metric is IB’(P). Also location metrics can be used. It is hard to devise the ordering metric from the similarity metric, since we have not seen P yet.

Page Refresh After downloading the Crawler has to periodically refresh pages. Two strategies : • Uniform Refresh Policy : Revisits all pages at the same frequency f, regardless of how often they change. • Proportional Refresh Policy : Assume λi is the change freq of ei and that fi is the crawler’s revisiting freq of ei. Then the freq ratio λi/fi is the same for any i.

Freshness and Age Metrics Some definitions • Freshness of local page ei at time t. • Freshness of the local collection S at time t. • Age of local page ei at time t. • Age of the local collection • We define the time average of freshness of ei and S The time average of age, similarly. All the above are approximations

Refresh Strategy I Note that crawlers can download/update limited number of pages within a period because they have limited resources. Consider a simple example. Collection of 2 pages e1, e2. e1 changes 9 times per day and 2 once a day. For e1 a day is split into 9 intervals and e1 changes once and only one in each interval, but we do not know precisely when. e2 changes once and only one in each day, but we do not know precisely when. Assume that our crawler can refresh one page/day. But which page? If e2 changes in the middle of the day and we refresh right after, e2 will be up-to-date for the remaining 1/2day. The prob. that change is before the middle is 1/2, thus the expected benefit is 1/4 and so on.

Refresh Strategy II It can be mathematically be proved that uniform refresh policy is always superior or equal to the proportional for any number of pages, change freqs and refresh rates, for both freshness and age metrics. Best solution  Assume that pages change following a Poisson process and their change freq. is static. The mathematic proof and the idea of the above statement is described in “Cho, Garcia-Molina Synchronizing a database to improve freshness, International Conf on Management of Data, 2000”

Storage The page repository must manage a large collection of web pages. There are 4 challenges. • Scalability. It must be possible to distribute the repository across a cluster of computers and disks to cope with the size of the web. • Dual access modes. Random access is used to quickly retrieve a specific web page, streaming access is used to receive the entire collection. The first is used by the Query Engine and the second by the Indexer and Analysis modules. • Large bulk updates. As new versions of pages are stored, the space occupied by the old must be reclaimed through compaction and reorganization. • Obsolete pages. Mechanism for detecting and removing obsolete pages

Page Distribution Policies Assumption : The repository is designed to function over a cluster of interconnected storage nodes. • Uniform distribution. A page can be stored at any node independently of its identifier. • Hash distribution. A page id would be hashed to yield a node id. The page should be stored at the corresponding node.

Physical Page Organization Methods Within an node, there are 3 possible operations : addition/insertion, high-speed streaming, random page access . Methods • Hash-based • Log-structured • Hashed-log

Update Strategies • Batch Mode or Steady Crawler A batch-mode crawler is a periodical crawler, that crawls for a certain amount of time. The repository receives updates only for a certain number of dates in a month. In contrast a steady crawler crawls without any pause and updates continuously the repository. • Partial or Complete crawls According to the crawl, update can be : • In place, that is the pages are directly integrated in the repository’s existing collection, possibly replacing older versions. • Shadowing, that is the pages are stored separately and update is done in another step

The Stanford WebBase repository • It is a distributed storage system, that works with the Stanford WebCrawler. • The repository employs a node manager to monitor the nodes and collect status information. • Since the Stanford crawler is a batch crawler, the repository applies a shadowing technique. • The URLs are first normalized to yield a canonical representation. The page id is computed as a signature of this normalized URL.

Indexing • Structure (or link) index The Web is modeled as a graph. The nodes are pages and the edges hyperlinks from one to another. It uses neighborhood information : given a page P, retrieve the pages pointed to by P or the pages pointing to P. • Text (or content) index Text –based retrieval continues to be the primary method for identifying pages relevant to a query. Indices to support this retrieval can be implemented with suffix arrays, inverted files, inverted indices and signature files. • Utility indices Special indices like site indices for example for searching in one domain only.

WebBase text-indexing system I 3 types of nodes • Distributors, that store the pages to be indexed • Indexers, that execute the core of the index building engine • Query servers. The final inverted index is partitioned across them. The inverted index is built in 2 stages • Each distributor runs a process that disseminates the pages to the indexers. Each subset is mutually disjoint. The indexers extract postings, sort them and flush to intermediate structures on disk. • These are merged to create a inverted file and its lexicon. These pairs are transferred to the query servers

WebBase text-indexing system II The core of the indexing is the index-builder process. This process can be parallelized with 3 phases. Loading, Processing and Flushing. Loading : pages are read and stored in memory Processing : pages are parsed and stored as a set of postings in a mem. Buffer. Then the postings are sorted by term and then by location. Flushing : The sorted postings are saved in the disk as a sorted run

WebBase Indexing System Statistics I One of the most commonly used statistic is idf. The idf of a term w is log(N/dfw) where N is the total number of pages in the collection and dfw is the number of pages that contain at least on occurrence of w. To avoid the the query time overhead, the WebBase computes and stores statistics as part of index creation. • Avoiding explicit I/O for statistics : To avoid additional I/O the local data are sent to the statistician only when they are available in memory. 2 strategies • ME, FL :Send local info during merging or during flushing • Local aggregation : Multiple postings for a term pass through memory in groups. Eg 1000 postings for “cat”. The pair (“cat”,1000) can be sent to the statistician.

Page Rank I Page Rank extends the basic idea of citation by taking into consideration the importance of the pages pointing to a given page. Thus a page receives more importance if YAHOO points to it, than an unknown page. Note that the definition of Page Rank is recursive. • Simple Page Rank Let 1…m be the pages of the web, N(i) the # of outgoing links from i, B(i) the set of pages that point to i, then we have The above definition leads to the idea of random walks, called the Random Surfer Model. It can be proved that the page rank of a page is proportional to the freq. with which a random surfer would visit it.

Page Rank II • Practical Page Rank The Simple Page Rank is well defined if the graph is strongly connected. This isn’t the case here. A rank sink is a connected cluster of pages that has no outgoing links. A rank leak is a single page with no outgoing links. Thus two solutions. Removal of all the leak nodes with out-degree 0 and introduction of a decay factor d to solve the problem of sinks. So the modified Page Rank where m is the number of nodes in the graph.

HITS I Link based search alg. : Hypertext Included Topic Search Instead of producing a single ranking score, HITS produces the Authority and the Hub score. Authority pages are those most likely to be relevant to a query and Hub pages are not necessarily authorities but point to several of them. • The HITS algorithm The basic idea is to identify a small subgraph of the web and apply link analysis to locate the Authorities and the Hubs for a given query.

HITS II • Identifying the focused subgraph • Link Analysis Two kind of operations in each step, I and O.

HITS III The alg. iteratively repeats I and O steps, with normalization, until the hub and authority scores converge.

Other Link Based Techniques • Identifying Communities Interesting problem to identify communities in the web. See ref [30] and [40] • Finding Related Pages Companion and Cocitation algorithms. See ref [22], [32] and [38] • Classification and Resource Compilation Problem of automatically classifying documents. See ref [13], [14], [15]

T H E E N D !

This paper is about …