Web and Search Engines

Web and Search Engines

The Web: An Overview • Developed by Tim Berners-Lee and colleagues at CERN in 1990. • Currently governed by the World Wide Web Consortium • First Graphical Web Browser – Mosaic • Has over 800 million publicly indexable web pages and 180 million publicly indexable images by February of 1999 • Over 16 million web servers. • Create numerous millionaires and billionaires!

Two general paradigms for finding information on Web: Browsing: From a starting point, navigate through hyperlinks to find desired documents. Yahoo’s category hierarchy facilitates browsing. Searching: Submit a query to a search engine to find desired documents. Many well-known search engines on the Web: AltaVista, Excite, HotBot, Infoseek, Lycos, Google, Northern Light, etc. Search Engine Technology

Category hierarchy is built mostly manually and search engine databases can be created automatically. Search engines can index much more documents than a category hierarchy. Browsing is good for finding some desired documents and searching is better for finding a lot of desired documents. Browsing is more accurate (less junk will be encountered) than searching. Browsing Versus Searching

A search engine is essentially a text retrieval system for web pages plus a Web interface. So what’s new??? Search Engine

Web pages are widely distributed on many servers. Web pages are extremely dynamic/volatile. Web pages have more structures (extensively tagged). Web pages are extensively linked. Web pages are very voluminous and diversified. Web pages often have other associated metadata. Web users are ordinary folks without special training and they tend to submit short queries. There is a very large user community. Some Characteristics of the Web

Discuss how to take the special characteristics of the Web into consideration for building good search engines. Specific Subtopics: Robot; The use of tag information; The use of link information; Collaborative Filtering. Overview of this Topic

A robot (also known as spider, crawler, wanderer) is a program for fetching web pages from the Web. Main idea: Place some initial URLs into a URL queue. Repeat the steps below until the queue is empty Take the next URL from the queue and fetch the web page using HTTP. Extract new URLs from the downloaded web page and add them to the queue. Robots

What initial URLs to use? Choice depends on type of search engines to be built. For general-purpose search engines, use URLs that are likely to reach a large portion of the Web such as the Yahoo home page. For local search engines covering one or several organizations, use URLs of the home pages of these organizations. In addition, use appropriate domain constraint. Robots

Examples: To create a search engine for PUCPR University, use initial URL www.pucpr.br and domain constraint “pucpr.br”. Only URLs having “pucpr.br” will be used. To create a search engine for FK (Facchochschule Konstanz), use initial URL and domain constraints... Robots

How to extract URLs from a web page? Need to identify all possible tags and attributes that hold URLs. Anchor tag: <a href=“URL” … > … </a> Option tag: <option value=“URL”…> … </option> Map: <area href=“URL” …> Frame: <frame src=“URL” …> Link to an image: <img src=“URL” …> Relative path vs. absolute path: <base href= …> Robots

How fast should we download web pages from the same server? Downloading web pages from a web server will consume local resources; Be considerate to used web servers (e.g.: one page per minute from the same server); Other issues: Handling bad links and down links; Handling duplicate pages; Robot exclusion protocol. Robots

Robots Exclusion Protocol • Site administrator puts a “robots.txt” file at the root of the host’s web directory. • http://www.ebay.com/robots.txt • http://www.cnn.com/robots.txt • File is a list of excluded directories for a given robot (user-agent). • Exclude all robots from the entire site: User-agent: * Disallow: /

Robot Exclusion Protocol Examples • Exclude specific directories: User-agent: * Disallow: /tmp/ Disallow: /cgi-bin/ Disallow: /users/paranoid/ • Exclude a specific robot: User-agent: GoogleBot Disallow: / • Allow a specific robot: • User-agent: GoogleBot • Disallow: • User-agent: * • Disallow: /

Another example: User-agent: webcrawler Disallow: # no restriction for webcrawler User-agent: lycra Disallow: / # no access for robot lycra User-agent: * Disallow: /tmp # all other robots can index Disallow: /logs # docs not under /tmp,/logs Robots

Several research issues about robots: Fetching more important pages first with limited resources; Fetching web pages in a specified subject area such as movies and sports for creating domain-specific search engines; Efficient re-fetch of web pages to keep web page index up-to-date. Robots

Efficient Crawling through URL Ordering [Cho 98] Default ordering is based on breadth-first search; Efficient crawling fetches important pages first. Importance Definition Similarity of a page to a driving query; Backlink count of a page; PageRank of a page; Forward link of a page; Domain of a page; Combination of the above. Robots

A method for fetching pages related to a driving query first [Cho 98]. Suppose the query is “computer”. A page is related (hot) if “computer” appears in the title or appears  10 times in the body of the page. Some heuristics for finding a hot page: The anchor of its URL contains “computer”. Its URL contains “computer”. Its URL is within 3 links from a hot page. Call the above URL as a hot URL. Robots

Crawling Algorithm hot_queue = url_queue = empty; /* initialization */ /* hot_queue stores hot URL and url_queue stores other URL */ enqueue(url_queue, starting_url); while (hot_queue or url_queue is not empty) { url = dequeue2(hot_queue, url_queue); /* dequeue hot_queue first if it is not empty */ page = fetch(url); if (page is hot) then hot[url] = true; enqueue(crawled_urls, url); Robots

url_list = extract_urls(page); for each u in url_list if (u not in url_queue and u not in hot_queue and u is not in crawled_urls) /* If u is a new URL */ if (u is a hot URL) enqueue(hot_queue, u); else enqueue(url_queue, u); } Reported experimental results indicate the method is effective. Robots

Fish search (De Bra 94): Search by intelligently and automatically navigating through real online web pages from a starting point. Some key features: Use heuristics to select the next page to navigate. Client-based search and Favors depth-first search. ARACHNID (Adaptive Retrieval Agents Choosing Heuristic Neighborhoods for Information Discovery, Menczer 97) Key features: Start from multiple promising starting points. Each agent acts like a fish search engine but with more sophisticated navigation techniques. Fish Search and ARACHNID

Web pages are mostly HTML documents (for now). HTML tags allow the author of a web page to Control the display of page contents on the Web. Express their emphases on different parts of the page. HTML tags provide additional information about the contents of a web page. Question: Can we make use of the tag information to improve the effectiveness of a search engine? Use of Tag Information

Two main ideas of using tags: Associate different importance to term occurrences in different tags. Use anchor text to index referenced documents. Use of Tag Information Page 2: http://travelocity.com/ Page 1 . . . . . . airplane ticket and hotel . . . . . .

Many search engines are using tags to improve retrieval effectiveness. Associating different importance to term occurrences is used in Altavista, HotBot, Yahoo, Lycos, LASER, SIBRIS. WWWW and Google use terms in anchor tags to index a referenced page. Shortcomings: very few tags are considered; relative importance of tags not studied; lacks rigorous performance study. Use of Tag Information

The Webor Method (Cutler 97, Cutler 99) Partition HTML tags into six ordered classes: title, header, list, strong, anchor, plain Extend the term frequency value of a term in a document into a term frequency vector (TFV). Suppose term t appears in the ith class tfi times, i = 1, 2, 3, 4, 5, 6. Then TFV = (tf1, tf2, tf3, tf4, tf5, tf6). Example: If for page p, term “konstanz” appears 1 time in the title, 2 times in the headers and 8 times in the anchors of hyperlinks pointing to p, then for this term in p: TFV = (1, 2, 0, 0, 8, 0). Use of Tag Information

The Webor Method (Continued) Assign different importance values to term occurrences in different classes. Let civi be the importance value assigned to the ith class. We have vector: CIV = (civ1, civ2, civ3, civ4, civ5, civ6) Extend the tf term weighting scheme as follows: Suppose for term t, TFV = (tf1, tf2, tf3, tf4, tf5, tf6) tfw = TFV  CIV = tf1civ1 + … + tf6 civ6 When CIV = (1, 1, 1, 1, 0, 1), the new tfw becomes the tfw in traditional text retrieval. Use of Tag Information

The Webor Method (Continued) Challenge: How to find the (optimal) CIV = (civ1, civ2, civ3, civ4, civ5, civ6) such that the retrieval performance can be improved the most? Our Solution: Find the optimal CIV experimentally. Need a test bed for the experiments so that we can measure the performance of a given CIV. Need a systematic way to try out different CIVs and to find out the optimal (or near optimal) CIV. Use of Tag Information

The Webor Method (from Weiyi Meng - Binghamton University) Creating a test bed: Web pages: A snap shot of the Binghamton University site in Dec. 1996 (about 4,600 pages; after removing duplicates, about 3,000 pages). Queries: 20 queries were created (see next page). For each query, (manually) identify the documents relevant to the query. Use of Tag Information

The Webor Method (Continued): 20 test bed queries: web-based retrievalconcert and music neural networkintramural sports master thesis in geologycognitive science prerequisite of algorithmcampus dining handicap student helpcareer development promotion guidelinenon-matriculated admissions grievance committeestudent associations laboratory in electrical engineeringresearch centers anthropology chairmanengineering program computer workshoppapers in philosophy and computer and cognitive system Use of Tag Information

The Webor Method (Continued) Use a Genetic Algorithm to find the optimal CIV. The initial population has 30 CIVs. 25 are randomly generated (range [1, 15]) 5 are “good” CIVs from manual screening. Each new generation of CIVs is produced by executing: crossover, mutation, and reproduction. Use of Tag Information

Use of Tag Information The Genetic Algorithm (continued) Crossover done for each consecutive pair CIVs, with probability 0.75. a single random cut for each selected pair Example: old pair new pair (1, 4, 2, 1, 2, 1) (2, 3, 2, 1, 2, 1) (2, 3, 1, 2, 5, 1) (1, 4, 1, 2, 5, 1) cut

The Genetic Algorithm (continued) Mutation performed on each CIV with probability 0.1. When mutation is performed, each CIV component is either decreased or increased by one with equal probability, subject to range conditions of each component. Example: If a component is already 15, then it cannot be increased. Use of Tag Information

The Genetic Algorithm (continued) The fitness function A CIV has an initial fitness of 0 when the 11-point average precision is less than 0.22. (11-point average precision - 0.22), otherwise. The final fitness is its initial fitness divided by the sum of the initial fitnesses of all the CIVs in the current generation. each fitness is between 0 and 1 the sum of all fitnesses is 1 Use of Tag Information

The Genetic Algorithm (continued) Reproduction Wheel of fortune scheme to select the parent population. The scheme selects fit CIVs with high probability and unfit CIVs with low probability. The same CIV may be selected more than once. The algorithm terminates after 25 generations and the best CIV obtained is reported as the optimal CIV. The 11-point average precision by the optimal CIV is reported as the performance of the CIV. Use of Tag Information

The Webor Method (continued): Experimental Results Classes:title, header, list, strong, anchor, plain Queries Opt.CIV Normal New Improvement 1st 10 281881 0.182 0.254 39.6% 2nd 10 271881 0.172 0.255 48.3% all 251881 0.177 0.254 43.5% Conclusions: anchor and strong are most important header is also important title is only slightly more important than list and plain Use of Tag Information

The Webor Method (continued): Summary The Webor method has the potential to substantially improve the retrieval effectiveness. But be cautious to draw any definitive conclusions as the results are too preliminary. Need to Expand the set of queries in the test bed Use other Web page collections Use of Tag Information

Hyperlinks among web pages provide new document retrieval opportunities. Selected Examples: Anchor texts can be used to index a referenced page (e.g., Webor, WWWW, Google). The ranking score (similarity) of a page with a query can be spread to its neighboring pages. Links can be used to compute the importance of web pages based on citation analysis. Links can be combined with a regular query to find authoritative pages on a given topic. Use of Link Information

Vector spread activation (Yuwono 97) The final ranking score of a page p is the sum of its regular similarity and a portion of the similarity of each page that points to p. Rationale: If a page is pointed to by many relevant pages, then the page is also likely to be relevant. Let sim(q, di) be the regular similarity between q and di; rs(q, di) be the ranking score of di with respect to q; link(j, i) = 1 if dj points to di, = 0 otherwise. rs(q, di) = sim(q, di) +   link(j, i) sim(q, dj)  = 0.2 is a constant parameter. Use of Link Information

PageRank citation ranking (Page 98). Web can be viewed as a huge directed graph G(V, E), where V is the set of web pages (vertices) and E is the set of hyperlinks (directed edges). Each page may have a number of outgoing edges (forward links) and a number of incoming links (backlinks). Each backlink of a page represents a citation to the page. PageRank is a measure of global web page importance based on the backlinks of web pages. Use of Link Information

PageRank is based on the following basic ideas: If a page is linked to by many pages, then the page is likely to be important. If a page is linked to by important pages, then the page is likely to be important even though there aren’t too many pages linking to it. The importance of a page is divided evenly and propagated to the pages pointed to by it. Computing PageRank 5 10 5

PageRank Definition Let u be a web page, Fu be the set of pages u points to, Bu be the set of pages that point to u, Nu = |Fu| be the number pages in Fu. The rank (importance) of a page u can be defined by: R(u) =  ( R(v) / Nv ) v Bu Computing PageRank

PageRank is defined recursively and can be computed iteratively. Initiate all page ranks to be 1/N, N is the number of vertices in the Web graph. In ith iteration, the rank of a page is computed using the ranks of its parent pages in (i-1)th iteration. Repeat until all ranks converge. Let Ri(u) be the rank of page u in ith iteration and R0(u) be the initial rank of u. Ri(u) =  ( Ri-1(v) / Nv ) v Bu Computing PageRank

Matrix representation Let M be an NN matrix and muv be the entry at the u-th row and v-th column. muv = 1/Nv if page v has a link to page u muv = 0 if there is no link from v to u Let Ri be the N1 rank vector for I-th iteration and R0 be the initial rank vector. Then Ri = M  Ri-1 Computing PageRank

If the ranks converge, i.e., there is a rank vector R such that R= M  R, R is the eigenvector of matrix M with eigenvalue being 1. Convergence is guaranteed only if M is aperiodic (the Web graph is not a big cycle). This is practically guaranteed for Web. M is irreducible (the Web graph is strongly connected). This is usually not true. Computing PageRank

Rank sink: A page or a group of pages is a rank sink if they can receive rank propagation from its parents but cannot propagate rank to other pages. Rank sink causes the loss of total ranks. Example: Computing PageRank (C, D) is a rank sink A B C D

A solution to the non-irreducibility and rank sink problem. Conceptually add a link from each page v to every page (include self). If v has no forward links originally, make all entries in the corresponding column in M be 1/N. If v has forward links originally, replace 1/Nv in the corresponding column by c1/Nv and then add (1-c) 1/N to all entries, 0 < c < 1. Computing PageRank

Let M* be the new matrix. M* is irreducible. M* is stochastic, the sum of all entries of each column is 1 and there are no negative entries. Therefore, if M is replaced by M* as in Ri = M*  Ri-1 then the convergence is guaranteed and there will be no loss of the total rank (which is 1). Computing PageRank

Interpretation of M* based on the random walk model. If page v has no forward links originally, a web surfer at v can jump to any page in the Web with probability 1/N. If page v has forward links originally, a surfer at v can either follow a link to another page with probability c  1/Nv, or jumps to any page with probability (1-c) 1/N. Computing PageRank

Example: Suppose the Web graph is: M = Computing PageRank D C A B A B C D A B C D • 0 0 0 ½ • 0 0 0 ½ • 1 0 0 • 0 0 1 0

Example (continued): Suppose c = 0.8. All entries in Z are 0 and all entries in K are ¼. M* = 0.8 (M+Z) + 0.2 K = After 30 iterations: R(A) = R(B) = 0.176 R(C) = 0.332, R(D) = 0.316 Computing PageRank 0.05 0.05 0.05 0.45 0.05 0.05 0.05 0.45 0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05

Web and Search Engines

Web and Search Engines

Presentation Transcript

Search Engines

Web Search Engines

Web Technologies Search Engines

Web Technologies Search Engines

Search Engines and Web Research

Web search engines

Web Search Engines

Web Search Engines

Search Engines and Finding Web Pages

Search Engines

Web Search Engines

Web search engines

Web Search Engines

Web Search Engines

Web Search Engines

Search Engines and Metasearch Engines

Search Engines and Web Advertising

Deep Web Search Engines

Web search engines

Web Search Engines