1 / 72

How to Build a Search Engine

How to Build a Search Engine. 樂倍達數位科技股份有限公司 范綱岷( Kung-Ming Fung ) kmfung@doubleservice.com 2008/04/01. Outline. Introduction Different Kinds of Search Engine Architecture Robot, Spider, Crawler HTML and HTTP Indexing Keyword Search Evaluation Criteria Related Work Discussion

plato
Télécharger la présentation

How to Build a Search Engine

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How to Build a Search Engine 樂倍達數位科技股份有限公司 范綱岷(Kung-Ming Fung) kmfung@doubleservice.com 2008/04/01

  2. Outline • Introduction • Different Kinds of Search Engine • Architecture • Robot, Spider, Crawler • HTML and HTTP • Indexing • Keyword Search • Evaluation Criteria • Related Work • Discussion • About Google • Ajax: A New Approach to Web Applications • References

  3. Introduction

  4. Different Kinds of Search Engine • Directory Search • Full Text Search • Web pages • News • Images • … • Meta Search

  5. Number of Page:Directory < Full text < Meta • Directory Search 目錄式 • ODP:Open Directory Project,http://dmoz.org/ • Full-Text Search 全文檢索 • Google,http://www.google.com/

  6. Meta Search 整合型 • MetaCrawler,http://www.metacrawler.com/ • 愛幫,http://www.aibang.com/

  7. Simplified control flow of the meta search engineReference: Context and Page Analysis for Improved Web Search,http://www.neci.nec.com/~lawrence/papers.html

  8. Architecture WWW Robot, Spider, Crawler Database Indexing Simple Architecture Keyword Search

  9. Typical high-level architecture of a Web crawler Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  10. Typical anatomy of a large-scale crawler. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  11. High Level Google Architecture Reference: A Survey On Web Information Retrieval Technologies

  12. The architecture of a standard meta search engine. Reference: Web Search – Your Way

  13. The architecture of a meta search engine. Reference: Web Search – Your Way

  14. Cyclic architecture for search engines Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  15. Robot, Spider, Crawler • Robot是Search Engine中負責資料收集的軟體,又稱為Spider、或Crawler,他可以自動在設定的期限內定時自各網站收集網頁資料,而且通常是由一些預定的起始網站開始遊歷其所連結的網站,如此反覆不斷(recursive)的串連收集。 • A major performance stress is DNS lookup.

  16. Goal • Resolving the hostname in the URL to an IP address using DNS(Domain Name Service). • Connecting a socket to the server and sending the request. • Receiving the request page in response.

  17. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  18. Amount of static and dynamic pages at a given depth Dynamic pages: 5 levels Static pages: 15 levels Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  19. Policy • A selection policy that states which pages to download. • A re-visit policy that states when to check for changes to the pages. • A politeness policy that states how to avoid overloading Web sites. • A parallelization policy that states how to coordinate distributed Web crawlers. Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  20. The view of Web Crawler Reference: Structural abstractions of hypertext documents for Web-based retrieval

  21. Flow of a basic sequential crawler Reference: Crawling the Web.

  22. A multi-threaded crawler model Reference: Crawling the Web.

  23. HTML and HTTP • HTML – Hypertext Markup Language • HTTP – Hypertext Transport Protocol • TCP – Transport Control Protocol • HTTP is built on top of TCP. • Hyperlink • A hyperlink is expressed as an anchor tag with an href attribute. • <a href=“http://www.ntust.edu.tw/”>NTUST</a> • URL – Uniform Resource Locator(http://www.ntust.edu.tw/)

  24. GET / Http/1.0 Http/1.1 200 OK Date: Sat, 13 Jan 2001 09:01:02 GMT Server: Apache/1.3.0 (Unix) PHP/3.0.4 Last-Modified: Wed, 20 Dec 2000 13:18:38 GMT Accept-Ranges: bytes Content-Length: 5437 Connection: Close Content-Type: text/html <html> <head> <title>NTUST</title> </head> <body> … </body></html>

  25. For checking a URL Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  26. Operation of a crawler Reference: Crawling a Country Better Strategies than Breadth-First for Web Page Ordering.

  27. Get new URLs Reference: Crawling on the World Wide Web.

  28. HTML Tag Tree Reference: Crawling the Web.

  29. HTML Tag Tree Reference: Crawling the Web.

  30. Strategies • Breadth-first • Backlink-count • Batch-pagerank • Partial-pagerank • OPIC(On-line Page Importance Computation ) • Larger-sites-first Reference: Carlos Castillo, Effective Web Crawling, Dept. of Computer Science - University of Chile, November 2004.

  31. Re-visit policy • Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The freshness of a page p in the repository at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

  32. Age: This is a measure that indicates how outdated the local copy is. The age of a page p in the repository, at time t is defined as: Reference: Web crawler, From Wikipedia, the free encyclopedia, http://en.wikipedia.org/wiki/Web_crawler.

  33. Robot Exclusion http://www.robotstxt.org/wc/exclusion.html • The robots exclusion protocol • The robots META tag

  34. The Robots Exclusion Protocol - /robots.txt • Where to create the robots.txt file?EX:

  35. URL's are case sensitive, and "/robots.txt" must be all lower-case • Examples: • To exclude all robots from the entire server User-agent: * Disallow: / • To exclude all robots from part of the server User-agent: * Disallow: /cgi-bin/ Disallow: /tmp/ Disallow: /private/

  36. To exclude a single robot User-agent: BadBot Disallow: / • To allow a single robot User-agent: WebCrawler Disallow: User-agent: * Disallow: /

  37. To exclude all files except one User-agent: * Disallow: /~joe/docs/User-agent: * Disallow: /~joe/private.html Disallow: /~joe/foo.html Disallow: /~joe/bar.html

  38. A sample robots.txt file# AltaVista SearchUser-agent: AltaVista Intranet V2.0 W3C WebreqDisallow: /Out-Of-Date/# Exclude some access-controlled areasUser-agent: *Disallow: /Team/Disallow: /Project/Disallow: /Sytems/

  39. The Robots META Tag • <meta name="robots" content="noindex,nofollow"> • Like any META tag it should be placed in the HEAD section of an HTML page<html> <head> <meta name="robots" content="noindex,nofollow"> <meta name="description" content="This page ...."> <title>...</title> </head> <body>...

  40. Examples: • <meta name="robots" content="index,follow"> • <meta name="robots" content="noindex,follow"> • <meta name="robots" content="index,nofollow"> • <meta name="robots" content="noindex,nofollow"> • Index: if an indexing robot should index the page • Follow: if a robot is to follow links on the page • The defaults are INDEX and FOLLOW.

  41. Indexing 索引 • 一般而言,索引的產生是將網頁中每個Word或者Phrase存入Keyword索引檔中,另外除了來自網頁內容外,網頁作者所自行定義Meta Tag中的Keyword也常被納入索引範圍。 • TF, IDF, Reverse(Inverted) Index • Stop words

  42. (b) is a inverted index of (a) Reference: Supporting web query expansion efficiently using multi-granularity indexing and query processing

  43. d1: My1 care2 is3 loss4 of5 care6 with7 old8 care9 done10. • d2: Your1 care2 is3 gain4 of5 care6 with7 new8 care9 won10. • tid: token ID • did: document ID • pos: position Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  44. my -> d1 care -> d1; d2 is -> d1; d2 loss -> d1 of -> d1; d2 with -> d1; d2 old -> d1 done -> d1 your -> d2 gain -> d2 new -> d2 won -> d2 my -> d1/1 care -> d1/2,6,9; d2/2,6,9 is -> d1/3; d2/3 loss -> d1/4 of -> d1/5; d2/5 with -> d1/7; d2/7 old -> d1/8 done -> d1/10 your -> d2/1 gain -> d2/4 new -> d2/8 won -> d2/10 My care is loss of care with old care done. d1 Your care is gain of care with new care won. d2 Two variants of the inverted index data structure. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  45. Usually stored on disk • Implemented using a B-tree or a hash table

  46. Large-scale crawlers often use multiple ISPs and a bank of local storage servers to store the pages crawled. Reference: Soumen Chakrabarti, mining the web – Discovering Knowledge from Hypertext Data, Morgan Kaufmann, 2003.

  47. Keyword Search 查詢 • 檢索軟體是決定Search Engine是否能普遍為人使用的關鍵要素,因為使用者多只能藉由搜尋速度、搜尋結果來判斷一個系統的好壞,而這些工作都屬於檢索軟體的範圍。 • 人工智慧、自然語言 • Ranking:PageRank、HITS • Query Expansion

  48. WAIS: • 廣域資訊服務(Wide Area Information System;WAIS)是一套可以建立全文索引,並提供網路資源全文檢索功能的軟體,其主要由伺服器(Server)、用戶端(Client)、協定(Protocol)等三部份組成 。 • 查詢方式: • 關鍵字(Keyword) • 以概念為基礎的(Concept-based) • 模糊(Fuzzy) • 自然語言(Natural Language)

  49. PageRankA page can have a high PageRank if there are many pages pointing to it, or if there are same pages that point to it and have a high PageRank. Reference: A Survey On Web Information Retrieval Technologies

  50. We assume page A has pages T1…Tn which point to it (i.e., are citations). The parameter d is a damping factor which can be set between 0 and 1. Google usually sets d to 0.85. Also C(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: Reference: A Survey On Web Information Retrieval Technologies

More Related