1 / 30

WWW servers and search engines

WWW servers and search engines. 2004, 劉震昌. Web browser and server. tools to read HTML document. client. server. Web browser. Web server (ex. 跑 IIS). send request. click a link. find document. display. return HTML document. Where is the web server?. Probing the Internet (cont.).

dolf
Télécharger la présentation

WWW servers and search engines

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. WWW servers and search engines 2004, 劉震昌

  2. Web browser and server • tools to read HTML document client server Web browser Web server (ex. 跑IIS) send request click a link find document display return HTML document Where is the web server?

  3. Probing the Internet (cont.) • tracert, ping 封包(網路上資料傳輸單位) packet source destination www.yahoo.com.tw router

  4. Probing the Internet (How do you know you are on Internet?) • ping www.yahoo.com.tw Pinging rc.tpe.yahoo.com [202.1.237.23] with 32 bytes of data: Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Reply from 202.1.237.23: bytes=32 time=5ms TTL=246 Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Reply from 202.1.237.23: bytes=32 time=4ms TTL=246 Ping statistics for 202.1.237.23: Packets: Sent = 4, Received = 4, Lost = 0 (0% loss), Approximate round trip times in milli-seconds: Minimum = 4ms, Maximum = 5ms, Average = 4ms

  5. The route from source to destination • tracert www.yahoo.com.tw Tracing route to rc.tpe.yahoo.com [202.1.237.23] over a maximum of 30 hops: 1 <1 ms <1 ms <1 ms gateway.lan20.csie.ncnu.edu.tw [163.22.20.254] 2 <1 ms <1 ms <1 ms ip253.puli01.ncnu.edu.tw [163.22.1.253] 3 <1 ms <1 ms <1 ms ip090.puli255-64-203.ncnu.edu.tw [203.64.255.90] 4 1 ms 1 ms 1 ms 140.128.251.38 5 17 ms 74 ms 2 ms tc-tanet-gw01.router.hinet.net [211.22.189.186] 6 2 ms 1 ms 1 ms 211.22.189.190 7 1 ms 1 ms 1 ms tc-c12r2.router.hinet.net [211.22.189.74] 8 4 ms 4 ms 4 ms tp-s2-c12r2.router.hinet.net [210.65.200.30] 9 4 ms 4 ms 4 ms tp-s2-c6r8.router.hinet.net [211.22.35.181] 10 9 ms 5 ms 6 ms 211.22.41.89 11 5 ms 5 ms 5 ms rc.tpe.yahoo.com [202.1.237.23] Trace complete.

  6. Lab#5 • Try ping and tracert to access www.google.com.tw • Record your results in a text file • Email to me with subject: Lab5 學號

  7. 動態 IP 如何架站(WWW,ftp,…)? • DHCP (Dynamic Host Configuration Protocol) • DHCP 說明 IP:163.22.123.111 IP:163.22.123.123 If we want to communicate with hime, What’s the IP or domain name? . . . • 自己架 DNS (domain name server) • 動態註冊 IP 與 domain name

  8. www.no-ip.com 動態 www.no-ip.com DNS server IP:163.22.123.111 Kamiry.no-ip.com 註冊 IP 與 domain name 的對應 參考:No-IP 使用文件

  9. 安裝 IIS (internet information server) • 在 Windows CD 片 • 安裝說明 • IIS 設定 • Microsoft IIS 太普遍,並且有很多安全漏洞,可以使用非微軟的 WWW server • Ex. Apache, analogx, … • 參考文件

  10. HW#3 • 在自己的電腦上架設 WWW server • 將 server 的 domain name email 給我 • 將自己的個人網頁放到自己的電腦上 • 助教指定開機時間 server 必須開啟

  11. Searching the Web Ref: Chapter 13 in “Modern Information Retrieval” Ricardo Baeza-Yates and Berthier Ribeiro-Neto

  12. Outline • Measuring the Web • Methods for searching the Web • Search engines • Web directories

  13. Searching the Web • WWW starts in 1989 • Just the textual data is estimated to be in the order of one terabyte • Goal: how to efficiently manage, retrieve and filter information from the Web?

  14. Challenges • Distributed data • Data spans over many computers interconnected without predefined topology • High percentage of volatile data 易變資料 • 40% of the Web changes every month • Large volume • Unstructured and redundant data 重複資料 • 30% of Web pages are (near) duplicates • Heterogeneous data • Different languages

  15. Measuring the Web URLs WWW *1998, 3M servers Web server 3百萬 Internet No. of servers = 1/10 no. of computers on Internet

  16. Measuring the Web (cont.) • 1998 • 5Kb per Web page on average • 300M Web pages (3億…) • 300M * 5Kb = 1.5 Terabytes • Grow at a rate of 20M pages per month

  17. Growth of the Web Web pages Million Web sites 300 200 100 year 1996 1997 1998

  18. Methods for searching the Web • Search engines 搜尋引擎 • Index the Web documents as a full-text database • Alta Vista, Google, … • Web directories 入門網站目錄 • Classify selected Web documents by subject • Yahoo!

  19. Search engines concept 搜尋引擎 • Model the Web as a database • All queries must be answered without accessing the Web pages database User queries

  20. Search engines (cont.) • AltaVista (www.altavista.com) • 20 multi-processor machines • 130 Gb of RAM each • Over 500 Gb of disk space each • 75% resources on the query engine

  21. The top search engines • Foreign • Google ( www.google.com ) • www.yahoo.com • www.altavista.com • Inktomi ( www.inktomi.com ) • Statistics on search engines • www.searchenginewatch.com • http://imt.net/~notess/search • Taiwan • Yahoo!/Kimo uses google • Openfind ( www.openfind.com.tw )(中正大學吳昇教授) • Yam ( www.yam.com.tw )

  22. Search engines (cont.) • Centralizedcrawler-indexer architecture Index database Query Engine User Interface Indexer users Crawler Web

  23. User Interface • Query interface • Keywords • Boolean operator • Answer interface • Rank the searched pages • Statistics about the term occurrence within the document • Popularity • Hyperlink information

  24. Index database Query Engine User Interface Indexer users Crawler Web

  25. Crawler • Robots, spiders (蜘蛛), wanderers, walkers, and knowbots • In spite of their name, the crawler runs on a local system and sends requests to remote Web servers • Method: start with a set of URLs, and from there extract other URLs

  26. Crawler (cont.) • How the Web is traversed, the index of a search engine can be thought as analogous to the stars in a sky • Invalid links in search engines vary from 2% to 9% • The current fastest crawlers are able to traverse up to 10M Web pages per day (’98) • 300M/10M = 30 days

  27. Web directories 網站目錄 • Classify the Web pages by categories • Directories are hierarchical taxonomies that classify human knowledge • Yahoo! has close to 1M pages classified • How to classify pages? • Pages has to submitted to the Web directories • Manually done by few people • Automatic classification is not yet mature • Not every page is classified

  28. Some Web directories Web directories URL Web sites(K) Categories Yahoo! www.yahoo.com 750 LookSmart www.looksmart.com 300 24 Lycos Subjects a2z.lycos.com 50 eBLAST www.eblast.com 125 NewHoo www.newhoo.com 100 23 Magellan www.mckinley.com 60 Netscape www.netscape.com Snap www.snap.com

  29. Lab about search engine • Today 1:00~3:00

  30. Final typing test • 10/20 • 沒達到標準學期總分扣 10 分

More Related