1 / 54

7310 1 CS5286 Algorithms and Techniques for Web Search ...

CS5286 Algorithms And Techniques for Web Search. Objective: Provide a practical introduction to algorithms and techniques for information retrieval over the Internet. ...

LeeJohn
Télécharger la présentation

7310 1 CS5286 Algorithms and Techniques for Web Search ...

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    Slide 1:CS5286 Algorithms And Techniques for Web Search

    Objective: Provide a practical introduction to algorithms and techniques for information retrieval over the Internet.

    Slide 2:Lecturer: Professor DENG, Xiaotie Room Y6321 Ext 8632 Email: csdeng TA: SUN Wei Room CYC2207 Ext 8030 Email: sunwei@cs

    Contact

    Slide 3:Coursework: 50% 20% marks for quiz: two, each 10% of the final mark. 27% marks for a group project (2-3 people in a group). 3% participation points, at Discussion Forum, tutorials and classes (one point each). Examination: 50% one 1.5-hour examination. At least 30% examination marks are required to pass.

    Assessment

    Slide 4:Modern Information Retrieval, by Ricardo Baeza-Yates and Berthier Ribeiro-Neto, Addison Wesley, 1999. GUIDE TO SEARCH ENGINES, by Wes Sonnenreich and Tim Macinta, Wiley Computer Publishing, 1998.

    Reference Books

    Slide 5:Web access: Automated access to existing search engines The use of spiders/robots for web searching Collection of visitor information to ones own web site Web mining: Ranking techniques for web sites on specific topics Automated abstract generation User profile Information retrieval Basic Models Major Query Operations Indexing and Searching New research topics

    Students Will Acquire The Following

    Slide 6:A history of search engines: http://www.wiley.com/legacy/compbooks/sonnenreich/webdev/history.html Java and the class URL (search under class net) http://java.sun.com/j2se/1.3/docs/api/index.html Free search engines written in Java: http://www.freewarejava.com/applets/search.shtml Robots: http://www.robotstxt.org/wc/robots.html

    Some Helpful Web Sites

    Slide 7:The Internet and Web Collection of Information over the Web Quiz 1 Models of Information Retrieval Query techniques Quiz 2 Start of Project Text Operations Indexing and Searching Techniques

    Tentative Lecture Plan

    Slide 8:The purpose: To provide hand-on experience learning Materials to be covered: Review of Java and Link to the Internet Functionality of Spider/Robot Access to Major Search Engines A simple search engine in Java In addition, we will conduct the following in tutorial sessions Submission and discussion of project proposal and plan Project Presentation

    Tentative Tutorial Session Plan

    Slide 9:Two or Three people in a group It is best to do a project that use one of the following available tools for some application problem. Spider/Robot Major Search Engines The simple search engine in Java Some example of possible projects: Build a network map of co-authorship relations. Build relationship networks by Internet information retrieval. Design a method to test which search engine covers more webpages. Start your project as early as possible.

    Plan For The Group Project

    Slide 10:Know how to program in JAVA. Or Capable of learning JAVA programming in one week or so. DROP the course if you dont. We will have some quick quiz on JAVA to determine whether the course is suitable for you.

    Pre-Requisites

    Slide 11:Lecture 1: Introduction

    Slide 12:A Simple Search Engine Architecture

    Web User Spider Indexer Query Interface Query Engine Database

    Slide 13: Major issues

    Spider and communication between computer and the Internet Data/document model for information retrieval Query protocol design User profile techniques Interactive Information Retrieval Technique Design

    Slide 14: Spiders

    Automatically Retrieve web pages Start with an URL retrieve the associated web page Find all URLs on the web page recursively retrieve not-yet searched URLs Algorithmic Issues How to choose the next URL? Avoid overloaded sub-networks

    Slide 15:Indexer

    Selects terms to index for a document may utilise co-operation from web page authors through Meta tags to indicate specific terms to index <META name="keywords" content=information retrieval> Algorithmic issues: How to choose terms/phrases or other entities to index so as to accurately and efficiently respond to use queries

    Slide 16:Database

    Tradeoff of Hardware/Speed Efficiency Algorithmic issues: efficiency in space redundancy as trade-off for speed in query response Cost efficiency: How many computers to use? How to distribute load efficiently?

    Slide 17:Query Engine

    Return the most relevant documents for queries Algorithmic Issues: document model relevance analysis

    Slide 18:Query Interface

    Analyse user profiles generate user specific query result Algorithmic issues: Design of efficient and user-friendly query protocols

    Slide 19: Interesting Problems

    Finding the needle in the haystack: search for certain specific information on the Internet User-specific ranking of documents on the web how to collect and apply user information to provide better service Trust analysis of information on the web avoid providing false information Trustworthiness analysis of virtual identities over the Internet. http://www.firstgov.gov/Citizen/Topics/Internet_Fraud.shtml

    Slide 20:Some Facts about the Internet

    Slide 21:Statistics About Internet

    Internet Domain Growth http://www.isc.org/index.pl?/ops/ds/ How to conduct Internet Domain Survey http://www.isc.org/ds/faq.html

    Slide 22:Internet Growth Charts

    http://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htmhttp://www.cyveillance.com/web/us/newsroom/releases/2000/2000-07-10.htm

    Slide 23:Internet Provides Varieties of Information

    Text documents Multimedia files Interactive information services Internet group membership services Databases Frauds: Trojan horses and Phishing tricks

    Slide 24:Major Features of Information Retrieval on the Internet

    Large amount of information Rapid information update Dynamic hyperlink structure Varieties of data format, language, qualities

    Slide 25:Some Difficulties for Internet Informational Retrieval System

    Diversified user base (from layman to computer nerds). could we develop an evolving system that adapts to user? Language Ambiguity This becomes an especially important issue because of varieties of different data on the Internet How do we collect and apply user profiling techniques to resolve it?

    Slide 26:Search Engines Today

    Slide 27:Evolving Search Engines

    Tools for finding information on the Web Problem: hidden databases, e.g. New York Times Directory A hand-constructed hierarchy of topics (e.g. Yahoo) Search engine A machine-constructed index (usually by keyword) Interactive Searching http://www.learnthenet.com/english/html/78tutorial.htm Specialized Searching Google Scholar: http://www.scholar.google.com/ Guide to find search engines http://www.searchenginecolossus.com/ New trends in search engines http://www.searchengineshowdown.com/

    Slide 28:Coverage of Search Engine

    Number of web pages covered Self claimed. Maybe include link-only without analyzing the page Page Depth The maximum amount of information indexed for an individual webpage. http://blog.searchenginewatch.com/blog/041111-084221

    Slide 29:Search Engine Sizes (Apr. 6, 2001)

    SOURCE: SEARCHENGINEWATCH.COM AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com Estimated total web pages ~ 2 billion SHADED DATA FOR GG AND INKTOMI INCLUDES PAGES INDEXED BUT NOT VISITED SEARCHES/DAY (MILLIONS) 100 12 50 47 50 5

    Slide 30:Search Engine Sizes (Dec 11, 2001)

    SOURCE: http://searchenginewatch.com/reports/sizes.html AV Altavista EX Excite FAST FAST GG Google Go Go (Infoseek) INK Inktomi NL Northern Light WT WebTop.com

    Slide 31:Search Engine Size Trends

    SOURCE: http://searchenginewatch.com/reports/article.php/2156481#trend

    Slide 32:Search Engines Disjointness

    SOURCE: SEARCHENGINESHOWDOWN

    Slide 33:Search Engines Uniqueness

    SOURCE: http://www.searchengineshowdown.com/stats/overlap.shtml

    Slide 34:Time Spent Per Visitor (minutes) by Search Engine, April 1999

    SOURCE: http://www.nielsen-netratings.com/ AV Altavista EX Excite Go/IS Go/Infoseek GT GoTo HB Hotbot LS LookSmart LY Lycos MSN MSN NS Netscape WC Webcrawler YH Yahoo

    Slide 35:Time Spent Per Visitor (minutes) by Search Engine, June 2002

    SOURCE: http://searchenginewatch.com/reports/netratings.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, IS=InfoSpace; OVR=Overture (GoTo), AV=AltaVista, NS=Netscape, LS=LookSmart, LY=Lycos; DP=Dogpile.

    Slide 36:Total (millions of) Hours Spent on by Search Engine, June 2002

    SOURCE: http://searchenginewatch.com/reports/netratings.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, IS=InfoSpace; OVR=Overture (GoTo), AV=AltaVista, NS=Netscape, LS=LookSmart, LY=Lycos; DP=Dogpile.

    Slide 37:Audience Reach by Search Engine, July , 2001

    SOURCE: http://wreportus.mediametrix.com/clientCenter.html AJ Ask Jeeves AV Altavista DH Direct Hit DP Dogpile EX Excite GG Google GO Go/Infoseek G2N GoTo HB Hotbot iWN iWon LS LookSmart LY Lycos MC Metacrawler MM Mamma MSN MSN NL Northern Light NS Netscape WC Webcrawler YH Yahoo Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap

    Slide 38:Audience Reach by Search Engine, Mar. 2002

    SOURCE: http://searchenginewatch.com/reports/mediametrix.html MSN=MSN, YH=Yahoo, GG=Google, AOL=AOL, AJ=Ask Jeeves, LS=LookSmart, ISP=InfoSpace, NS=Netscape, OVR=Overture (GoTo). Audience Reach = % of active surfers visiting during month. Totals exceed 100% because of overlap

    Slide 39:Start With Spider

    Slide 40:Spider Architecture

    Database Shared URL pool Database Interface url_spider Web Space url_spider url_spider url_spider url_spider spiders Http Request Http Response Add a new URL Get an URL

    Slide 41:Communication

    How a web browser communicates with computer How a browser communicates with the Internet How data travels through the Internet How a web browser communicates with a web server

    Slide 42:Web Browser

    A primary tool to gather information from the Internet Netscape Navigator now firefox Microsofts Internet Explorer

    Slide 43:Web Server

    It provides the connection of the computer to the Internet Serving Web pages to browsers It usually runs on TCP port 80

    Slide 44:Uniform Resource Locator(URL)

    The address of a web page on the net The web server is waiting at this address for the browsers. URL is used by a web browser to travel to the address and request desired Web page from the web server. If the web server give the page to the Web browser The browser then display it to user.

    Slide 45:TCP/IP for Internet Connection

    IP stands for Internet Protocol TCP stands for Transmission Control Protocol TCP is layered on top of IP The result communication system is TCP/IP.

    Slide 46:The IP layer

    Inter-network layer Data are breaking down into packets of fixed size and sent over to the destinations. IP address: consists of 4 8-bit numbers example: 144.214.37.200 Routes use IP address to send packets to their destinations packets of the same stream of data may go through different routes.

    Slide 47:The TCP layer

    A service provider protocol Provide a logical connection between the sender and the receiver of data over the unreliable network Its data integrity support functions and mechanism are the basis for application services such as FTP, Telnet, etc.

    Slide 48:TCP/IP Port Number

    One for each specific application layer service Used between two host computers to identify which application program is to receive the incoming traffic. 0-255 are pre-assigned and are called well-known ports. If you want to assign a port number to a specific application, use a number above 255.

    Slide 49:Browser/Server Interaction

    You type a URL (or click at it) your browser opens up a connection with the web server at the URL your browser tells the web server the particular page you want the web server sends back a response giving information about the page then sends back the appropriate page

    Slide 50:The Spider

    Does that automatically (without clicking on a line nor type a URL) It is an automated program that search the web. Read a web page store/index the relevant information on the page follow all the links on the page (and repeat the above for each link)

    Slide 51:Caution About Using A Spider

    It may puts an unexpected amount of traffic load if poorly written Be responsible for your actions Use a well-tested one instead of writing your own Test it locally before running it over the Internet Follow the standard guideline www.robotstxt.org/wc/guidelines.html

    Slide 52:Tutorials

    Start with a review of Java Then how to connect to the internet Use of spider Major functionality of search engine In addition, certain tasks will be assigned to gain the first hand experience in learning.

    Slide 53:Todays Tutorial

    A typical Java program A typical Java program that uses a URL as input and return the content of the web page Some further questions will be left as your exercise.

    Slide 54:Next Weeks Tutorial

    Java network programming introduction HTTP introduction Java URL class for establish HTTP connection

More Related