CS 430: Information Discovery

CS 430: Information Discovery Lecture 21 Web Search 3

Course Administration Thursday, November 11 No office hours Tuesday, November 16 No class Wednesday, November 17 Discussion class requires you to read three short papers. Wednesday, December 1 Discussion class requires you to search for and read materials on a specified topic.

Effective Information Retrieval 1. Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. 2. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous material, but requires available full text. Neither of these methods is very effective when applied directly to the Web.

Effective Information Retrieval (cont) 3. Full text indexingwith contextual information and ranked retrieval (e.g., Google, Teoma). Excellent for mixed textual information with rich structure. 4. Contextual information with non-textual materialsand ranked retrieval (e.g., Google and Yahoo image retrieval). Promising, but still experimental.

New concepts in Web Searching • Goal of search is redefined to emphasize precision of the most highly ranked group of hits. • Concept of relevance is changed to include importance of documents as a factor in ranking. • Browsing is tightly connected to searching. • Contextual information is used as an integral part of the search.

Browsing Users give queries of 2 to 4 words Most users click only on the first few results; few go beyond the fold on the first page 80% of users, use search engine to find sites search to find site browse to find information Amil Singhal, Google, 2004

Browsing and Searching Searching is followed by browsing. Browsing the hit list: helpful summary records (snippets) removal of duplicates grouping results from a single site Browsing the web pages themselves: direct links from the snippets to the pages cache with highlights translation in same format

Dynamic Snippets Query:Cornell sports LII: Law about...Sports...sports law: an overview. Sports Law encompasses a multitude areas of law brought together in unique ways. Issues ... vocation. Amateur Sports. ...www.law.cornell.edu/topics/sports.html Query: NCAATarkanian LII: Law about...Sports... purposes. See NCAA v. Tarkanian, 109 US 454 (1988). State action status may also be a factor in mandatory drug testing rules. On ...www.law.cornell.edu/topics/sports.html

Contextual information The context in which an item exists may give useful information for searching. Information about a document: • Content (terms, formatting, etc.) • Metadata (externally created following rules) • Context (citations and links, reviews, annotations, etc.) Context has many uses: • Selecting documents to index • Retrieval clues (e.g., anchor text) • Ranking

Context: Anchor Text words words words Cornell University words words words Linking page Linked to page <a href = "http://www.cornell.edu">Cornell University</a> HTML source

Context: Image Searching <img src="images/Arms.jpg" alt="Photo of William Arms"> HTML source Captions and other adjacent text on the web page From the Information Science web site

Reference Pattern Ranking using Dynamic Document Sets PageRank calculates document ranks for the entire (fixed) set of documents. The calculations are made periodically (e.g., monthy) and the document ranks are the same for all queries. Concept of dynamic document sets. Reference patterns among documents that are related to a specific query convey more information than patterns calculated across entire document collections. With dynamic document sets, references patterns are calculated for a set of documents that are selected based on each individual query.

Reference Pattern Ranking using Dynamic Document Sets Teoma Dynamic Ranking Algorithm (used in Ask Jeeves) 1. Search using conventional term weighting. Rank the hits using similarity between query and documents. 2. Select the highest ranking hits (e.g., top 5,000 hits). 3. Carry out PageRank or similar algorithm on this set of hits. This creates a set of document ranks that are specific to this query. 4. Display the results ranked in the order of the reference patterns calculated.

Scalability 10,000,000,000 1,000,000,000 100,000,000 10,000,000 1,000,000 100,000 10,000 1,000 100 10 1 1994 1997 2000 The growth of the web

Scalability Web search services are centralized systems • Over the past 9 years, Moore's Law has enabled the services to keep pace with the growth of the web and the number of users, while adding extra function. • Will this continue? • Possible areas for concern are: staff costs, telecommunications costs, disk access rates.

Growth of Web Searching In November 1997: • AltaVista was handling 20 million searches/day. • Google forecast for 2000 was 100s of millions of searches/day. In 2004, Google reports 250 million webs searches/day, and estimates that the total number over all engines is 500 million searches/day. Moore's Law and web searching In 7 years, Moore's Law predicts computer power will increase by a factor of at least 24 = 16. It appears that computing power is growing at least as fast as web searching.

Growth of Google In 2000: 85 people 50% technical, 14 Ph.D. in Computer Science In 2000: Equipment 2,500 Linux machines 80 terabytes of spinning disks 30 new machines installed daily Reported by Larry Page, Google, March 2000 At that time, Google was handling 5.5 million searches per day Increase rate was 20% per month By fall 2002, Google had grown to over 400 people. In 2004, Google plans to hire 1,000 new people.

Scalability: Performance Very large numbers of commodity computers Algorithms and data structures scale linearly • Storage • Scale with the size of the Web • Compression/decompression • System • Crawling, indexing, sorting simultaneously • Searching • Bounded by disk I/O

Software and Hardware Replication Search service advertisement server advertisement server index server advertisement server index server advertisement server index server advertisement server index server advertisement server index server index server index server spell checking document server spell checking document server spell checking document server spell checking document server spell checking document server spell checking document server spell checking document server

Scalability: Numbers of Computers Very rough calculation In March 2000, 5.5 million searches per day, required 2,500 computers In fall 2004, computers are about 8 times more powerful. Estimated number of computers for 250 million searches per day: = (250/5.5) x 2,500/8 = about 15,000 Some industry estimates suggest that Google may have as many as 100,000 computers.

Scalability: Staff Programming: Have very well trained staff. Isolate complex code. Most coding is single image. System maintenance: Organize for minimal staff (e.g., automated log analysis, do not fix broken computers). Customer service: Automate everything possible, but complaints, large collections, etc. require staff.

Evaluation Web Searching Test corpus must be dynamic The web is dynamic (10%-20%) of URLs change every month Spam methods change change continually Queries are time sensitive Topic are hot and then not Need to have a sample of real queries Languages At least 90 different languages Reflected in cultural and technical differences Amil Singhal, Google, 2004

Other Uses of Web Crawling and Associated Technology The technology developed for web search services has many other applications. Conversely, technology developed for other Internet applications can be applied in web searching • Related objects (e.g., Amazon's "Other people bought the following"). • Recommender and reputation systems (e.g., ePinion's reputation system).

Google API

Selective searching

Google News

CS 430: Information Discovery