Search Engines

Search Engines CS 186 Guest Lecture Prof. Marti Hearst SIMS

Web Search Questions • How do search engines differ from DBMSs? • What do people search for? • How do search engines work? • Interfaces • Ranking • Architecture

Web Search vs DBMS?

Web Search Imprecise Ranked results “Satisficing” results Unedited content Keyword queries Mainly Read-only Inverted index DBMS Precise Usually unordered Complete results Controlled content SQL Reads and Writes B-trees A Comparison

What Do People Search for on the Web?

What Do People Search for on the Web? • Genealogy/Public Figure: 12% • Computer related: 12% • Business: 12% • Entertainment: 8% • Medical: 8% • Politics & Government 7% • News 7% • Hobbies 6% • General info/surfing 6% • Science 6% • Travel 5% • Arts/education/shopping/images 14% Something is missing… Study by Spink et al., Oct 98Survey on Excite, 13 questions Data for 316 surveyswww.shef.ac.uk/~is/publications/infres/paper53.html

4660 sex 3129 yahoo 2191 internal site admin check from kho 1520 chat 1498 porn 1315 horoscopes 1284 pokemon 1283 SiteScope test 1223 hotmail 1163 games 1151 mp3 1140 weather 1127 www.yahoo.com 1110 maps 1036 yahoo.com 983 ebay 980 recipes What Do People Search for on the Web? • 50,000 queries from Excite, 1997 • Most frequent terms:

Why do these differ? • Self-reporting survey • The nature of language • Only a few ways to say certain things • Many different ways to express most concepts • UFO, Flying Saucer, Space Ship, Satellite • How many ways are there to talk about history?

3351 bearfacts 3349 telebears 1909 extension 1874 schedule+of+classes 1780 bearlink 1737 bear+facts 1468 decal 1443 infobears 1227 calendar 989 career+center 974 campus+map 920 academic+calendar 840 map 773 bookstore 741 class+pass 738 housing 721 tele-bears 716 directory 667 schedule 627 recipes 602 transcripts 582 tuition 577 seti 563 registrar 550 info+bears 543 class+schedule 470 financial+aid Intranet Queries (Aug 2000)

Intranet Queries • Summary of sample data from 3 weeks of UCB queries • 13.2% Telebears/BearFacts/InfoBears/BearLink (12297) • 6.7% Schedule of classes or final exams (6222) • 5.4% Summer Session (5041) • 3.2% Extension (2932) • 3.1% Academic Calendar (2846) • 2.4% Directories (2202) • 1.7% Career Center (1588) • 1.7% Housing (1583) • 1.5% Map (1393) • Average query length over last 4 months: 1.8 words • This suggests what is difficult to find from the home page

Different kinds of users; different kinds of data • Legal and news colleciton: • professional searchers • paying (by the query or by the minute) • Online bibliographic catalogs (melvyl) • scholars searching scholarly literature • Web • Every type of person with every type of goal • No “driving school” for searching

Different kinds of information needs; different kinds of queries • Example: Search on “Mazda” • What does this mean on the web? • What does this mean on a news collection? • Example: “Mazda transmissions” • Example: “Manufacture of Mazda transmissions in the post-cold war world”

Web Queries • Web queries are SHORT • ~2.4 words on average (Aug 2000) • Has increased, was 1.7 (~1997) • User Expectations • Many say “the first item shown should be what I want to see”! • This works if the user has the most popular/common notion in mind

Recent statistics from Inktomi, August 2000, for one client, one week • Total # queries: 1315040 Number of repeated queries: 771085 Number of queries with repeated words: 12301 Average words/ query: 2.39 Query type: All words: 0.3036; Any words: 0.6886; Some words:0.0078 Boolean: 0.0015 (0.9777 AND / 0.0252 OR / 0.0054 NOT) Phrase searches: 0.198 URL searches: 0.066 URL searches w/http: 0.000 email searches: 0.001 Wildcards: 0.0011 (0.7042 '?'s ) fraction '?' at end of query: 0.6753 interrogatives when '?' at end: 0.8456

How to Optimize for Short Queries? • Find good starting places • User still has to search at the site itself • Dialogues • Build upon a series of short queries • Not well understood how to do this for the general case • Question Answering • AskJeeves – hand edited • Automated approaches are under development • Very simple • Or domain-specific

How to Find Good Starting Points? • Manually compiled lists • Directories • e.g., Yahoo, Looksmart, Open directory • Page “popularity” • Frequently visited pages (in general) • Frequently visited pages as a result of a query • Link “co-citation”, • which sites are linked to by other sites? • Number of pages in the site • Not currently used (as far as I know)

Directories Hand-selected sites Search over the contents of the descriptions of the pages Organized in advance into categories Search Engines All pages in all sites Search over the contents of the pages themselves Organized after the query by relevance rankings or other scores Directories vs. Search EnginesAn IMPORTANT Distinction

Link Analysis for Starting Points • Assumptions: • If the pages pointing to this page are good, then this is also a good page. • The words on the links pointing to this page are useful indicators of what this page is about. • References: Page et al. 98, Kleinberg 98

Co-Citation Analysis • Has been around since the 50’s. (Small, Garfield, White & McCain) • Used to identify core sets of • authors, journals, articles for particular fields • Not for general search • Main Idea: • Find pairs of papers that cite third papers • Look for commonalitieis • A nice demonstration by Eugene Garfield at: • http://165.123.33.33/eugene_garfield/papers/mapsciworld.html

Link Analysis for Starting Points • Why does this work? • The official Toyota site will be linked to by lots of other official (or high-quality) sites • The best Toyota fan-club site probably also has many links pointing to it • Less high-quality sites do not have as many high-quality sites linking to them

Co-citation analysis (From Garfield 98)

Link Analysis for Starting Points • Does this really work? • Actually, there have been no rigorous evaluations • Seems to work for the primary sites; not clear if it works for the relevant secondary sites • One (small) study suggests that sites with many pages are often the same as those with good link co-citation scores. (Terveen & Hill, SIGIR 2000)

What is Really Being Used? • Todays search engines combine these methods in various ways • Integration of Directories • Today most web search engines integrate categories into the results listings • Lycos, MSN, Google • Link analysis • Google uses it; others are using it or will soon • Words on the links seems to be especially useful • Page popularity • Many use DirectHit’s popularity rankings

Ranking Algorithms

The problem of ranking Query: cat dog fish orangutang Cat cat cat Dog dog dog Fish fish fish Cat cat cat Cat cat cat Cat cat cat Orangutang Fish Which is the best match?

Assigning Weights to Terms • Binary Weights • Raw term frequency • tf x idf • Recall the Zipf distribution • Want to weight terms highly if they are • frequent in relevant documents … BUT • infrequent in the collection as a whole • Automatically derived thesaurus terms

Binary Weights • Only the presence (1) or absence (0) of a term is included in the vector

Raw Term Weights • The frequency of occurrence for the term in each document is included in the vector

Assigning Weights • Goal: give more weight to terms that are • Common in THIS document • Uncommon in the collection as a whole • The tf x idf measure: • term frequency (tf) • inverse document frequency (idf)

Document Vectors • Documents are represented as “bags of words” • Represented as vectors when used computationally • A vector is like an array of floating point • Each vector holds a place for every term in the collection • Therefore, most vectors are sparse

Document VectorsOne location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Nova” occurs 10 times in text A “Galaxy” occurs 5 times in text A “Heat” occurs 3 times in text A (Blank means 0 occurrences.)

Document VectorsOne location for each word. A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3 “Hollywood” occurs 7 times in text I “Film” occurs 5 times in text I “Diet” occurs 1 time in text I “Fur” occurs 3 times in text I

Document Vectors Document ids A B C D E F G H I nova galaxy heat h’wood film role diet fur 10 5 3 5 10 10 8 7 9 10 5 10 10 9 10 5 7 9 6 10 2 8 7 5 1 3

Vector Space Model • Documents are represented as vectors in term space • Terms are usually stems • Documents represented by binary vectors of terms • Queries represented the same as documents • Query and Document weights are based on length and direction of their vector • A vector distance measure between the query and documents is used to rank retrieved documents

Documents in 3D Space Assumption: Documents that are “close together” in space are similar in meaning.

tf x idf

Computing Similarity Scores 1.0 0.8 0.6 0.4 0.2 0.2 0.4 0.6 0.8 1.0

The results of ranking Query: cat dog fish orangutang Cat cat cat Dog dog dog Fish fish fish Cat cat cat Cat cat cat Cat cat cat Orangutang Fish What does vector space ranking do?

High-Precision Ranking Proximity search can help get high-precision results if > 1 term • Hearst ’96 paper: • Combine Boolean and passage-level proximity • Proves significant improvements when retrieving top 5, 10, 20, 30 documents • Results reproduced by Mitra et al. 98 • Google uses something similar

What is Really Being Used? • Lots of variation here • Pretty messy in many cases • Details usually proprietary and fluctuating • Combining subsets of: • Term frequencies • Term proximities • Term position (title, top of page, etc) • Term characteristics (boldface, capitalized, etc) • Link analysis information • Category information • Popularity information

Web Spam • Email Spam: • Undesired content • Web Spam: • Content disguised as something it is not: • Be retrieved more often than it otherwise would • Be retrieved in contexts that it otherwise would not be retrieved in

Web Spam • What are the types of Web spam? • Add extra terms to get a higher ranking • Repeat “cars” thousands of times • Add irrelevant terms to get more hits • Put a dictionary in the comments field • Put extra terms in the same color as the background of the web page • Add irrelevant terms to get different types of hits • Put “sex” in the title field in sites that are selling cars • Add irrelevant links to boost your link analysis ranking • There is a constant “arms race” between web search companies and spammers

Inverted Index • This is the primary data structure for text indexes • Main Idea: • Invert documents into a big index • Basic steps: • Make a “dictionary” of all the tokens in the collection • For each token, list all the docs it occurs in. • Do a few things to reduce redundancy in the data structure

Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries • Also used for statistical ranking algorithms

Inverted Indexes An Inverted File is a vector file “inverted” so that rows become columns and columns become rows

How Are Inverted Files Created • Documents are parsed to extract tokens. These are saved with the Document ID. Doc 1 Doc 2 Now is the time for all good men to come to the aid of their country It was a dark and stormy night in the country manor. The time was past midnight

How Inverted Files are Created After all documents have been parsed the inverted file is sorted alphabetically.

How InvertedFiles are Created • Multiple term entries for a single document are merged. • Within-document term frequency information is compiled.

How Inverted Files are Created Dictionary Postings

Inverted indexes • Permit fast search for individual terms • For each term, you get a list consisting of: • document ID • frequency of term in doc (optional) • position of term in doc (optional) • These lists can be used to solve Boolean queries: • country -> d1, d2 • manor -> d2 • country AND manor -> d2 • Also used for statistical ranking algorithms

Search Engines

Search Engines

Presentation Transcript

Search Engines.

Search Engines

Search Engines

Search Engines

Search Engines

SEARCH ENGINES

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines

Search Engines?

Search Engines

Search Engines

Search Engines

Search Engines