Maximizing Search Engines: Practices and Success Measures

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for?{week 9} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

What is search? • What is search? • What are we searching for? • How many searches areprocessed per day? • What is the average number ofwords in text-based searches?

Finding things • Applications and varieties of search: • Web search • Site search • Vertical search • Enterprise search • Desktop search • As-you-type search • Proximity search search

Acquisition and indexing

User interaction and querying

Measures of success (i) • Relevance • Search results contain informationthe searcher was looking for • Problems with vocabulary mismatch • Homonyms (e.g. “Jersey shore”) • User relevance • Search results relevant to one usermay be completely irrelevant toanother user SNOOKI

Measures of success (ii) http://trec.nist.gov • Precision • Proportion of retrieved documentsthat are relevant • How precise were the results? • Recall (and coverage) • Proportion of relevant documentsthat were actually retrieved • Did we retrieve all of the relevant documents?

Measures of success (iii) • Timeliness and freshness • Search results contain information thatis current and up-to-date • Performance • Users expect subsecond response times • Media • User devices are constantly changing (cellphones, mobile devices, tablets, etc.)

Measures of success (iv) • Scalability • Designs that perform equally well as thesystem grows and expands • Increased number of documents, number of users, etc. • Flexibility (or adaptability) • Tune search engine components tokeep up with changing landscape • Spam-resistance

Information retrieval (IR) • Gerard Salton (1927-1995) • Pioneer in information retrieval • Defined information retrieval as: • “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” • This was 1968 (before the Internet and Web!)

(Un)structured information • Structured information: • Often stored in a database • Organized via predefinedtables, columns, etc. • Select all accounts with balances less than $200 • Unstructured information • Document text (headings, words, phrases) • Images, audio, video (often relies on textual tags)

Processing text • Search and IR has largelyfocused on text processingand documents • Search typically uses thestatistical properties of text • Word counts • Word frequencies • But ignore linguistic features (noun, verb, etc.)

Politeness and robots.txt • Web crawlers adhere to a politeness policy: • GET requests sent every few seconds or minutes • A robots.txt filespecifies whatcrawlers areallowed to crawl:

Sitemaps default priority is 0.5 some URLs might not be discovered by crawler

A day in the life of a crawler what about checkingfor updated pages?

Freshness vs. age • Freshness is essentially a Boolean value • Age measures the degree to which crawled page is out of date

Maximizing Search Engines: Practices and Success Measures