1 / 16

What are we searching for? {week 9 }

Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for? {week 9 }. from Search Engines: Information Retrieval in Practice , 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0.

thor
Télécharger la présentation

What are we searching for? {week 9 }

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rensselaer Polytechnic Institute CSCI-4220 – Network Programming David Goldschmidt, Ph.D. What are we searching for?{week 9} from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN 0-13-607224-0

  2. What is search? • What is search? • What are we searching for? • How many searches areprocessed per day? • What is the average number ofwords in text-based searches?

  3. Finding things • Applications and varieties of search: • Web search • Site search • Vertical search • Enterprise search • Desktop search • As-you-type search • Proximity search search

  4. Acquisition and indexing

  5. User interaction and querying

  6. Measures of success (i) • Relevance • Search results contain informationthe searcher was looking for • Problems with vocabulary mismatch • Homonyms (e.g. “Jersey shore”) • User relevance • Search results relevant to one usermay be completely irrelevant toanother user SNOOKI

  7. Measures of success (ii) http://trec.nist.gov • Precision • Proportion of retrieved documentsthat are relevant • How precise were the results? • Recall (and coverage) • Proportion of relevant documentsthat were actually retrieved • Did we retrieve all of the relevant documents?

  8. Measures of success (iii) • Timeliness and freshness • Search results contain information thatis current and up-to-date • Performance • Users expect subsecond response times • Media • User devices are constantly changing (cellphones, mobile devices, tablets, etc.)

  9. Measures of success (iv) • Scalability • Designs that perform equally well as thesystem grows and expands • Increased number of documents, number of users, etc. • Flexibility (or adaptability) • Tune search engine components tokeep up with changing landscape • Spam-resistance

  10. Information retrieval (IR) • Gerard Salton (1927-1995) • Pioneer in information retrieval • Defined information retrieval as: • “a field concerned with the structure, analysis, organization, storage, searching, and retrieval of information” • This was 1968 (before the Internet and Web!)

  11. (Un)structured information • Structured information: • Often stored in a database • Organized via predefinedtables, columns, etc. • Select all accounts with balances less than $200 • Unstructured information • Document text (headings, words, phrases) • Images, audio, video (often relies on textual tags)

  12. Processing text • Search and IR has largelyfocused on text processingand documents • Search typically uses thestatistical properties of text • Word counts • Word frequencies • But ignore linguistic features (noun, verb, etc.)

  13. Politeness and robots.txt • Web crawlers adhere to a politeness policy: • GET requests sent every few seconds or minutes • A robots.txt filespecifies whatcrawlers areallowed to crawl:

  14. Sitemaps default priority is 0.5 some URLs might not be discovered by crawler

  15. A day in the life of a crawler what about checkingfor updated pages?

  16. Freshness vs. age • Freshness is essentially a Boolean value • Age measures the degree to which crawled page is out of date

More Related