1 / 20

Search

Search. Stephen Robertson Microsoft Research Cambridge. MSR Cambridge. Andrew Herbert, Director Cambridge Laboratory … External Research Office Stephen Emmott. MSR Cambridge. Systems & Networking – Peter Key Operating Systems Networking Distributed Computing

carla-sloan
Télécharger la présentation

Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Search Stephen Robertson Microsoft Research Cambridge Moscow

  2. MSR Cambridge • Andrew Herbert, Director • Cambridge Laboratory … • External Research Office • Stephen Emmott Moscow

  3. MSR Cambridge • Systems & Networking – Peter Key • Operating Systems • Networking • Distributed Computing • Machine Learning & Perception – Christopher Bishop • Machine Learning • Computer Vision • Information Retrieval Moscow

  4. MSR Cambridge • Programming Principles & Tools – Luca Cardelli • Programming Principles & Tools • Security • Computer-Mediated Living – Ken Wood • Human Computer Interaction • Ubiquitous Computing • Sensors and Devices • Integrated Systems Moscow

  5. Search: a bit of history People sometimes assume that G**gle invented search … but of course this is false • Library catalogues • Scientific abstracts • Printed indexes • The 1960s to 80s: Boolean search • Free text queries and ranking – a long gestation • The web Moscow

  6. Web search • The technology • Crawling • Indexing • Ranking • Efficiency and effectiveness • The business • Integrity of search • UI, speed • Ads • Ad ranking • Payment for clickthrough Moscow

  7. Other search environments • Within-site • Specialist databases • Enterprise/intranet • Desktop Moscow

  8. How search engines work • Crawl a lot of documents • Create a vast index • Every word in every document • Point to where it occurred • Allow documents to inherit additional text • From the url • From anchors in other documents… • Index this as well • Also gather static information Moscow

  9. How search engines work Given a query: • Look up each query word in the index • Throw all this information at the ranker Ranker: A computing engine which calculates a score for each document, and identifies the top n scoring documents Score depends on a whole variety of features, and may include static information Moscow

  10. A core challenge: ranking • What features might be useful? • Features of the query-document pair • Features of the document • Maybe features of the query • Simple / transformed / compound • Combining features • Formulae • Weights and other free parameters • Tuning / training / learning Moscow

  11. Ranking algorithms • Based on probabilistic models • we are trying to predict relevance • … plus a little linguistic analysis • but this is secondary to the statistics • … plus a great deal of know-how, experience, experiment • Need: • Evidence from all possible sources • … combined appropriately Moscow

  12. Evaluation • User queries • Relevance judgements • by humans • yes-no or multilevel • Evaluation measures • How to evaluate a ranking? • Only the top end matters • Various different measures in use • Public bake-offs • TREC etc. Moscow

  13. Using evaluation data for training • Task: to optimise a set of parameters • E.g. weights of features • Optimisation is potentially very powerful • Can make a huge difference to effectiveness • But there are challenges… Moscow

  14. Challenge 1: Optimisation methods • Training is something of a black art  • Not easy to write recipes for • Much work currently on optimisation methods • Some of it coming from the machine learning community Moscow

  15. Challenge 2: a tradeoff • Many features require many parameters • From a machine learning point of view, the more the better • Many parameters means much training • Human relevance judgements are expensive Moscow

  16. Challenge 3: How specific? • How much does the environment matter? • Different features • E.g. characteristics of documents, file types, linkage, statistical properties… • Different kinds of queries • Or different mixes of the same kinds • Different factors affecting relevance • Access constraints • … Moscow

  17. Challenge 3: How specific? • And if it does matter… How to train for the specific environment? • Web search: huge training effort • Enterprise: some might be feasible • Desktop: unlikely • Within-site / specialist databases: some might be feasible Moscow

  18. Looking for alternatives If training is difficult… Some other possibilities: • Robustness – parameters with stable optima (probably means fewer features) • Training tool-kits (but remember the black art) • Auto-training – a system that trains itself on the basis of clickthrough (a long-term prospect) Moscow

  19. A little about Microsoft • Web search: MSN takes on Google and Yahoo • New search engine is closing the gap • Some MSRC input • Enterprise search: MS Search and SharePoint • New version is on its way • Much MSRC input • Desktop: also MS Search Moscow

  20. Final thoughts • Search has come a long way since the library card catalogue  • … but it is by no means a done deal • This is a very active field • both academically and commercially I confidently expect that it will change as much in the next 16 years as it has since 1990 Moscow

More Related