Search

Search Stephen Robertson Microsoft Research Cambridge Moscow

MSR Cambridge • Andrew Herbert, Director • Cambridge Laboratory … • External Research Office • Stephen Emmott Moscow

MSR Cambridge • Systems & Networking – Peter Key • Operating Systems • Networking • Distributed Computing • Machine Learning & Perception – Christopher Bishop • Machine Learning • Computer Vision • Information Retrieval Moscow

MSR Cambridge • Programming Principles & Tools – Luca Cardelli • Programming Principles & Tools • Security • Computer-Mediated Living – Ken Wood • Human Computer Interaction • Ubiquitous Computing • Sensors and Devices • Integrated Systems Moscow

Search: a bit of history People sometimes assume that G**gle invented search … but of course this is false • Library catalogues • Scientific abstracts • Printed indexes • The 1960s to 80s: Boolean search • Free text queries and ranking – a long gestation • The web Moscow

Web search • The technology • Crawling • Indexing • Ranking • Efficiency and effectiveness • The business • Integrity of search • UI, speed • Ads • Ad ranking • Payment for clickthrough Moscow

Other search environments • Within-site • Specialist databases • Enterprise/intranet • Desktop Moscow

How search engines work • Crawl a lot of documents • Create a vast index • Every word in every document • Point to where it occurred • Allow documents to inherit additional text • From the url • From anchors in other documents… • Index this as well • Also gather static information Moscow

How search engines work Given a query: • Look up each query word in the index • Throw all this information at the ranker Ranker: A computing engine which calculates a score for each document, and identifies the top n scoring documents Score depends on a whole variety of features, and may include static information Moscow

A core challenge: ranking • What features might be useful? • Features of the query-document pair • Features of the document • Maybe features of the query • Simple / transformed / compound • Combining features • Formulae • Weights and other free parameters • Tuning / training / learning Moscow

Ranking algorithms • Based on probabilistic models • we are trying to predict relevance • … plus a little linguistic analysis • but this is secondary to the statistics • … plus a great deal of know-how, experience, experiment • Need: • Evidence from all possible sources • … combined appropriately Moscow

Evaluation • User queries • Relevance judgements • by humans • yes-no or multilevel • Evaluation measures • How to evaluate a ranking? • Only the top end matters • Various different measures in use • Public bake-offs • TREC etc. Moscow

Using evaluation data for training • Task: to optimise a set of parameters • E.g. weights of features • Optimisation is potentially very powerful • Can make a huge difference to effectiveness • But there are challenges… Moscow

Challenge 1: Optimisation methods • Training is something of a black art  • Not easy to write recipes for • Much work currently on optimisation methods • Some of it coming from the machine learning community Moscow

Challenge 2: a tradeoff • Many features require many parameters • From a machine learning point of view, the more the better • Many parameters means much training • Human relevance judgements are expensive Moscow

Challenge 3: How specific? • How much does the environment matter? • Different features • E.g. characteristics of documents, file types, linkage, statistical properties… • Different kinds of queries • Or different mixes of the same kinds • Different factors affecting relevance • Access constraints • … Moscow

Challenge 3: How specific? • And if it does matter… How to train for the specific environment? • Web search: huge training effort • Enterprise: some might be feasible • Desktop: unlikely • Within-site / specialist databases: some might be feasible Moscow

Looking for alternatives If training is difficult… Some other possibilities: • Robustness – parameters with stable optima (probably means fewer features) • Training tool-kits (but remember the black art) • Auto-training – a system that trains itself on the basis of clickthrough (a long-term prospect) Moscow

A little about Microsoft • Web search: MSN takes on Google and Yahoo • New search engine is closing the gap • Some MSRC input • Enterprise search: MS Search and SharePoint • New version is on its way • Much MSRC input • Desktop: also MS Search Moscow

Final thoughts • Search has come a long way since the library card catalogue  • … but it is by no means a done deal • This is a very active field • both academically and commercially I confidently expect that it will change as much in the next 16 years as it has since 1990 Moscow

Search

Search

Presentation Transcript

Search

Search

Search

Literature search Search Engines

Search

Search

Search

Search

Search

Search

Search

Search Algorithms Sequential Search (Linear Search) Binary Search

IdeationIP - Novelty Search, Knockout Search, Invalidity Search

Search

Search

Search

Search form Search

Search

Search

Search: Binary Search Trees

Invalidity Search | Novelty Search

Search Consultancy | Search Recruitment | Search Agency