200 likes | 315 Vues
This article explores the history and development of search technologies, tracing their origins from traditional library catalogs and scientific abstracts to modern web search engines. It discusses the technical components of search, such as crawling, indexing, and ranking algorithms, highlighting the challenges in optimizing search results. Additionally, it examines the role of user queries and relevance judgments in evaluating search effectiveness. As search technology continues to evolve, this piece emphasizes the importance of ongoing research and innovation in the field.
E N D
Search Stephen Robertson Microsoft Research Cambridge Moscow
MSR Cambridge • Andrew Herbert, Director • Cambridge Laboratory … • External Research Office • Stephen Emmott Moscow
MSR Cambridge • Systems & Networking – Peter Key • Operating Systems • Networking • Distributed Computing • Machine Learning & Perception – Christopher Bishop • Machine Learning • Computer Vision • Information Retrieval Moscow
MSR Cambridge • Programming Principles & Tools – Luca Cardelli • Programming Principles & Tools • Security • Computer-Mediated Living – Ken Wood • Human Computer Interaction • Ubiquitous Computing • Sensors and Devices • Integrated Systems Moscow
Search: a bit of history People sometimes assume that G**gle invented search … but of course this is false • Library catalogues • Scientific abstracts • Printed indexes • The 1960s to 80s: Boolean search • Free text queries and ranking – a long gestation • The web Moscow
Web search • The technology • Crawling • Indexing • Ranking • Efficiency and effectiveness • The business • Integrity of search • UI, speed • Ads • Ad ranking • Payment for clickthrough Moscow
Other search environments • Within-site • Specialist databases • Enterprise/intranet • Desktop Moscow
How search engines work • Crawl a lot of documents • Create a vast index • Every word in every document • Point to where it occurred • Allow documents to inherit additional text • From the url • From anchors in other documents… • Index this as well • Also gather static information Moscow
How search engines work Given a query: • Look up each query word in the index • Throw all this information at the ranker Ranker: A computing engine which calculates a score for each document, and identifies the top n scoring documents Score depends on a whole variety of features, and may include static information Moscow
A core challenge: ranking • What features might be useful? • Features of the query-document pair • Features of the document • Maybe features of the query • Simple / transformed / compound • Combining features • Formulae • Weights and other free parameters • Tuning / training / learning Moscow
Ranking algorithms • Based on probabilistic models • we are trying to predict relevance • … plus a little linguistic analysis • but this is secondary to the statistics • … plus a great deal of know-how, experience, experiment • Need: • Evidence from all possible sources • … combined appropriately Moscow
Evaluation • User queries • Relevance judgements • by humans • yes-no or multilevel • Evaluation measures • How to evaluate a ranking? • Only the top end matters • Various different measures in use • Public bake-offs • TREC etc. Moscow
Using evaluation data for training • Task: to optimise a set of parameters • E.g. weights of features • Optimisation is potentially very powerful • Can make a huge difference to effectiveness • But there are challenges… Moscow
Challenge 1: Optimisation methods • Training is something of a black art • Not easy to write recipes for • Much work currently on optimisation methods • Some of it coming from the machine learning community Moscow
Challenge 2: a tradeoff • Many features require many parameters • From a machine learning point of view, the more the better • Many parameters means much training • Human relevance judgements are expensive Moscow
Challenge 3: How specific? • How much does the environment matter? • Different features • E.g. characteristics of documents, file types, linkage, statistical properties… • Different kinds of queries • Or different mixes of the same kinds • Different factors affecting relevance • Access constraints • … Moscow
Challenge 3: How specific? • And if it does matter… How to train for the specific environment? • Web search: huge training effort • Enterprise: some might be feasible • Desktop: unlikely • Within-site / specialist databases: some might be feasible Moscow
Looking for alternatives If training is difficult… Some other possibilities: • Robustness – parameters with stable optima (probably means fewer features) • Training tool-kits (but remember the black art) • Auto-training – a system that trains itself on the basis of clickthrough (a long-term prospect) Moscow
A little about Microsoft • Web search: MSN takes on Google and Yahoo • New search engine is closing the gap • Some MSRC input • Enterprise search: MS Search and SharePoint • New version is on its way • Much MSRC input • Desktop: also MS Search Moscow
Final thoughts • Search has come a long way since the library card catalogue • … but it is by no means a done deal • This is a very active field • both academically and commercially I confidently expect that it will change as much in the next 16 years as it has since 1990 Moscow