going further together
210 likes | 225 Vues
Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG. going further together. Contents. The BCS Information Retrieval SG What is IR anyway? How search engines work Why search is hard Where’s it all going?. Information Retrieval SG.
going further together
E N D
Presentation Transcript
Information Search & Retrieval: Problems, solutions, trends… Tony Rose, PhD MBCS CEng Vice-Chair, BCS IRSG going further together
Contents • The BCS Information Retrieval SG • What is IR anyway? • How search engines work • Why search is hard • Where’s it all going?
Information Retrieval SG • Growing rapidly • 750+ members • Annual conference (ECIR) • FDIA • Various 1-day events • Search Solutions • Informer • Discounts for various events, e.g. SIGIR • … is free to join!
Information Retrieval SG • Traditional focus on search (text retrieval) • Knowledge management, Multimedia retrieval, User experience, Information visualisation, extraction, summarisation, etc. • Latest issue of Informer: • “Searching for the Music You Like” • “Exploring Maps through Geo-referenced Images and RDF Shared Metadata” • “Using Semantic Relations to improve Question Answering” • “Modeling & Annotation of Dance Media Semantics”
What is IR? • “Science of searching for: • information in documents • documents themselves • metadata which describe documents, • within databases • …whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web”
The Need for IR • In a word … Infoglut • 800Mb of recorded information is produced per person per year [Computing magazine] • Up to 80% of corporate information is unstructured • Documents, emails, images, voicemail, etc. • So …can’t we just use Google?
How do Search Engines Work? • On the surface: • Understand what the user wants • Find documents about that topic • In reality: • Count words • Apply a simple equation
How do Search Engines Work? • Measure the conceptual distance between your query and each document in the DB • Return the best matches [Source: Maristella Agosti, University of Padova]
The Central Problem in IR Information Seeker Author Concepts Concepts Query Terms Document Terms Do these represent the same concepts? [Source: Jimmy Lin, University of Maryland]
The Central Problem in IR • How do you represent the concepts? • Documents and queries = “bag of words” • Unordered set of terms + numeric weights • How do you calculate similarity? • Set theory (e.g. Boolean) • Algebraic (e.g. vector space) • Probabilistic
IR models [Source: Wikipedia]
How do we Evaluate Search? • Assume that results are either relevant or non-relevant • Precision: • Proportion of retrieved documents that are relevant • Recall: • Proportion of known-relevant documents that were actually retrieved • But what about: indexing / retrieval speed, query language, user experience, etc? relevant retrieved
Why Search is Hard • Document representation • Keywords are not enough • Blind Venetian = Venetian Blind • Terms are not independent • Structural & discourse dependencies, co-references, etc. • Imperfect “stop lists” • the, and, of…
Why Search is Hard • Morphological relationships • Computer, computing, compute, computed… • Index documents using word stems • False positives: • organization, organ organ • police, policy polic • arm, army arm • False negatives: • cylinder, cylindrical • create, creation • Europe, European • Prefixes are particularly difficult • Un*, dis* • Delegate = de-leg-ate • Ratify = rat-ify
Why Search is Hard • Named entity recognition • Companies in New York • New companies in York • NEs are highly discriminatory • People • Places • Organisations • Many vertical applications • e.g. bioscience
Why Search is Hard • Semantic relationships • Car = automobile • Buy = purchase • Sick = ill • Synonym rings • Car, automobile, truck, bus, taxi... • Appropriate level of abstraction depends on user & task • Development of subject-specific taxonomies • “concept matching”
Why Search is Hard • Word sense disambiguation • “Bank” • Financial institution? • Part of a river? • An aerial manoeuvre? • Active research area • Categorisation & clustering of results
Google’s Insight • Exploit the link structure inherent in the web • calculate measure of document’s value • Independent of any query • “PageRank” • Overall relevance based on 100+ parameters • Constant battle with SEOs • Enterprise search is a different proposition… • As is desktop search
Where’s it all going? • Vertical search • Jobs, travel, health, people, etc. • Rich media search • Audio, video, TV, images • Specialised content search • blogs, news, classifieds • Social search • Personalisation
Where’s it all going? • Mobile search • Answer engines • Active research communityin Question Answering • Multi / cross-lingual search • Search agents • Human UI
Further Information • www.irsg.bcs.org • Informer • ECIR (March 2008, Glasgow) • Search Solutions 2008 (Sept 2008, London)