190 likes | 275 Vues
Explore the integration of human expertise and artificial intelligence in vertical search for improved precision and understanding of search results. Learn about expert search, legal search, and recent trends in search algorithms. Discover how editorial value and user behavior enhance search capabilities.
E N D
Human Expertise and Artificial Intelligence in Vertical Search Peter Jackson & Khalid Al-Kofahi Corporate Research & Development
The Paradox of Search • The further you get from keyword indexing and retrieval, the harder it is to explain a search result • Professional searchers demand transparency • Tool versus appliance • You need an ‘explanatory model’ that people can relate to and understand, even if it is actually just a cartoon of the real process • Examples: Basic PageRank, Collaborative Filtering • Such models don’t work so well in vertical domains • Links aren’t always endorsements • Sparsity of data in smaller communities
Recent Trends in Search • Fragmentation of ‘horizontal’ search • Media, location, demographics (Weber & Castillo, 2010) • More sophisticated models of user behavior • Post-click behaviors (Zhong, Wang, et al, 2010) • ‘Practical semantics’ versus Semantic Web • Maps as search results for local, micro-results • Incorporation of domain knowledge into search • Taxonomies, vocabularies, use cases, work flows
The Example of Legal Search • The completeness requirement • Recall as important as precision • Less redundancy than on the Web • The authority requirement • Court superiority, jurisdiction • Highly cited cases and statutes • Supercession by statute or regulation • The multi-topical nature of documents • Case may cover many points of law but only cited for one • Citations can be negative as well as positive per topic >These factors also apply to scientific documents
Expert Search • In many verticals, there are at least two sources of expertise available for enhancing search • Editors and authors, who generate useful metadata • Users, who generate clickstreams and other data • Editorial value addition improves recall especially • Helps find both fat neck and long tail document on a topic • Aggregate user behavior mostly improves precision • Power users find most relevant and important documents • The model of expert search enables and explains the portfolio of results, rather than individual results
Sources of Evidence:Authors & Editors case Burger King Corp, V. Rudzewicz case case = = = = = = = = = 17201 3 (A) 28 (B) 35 4 (A) 5 (B) = = = = = = = = = = = = = = = = = = Headnote, KN Headnote, KN text text text text citation text citation text text case case case = = = = = = = = = = = = = = = = = = = = = = = = = = = 205,310 5 (A) 19 (B) case case case case = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Issue: Long arm jurisdiction 12 A (Key cases) 54 B (Highly Relevant) 9
Sources of EvidenceAuthors & Editors cases cases ALR = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Burger King Corp, V. Rudzewicz cases cases CJS = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = HN1 KN1 HN2 KN2 HN3 KN2 …. …. …. .... HN35 KN14 cases cases AMJUR = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = Another set of related cases 10
Sources of Evidence: Users (I) cases Session 1 = = = = = = = = = = = = = = = = = = = = = = = = = = = Click Query 1 Burger King Corp, V. Rudzewicz Click Actions Print Query 2 KeyCite Query 3 cases Session N = = = = = = = = = = = = = = = = = = = = = = = = = = = Print Actions Click Query N Link query language to document language via click, print, and cite checking behaviors Identify documents that are co-clicked, co-printed, etc, with the Burger King case across user sessions 11
Sources of Evidence: Users (II) cases Session 1 In the last 3 months = = = = = = = = = = = = = = = = = = = = = = = = = = = Burger King Corp, V. Rudzewicz Click Actions Query 1 Original breach of contract and trademark infringement case turned into a civil procedure case about jurisdictionon appeal "personal jurisdiction” 176"minimum contacts” 50"forum selection clause” 39“personal jurisdiction” 39"forum non conveniens” 32"choice of law” 29 cases Session N = = = = = = = = = = = = = = = = = = = = = = = = = = = Print Actions Query N User actions: 10417 Total sessions: 9758 12
AI & The Ranking Problem • Supervised Machine Learning (Ranker SVM) • Iteratively retrieve and rank documents • Incorporate all available cues: text similarity, classifications, citations, user behavior and query logs • All of this requires lots of data! • Training & Validation • Gold data: hand-crafted research reports covering a variety of legal issues • Report contains an issue statement, multiple queries, all seminal, highly relevant documents, some relevant docs • > 100K documents judged against ~400 legal issues • System was also tested by an independent 3rd party
Hadoop for Big Data Processing • At launch, query logs contained ~ 2 Billion records • Queries & user actions • Relied on a Hadoop cluster to • Extract, Transform, and Load processes. • Cluster similar queries together • Extract, normalize, collate citation contexts • Dramatic improvement in processing times • From tens of hours to tens of minutes
Cluster Configuration: Queries • 8 machines, each with 16 cores • Only 14 cores/machine were available for processing • Giving a total of 112 cores • Block size of 64 MB • Each core processes one block at a time • Cluster can process 7 GB at each step • Latest cluster is twice the size: 224 cores • Almost 1 TB of memory and over 1 PB of storage
The Power of Expert Search • Leverages expertise of community: authors, editors, & users • We know why documents are linked • We know exactly who our users are • Metadata, authority & aggregated user data all contribute to relevance, importance & popularity • Can still benefit from Power Law phenomena so common on the Web • Can exploit data parallelism to achieve the same kind of scale as horizontal search
Lessons Learned • Vertical search is not just about search • It’s about findability • Includes navigation, recommendations, clustering, faceted classification, etc. • It’s about satisfying a set of well-understood tasks • Usually on enhanced content • Usually for expert customers • Leveraging human value addition is key • None of the human actors set out to improve search • Difficult to design complete solution upfront • Need platform for experimentation and validation at scale
questions? • A relevant paper is downloadable from http://labs.thomsonreuters.com