CHORUS What is « Search » A functional view ------------------------- 2008-04-21 Henri Gouraud

CHORUS What is « Search » A functional view ------------------------- 2008-04-21 Henri Gouraud WP2

Overall goal • Break down search into essential (necessary) components • Identify issues associated with each component • Facilitate matching of use-cases with functional overview • For a given use-case, identify “critical” components • Those for which there is no known solution • Those for which existing solutions are not performing • Identify use-cases where the model breaks • Repair/extend model • Identify potential « new models » • ----> Prepare Gap Analysis

This analysis tries to be « Media » independant • Functions are media independant • Document discovery • Meta-data extraction • User Interface • ..... • Techniques necessary to implement each function are media dependant ... • Text extraction • Speech to text • Image signatures • .... • ... and are at varying levels of maturity and performance

Top level vision • Search engines come into play when « direct » search into the document repository fails (volume, performance, ...)‏ Documents Querying Indexing Matching Data-base

At the core: matching • Matching happens between two « computer based » chunks of data • Query-meta-data, derived from the user input (and his context)‏ • Document-meta-data derived from the documents being searched Query-meta-data Matching Document-meta-data Data-base

The Matching process • Simple or boolean • AND, OR, NEAR, Parentheses, Regular expression, ... • Accurate of fuzzy • Spelling, phonetic, « similar to », ... • Typed • Author:xx, Title:xx, ... • Centralized/distributed • Across single LAN, across WAN, peer 2 peer, ... • Issues • New media types: algorythms • Performance • single query response time • query throughput

The document side • The main issue: the « Transform » step • Extracting useful information from the documents Pull Crawl Document Push Transform Matching D-meta-data Build Data-base Content

The document side • Document discovery • Pull=crawling, push=OK • Completeness, freshness, • Building the SE data-base • Scalabality, reliability • Incremental • Distributed • Transform: elaborating D-meta-data • Deal with existing meta-data, multi pass process, ... • Dealing with multiplicity of content type and formats • For each type, specific meta-data elaboration process • Issue • Algorythm (for each media type)‏ • Performance (relates to document repository size and churn rate)‏

The user side • The two main issues • Transforming the user query into Q-meta-data • Organizing the results into manageable form Query UI Transform Q-meta-data User Navigation Matching UI Organize Results Data-base

The user side • Capturing the « user intent » • The DWIM dream • Providing useful hints (what is « searchable »?)‏ • Organizing the results • Assume multiple results, i.e. choice or refinement • Issues • Algorythm (for each media type)‏ • Clustering, structuring, summarizing, ... • User Interface (for each terminal type)‏ • Performance (under the ½ sec threshold)‏

The big picture Pull Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation

The big picture issues • On the document side, acquiring D-meta-data that will speed up the matching process • Performnce trade-off • On the document side, acquiring D-meta-data that will be relevant on the user side • That will fit « naturally » with the potential user queries • That will assist in organizing results into « manageable » form

Context, personalization User context Content context Pull Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation

A Functional breakdown of Search Engine (it is much more complex)‏ Usercontext Contentcontext Pull Corpora Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation

Search vs Alerts Stored queries User context Content context Pull Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation

Acting on results Stored queries User context Content context Pull Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation Act User as a “librarian”

Some global cross-functional issues • IP, access rights, usage rights, • Security, privacy, … • Business model • Architecture, APIs, standards, … • Software engineering • Scalability

The Research triangle for Search Engines Usercontext Contentcontext Pull Push UI Query Transform Crawl Librarian Q-meta-data Document Navigation User Transform Matching D-meta-data Build Organize UI Results Data-base Content Intra-doc navigation

Next steps • Quantify limits associated with each functional component • Main driving parameter (size/churn, user population, media type, ...)‏ • Influence on other functional components--> Identify main use-case typology terms • Compare/describe research and industry use-cases according to the proposed functional description • Prepare for gap analysis • Identify expected functional level progress • Identify « mismatch » cases, alternative/complementary models

CHORUS What is « Search » A functional view ------------------------- 2008-04-21 Henri Gouraud