Search and Ye Shall Find (maybe)

Search and Ye Shall Find(maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard

What Do People Search For? • Searchers often don’t clearly understand • The problem they are trying to solve • What information is needed to solve the problem • How to ask for that information • The query results from a clarification process • Dervin’s “sense making”: Need Gap Bridge

Process/System Co-Design

Design Strategies • Foster human-machine synergy • Exploit complementary strengths • Accommodate shared weaknesses • Divide-and-conquer • Divide task into stages with well-defined interfaces • Continue dividing until problems are easily solved • Co-design related components • Iterative process of joint optimization

Human-Machine Synergy • Machines are good at: • Doing simple things accurately and quickly • Scaling to larger collections in sublinear time • People are better at: • Accurately recognizing what they are looking for • Evaluating intangibles such as “quality” • Both are pretty bad at: • Mapping consistently between words and concepts

Predict Nominate IR System Query Formulation Query Search Ranked List Selection Query Reformulation and Relevance Feedback Document Examination Document Source Reselection Delivery Supporting the Search Process Source Selection Choose

IR System Query Formulation Query Search Ranked List Selection Document Indexing Index Examination Document Acquisition Collection Delivery Supporting the Search Process Source Selection

End-user Search Taylor’s Model of Question Formation Q1 Visceral Need Q2 Conscious Need Intermediated Search Q3 Formalized Need Q4 Compromised Need (Query)

Search Component Model Utility Human Judgment Information Need Document Query Formulation Query Document Processing Query Processing Representation Function Representation Function Query Representation Document Representation Comparison Function Retrieval Status Value

Relevance • Relevance relates a topic and a document • Duplicates are equally relevant, by definition • Constant over time and across users • Pertinence relates a task and a document • Accounts for quality, complexity, language, … • Utility relates a user and a document • Accounts for prior knowledge

“Bag of Terms” Representation • Bag = a “set” that can contain duplicates • “The quick brown fox jumped over the lazy dog’s back”  {back, brown, dog, fox, jump, lazy, over, quick, the, the} • Vector = values recorded in any consistent order • {back, brown, dog, fox, jump, lazy, over, quick, the, the}  [1 1 1 1 1 1 1 1 2]

Bag of Terms Example Document 1 Stopword List Term Document 1 Document 2 The quick brown fox jumped over the lazy dog’s back. for aid 0 1 is all 0 1 of back 1 0 the brown 1 0 to come 0 1 dog 1 0 fox 1 0 Document 2 good 0 1 jump 1 0 lazy 1 0 Now is the time for all good men to come to the aid of their party. men 0 1 now 0 1 over 1 0 party 0 1 quick 1 0 their 0 1 time 0 1

Advantages of Ranked Retrieval • Closer to the way people think • Some documents are better than others • Enriches browsing behavior • Decide how far down the list to go as you read it • Allows more flexible queries • Long and short queries can produce useful results

Counting Terms • Terms tell us about documents • If “rabbit” appears a lot, it may be about rabbits • Documents tell us about terms • “the” is in every document -- not discriminating • Documents are most likely described well by rare terms that occur in them frequently • Higher “term frequency” is stronger evidence • Low “document frequency” makes it stronger still

Document Length Normalization • Long documents have an unfair advantage • They use a lot of terms • So they get more matches than short documents • And they use the same words repeatedly • So they have much higher term frequencies • Normalization seeks to remove these effects

“Okapi” Term Weights TF component IDF component

Summary • Goal: find documents most similar to the query • Compute normalized document term weights • Some combination of TF, DF, and Length • Optionally, get query term weights from the user • Estimate of term importance • Compute inner product of query and doc vectors • Multiply corresponding elements and then add

Some Questions for Indexing • How long will it take to find a document? • Is there any work we can do in advance? • If so, how long will that take? • How big a computer will I need? • How much disk space? How much RAM? • What if more documents arrive? • How much of the advance work must be repeated? • Will searching become slower? • How much more disk space will be needed?

The Indexing Process Postings Term Inverted File Doc 3 Doc 1 Doc 2 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 aid 0 0 0 1 0 0 0 1 4, 8 AI A all 0 1 0 1 0 1 0 0 2, 4, 6 AL back 1 0 1 0 0 0 1 0 1, 3, 7 BA B brown 1 0 1 0 1 0 1 0 1, 3, 5, 7 BR come 0 1 0 1 0 1 0 1 2, 4, 6, 8 C dog 0 0 1 0 1 0 0 0 3, 5 D fox 0 0 1 0 1 0 1 0 3, 5, 7 F good 0 1 0 1 0 1 0 1 2, 4, 6, 8 G jump 0 0 1 0 0 0 0 0 3 J lazy 1 0 1 0 1 0 1 0 1, 3, 5, 7 L men 0 1 0 1 0 0 0 1 2, 4, 8 M now 0 1 0 0 0 1 0 1 2, 6, 8 N over 1 0 1 0 1 0 1 1 1, 3, 5, 7, 8 O party 0 0 0 0 0 1 0 1 6, 8 P quick 1 0 1 0 0 0 0 0 1, 3 Q their 1 0 0 0 1 0 1 0 1, 5, 7 TH T time 0 1 0 1 0 1 0 0 2, 4, 6 TI

The Finished Product Term Inverted File Postings aid 4, 8 AI A all 2, 4, 6 AL back 1, 3, 7 BA B brown 1, 3, 5, 7 BR come 2, 4, 6, 8 C dog 3, 5 D fox 3, 5, 7 F good 2, 4, 6, 8 G jump 3 J lazy 1, 3, 5, 7 L men 2, 4, 8 M now 2, 6, 8 N over 1, 3, 5, 7, 8 O party 6, 8 P quick 1, 3 Q their 1, 5, 7 TH T time 2, 4, 6 TI

Building an Inverted Index • Simplest solution is a single sorted array • Fast lookup using binary search • But sorting large files on disk is very slow • And adding one document means starting over • Tree structures allow easy insertion • But the worst case lookup time is linear • Balanced trees provide the best of both • Fast lookup and easy insertion • But they require 45% more disk space

(Uncompressed) Index Size • Very compact for Boolean retrieval • About 10% of the size of the documents • If an aggressive stopword list is used • Not much larger for ranked retrieval • Perhaps 20% • Enormous for proximity operators • Sometimes larger than the documents!

Index Compression • CPU’s are much faster than disks • A disk can transfer 1,000 bytes in ~20 ms • The CPU can do ~10 million instructions in that time • Compressing the postings file is a big win • Trades decompression time for fewer disk reads • Key idea: reduce redundancy • Trick 1: store relative offsets (some will be the same) • Trick 2: use an optimal coding scheme

Problems with “Free Text” Search • Homonymy • Terms may have many unrelated meanings • Polysemy (related meanings) is less of a problem • Synonymy • Many ways of saying (nearly) the same thing • Anaphora • Alternate ways of referring to the same thing

Controlled Vocabulary Searcher Free-Text Searcher Indexer Choose appropriate concept descriptors Construct query from available concept descriptors Construct query from terms that may appear in documents Content-Based Query-Document Matching Metadata-Based Query-Document Matching Query Terms Document Terms Document Descriptors Query Descriptors Retrieval Status Value Two Ways of Searching Author Write the document using terms to convey meaning

f1 f2 f3 f4 … fN v1 v2 v3 v4 … vN Cv w1 w2 w3 w4 … wN Cw Cw Labelled training examples Learner Classifier Cx x1 x2 x3 x4 … xN New example Supervised Learning

Example: kNN Classifier

Problems with Controlled Vocabulary • New concepts • Users and indexers may think differently • Using thesauri effectively requires training

Index Spam • Goal: Manipulate rankings of an IR system • Multiple strategies: • Create bogus user-assigned metadata • Add invisible text (font in background color, …) • Alter your text to include desired query terms • “Link exchanges” create links to your page

Some Observable Behaviors

Behavior Category

Minimum Scope Behavior Category

Some Examples • Read/Ignored, Saved/Deleted, Replied to • (Stevens, 1993) • Reading time • (Morita & Shinoda, 1994; Konstan et al., 1997) • Hypertext Link • (Brin & Page, 1998)

Estimating Authority from Links Hub Authority Authority

Collecting Click Streams • Browsing histories are easily captured • Make all links initially point to a central site • Encode the desired URL as a parameter • Build a time-annotated transition graph for each user • Cookies identify users (when they use the same machine) • Redirect the browser to the desired page • Reading time is correlated with interest • Can be used to build individual profiles • Used to target advertising by doubleclick.com

180 160 140 120 Reading Time (seconds) 100 80 60 58 50 43 40 32 20 0 No Interest Low Interest Moderate Interest High Interest Rating Full Text Articles (Telecommunications)

Problems with Observed Behavior • Protecting privacy • What absolute assurances can we provide? • How can we make remaining risks understood? • Scalable rating servers • Is a fully distributed architecture practical? • Non-cooperative users • How can the effect of spamming be limited?

Putting It All Together

Search and Ye Shall Find (maybe)