Today’s Topics

Today’s Topics • Boolean IR • Signature files • Inverted files • PAT trees • Suffix arrays

Doc’s containing term A Doc’s containing term A term B term B term C term C - Pre 1970’s - Dominant industrial model through 1994 (Lexis-Nexis, DIALOG) Boolean IR • Documents composed of TERMS(words, stems) • Express result in set-theoretic terms A AND B (A AND B) OR C

Boolean Operators Doc’s containing term A A AND B A OR B (A AND B) OR C A AND ( NOT B ) Adjacent AND g “ A B ” e.g. “Johns Hopkins” “The Who” Proximity window g A w/10 B A and B within +/- 10 words g A w/sent B A + B in same sentence Proximity Operators (Extended ANDs) (in +/- K words)

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Boolean IR(implementation) • Bit vectors • Inverted files(a.k.a. Index) • PAT tree(more powerful index) Termi V1 V2 Impractical g very sparse(wastefully big) g costly to compare

Problems with Boolean IR • Does not effectively support relevance ranking of returned documents • Base model : expression satisfaction is Boolean A document matches expression or it doesn’t • Extension to permit ordering : (A AND B) OR C • Supermatches(5 terms/doc > 3 terms/doc) • Partial matches (expression incompletely satisfied – give partial credit) • Importance weighting(10A OR 5B) Weight/importance

Boolean IR • Advantages : Can directly control search Good for precise queries in structured data (e.g. database search or legal index) • Disadvantages : Must directly control search • Users should be familiar with domain and term space(know what to ask for and exclude) • Poor at relevance ranking • Poor at weighted query expansion, user modelling etc.

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 Signature Files Document Bit vector Superimposed Coding Using some mapping/ Hash function Mapping function f( ) Signature fewer bits Problem : several different document bit vectors(i.e. different words) get mapped to same signature. (use stoplist to help avoid common words from overwhelming signatures)

False Drop Problem • On retrieval, all documents/bit vectors mapped to the same signature are retrieved(returned) • Only a portion are relevant • Need to do secondary validation step to make sure target words actually match Prob(False Drop) = Prob(Signature qualifies & Text does not)

Efficiency Problem Testing for signature match may require linear scan through all document signatures

Vertical Partitioning • Improves sig1, sig2 comparison speed, but still requires O(N) linear search of all signatures • Options : sig - Bit sliced onto different devices for parallel comparision - And together matches on each segment sig1 sig2 comp AND AND g result

Horizontal Partitioning • Goal : avoid sequential scanning of the signature file Signature Database Input signature Hash function or index yielding specific candidates to try

Inverted Files • Like an index to a book Documents Terms Baum Bayes Viterbi index 14 39 156 39 45 156 290 41 86 156 217 14 15 16 17 37 38 39 40

Inverted Files • Very efficient for single word queries Just enumerate documents pointed to by index O( |A| ) = O(SA) • Efficient for OR’s Just enumerate both lists and remove duplicates O(SA + SB)

AND’s using Inverted Files (meet search) Method 1: Index for Bayes Index for Viterbi Ai Bj 14 39 156 227 319 39 45 58 96 156 208 j i • O(SA + SB ) • same as OR, • but smaller output • Begin with two pointers(i, j) on list # is in index(A,B) • if A[ i ] = B[ i ], write A[ i ] to output • if A[ i ] < B[ i ], i++ else j++

AND’s using Inverted Files Method 2: Useful if one index is smaller than the other(SA << SB ) (Hopkins) Bj • For all members of A • bsearch (A[ i ], B) • (do binary search • into larger index) • for all members of • smaller index 1 5 25 28 39 45 58 96 156 (Johns) Ai 39 227 A AND B AND C Order by smaller list pairwise Cost : SA * log2 (SB ) can achieve SA * log log (SB )

A H J H H A Proximity Search Document level indexes not adequate Option 1 : Doc 1 Index to corpus Position offset Before : Match if ptrA =ptrB Now : “A B” = match if ptrA =ptrB -1 A w/10 B = match if | ptrA -ptrB | 10 Anthony Johns Hopkins Doc 2 Doc 3 Doc i Size of corpus = size of index

Variations 1 Don’t index function words index wordlist Johns The X * The Johns Hopkins • Do linear match search in corpus • savings on 50% index size • potential speed improvement given data access costs

Variations 2 : Multilevel Indexes Position level Doc level Johns Hopkins • Supports parallel search • May have paging cost advantage • Cost – large index N + dV Anthony Johns Hopkins Johns Hopkins Anthony Hopkins Anthony Avg. Doc/vocab size

Interpolation Search Useful when data are numeric and uniformly distributed value Bi cell # of cells in index : 100 Values range from 0 … 1000 Goal : looking for the value 211 Binary search : begin looking at cell 50 Interpolation search : better guess for 1st cell to examine? 17 18 19 20 21 22 23 48 49 50 51 100 174 195 * 211 * 226 230 231 246 483 496 521 526 995

Interpolation Search • Isearch(low, high, key) • mid = best estimate of pos • mid = low + (high – low) * • (expected % of way • through range) Binary Search Bsearch(low, high, key) mid = (high + low) / 2 If (key = A[mid]) return mid Else if (key < A[mid]) Bsearch (low, mid-1, key) Else Bsearch(mid+1, high, key)

Comparison Typical sequence of cell’s tested : Binary Search 50 25 12 18 22 21 19. Interpolation Search 21 19. g go directly to expected region log log (N)

Cost of Computing Inverted Index • Simple wordposition pairs and sort • If N >> memory size • Tokenize(words g integers) • Create histogram • Allocate space in index • Do multipass(K-pass) through corpus only adding tokens in K bins Corpus size gN log N

K-pass Indexing index W1 W2 W3 W4 Block1 (pass K = 1) Time = KN + 1 But big win over N log N on paging K = 2

Vector Models for IR • Gerald Salton, Cornell (Salton + Lesk, 68) (Salton, 71) (Salton + McGill, 83) • SMART System Chris Buckely, Cornell g Current keeper of the flame Salton’s magical automatic retrieval tool(?)

0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 Vector Models for IR Boolean Model Doc V1 Doc V2 Word Stem Special compounds SMART Vector Model Termi Doc V1 1.0 3.5 4.6 0.1 0.0 0.0 Doc V2 0.0 0.0 0.0 0.1 4.0 0.0 SMART vectors are composed of real valued Term weights NOT simply Boolean Term Present or NOT

Example Comput* C++ Sparc genome Bilog* protein Compiler DNA Doc V1 3 5 4 1 0 1 0 0 Doc V2 1 0 0 0 5 3 1 4 Doc V3 2 8 0 1 0 1 0 0 • Issues • How are weights determined? • (simple option : • jraw freq. • kweighted by region, titles, keywords) • Which terms to include? Stoplists • Stem or not?

QUERIES and Documents share same vector representaion D1 D2 Q D3 Given Qeury DQ gmap to vector VQ and find document Di : sim (Vi ,VQ) is greatest

Similarity Functions • Many other options availabe(Dice, Jaccard) • Cosine similarity is self normalizing V1 100 200 300 50 D2 V2 1 2 3 0.5 Q D3 V3 10 20 30 5 Can use arbitrary integer values (don’t need to be probabilities)

Today’s Topics

Today’s Topics

Presentation Transcript

Topics for today

Today’s Topics

Today’s Topics

Today’s Topics

Today’s Topics

Seminar: Timely Topics for Today’s Business World

Today’s topics:

Today’s Topics

Today’s Topics

Today’s topics : Review

Today’s topics

Today’s Topics

Today’s Topics

CRIMINAL PROCEDURE

Topics for Today

Topics for Today

Today’s Topics

Today’s topics

Today’s topics

Today’s topics

Today’s Topics

Topics Covered Today