220 likes | 312 Vues
Intro to Information Retrieval. By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies describe the basic maths underlying set-theoretic and vector models of classical IR. Reminder: efficiency is vital.
E N D
Intro to Information Retrieval By the end of the lecture you should be able to: • explain the differences between database and information retrieval technologies • describe the basic maths underlying set-theoretic and vector models of classical IR.
Reminder: efficiency is vital • Reminder: Google finds documents which match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword • So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document) • Try to match keywords against this list; if found, then return the full document • Even cleverer: dictionary and inverted file…
Inverted file structure dictionary Inverted or postings file Data file 1 2 1 2 3 2 2 3 4 . . Term 1 (2) Term 2 (3) Term 3 (1) Term 4 (3) Term 5 (4) . . 1 3 6 7 9 . . Doc 1 Doc2 Doc3 Doc4 Doc5 Doc6 . .
informal introduction • IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text. • central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”). • searching for a document is carried out (mainly) in the ‘space’ of index terms. • we need a language for formulating queries, and a method for matching queries with document descriptors.
architecture query user Query matching hits Learning component feedback Object base (objects and their descriptions)
basic notation Given a list of m documents, D, and a list of n index terms, T, we define wi,j to be a weight associated with the ith keyword and the jth document. For the jth document, we define an index term vector, dj : dj = (w1,j , w2,j , …., wn,j ) Recipe for jam pudding For example: D = { d1, d2, d3}, T = {pudding, jam, traffic, lane, treacle} d1 = (1, 1, 0, 0, 0), d2 = (0, 0, 1, 1, 0), d3 = (1, 1, 1, 1, 0) DoT report on traffic lanes Radio item on traffic jam in Pudding Lane
set theoretic, Boolean model • Queries are Boolean expressions formed using keywords, eg: (‘Jam’ V ‘Treacle’)Λ’Pudding’ Λ¬ ‘Lane’ Λ¬ ‘Traffic’ • Query is re-expressed in disjunctive normal form (DNF) CF: T = {pudding, jam, traffic, lane, treacle} • eg (1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1) • To match a document with a query: • sim(d, qDNF) = 1 if d is equal to a component of qDNF • = 0 otherwise
(1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1) T = {pudding, jam, traffic, lane, treacle} treacle pudding jam traffic lane d1 = (1, 1, 0, 0, 0), d2 = (0, 0, 1, 1, 0), d3 = (1, 1, 1, 1, 0)
collecting results T = {pudding, jam, traffic, lane, treacle} • Query: • (‘Jam’ V ‘Treacle’)Λ’Pudding’ Λ¬ ‘Lane’ • Λ¬ ‘Traffic’ treacle pudding (jam V treacle) Λ (pudding) Λ–(Lane) Λ –(Traffic) jam traffic lane Answer:d1 = (1, 1, 0, 0, 0) Jam pud recipe
Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wiq 2 wij2 × Statistical vector model • weights, 1 <=wi,j<= 0, no longer binary-valued • query also represented by a vector q = (w1q, w2q, …, wnq) • eg q= (1.0, 0.6, 0.0, 0.0, 0.8) CF: T = {pudding, jam, traffic, lane, treacle} to match jth document with a query: sim(dj, q) =
Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 D1 w11 Q w1q w2q w21 Cosine coefficient = cos() T1 T2
Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 D1 w11 w1q Q w21 w2q Cosine coefficient = cos(0) = 1 T1 =0 T2
Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 Cosine coefficient = cos(90º) = 0 T1 D1 w11 = 90º w1q= 0 Q w2q w21= 0 T2
Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 1.44 = 0.89 1.32× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe = 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8 = 1.44 = 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 0.0 = 0.0 1.45× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report = 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8 = 0.0 = 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 1.14 = 0.51 2.53× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report = 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8 = 1.14 = 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0
collecting results CF: T = {pudding, jam, traffic, lane, treacle} q = (1.0, 0.6, 0.0, 0.0, 0.8) Rank document vector document (sim) 1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89) 2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report
Discussion: Set theoretic model • Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results • Boolean model popular with bibliographic systems; available on some search engines • Users find Boolean queries hard to formulate • Attempts to use set theoretic model as basis for a partial-match system: Fuzzy set model and the extended Boolean model.
Discussion: Vector Model • Vector model is simple, fast and results show leads to ‘good’ results. • Partial matching leads to ranked output • Popular model with search engines • Underlying assumption of term independence (not realistic! Phrases, collocations, grammar) • Generalised vector space model relaxes the assumption that index terms are pairwise orthogonal (but is more complicated).
questions raised • Where do the index terms come from? (ALL the words in the source documents?) • What determines the weights? • How well can we expect these systems to work for practical applications? • How can we improve them? • How do we integrate IR into more traditional DB management?
Questions to think about • Why is traditional database unsuited to retrieval of unstructured information? • How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form? • For the matching coefficient, sim(., .) show that 0 <= sim(., .) <= 1, and that sim(a, a) = 1. • Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.