Introduction to Information Retrieval: Database vs. Technologies

Intro to Information Retrieval By the end of the lecture you should be able to: • explain the differences between database and information retrieval technologies • describe the basic maths underlying set-theoretic and vector models of classical IR.

Reminder: efficiency is vital • Reminder: Google finds documents which match your keywords; this must be done EFFICIENTLY – cant just go through each document from start to end for each keyword • So, cache stores copy of document, and also a “cut-down” version of the document for searching: just a “bag of words”, a sorted list (or array/vector/…) of words appearing in the document (with links back to full document) • Try to match keywords against this list; if found, then return the full document • Even cleverer: dictionary and inverted file…

Inverted file structure dictionary Inverted or postings file Data file 1 2 1 2 3 2 2 3 4 . . Term 1 (2) Term 2 (3) Term 3 (1) Term 4 (3) Term 5 (4) . . 1 3 6 7 9 . . Doc 1 Doc2 Doc3 Doc4 Doc5 Doc6 . .

IR vs DBMS

informal introduction • IR was developed for bibliographic systems. We shall refer to ‘documents’, but the technique extends beyond items of text. • central to IR is representation of a document by a set of ‘descriptors’ or ‘index terms’ (“words in the document”). • searching for a document is carried out (mainly) in the ‘space’ of index terms. • we need a language for formulating queries, and a method for matching queries with document descriptors.

architecture query user Query matching hits Learning component feedback Object base (objects and their descriptions)

basic notation Given a list of m documents, D, and a list of n index terms, T, we define wi,j to be a weight associated with the ith keyword and the jth document. For the jth document, we define an index term vector, dj : dj = (w1,j , w2,j , …., wn,j ) Recipe for jam pudding For example: D = { d1, d2, d3}, T = {pudding, jam, traffic, lane, treacle} d1 = (1, 1, 0, 0, 0), d2 = (0, 0, 1, 1, 0), d3 = (1, 1, 1, 1, 0) DoT report on traffic lanes Radio item on traffic jam in Pudding Lane

set theoretic, Boolean model • Queries are Boolean expressions formed using keywords, eg: (‘Jam’ V ‘Treacle’)Λ’Pudding’ Λ¬ ‘Lane’ Λ¬ ‘Traffic’ • Query is re-expressed in disjunctive normal form (DNF) CF: T = {pudding, jam, traffic, lane, treacle} • eg (1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1) • To match a document with a query: • sim(d, qDNF) = 1 if d is equal to a component of qDNF • = 0 otherwise

(1, 1, 0, 0, 0) V (1, 0, 0, 0, 1) V (1, 1, 0, 0, 1) T = {pudding, jam, traffic, lane, treacle} treacle pudding jam traffic lane d1 = (1, 1, 0, 0, 0), d2 = (0, 0, 1, 1, 0), d3 = (1, 1, 1, 1, 0)

collecting results T = {pudding, jam, traffic, lane, treacle} • Query: • (‘Jam’ V ‘Treacle’)Λ’Pudding’ Λ¬ ‘Lane’ • Λ¬ ‘Traffic’ treacle pudding (jam V treacle) Λ (pudding) Λ–(Lane) Λ –(Traffic) jam traffic lane Answer:d1 = (1, 1, 0, 0, 0) Jam pud recipe

Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wiq 2 wij2 × Statistical vector model • weights, 1 <=wi,j<= 0, no longer binary-valued • query also represented by a vector q = (w1q, w2q, …, wnq) • eg q= (1.0, 0.6, 0.0, 0.0, 0.8) CF: T = {pudding, jam, traffic, lane, treacle} to match jth document with a query: sim(dj, q) =

Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 D1 w11 Q w1q w2q w21 Cosine coefficient = cos() T1  T2

Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 D1 w11 w1q Q w21 w2q Cosine coefficient = cos(0) = 1 T1 =0 T2

Σi=1 n (wij× wiq) Σi=1 Σi=1 n n wij2 × wiq 2 Cosine coefficient = cos(90º) = 0 T1 D1 w11 = 90º w1q= 0 Q w2q w21= 0 T2

Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 1.44 = 0.89 1.32× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe = 0.8×1.0 + 0.8×0.6 + 0.0×0.0 + 0.0×0.0 + 0.2×0.8 = 1.44 = 0.82 + 0.82 + 0.02 + 0.02 + 0.22 = 1.32 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 0.0 = 0.0 1.45× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d2 = (0.0, 0.0, 0.9, 0.8, 0), DoT Report = 0.0×1.0 + 0.0×0.6 + 0.9×0.0 + 0.8×0.0 + 0.0×0.8 = 0.0 = 0.02 + 0.02 + 0.92 + 0.82 + 0.02 = 1.45 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

Σi=1 n (wij× wiq) Σi=1 n wiq 2 Σi=1 n wij2 Σi=1 n (wij× wiq) = 1.14 = 0.51 2.53× 2.0 Σi=1 Σi=1 n n wiq 2 wij2 × q = (1.0, 0.6, 0.0, 0.0, 0.8) d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic Report = 0.6×1.0 + 0.9×0.6 + 1.0×0.0 + 0.6×0.0 + 0.0×0.8 = 1.14 = 0.62 + 0.92 + 1.02 + 0.62 + 0.02 = 2.53 = 1.02 + 0.62 + 0.02 + 0.02 + 0.82 = 2.0

collecting results CF: T = {pudding, jam, traffic, lane, treacle} q = (1.0, 0.6, 0.0, 0.0, 0.8) Rank document vector document (sim) 1. d1 = (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe (0.89) 2. d3 = (0.6, 0.9, 1.0, 0.6, 0.0) Radio Traffic (0.51) Report

Discussion: Set theoretic model • Boolean model is simple, queries have precise semantics, but it is an ‘exact match’ model, and does not Rank results • Boolean model popular with bibliographic systems; available on some search engines • Users find Boolean queries hard to formulate • Attempts to use set theoretic model as basis for a partial-match system: Fuzzy set model and the extended Boolean model.

Discussion: Vector Model • Vector model is simple, fast and results show leads to ‘good’ results. • Partial matching leads to ranked output • Popular model with search engines • Underlying assumption of term independence (not realistic! Phrases, collocations, grammar) • Generalised vector space model relaxes the assumption that index terms are pairwise orthogonal (but is more complicated).

questions raised • Where do the index terms come from? (ALL the words in the source documents?) • What determines the weights? • How well can we expect these systems to work for practical applications? • How can we improve them? • How do we integrate IR into more traditional DB management?

Questions to think about • Why is traditional database unsuited to retrieval of unstructured information? • How would you re-express a Boolean query, eg (A or B or (C and not D)), in disjunctive normal form? • For the matching coefficient, sim(., .) show that 0 <= sim(., .) <= 1, and that sim(a, a) = 1. • Compare and contrast the ‘vector’ and ‘set theoretic’ models in terms of power of representation of documents and queries.

Introduction to Information Retrieval: Database vs. Technologies

Introduction to Information Retrieval: Database vs. Technologies

Presentation Transcript

Introduction to Information Retrieval

Introduction to Information Retrieval

Intro to Information Retrieval

Introduction to Information Retrieval

Introduction to information retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Information Retrieval