Introduction to Digital Libraries Information Retrieval

Introduction to Digital LibrariesInformation Retrieval

Performance of search • 3 major classes of measuring performance • precision / recall • TREC conference series, http://trec.nist.gov/ • space / time • see Esler & Nelson, JNCA for an example • http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf • usability • probably the most important measure, but largely ignored

Precision and Recall • Precision = No. of relevant documents retrieved Total no. of documents retrieved • Recall = No. of relevant documents retrieved . Total no. of relevant documents in database

Standard Evaluation Measures Starts with a CONTINGENCY table retrieved not retrieved relevant w x n1 = w + x not relevant y z N n2 = w + y

Precision and Recall From all the documents that are relevant out there, how many did the IR system retrieve? w Recall: w+x From all the documents that are retrieved by the IR system, how many are relevant? w Precision: w+y

User-Centered IR Evaluation • More user-oriented measures • Satisfaction, informativeness • Other types of measures • Time, cost-benefit, error rate, task analysis • Evaluation of user characteristics • Evaluation of interface • Evaluation of process or interaction

Technical View: Retrieval as Matching Documents to Queries Match Algorithm Surrogates Surrogates Query Form A Terms Document Space Query Space Sample Sample Vectors Query Form B Etc.. Etc.. Retrieval is algorithmic. Evaluation is typically a binary decision for each pairwise match and one or more aggregate values for a set of matches (e.g., recall and precision).

Human View: Information-Seeking Process Results Perceived Needs Queries Indexes Problem Actions Data Physical Interface Information seeking is an active, iterative process controlled by a human who Changes throughout the process. Evaluation is relative to human needs.

Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing

“Classic” Retrieval Models • Boolean • Documents and queries are sets of index terms • Vector • Documents and queries are documents in N-dimensional space • Probabilistic • Based on probability theory

Boolean Searching • Exactly what you would expect • and, or, not operations defined • requires an exact match • based on inverted file • (computer and science) and (not(animals)) would prevent a document with “use of computers in animal science research” from being retrieved

Boolean ‘AND’ • Information AND Retrieval Information Retrieval

Example • Draw a Venn diagram for: Care and feeding and (cats or dogs) • What is the meaning of: Information and retrieval and performance or evaluation

Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information  retrieval” • Q2 = “information ¬computer”

Boolean-based Matching • Exact match systems; separate the documents containing a given term from those that do not. Queries Terms Mediterranean scholarships horticulture agriculture cathedrals adventure disasters leprosy recipes bridge Venus tennis flags flags AND tennis 0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0 0 0 1 0 leprosy AND tennis Venus OR (tennis AND flags) Documents (bridge OR flags) AND tennis

Exercise ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))

Boolean features • Order dependency of operators • ( ), NOT, AND, OR (DIALOG) • May differ on different systems • Nesting of search terms • Nutrition and (fast or junk) and food

Boolean Limitations • Searches can become complex for the average user • too much ANDing can clobber recall • tricky syntax: “research AND NOT computer science” “research AND NOT (computer science)” (implicit OR) “research AND NOT (computer AND science)” all different -- (frequently seen in NTRS logs)

Vector Model • Calculate degree of similarity between document and query • Ranked output by sorting similarity values • Also called ‘vector space model’ • Imagine your documents as N-dimensional vectors (where N=number of words) • The “closeness” of 2 documents can be expressed as the cosine of the angle between the two vectors

Vector Space Model • Documents and queries are points in N-dimensional space (where N is number of unique index terms in the data collection) Q D

Vector Space Model with Term Weights • assume document terms have different values for retrieval • therefore assign weights to each term in each document • example: • proportional to frequency of term in document • inversely proportional to frequency of term in collection

T3 5 D1 = 2T1+ 3T2 + 5T3 Q = 0T1 + 0T2 + 2T3 2 3 T1 D2 = 3T1 + 7T2 + T3 7 T2 Graphic Representation Example: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3 Q = 0T1 + 0T2 + 2T3 • Is D1 or D2 more similar to Q? • How to measure the degree of similarity? Distance? Angle? Projection?

Document and Query Vectors • Documents and Queries are vectors of terms • Vectors can use binary keyword weights or assume 0-1 weights (term frequencies) • Example terms: “dog”,”cat”,”house”, “sink”, “road”, “car” • Binary: (1,1,0,0,0,0), (0,0,1,1,0,0) • Weighted: (0.01,0.01, 0.002, 0.0,0.0,0.0)

T1 T2 …. Tt D1 w11 w21 … wt1 D2 w12 w22 … wt2 : : : : : : : : Dn w1n w2n … wtn Document Collection Representation • A collection of n documents can be represented in the vector space model by a term-document matrix. • An entry in the matrix corresponds to the “weight” of a term in the document; zero means the term has no significance in the document or it simply doesn’t exist in the document.

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 Inner Product: Example 1

Vector Space Example Simple Match Query (1 1 0 1 0 1 1) Rec1 (1 1 0 1 0 1 0) (1 1 0 1 0 1 0) =4 Query (1 1 0 1 0 1 1) Rec2 (1 0 1 1 0 0 1) (1 0 0 1 0 0 1) =3 Query (1 1 0 1 0 1 1) Rec3 (1 0 0 0 1 0 1) (1 0 0 0 0 0 1) =2 indexed words: factors information help human operation retrieval systems Query: human factors in information retrieval systems Vector: (1 1 0 1 0 1 1) Record 1 contains: human, factors, information, retrieval Vector: (1 1 0 1 0 1 0) Record 2 contains: human, factors, help, systems Vector: (1 0 1 1 0 0 1) Record 3 contains: factors, operation, systems Vector: (1 0 0 0 1 0 1) Weighted Match Query (1 1 0 1 0 1 1) Rec1 (2 3 0 5 0 3 0) (2 3 0 5 0 3 0) =13 Query (1 1 0 1 0 1 1) Rec2 (2 0 4 5 0 0 1) (2 0 0 5 0 0 1) =8 Query (1 1 0 1 0 1 1) Rec3 (2 0 0 0 2 0 1) (2 0 0 0 0 0 1) =3

Term Weights: Term Frequency • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) across the entire corpus: tfij = fij / max{fij}

Some formulas for Sim Dot product Cosine Dice Jaccard t1 D Q t2

Example • Documents: Austen's Sense and Sensibility, Pride and Prejudice; Bronte's Wuthering Heights • cos(SAS, PAP) = .996 x .993 + .087 x .120 + .017 x 0.0 = 0.999 • cos(SAS, WH) = .996 x .847 + .087 x .466 + .017 x .254 = 0.929

Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra

2 2 sim(qor,dj) = sqrt( x + y ) 2 The Idea • qor = kx  ky; wxj = x and wyj = y ky (1,1) dj+1 OR dj y = wyj (0,0) x = wxj kx We want a document to be as far as possible from (0,0)

Fuzzy Set Model • Queries and docs represented by sets of index terms: matching is approximate from the start • This vagueness can be modeled using a fuzzy framework, as follows: • with each term is associated a fuzzy set • each doc has a degree of membership in this fuzzy set • This interpretation provides the foundation for many models for IR based on fuzzy theory

Probabilistic Model • Views retrieval as an attempt to answer a basic question: “What is the probability that this document is relevant to this query?” • expressed as: P(REL|D) ie. Probability of x given y (Probability that of relevance given a particular document D)

Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • The system uses this information to refine description of ideal answer set • By repeting this process, it is expected that the description of the ideal answer set will improve • Have always in mind the need to guess at the very beginning the description of the ideal answer set • Description of ideal answer set is modeled in probabilistic terms

Recombination after dimensionality reduction

Classic IR Models • Vector vs. probabilistic “Numerous experiments demonstrate that probabilistic retrieval procedures yield good results. However, the results have not been sufficiently better than those obtained using Boolean or vector techniques to convince system developers to move heavily in this direction

Example • Build the inverted file for the following document • F1={Written Quiz for Algorithms and Techniques of Information Retrieval} • F2={Program Quiz for Algorithms and Techniques of Web Search} • F3={Search on the Web for Information on Algorithms}

Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ human 1 0 0 1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1 1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1 1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0 0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0 0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1 Give the scores of the 9 documents for the query trees, minors using Boolean search Give the scores of the 9 documents for the query trees, minors using the vector model.

Introduction to Digital Libraries Information Retrieval