Classic IR Models

Classic IR Models • Boolean model • simple model based on set theory • queries as Boolean expressions • adopted by many commercial systems • Vector space model • queries and documents as vectors in an M-dimensional space • M is the number of terms • find documents most similar to the query in the M-dimensional space • Probabilistic model • a probabilistic approach • assume an ideal answer set for each query • iteratively refine the properties of the ideal answer set Information Retrieval Models

Document Index Terms • Each document is represented by a set of representative index terms or keywords • requires text pre-processing (off-line) • these terms summarize document contents • adjectives, adverbs, connectives are less useful • the index terms are mainly nouns (lexicon look-up) • Not all terms are equally useful • very frequent terms are not useful • very infrequent terms are not useful neither • terms have varying relevance (weights) when used to describe documents Information Retrieval Models

Text Preprocessing • Extract terms from documents and queries • document - query profile • Processing stages • word separation • sentence splitting • change terms to a standard form (e.g., lowercase) • eliminate stop-words (e.g. and, is, the, …) • reduce terms to their base form (e.g., eliminate prefixes, suffixes) • construct term indices (usually inverted files) Information Retrieval Models

Text Preprocessing Chart from Baeza – Yates & Ribeiro – Neto, 1999 Information Retrieval Models

documents index posting list 1 άγαλμα αγάπη … δουλειά … πρωί … ωκεανός 2 (1,2)(3,4) 3 4 (4,3)(7,5) 5 6 ……… 7 8 9 (10,3) 10 11 Inverted Index Information Retrieval Models

Basic Notation • Document: usually text • D: document collection (corpus) • d: an instance of D • Query: same representation with documents • Q: set of all possible queries • q: an instance of Q • Relevance: R(d,q) • binary relation R: D x Q  {0,1} • d is “relevant” to qiffR(d,q) = 1or • degree of relevance: R(d,q)  [0,1] or • probability of relevance R(d,q) = Prob(R|d,q) Information Retrieval Models

Term Weights • T = {t1, t2, ….tM } the terms in corpus • N number of documents in corpus • dj a document • djis represented by (w1j,w2j,…wMj)where • wij > 0 if ti appears in dj • wij= 0 otherwise • q is represented by (q1,q2,…qM) • R(d,q) > 0 ifq and d have common terms Information Retrieval Models

docs terms d1 d2 …. dN t1 w11 w12 w1N t2 w2i tM wM1 wMN Term Weighting Information Retrieval Models

D q query relevant document non-relevant document Document Space(corpus) Information Retrieval Models

Boolean Model • Based on set theory and Boolean algebra • Boolean queries: “John”and“Mary”not“Ann” • terms linked by “and”, “or”, “not” • terms weights are 0or 1(wij=0 or 1) • query terms are present or absent in a document • a document is relevant if the query condition is satisfied • Pros: simple, in many commercial systems • Cons: no ranking, not easy for complex queries Information Retrieval Models

Query Processing • For each term ti in query q={t1,t2,…tM} • use the index to retrieve all dj with wij> 0 • sort them by decreasing order (e.g., by term frequency) • Return documents satisfying the query condition • Slow for many terms: involves set intersections • Keep only the top K documents for each term at step 2 or • Do not process all query terms Information Retrieval Models

Vector Space Model • Documents and queries are M – dimensional term vectors • non-binary weights to index terms • a query is similar to a document if their vectors are similar • retrieved documents are sorted by decreasing order • a document may match a query only partially • SMARTis the most popular implementation Information Retrieval Models

q d θ Query – Document Similarity • Similarity is defined as the cosine of the angle between document and query vectors Information Retrieval Models

Weighting Scheme • tf x idf weighting scheme • wij: weight of term tiassociated with document dj • tfij frequency of term ti in document dj • max frequencytfli is computed over all terms in dj • tfij: normalized frequency • idfi: inverse document frequency • ni: number of documents where term ti occurs Information Retrieval Models

Weight Normalization • Many ways to express weights • E.g., using log(tfij) • The weight is normalized in [0,1] • Normalize by document length Information Retrieval Models

Normalization by Document Length • The longer the document, the more likely it is for a given term to appear in it • Normalize the term weights by document length (so longer documents are not given more weight) Information Retrieval Models

Comments on Term Weighting • tfij: term frequency – measures how well a term describes a document • intra documentcharacterization • idfi: terms appearing in many documents are not very useful in distinguishing relevant from non-relevant documents • inter documentcharacterization • This schemefavors averageterms Information Retrieval Models

Comments on Vector Space Model • Pros: • at least as good as other models • approximate query matching: a query and a document need not contain exactly the same terms • allows for ranking of results • Cons: • assumes term independency Information Retrieval Models

Document Distance • Consider documents d1, d2 with vectors u1, u2 • theirdistance is defined as the length AB Information Retrieval Models

Probabilistic Model • Computes the probability that the document is relevant to the query • ranks the documents according to their probability of being relevant to the query • Assumption: there is a set R of relevant documents which maximizes the overall probability of relevance • R: ideal answer set • R is not known in advance • initially assume a description (the terms) of R • iteratively refine this description Information Retrieval Models

Basic Notation • D: corpus, d: an instance of D • Q: set of queries, q: an instance of Q • P(R | d):probability that d is relevant • : probability that d is not relevant Information Retrieval Models

Probability of Relevance • P(R|d): probability that d is relevant • Bayes rule • P(d|R): probability of selecting d from R • P(R): probability of selecting R from D • P(d): probability of selecting d from D Information Retrieval Models

Document Ranking • Take the odds of relevance as the rank • Minimizes probability of erroneous judgment • are the same for all docs Information Retrieval Models

Ranking (cont’d) • Each document is represented by a set of index terms t1,t2,..tM • assume binary terms wi for terms ti • d=(w1,w2,…wM) where • wi=1 if the term appears in d • wi=0 otherwise • Assuming independence of index terms Information Retrieval Models

Ranking (conted) • By taking logarithms and by omitting constant terms • R is initially unknown Information Retrieval Models

Initial Estimation • Make simplifying assumptions such as • where ni: number of documents containing ti and N: total number of documents • Retrieve initial answer set using these values • Refine answer iteratively Information Retrieval Models

Improvement • Let V the number of documents retrieved initially • Take the fist r answers as relevant • From them compute Vi: number of documents containing ti • Update the initial probabilities: • Resubmit query and repeat until convergence Information Retrieval Models

Comments on Probabilistic Model • Pros: • good theoretical basis • Cons: • need to guess initial probabilities • binary weights • independence assumption • Extensions: • relevance feedback: humans choose relevant docs • OKAPI formula for non – binary weights Information Retrieval Models

Comparison of Models • The Boolean model is simple and used used almost everywhere. It does not allow for partial matches. It is the weakest model • The Vector space model has been shown (Salton and Buckley) to outperform the other two models • Various extensions deal with their weaknesses Information Retrieval Models

Query Modification • The results are not always satisfactory • some answers are correct, others are not • queries can’t specify user’s needs precisely • Iteratively reformulate and resubmit the query until the results become satisfactory • Two approaches • relevance feedback • query expansion Information Retrieval Models

Relevance Feedback • Mark answers as • relevant: positive examples • irrelevant: negative examples • Query: a point in document space • at each iteration compute new query point • the query moves towards an “optimal point” that distinguishes relevant from non-relevant document • the weights of query terms are modified • “term reweighting” Information Retrieval Models

Rochio Vectors q0 q1 optimal query q2 Information Retrieval Models

Rochio Formula • Query point • di: relevant answer • dj: non-relevant answer • n1: number of relevant answers • n2: number or non-relevant answers • α, β, γ: relative strength (usually α=β=γ=1) • α = 1, β = 0.75, γ = 0.25: q0 and relevant answers contain important information Information Retrieval Models

Query Expansion • Adds new terms to the query which are somehow related to existing terms • synonyms from dictionary (e.g., staff, crew) • semantically related terms from a thesaurus (e.g., “wordnet”): man, woman, man kind, human…) • terms with similar pronunciation (Phonix, Soundex) • Better results in many cases but query defocuses (topic drift) Information Retrieval Models

Comments • Do all together • query expansion: new terms are added from relevant documents, dictionaries, thesaurus • term reweighing by Rochio formula • If consistent relevance judgments are provided • 2-3 iterations improve results • quality depends on corpus Information Retrieval Models

Extensions • Pseudo relevance feedback: mark top k answers as relevant, bottom k answers as non-relevant and apply Rochio formula • Relevance models for probabilistic model • evaluation of initial answers by humans • term reweighting model by Bruce Croft, 1983 Information Retrieval Models

Text Clustering • The grouping of similar vectors into clusters • Similar documents tend to be relevant to the same requests • Clustering on M-dimensional space • M number of terms Information Retrieval Models

Clustering Methods • Sound methods based on the document-to-document similarity matrix • graph theoretic methods • O(N2) time • Iterative methods operating directly on the document vectors • O(NlogN) or O(N2/logN) time Information Retrieval Models

Sound Methods • Two documents with similarity > T(threshold) are connected with an edge [Duda&Hart73] • clusters: the connected components (maximal cliques) of the resulting graph • problem: selection of appropriate threshold T Information Retrieval Models

Zahn’s method [Zahn71] the dashed edge is inconsistent and is deleted • Find the minimum spanning tree • For each doc delete edges with length l > lavg • lavg: average distance if its incident edges • Or remove the longest edge (1 edge removed => 2 clusters, 2 edges removed => 3 clusters • Clusters: the connected components of the graph Information Retrieval Models

Iterative Methods • K-means clustering (K known in advance) • Choose some seed points (documents) • possible cluster centroids • Repeat until the centroids do not change • assign each vector (document) to its closest seed • compute new centroids • reassign vectors to improve clusters Information Retrieval Models

Cluster Searching • The M-dimensional query vector is compared with the cluster-centroids • search closest cluster • retrieve documents with similarity > T Information Retrieval Models

References • "Modern Information Retrieval", Richardo Baeza-Yates, Addison Wesley 1999 • "Searching Multimedia Databases by Content", Christos Faloutsos, Kluwer Academic Publishers, 1996 • Information Retrieval Resources http://nlp.stanford.edu/IR-book/information-retrieval.html • TREC http://trec.nist.gov/ • SMART http://en.wikipedia.org/wiki/SMART_ Information_Retrieval_System • LEMOUR http://www.lemurproject.org/ • LUCENE http://lucene.apache.org/ Information Retrieval Models

Classic IR Models

Classic IR Models

Presentation Transcript

Models for IR

Alternative IR models

Ranking models in IR

IR Models: Latent Semantic Analysis

Traditional IR models

Classic Models In the Writing Center

Alternative IR Models

Models for IR

Chapter 4: Advanced IR Models

Alternative IR models

Chapter 4: Advanced IR Models

Chapter 4: Advanced IR Models

IR Models

Other IR Models

Probabilistic IR Models

IR Models

User-Oriented IR Models

Classic Mercedes Benz Models

Alternative IR Models

User-Oriented IR Models

Alternative IR models

Vector Models for IR