Re - ranking R etrieved D ocuments U sing Q uery T erm D istances

Re-ranking Retrieved Documents Using Query Term Distances Kerem Ali Uluğ Turhan O. Daybelge

Outline • Motivation – How query term distances can affect the relevance of a document • Problem statement and a possible solution • Scoring spans in a document according to query-term distances – Different approaches • Detecting spans contained in a document • Combining pieces – A numeric example • Input data for our project • Evaluating the performance CS533 Information Retrieval Systems

Motivation • In conventional IR Systems documents are retrieved and ranked according to their relevance to a query using the tf-idf approach: • But, the relevance of a document to a query should increase when distance between query-terms in the document decreases. • We need to re-rank documents retrieved from the IR system according to query-term distances in order to incorporate proximity information into the ranking. CS533 Information Retrieval Systems

Motivation Example Query: “renk körlüğü” renk AND körlük Göz Duyusu; ışık, şekil, renk, hareket ve derinlik gibi çok çeşitli özelliklerin toplamıdır. Görme duyusunun gelişmesi, dogumdan sonra altı yaşına kadar devam eder. Doğumda, iki göz arasındaki denge herhangi bir nedenle bozulmuş ise, bir göz, beyin tarafindan tercih edilir, diğer göz atıl kapasite ile kullanılır. Düşük kapasite ile kullanılan gözün görme yeteneği azalır ve göz tembelliği oluşur. Göz hastalıkları kalıtım ile geçen, mikrobik, çeşitli kazalar ve mekanik birçok nedenlerle ortaya çıkabilir. Ülkemizde akraba evlilikleri, çocukluk çağı körlüklerinin başta gelen sebebidir. Geri kalmış ülkelerde trahom gibi mikrobik ve A vitamini eksikliği gibi beslenme bozukluğu, başlıca körlük nedenleridir. For our query, an IR system can assign a high rank to this irrelevant documeny when solely using a tf-idf based method. However the query-terms are so apart from each other that they are not semantically related. CS533 Information Retrieval Systems

Problem & Possible Solution • This ranking method fails for this kind of situations because term-distance (proximity) information is not used. • By intiution, we know that semantically related terms often occur near to each other in a document and a usual query searches for such related term groups. • We should use a re-ranking method that will consider the following criteria: • The relevance of a document to a query increases when: • The distance between appropriately chosen groups of query-term occurences (spans) in the document decreases. • The number of such spans in the document increases. • Such a re-ranking method should assign higher ranks to more relevant documents. CS533 Information Retrieval Systems

Problem & Possible Solution • We need a method that will assign a score to each retrieved document. • If we could assign a proximity score to span Si in document D, then we could calculate a relevance score for D. • Then we can re-rank each document according to this measure. CS533 Information Retrieval Systems

Things to be solved • There are two problems that should be solved: • How will we assign scores to spans using proximity information? • How will we group related terms into spans? CS533 Information Retrieval Systems

Calculating Span Scores • Factors that may affect the score of a span: • Lexical distance of query-terms in the span. • Whether or not the span crosses semantic boundaries. (Such as sentence and paragraph boundaries) • Number of unique query-term occurences in the span CS533 Information Retrieval Systems

Calculating Span Scores: Lexical Distance • Suppose that one span covers w1 through w10 and there are four query-term occurences in this span, namely w1, w4, w8 and w10. Some span-length measures are given below: CS533 Information Retrieval Systems

Calculating Span Scores: Lexical Distance • By using this approach we guarantee that the more query-term occurences are seperated from each other in a document, the less the score assigned to that document will be. • Limiting the span length: • We can set an upper limit Lmax on the span length. • Terms and that are apart from each other will be considered unrelated if dist( , ) > Lmax CS533 Information Retrieval Systems

Calculating Span Scores:Crossing Semantic Boundaries • Since a span represents a group of semantically related terms, when a span crosses a semantic boundary, the score of the span must drop. • Possible semantic boundaries: • Sentence boundaries • Paragraph boundaries • Section boundaries, etc... • This problem can again be solved using the lexical distance concept. • Semantic boundaries between term pairs can be considered as increasing the distance between those terms. CS533 Information Retrieval Systems

Calculating Span Scores: Unique Query Terms • The more a span contains repeated query-term occurrences in it, the less its score should be. • i.e. For a three term query, following two spans with equal lengths are identified in the text: q1 x x x q2 x q3 (should have a higher score) q1 x x x q2 x q1 • The span that has more unique query tems covers the query better. CS533 Information Retrieval Systems

Calculating Span Scores: Previous Research • Cormack et. al. University of Waterloo & University of Toronto • Hawking et. al. Australian National University • Shin et. al. AI Lab Seoul National University • Song et. al. Microsoft Research Asia • We will present the first two approaches in the next slides CS533 Information Retrieval Systems

Calculating Span Scores: Cormack et. al. • Cormack et. al. used a method named “ranking by solution density” • Suppose a document contains n spans • We calculate the score of a span S by the formula: • After scoring each span, we order them in descending order in terms of score. (S1, S2,..., Sn) • Finally the total score of the document is calculated as CS533 Information Retrieval Systems

Calculating Span Scores: Hawking et. al. • Hawking et. al. propose a similar distance-based relevance formula • A relevance contribution score of span S to document D for a query Q is defined as: • C is a constant, usually 1, but may be adjusted according to the number of repeating query terms in the span • F is a function, usually identity, but may be adjusted to alter the rate at which relevance contribution score decays with length • n = |Q| - number of unique query-term occurences in the query • Lmax is the maximum allowable span length CS533 Information Retrieval Systems

Determining Spans • There are many possible spans in a document • An example query: • Query: “Türkiye Avupa İlişkileri” Türkiye AND Avupa AND İlişki UNESCO Türkiye Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. CS533 Information Retrieval Systems

Determining Spans • We are planning to use an iterative algorithm • Algorithm iterates through query term occurences in the document • A maximum allowable distance (MAX_DIS) between query term occurences is defined • Same query term to occur more than once in a span is not allowed • All query term occurences are covered by a span at the end of a single pass of the algorithm CS533 Information Retrieval Systems

Determining Spans current-term = first query-term hit do while current-term ≠ NIL If the distance between the current-term and the next-term is bigger than a threshold MAX_DIS then the current-span ends and a new span begins with the next-term If the current-term and the next-term are identical then the current-span ends and a new span begins with the next-term If the next-term is identical to a hit within thecurrent-span then the distance between the current-term and the next-term and the distance between the identical hit and its next is compared, the span is separated at the bigger gap. Otherwise add the current term to the current-span current-term = next-term repeat CS533 Information Retrieval Systems

Determining Spans • Looking to the example again: UNESCO [Türkiye] Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. CS533 Information Retrieval Systems

Determining Spans • We will use some different approaches for determining spans by modifying the algorithm • Since we have relevancy data of documents, we will be able to compare different approaches • An example approach is to allow more than one occurrence of a query term in a span CS533 Information Retrieval Systems

Ranking Example The following two documents have the same number of query terms UNESCO Türkiye Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, Türkiye Cumhuriyeti'nin Avrupa değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “Avrupa Birliği - Türkiye ilişkileri” konulu panelde konuşan Tacar, Türkiye'nin Avrupa Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. Rusya Devlet Başkanı Vladimir Putin'den önce, mesajları Türkiye'ye ulaştı. Artık Türkiye için Rusya, "komünizm tehlikesi" , Rusya için Türkiye NATO üyesi hasım olmadığına göre, ilişkilere bu gözle bakmak iki ülkenin de yararına. Bunun Avrupa'ya alternatif bir blok anlayışı taşıması da gerekmez. Avrupa birliği bizim için son çare değildir. Türkiye'nin ulusal çıkarları doğrultusunda Avrupa dışında da temaslarımıza devam etmeliyiz. CS533 Information Retrieval Systems

Ranking Example By running the span detection algorithm with MAX_DIS = 8 we obtain the following spans UNESCO [Türkiye] Milli Komitesi Başkan Vekili ve Büyükelçi Pulat Tacar, [Türkiye Cumhuriyeti'nin Avrupa] değerleri çerçevesi içinde olduğunu söyledi. Doğuş Üniversitesi tarafından düzenlenen “[Avrupa Birliği - Türkiye ilişkileri]” konulu panelde konuşan Tacar, [Türkiye'nin Avrupa] Birliği yolunda büyük ilerlemeler kaydettiğini, ancak hala bazı eksiklikleri olduğunu belirtti. Rusya Devlet Başkanı Vladimir Putin'den önce, mesajları [Türkiye'ye] ulaştı. Artık [Türkiye] için Rusya, "komünizm tehlikesi" , Rusya için [Türkiye NATO üyesi hasım olmadığına göre, ilişkilere bu gözle bakmak iki ülkenin de yararına. Bunun Avrupa'ya] alternatif bir blok anlayışı taşıması da gerekmez. [Avrupa] birliği bizim için son çare değildir. [Türkiye'nin ulusal çıkarları doğrultusunda Avrupa] dışında da temaslarımıza devam etmeliyiz. CS533 Information Retrieval Systems

Ranking Example • By using the formula proposed by Hawking et. al., we calculate scores for each document: • Score(D1) = 0.592 • Score(D2) = 0.159 • Results show us that D1 is more relevant to the query, thus should be ranked higher than D2. C = 1 F(x) = x (identity function) Lmax = MAX_DISx ( |Q| - 1 ) = 16 CS533 Information Retrieval Systems

Input Data • We will use Bilkent Information Retrieval Group 2006 queries run on Milliyet 2001-2005 documents as data • Number of documents: 408,305 • Average article size: 234 tokens • Total database size: 800MB • Number of evaluated queries: 52 • Avg. no. of documents/query: 474 • Avg. no. of relevant documents/query: 133 • Current system retrieves documents according to its own ranking system • We will use query data (terms), original rankings of the documents, the information regarding the relavency of the documents and document full texts CS533 Information Retrieval Systems

Evaluation Strategy • Get data from the existing system • Rerank documents according to our method • Check to see if relevant documents are ranked high and non-relevant documents have lower ranks (compare with original rankings) • If the re-ranking happens to be succesful, precision values for 11 standard recall levels should improve. (using the TREC interpolation rule) CS533 Information Retrieval Systems

Evaluation Strategy CS533 Information Retrieval Systems

Summary • We have introduced the importance of proximity based document relevance measures • This relevance information can be used to re-rank the output of a conventional IR system • We have introduced two main approaches for determining span scores and cumulative scores for each document • We have introduced an algorithm to choose semantically related query-terms and group them into spans • We have shown how a distance-based relevance measure can effectively assign a higher score to a more relevant document on a simple example CS533 Information Retrieval Systems

Re - ranking R etrieved D ocuments U sing Q uery T erm D istances