SMART System: Automatic Information Retrieval

CS 430: Information Discovery Lecture 9 Vector Methods

Course Administration • No discussion class tomorrow • Assignment 2 will be posted shortly • Please hand in laptop questionnaires

SMART System An experimental system for automatic information retrieval • automatic indexing to assign terms to documents and queries • collect related documents into common subject classes • identify documents to be retrieved by calculating similarities between documents and queries • procedures for producing an improved search query based on information obtained from earlier searches Gerald Salton and colleagues Harvard 1964-1968 Cornell 1968-1988

Vector Space Methods • Problem:Given two text documents, how similar are they? • (One document may be a query.) • Vector space methods that measure similarity do not assume exact matches. • Benefits of similarity measures rather than exact matches • Encourage long queries, which are rich in information. An abstract should be very similar to its source document. • Accept probabilistic aspects of writing and searching. Different words will be used if an author writes the same document twice.

Vector space revision x = (x1, x2, x3, ..., xn) is a vector in an n-dimensional vector space Length of x is given by (extension of Pythagoras's theorem) |x|2 = x12 + x22 + x32 + ... + xn2 If x1 and x2 are vectors: Inner product (or dot product) is given by x1.x2 = x11x21 + x12x22 +x13x23 + ... + x1nx1n Cosine of the angle between the vectors x1 and x2: cos () = x1.x2 |x1| |x2|

Vector Space Methods: Concept n-dimensional space, where n is the total number of different terms used to index a set of documents. Each document is represented by a vector, with magnitude in each dimension equal to the (weighted) number of times that the corresponding term appears in the document. Similarity between two documents is the angle between their vectors.

Three terms represented in 3 dimensions t3 d1 d2 t2  t1

Example 1: Incidence array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox terms ant bee cat dog eel fox gnu hog length d1 1 1 2 d2 1 1 1 1 4 d3 1 1 1 1 1 5 Weights: tij = 1 if document i contains term j and zero otherwise

Example 1 (continued) Similarity of documents in example: d1d2d3 d1 1 0.71 0 d2 0.71 1 0.22 d3 0 0.22 1 • Similarity measures the occurrences of terms, but no other characteristics of the documents.

Example 2: frequency array terms in d1 -> ant ant bee terms in d2 -> bee hog ant dog terms in d3 -> cat gnu dog eel fox ant bee cat dog eel fox gnu hog length d1 2 1 5 d2 1 1 1 1 4 d3 1 1 1 1 1 5 Weights:tij = frequency that term j occurs in document i

Example 2 (continued) Similarity of documents in example: d1d2d3 d1 1 0.67 0 d2 0.67 1 0.22 d3 0 0.22 1 • Similarity depends upon the weights given to the terms.

Vector similarity computation Documents in a collection are assigned terms from a set of n terms The term assignment array T is defined as if term j does not occur in document i, tij = 0 if term j occurs in document i, tij is greater than zero (the value of tij is called the weight of term j in document i) Similarity between di and dj is defined as  tiktjk |di| |dj| n cos(di, dj) = k=1

Simple use of vector similarity Threshold For query q, retrieve all documents with similarity more than 0.50 Ranking For query q, return the n most similar documents ranked in order of similarity

Contrast with Boolean searching With Boolean retrieval, a document either matches a query exactly or not at all • Encourages short queries • Requires precise choice of index terms • Requires precise formulation of queries (professional training) With retrieval using similarity measures, similarities range from 0 to 1 for all documents • Encourages long queries to have as many dimensions as possible • Benefits from large numbers of index terms • Benefits from queries with many terms, not all of which need match the document

Document vectors as points on a surface • Normalize all document vectors to be of length 1 • Then the ends of the vectors all lie on a surface with unit radius • For similar documents, we can represent parts of this surface as a flat region • Similar document are represented as points that are close together on this surface

Relevance feedback (concept)   hits from original search x x o  x x o o x documents identified as non-relevant o documents identified as relevant  original query reformulated query

Document clustering (concept) x x x x x x x x x x x x x x x x x x x Document clusters are a form of automatic classification. A document may be in several clusters.

SMART System: Automatic Information Retrieval

SMART System: Automatic Information Retrieval

Presentation Transcript

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery

CS 430: Information Discovery