Computer comunication B

Computer comunication B Information retrieval

Information retrieval: introduction 1 • This topic addresses the question on how it is possible to find relevant information in big numbers of documents • These documents have to be elaborated by computers • Many times there are too many hits: 1.530.000 Dutch pages for the entry “insurance”

Information retrieval: introduction 1 • Sometimes there are ambiguities that have to be solved: for example the acronym LSA can stand for: • Linguistic Society of America, and what else? Let’s google it • Information retrieval (IR) searches for relevant documents for a specific topic in a large number of documents

Information retrieval: introduction 2 • Search engines are a sort of IR-systems • There are two characteristics that differentiate IR from simply searching in databases • Vagueness: the user cannot express and formalize in a refined way her/his information requirements • Uncertainty: the system does not have any knowledge about the content of the documents • Difference with Information Extraction (IE): extraction of relevant information for a specific topic in a large number of documents • The authors of the documents and their users are very often separate groups

Information retrieval: introduction 3 • The search does not go directly through documents but the search looks for index-terms (or descriptors) • What captures the essence of the topic of a document It is a sort of keyword that is used in the search) • Steps for the preparation: building the search index • Determine relevant terms and their occurrence in the document • Terms are nor only a group of signs between spaces (otherwise string search would be enough) • Save this in an index • Both branches are quite developed

Information retrieval: introduction 3 • Search instruction are translated as index-terms • They are evaluated on the basis of the index (not the documents) • A index is useful to optimize the search, Therefore what makes the answer efficient

Information retrieval: introduction 4 • A index is statistical. It does not change automatic when documents are added or are taken away (or disappear). • Results of a search are arranged according to their relevance • The search procedure (formalized in an algorithm) has to evaluate the relevance of a document in a search • The algorithms for the creation of the ranks can be “misused” to push WebPages in front of the search (“search engine optimization” SEO) • The higher the position of the page in the search, the higher the numbers of times that it will get visited. Advantage! • An example: insurance pages

Information retrieval: Vector space models 1 • Documents are characterized/evaluated according to their index-terms • Each document is identified with a vector • The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. • The value regarding an index is the number of times a specific term appears (sometimes the value is 0) • A metrics for the similarity between two documents is the co-sinus of the angle between their vectors • Searches are interpreted as well in terms of vectors

Vector space models 2 • An example of a vector-space model with only 2 index-terms • Booleans search methods have a stronger macroscopic perspective (documents are compared and not their index-terms

Vector space models 3 • Therefore, the more a term appears in a document the more important it will be for that document • But raw weights for terms (term frequency: tft,d) suggest that all terms have the same importance (i.e. have the same weight) • Therefore there can be a bias due to the difference in frequency among terms • Therefore it is analysed how many documents in the whole collection of documents D contain a certain term t (dft: document frequency) • With df we can calculate the inverse document-frequency, i.e. idftwith the formula • The weight of a term in a document is calculated therefore with the tf-idf formula

Information retrieval: evaluation 1 • The success if IR has several parts • Precision: how many of the found documents are relevant to the search? • Formula: P =׀found ∩relevant׀ ----------------------------- ׀found׀

Information retrieval: evaluation 1 • Recall • how many of the relevant documents are found to the search? • Formula: R=׀found ∩relevant׀ ----------------------------- ׀relevant׀

Information retrieval: evaluation 1 • Fall-out • how many of the irrelevant documents are found to the search? • Formula: F =׀found ∩ irrelevant׀ ----------------------------- ׀irrelevant׀ The is an inverse correlation between precision and recall

Information retrieval: evaluation 2 • Example: 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant are as well not found • Precision: 18/20= 90% • Recall: 18/21= 85.7% • Fall-out: 2/29= 6.9% • First attempt for a metrics that gets together precision and recall: accuracy • How many documents are correctly classified (relevant and found/irrelevant and not found) • In our example: (18+27)/50= 90% • But given the large majority of not found irrelevant documents (in true systems above 99%) leads to the fact that accuracy is not a good evaluation

Information retrieval: evaluation 3 • Second attempt: F-value • When precision and recall are balanced: the mean in between • Formula: F= 2PR/(P+R) • In our example: F= [2(18/20 *18/21)]/(18/20 +18/21)= 0.87% • Another metrics looks at the order of the found documents: are the most important documents cited first?

Computer comunication B