1 / 15

Computer comunication B

Computer comunication B. Information retrieval. Information retrieval: introduction 1. This topic addresses the question on how it is possible to find relevant information in big numbers of documents These documents have to be elaborated by computers

dezso
Télécharger la présentation

Computer comunication B

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer comunication B Information retrieval

  2. Information retrieval: introduction 1 • This topic addresses the question on how it is possible to find relevant information in big numbers of documents • These documents have to be elaborated by computers • Many times there are too many hits: 1.530.000 Dutch pages for the entry “insurance”

  3. Information retrieval: introduction 1 • Sometimes there are ambiguities that have to be solved: for example the acronym LSA can stand for: • Linguistic Society of America, and what else? Let’s google it • Information retrieval (IR) searches for relevant documents for a specific topic in a large number of documents

  4. Information retrieval: introduction 2 • Search engines are a sort of IR-systems • There are two characteristics that differentiate IR from simply searching in databases • Vagueness: the user cannot express and formalize in a refined way her/his information requirements • Uncertainty: the system does not have any knowledge about the content of the documents • Difference with Information Extraction (IE): extraction of relevant information for a specific topic in a large number of documents • The authors of the documents and their users are very often separate groups

  5. Information retrieval: introduction 3 • The search does not go directly through documents but the search looks for index-terms (or descriptors) • What captures the essence of the topic of a document It is a sort of keyword that is used in the search) • Steps for the preparation: building the search index • Determine relevant terms and their occurrence in the document • Terms are nor only a group of signs between spaces (otherwise string search would be enough) • Save this in an index • Both branches are quite developed

  6. Information retrieval: introduction 3 • Search instruction are translated as index-terms • They are evaluated on the basis of the index (not the documents) • A index is useful to optimize the search, Therefore what makes the answer efficient

  7. Information retrieval: introduction 4 • A index is statistical. It does not change automatic when documents are added or are taken away (or disappear). • Results of a search are arranged according to their relevance • The search procedure (formalized in an algorithm) has to evaluate the relevance of a document in a search • The algorithms for the creation of the ranks can be “misused” to push WebPages in front of the search (“search engine optimization” SEO) • The higher the position of the page in the search, the higher the numbers of times that it will get visited. Advantage! • An example: insurance pages

  8. Information retrieval: Vector space models 1 • Documents are characterized/evaluated according to their index-terms • Each document is identified with a vector • The dimensions of the vector are the index-terms. The dimensions of a document can be therefore several. • The value regarding an index is the number of times a specific term appears (sometimes the value is 0) • A metrics for the similarity between two documents is the co-sinus of the angle between their vectors • Searches are interpreted as well in terms of vectors

  9. Vector space models 2 • An example of a vector-space model with only 2 index-terms • Booleans search methods have a stronger macroscopic perspective (documents are compared and not their index-terms

  10. Vector space models 3 • Therefore, the more a term appears in a document the more important it will be for that document • But raw weights for terms (term frequency: tft,d) suggest that all terms have the same importance (i.e. have the same weight) • Therefore there can be a bias due to the difference in frequency among terms • Therefore it is analysed how many documents in the whole collection of documents D contain a certain term t (dft: document frequency) • With df we can calculate the inverse document-frequency, i.e. idftwith the formula • The weight of a term in a document is calculated therefore with the tf-idf formula

  11. Information retrieval: evaluation 1 • The success if IR has several parts • Precision: how many of the found documents are relevant to the search? • Formula: P =׀found ∩relevant׀ ----------------------------- ׀found׀

  12. Information retrieval: evaluation 1 • Recall • how many of the relevant documents are found to the search? • Formula: R=׀found ∩relevant׀ ----------------------------- ׀relevant׀

  13. Information retrieval: evaluation 1 • Fall-out • how many of the irrelevant documents are found to the search? • Formula: F =׀found ∩ irrelevant׀ ----------------------------- ׀irrelevant׀ The is an inverse correlation between precision and recall

  14. Information retrieval: evaluation 2 • Example: 20 found documents, 18 relevant, 3 relevant documents are not found, 27 irrelevant are as well not found • Precision: 18/20= 90% • Recall: 18/21= 85.7% • Fall-out: 2/29= 6.9% • First attempt for a metrics that gets together precision and recall: accuracy • How many documents are correctly classified (relevant and found/irrelevant and not found) • In our example: (18+27)/50= 90% • But given the large majority of not found irrelevant documents (in true systems above 99%) leads to the fact that accuracy is not a good evaluation

  15. Information retrieval: evaluation 3 • Second attempt: F-value • When precision and recall are balanced: the mean in between • Formula: F= 2PR/(P+R) • In our example: F= [2(18/20 *18/21)]/(18/20 +18/21)= 0.87% • Another metrics looks at the order of the found documents: are the most important documents cited first?

More Related