html5-img
1 / 39

Information Retrieval

Information Retrieval. Information Retrieval Systems. key word query. Document. IR System. document. In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document

mura
Télécharger la présentation

Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  2. Information Retrieval Systems key word query Document IR System document Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  3. In full text retrieval, all the words in each document are considered to be keywords. We use the word term to refer to the words in a document Ranking of documents on the basis of estimated relevance to a query is critical Keyword Search Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  4. Similarity Based Retrieval • Similarity based retrieval - retrieve documents similar to a given document • Similarity can be used to refine answer set to keyword query • User selects a few relevant documents from those retrieved by keyword query, and system finds other documents similar to these Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  5. Similarity Measures • A similarity measure is a function that computes the degree of similarity between two vectors. • Using a similarity measure between the query and each document: • It is possible to rank the retrieved documents in the order of presumed relevance. • It is possible to enforce a certain threshold so that the size of the retrieved set can be controlled. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  6. Relevance Ranking • Relevance ranking is based on factors such as • Term frequency • Frequency of occurrence of query keyword in document • Inverse document frequency • How many documents the query keyword occurs in • Fewer  give more importance to keyword • Hyperlinks to documents • More links to a document  document is more important Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  7. Relevance Ranking Using Terms (Cont.) • Most systems add to the above model • Words that occur in title, author list, section headings, etc. are given greater importance • Words whose first occurrence is late in the document are given lower importance • Very common words such as “a”, “an”, “the”, “it” etc are eliminated • Called stop words • Proximity: if keywords in query occur close together in the document, the document has higher importance than if they occur far apart Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  8. Vector Space Model • Assume t distinct terms remain after preprocessing; call them index terms or the vocabulary. • These “orthogonal” terms form a vector space. Dimension = t = |vocabulary| • Each term, i, in a document or query, j, is given a real-valued weight, wij. • Both documents and queries are expressed as t-dimensional vectors: dj = (w1j, w2j, …, wtj) Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  9. Term Weights • More frequent terms in a document are more important, i.e. more indicative of the topic. fij = frequency of term i in document j • May want to normalize term frequency (tf) by dividing by the frequency of the most common term in the document: tfij =fij / maxi{fij} Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  10. Reverse Term Weights • Terms that appear in many different documents are less indicative of overall topic. df i = document frequency of term i = number of documents containing term i idfi = inverse document frequency of term i, = log2 (N/ df i) (N: total number of documents) • An indication of a term’s discrimination power. • Log used to dampen the effect relative to tf. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  11. TF-IDF Weighting • A typical combined term importance indicator is tf-idf weighting: wij = tfij idfi = tfijlog2 (N/ dfi) • A term occurring frequently in the document but rarely in the rest of the collection is given high weight. • Many other ways of determining term weights have been proposed. • Experimentally, tf-idf has been found to work well. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  12. Inner Product Measure • Similarity between vectors for the document di and query q can be computed as the vector inner product: sim(dj,q) = dj•q = wij · wiq where wijis the weight of term i in document j andwiq is the weight of term i in the query • For binary vectors, the inner product is the number of matched query terms in the document (size of intersection). • For weighted term vectors, it is the sum of the products of the weights of the matched terms. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  13. Inner Product -- Examples Problems? architecture management information Binary: • D = 1, 1, 1, 0, 1, 1, 0 • Q = 1, 0 , 1, 0, 0, 1, 1 sim(D, Q) = 3 computer text retrieval database Size of vector = size of vocabulary = 7 0 means corresponding term not found in document or query Weighted: D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + 1T3 Q = 0T1 + 0T2 + 2T3 sim(D1, Q) = 2*0 + 3*0 + 5*2 = 10 sim(D2, Q) = 3*0 + 7*0 + 1*2 = 2 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  14. t3 1 D1 Q 2 t1 t2 D2 Cosine Similarity Measure • Cosine similarity measures the cosine of the angle between two vectors. • Inner product normalized by the vector lengths. CosSim(dj, q) = Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  15. Relevance Using Hyperlinks • Problem with key words search? • Problem with most frequented visited website search? • Idea: use popularity of Web site (e.g. how many people visit it) to rank site pages that match given keywords • Problem: hard to find actual popularity of site Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  16. Different Ranking Factors • Key word and anchor text based search find all the related pages first • PageRank rank the search result set • A high ranked page is not interesting to you at all if it is not related Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  17. Link Counts Taher’s Home Page Sep’s Home Page CS361 DB Pub Server CNN Yahoo! Linked by 2 Unimportant pages Linked by 2 Important Pages Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  18. Definition of PageRank let us calculate Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  19. Definition of PageRank 1/2 1/2 1 1 0.05 0.25 0.1 0.1 0.1 Sep Taher DB Pub Server CNN Yahoo! Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  20. PageRank Diagram 0.333 0.333 0.333 Initialize all nodes to rank Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  21. PageRank Diagram 0.167 0.333 0.333 0.167 Propagate ranks across links (multiplying by link weights) Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  22. PageRank Diagram 0.5 0.333 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  23. PageRank Diagram 0.167 0.5 0.167 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  24. PageRank Diagram 0.333 0.5 0.167 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  25. PageRank Diagram 0.4 0.4 0.2 After a while… Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  26. Computing PageRank importance of page i importance of page j number of outlinks from page j pages j that link to page i • Initialize: • Repeat until convergence: Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  27. importance of page i importance of page j number of outlinks from page j pages j that link to page i Definition of PageRank • The importance of a page is given by the importance of the pages that link to it • d is a damping factor, usually 0.85 Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  28. Synonyms and Homonyms • Synonyms • E.g. document: “motorcycle repair”, query: “motorcycle maintenance” • need to realize that “maintenance” and “repair” are synonyms • System can extend query as “motorcycle and (repair or maintenance)” • Homonyms • E.g. “object” has different meanings as noun/verb • Can disambiguate meanings (to some extent) from the context Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  29. Indexing of Documents • An inverted index maps each keyword Ki to a set of documents Sithat contain the keyword • Documents identified by identifiers Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  30. Relevant performance metrics: Precision - what percentage of the retrieved documents are relevant to the query. Recall - what percentage of the documents relevant to the query were retrieved. Measuring Retrieval Effectiveness Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  31. Precision and Recall • Precision: a/(a+c) • Among all the retrieved, how many are actual positive? • Recall: a/(a+b) • Percentage of actual positive data retrieved • F measure: 2pr/(r+p) predict actual Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  32. Training Data • Problem: which documents are actually relevant, and which are not • Usual solution: human judges • Create a corpus of documents and queries, with humans deciding which documents are relevant to which queries • TREC (Text REtrieval Conference) Benchmark Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  33. Web Crawling • Web crawlers are programs that locate and gather information on the Web • Recursively follow hyperlinks present in known documents, to find other documents • Starting from a seed set of documents • Fetched documents • Handed over to an indexing system • Can be discarded after indexing, or store as a cached copy Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  34. Storing related documents together in a library facilitates browsing users can see not only requested document but also related ones. Browsing is facilitated by classification system that organizes logically related documents together. Organization is hierarchical: classification hierarchy Browsing Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  35. A Classification Hierarchy For A Library System Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  36. Documents can reside in multiple places in a hierarchy in an information retrieval system, since physical location is not important. Classification hierarchy is thus Directed Acyclic Graph (DAG) Classification DAG Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  37. A Classification DAG For A Library Information Retrieval System Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  38. Web Directories • A Web directory is just a classification directory on Web pages • E.g. Yahoo! Directory, Open Directory project • Issues: • What should the directory hierarchy be? • Given a document, which nodes of the directory are categories relevant to the document • Often done manually • Classification of documents into a hierarchy may be done based on term similarity Yan Huang - CSCI5330 Database Implementation –Information Retrieval

  39. Some slides of this slide set adapted from the following slides: • Prof. James Allan’s course slides • Extrapolation Methods for Accelerating PageRank Computations by Sepandar D. Kamvar et. al. Yan Huang - CSCI5330 Database Implementation –Information Retrieval

More Related