1 / 104

Information Retrieval to Knowledge Retrieval , one more step

Information Retrieval to Knowledge Retrieval , one more step. Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington. What is Information?. What is Retrieval?. What is Information Retrieval?. I am Retriever.

quiana
Télécharger la présentation

Information Retrieval to Knowledge Retrieval , one more step

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Information Retrieval to Knowledge Retrieval, one more step Xiaozhong Liu Assistant Professor School of Library and Information Science Indiana University Bloomington

  2. What is Information? What is Retrieval? What is Information Retrieval?

  3. I am Retriever

  4. How to find this book in Library?

  5. Search something based on User Information Need!! How to express your information need? Query

  6. User Information Need!! What is Good query? What is Bad query? Good query: query ≈ information need Bad query: query ≠ information need Query Wait!!! User NEVER make mistake!!! It’s OUR job!!!

  7. Task 1: Given user information need, how to help (or automatically) help user propose a better query? If there is a query… Perfect query: User input query:

  8. User Information Need!! What is Good results? What is Bad results? Given a query, How to retrieve results? Query Results

  9. Task 2: Given a query (not perfect), how to retrieve Documents from collection? F(query, doc) Very Large, Unstructured Text Data!!! Can you give me an example?

  10. If query term exist in doc Yes, this is result F(query, doc) If query term NOT exist in doc No, this is not result Is there any problem in this function? Brainstorm…

  11. Query: Obama’s wife Doc 1. My wife supports Obama’s new policy on… Doc 2. Michelle, as the first lady of the United States… Yes, this is a very challenging task!

  12. Another problem Collection size: 5 billion Match doc: 5 My algorithm successfully finds all the 5 docs! In… 3 billion results…

  13. User Information Need!! How to help user find the results from all the retrieved results? Query Results

  14. Task 3: Given retrieved results, how to help you find their results? If retrieval algorithm retrieved 1 billion results from collection, what will you do??? Search with Google, click “next”??? Yes, we can help user find what they need!

  15. Query: Indiana University Bloomington Can you read it One by one? You use it??

  16. User User Information Need!! 1 3 2 Query Results System

  17. They are not independent! User User Information Need!! 1 3 2 Query Results System

  18. Text Map Information Retrieval Image …… Music

  19. web scholar document blog Text news Map Information Retrieval Image …… Music

  20. Index

  21. Documents vs. Database Records • Relational database records are typically made up of well-defined fields Select * from students where GPA > 2.5 Text, similar way? Find all the docs including “Xiaozhong” Select * from documents where text like ‘%xiaozhong%’ We need a more effective way to index the text!

  22. Collection C: doc1, doc2, doc3 ……… docN Vocabulary V: w1, w2, w3 ……… wn Document doci : di1, di2, di3 ……… dim All dij V Query q: q1, q2, q3 ……… qt where qx is the query term

  23. Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 1 0 0 1 Doc2 0 0 0 1 Doc3 1 1 1 1 ……… DocN1 0 1 1 Query q: 0, 1, 0 ………

  24. Collection C: doc1, doc2, doc3 ……… docN Normalization is very important! V: w1, w2, w3 ……… wn Doc1 3 0 0 9 Doc2 0 0 0 7 Doc3 2 11 21 1 ……… DocN7 0 1 2 Query q: 0, 3, 0 ………

  25. Collection C: doc1, doc2, doc3 ……… docN Normalization is very important! V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Weight Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ………

  26. Term weighting TF * IDF Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w Term frequency: freq (w, doc) / | doc| Or… An effective way to weight each word in a document

  27. Retrieval Model? Ranking? Index Speed? Semantic? Space? Document representation meets the requirement of retrieval system

  28. Stemming Education Educate Educational Educat Educating Educations Very effective to improve system performance. Some risk! E.g. LA Lakers = LA Lake?

  29. Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. I love my cat this is lovely yellow and write ilove cat thiyellow write i - 1 love - 1, 2 thi- 2 cat - 1, 2, 3 yellow - 3 write - 3 We lose something?

  30. Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i - 1 love - 1, 2 thi- 2 cat - 1, 2, 3 yellow - 3 write - 3 i – 1:1 love – 1:1, 2:1 thi– 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 We still lose something?

  31. Inverted index Doc 1: I love my cat. Doc 2: This cat is lovely! Doc 3: Yellow cat and white cat. i – 1:1 love – 1:1, 2:1 thi– 2:1 cat – 1:1, 2:1, 3:2 yellow – 3:1 write – 3:2 i – 1:1 love – 1:2, 2:4 thi– 2:1 cat – 1:4, 2:2, 3:2, 3:5 yellow – 3:2 write – 3:4 Why do you need position info?

  32. Proximity of query terms query: information retrieval Doc 1: informationretrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever.

  33. Index – bag of words query: information retrieval Doc 1: informationretrieval is important for digital library. Doc 2: I need some information about the dogs, my favorite is golden retriever. What’s the limitation of bag-of-words? Can we make it better? n-gram: Doc 1: information retrieval, retrieval is, is important, important for …… bi-gram Better semantic representation! What’s the limitation?

  34. Index – bag of “phrase”? Doc 1: …… big apple …… Doc 2: …… apple…… More precision, less ambiguous How to identify phrases from documents? Identify syntactic phrases using POS tagging n-grams from existing resources

  35. Noise detection What is the noise of web page? Non-informative content…

  36. Web Crawler - freshness Web is changing, but we cannot constantly check all the pages… Need to find the most important page that change freq www.nba.com www.iub.edu www.restaurant????.com Sitemap: a list of urls for each host; modification time and freq

  37. Retrieval

  38. Model Mathematical modeling is frequently used with the objective to understand, explain, reason and predict behavior or phenomenon in the real world (Hiemstra, 2001). i.e. some model help you to predict tomorrow stock price…

  39. Vector Space Model Hypothesis: Retrieval and ranking problem = Similarity Problem! Is that a good hypothesis? Why? Retrieval Function: Similarity (query, Document) Return a score!!! We can Rank the documents!!!

  40. Vector Space Model So, query is a short document

  41. Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ………

  42. Collection C: doc1, doc2, doc3 ……… docN V: w1, w2, w3 ……… wn Doc1 0.41 0 0 0.62 Doc2 0 0 0 0.12 Doc3 0.42 0.11 0.34 0.13 Similarity Doc Vector ……… DocN0.01 0 0.19 0.24 Query q: 0, 0.37, 0 ……… Query Vector

  43. Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 dog 1 doc 3

  44. Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat cat doc 1 2 doc 2 = query θ dog 1 doc 3 F (q, doc) = cosine similarity (q, doc) Why Cosine?

  45. Vector Space Model Vocabulary V: w1, w2, w3 ……… wn Dimension = n = vocabulary size Document doci : di1, di2, di3 ……… din All dij V Query q: q1, q2, q3 ……… qnSame dimensional space!!!

  46. Doc1: ……Cat……dog……cat…… Doc2: ……Cat……dog Doc3: ……snake…… Query: dog cat Try!

  47. Term weighting Doc[ 0.42 0.11 0.34 0.13 ] weight, how? TF * IDF Inverse document frequency 1+ log(N/k) N total num of docs in collection k total num of docs with word w Term frequency: freq (w, doc) / | doc| Or…

  48. More TF Weighting is very important for retrieval model! We can improve TF by… i.e. freq (term, doc) log [freq (term, doc)] • BM25:

  49. Vector Space Model But… Bag of word assumption = Word independent! Query = Document, maybe not true! Vector and SEO (Search Engine Optimization)… synonym? Semantic related words?

  50. TF IDF How about these… +parameter Normalization Pivoted Normalization Method Dirichlet Prior Method

More Related