1 / 39

Gravitation-Based Model for Information Retrieval

Gravitation-Based Model for Information Retrieval. Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com. From: http://www.awesomelibrary.org/images/solar-system-nasa.jpg. Background. A core problem in Information Retrieval (IR):

guinivere
Télécharger la présentation

Gravitation-Based Model for Information Retrieval

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gravitation-Based Model for Information Retrieval Shuming Shi Ji-Rong Wen Qing Yu Ruihua Song Wei-Ying Ma Microsoft Research Asia shumings@microsoft.com From: http://www.awesomelibrary.org/images/solar-system-nasa.jpg SIGIR’2005

  2. Background A core problem in Information Retrieval (IR): Determine the relevance of a document to a query Query: Bill Clinton Document: Relevant? How relevant? SIGIR’2005

  3. Background • IR Models & Perspectives • IR models define the representation of documents, queries, and the relevance relationship between them • The key behind all IR models is primary perspectives on information retrieval SIGIR’2005

  4. Background • Hard questions • What is the essence of information retrieval? • What is the right perspective of it? • Till now, we know more about IR each time when a new perspective is adopted • It would also be helpful to view IR problems from more new perspectives • We try to view IR from the perspective of physics SIGIR’2005

  5. Background (1687 AD.) From:http://csep10.phys.utk.edu/astr161/lect/history/newtongrav.html SIGIR’2005

  6. Background From http://www.enterprisemission.com/hyper2a.php SIGIR’2005

  7. Background • We are living in a physical world which is dominated by fundamental physics laws. • Can we get help from “the God” in acquiring deeper understanding of information retrieval? • Simply start from Newton’s Universal Law of Gravitation… SIGIR’2005

  8. Preliminary Achievements • First discovered by Robertson et al, inspired by the shape of a complex formula derived from a probabilistic model under the 2-Poisson assumption. • Amati and Rijsbergen proposed a probabilistic framework with which the BM25 function with some special parameters (k1=1.2, b=0.75; or k1=2, b=0.75) can be approximated numerically We lack a complete derivation of BM25 formula in theory. It is encouraging that we can really benefit from the nature. With the new perspective, we get the following preliminary achievements, • We build a new IR model GBM from which many effective ranking functions can be derived • The BM25 formula can be derived from our model, so we give an intuitive physical interpretation of this powerful and robust function. • A more reasonable approach for structured document retrieval can be obtained directly from the model. This approach is not only highly effective but also robust to be used in various conditions. SIGIR’2005

  9. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005

  10. GBM: Initial Idea IR concepts & notations: |D| Document length df(t) Document frequency of t avdlAverage document length in a collection NTotal number of documents c(t,D) Times of occurrences of t in D (or written as tf(t,D)) A mapping is need to be build from concepts of information retrieval to those of physics Query: Bill Clinton Document: Relevance score Attractive force Physics concepts mass distance … … SIGIR’2005

  11. GBM: Notations & Basic Concepts • Particle • (=atom): Basic element of any object • A particle has two attributes: mass and type • Type: Determined by the term object it composes SIGIR’2005

  12. GBM: Notations & Basic Concepts H(D): Hidden terms in document D Two natural assumptions: A term object has 4 attributes: type, shape, mass, and diameter SIGIR’2005

  13. Notation List SIGIR’2005

  14. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005

  15. Discrete GBM Model • Key Points: • Under the attraction of query terms, the structure of each document would be adjusted to an optimized-term-placement state. • 2. The relevance between a document and a query is defined by the attractive force between them when the document is in its optimized-term-placement state. Optimized-term-placement state A state where the aggregated force between the document and the query gets maximized SIGIR’2005

  16. Term Weighting Formula Unknown expressions: m(t,Q), m(t,D), anddi(t,D) Need: Mass and diameter estimation The force between query term t and its i-th nearest occurrence in D: The maximal (optimized) gravitational force between t and D: The attractive force between D and Q: SIGIR’2005

  17. Mass and Diameter Estimation For any two terms, their mass ratio in any document is equal to the ratio of their average masses in the whole collection. Assume that all terms in the same document have equal diameters (Assumption-2) (Assumption-1) Define a document-independent mass for each (type of) term. It denotes the average mass of term t in the whole collection. (Assumption-3) (Assumption-4) SIGIR’2005

  18. Ultimate Discrete GBM Formula • The mass of a document is a measure of its quality, which depends on how informative and important it is. • Relationship with PageRank? <Future work> The average (document-independent) mass of term t in the collection The ultimate term-weighting function: where and SIGIR’2005

  19. Ultimate Discrete GBM Formula If m(D) = const, di(D) = const, and Then a special case of the term-weighting function: where Two parameters: SIGIR’2005

  20. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005

  21. Continuous GBM Model Term shape: Ideal cylinder Document D is now in its optimized-term-placement state SIGIR’2005

  22. Term Weighting Formula The force between query term t and its i-th nearest occurrence in D: The maximal (optimized) gravitational force between t and D: SIGIR’2005

  23. Ultimate Continuous GBM Formula By doing mass and diameter estimation, we have the ultimate term-weighting function: where and If:m(D) = const, di(D) = const,and Then a special case of the above term-weighting function: (Two parameters: ) SIGIR’2005

  24. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005

  25. Continuous GBM Formula vs. BM25 A special case of the continuous GBM term-weighting function: where BM25 term-weighting function SIGIR’2005

  26. Other Ranking Formulas Derived Ranking formulas (highly simplified version) derived from the continuous GBM model with various gravitational-field-functions SIGIR’2005

  27. Check with Heuristic Constraints • [Fang et al, SIGIR’04]: Some heuristic constraints related to TF, IDF, and document length that all reasonable ranking formulas should satisfy • TFC1, TFC2 • TDC M-TDC • LNC1, LNC2 • TF-LNC • All our derived term weighting functions satisfy all the above constraints. SIGIR’2005

  28. Preliminary Experiments • Experimental Setup Corpora characteristics Query-sets used in the experiments SIGIR’2005

  29. Preliminary Experiments • Experimental Results Optimal performance comparison among some formulas over various corpora and tasks (measure: mean average precision) SIGIR’2005

  30. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrievalskip • Summary SIGIR’2005

  31. Structured Document Retrieval • A document is said to be structured here when it contains multiple fields. • Current approaches for structured document retrieval • Score combination • The most commonly used and well-studied approach • Rank combination is a special case of score combination • Term-frequency combination • [Robertson et al, CIKM’04]: An extension of BM25 • [Ogilvie et al, SIGIR’03]: Linearly combining language models Each approach works moderately well, but… SIGIR’2005

  32. Score Combination Issues • For a multi-term query, a document matching a single query term over many fields could get unreasonably higher score than another document which matches all the query terms in a few fields. (See discussions in [Robertson et al, CIKM’04]) score(d1) = s + s + s + … + s = 8s score(d2) = 2s + 2s + 0 + … + 0 = 4s score(d1) > score(d2) Unreasonable SIGIR’2005

  33. TF Combination Issues Consider a single-term query Q=t, and some documents with two fields (F1, F2). Assuming:w1 = weight(F1) = 5; w2 = weight(F2) = 1 tf(t,d1) = w1 * 1 + w2 * 0 = 5 tf(t,d2) = w1 * 0 + w2 * 6 = 6 score(d1) < score(d2) Reasonable • Larger w1? • Can’t remove this issue • Potential risk of making the case of example-1 unreasonable Example-1 (assuming |d1|=|d2|) tf(t,d3) = w1 * 1 + w2 * 8 = 13 tf(t,d4) = w1 * 0 + w2 * 14 = 19 score(d3) < score(t,d4) Unreasonable Example-2 (assuming |d3|=|d4|) SIGIR’2005

  34. Structured Document Retrievalby GBM SIGIR’2005

  35. Experimental Results Performance comparison of different approaches for the combination of body and title fields SIGIR’2005

  36. Outline • Background • Gravitation-based Model • Notations & Basic Concepts • Discrete GBM Model • Continuous GBM Model • Model analysis • GBM Model for Structured Document Retrieval • Summary SIGIR’2005

  37. Summary • Viewing IR from a different viewpoint is the same important as going deeper from traditional perspectives. • This paper may be a first step to take a physics viewpoint • It is encouraging that we can really benefit from the nature • A family of effective ranking functions derived • Give BM25 a physics interpretation • A more reasonable approach for structured document retrieval obtained SIGIR’2005

  38. Sorry, Sir Isaac Newton. Hope I am not abusing your laws. SIGIR’2005

  39. The End Gravitation-Based Model for Information Retrieval Please send your comments to: shumings@microsoft.com SIGIR’2005

More Related