1 / 16

Term Weighting approaches in automatic text retrieval.

Term Weighting approaches in automatic text retrieval. Presented by Ehsan. References. Modern Information Retrieval: Text book Slides on Vectorial Model by Dr. Rada The paper itself. The main idea.

jun
Télécharger la présentation

Term Weighting approaches in automatic text retrieval.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Term Weighting approaches in automatic text retrieval. Presented by Ehsan

  2. References • Modern Information Retrieval: Text book • Slides on Vectorial Model by Dr. Rada • The paper itself

  3. The main idea • Text indexing system based on weighted single terms is better than the one based on more complex text representation • Crucial importance: effective term weighting.

  4. Basic IR • Attach content identifier to both stored texts and user queries. • A content identifier/term is a word or a group of words extracted from the document/queries • Underlying assumption • Semantics of the documents and queries can be expressed by this terms

  5. Two things to consider • What is an appropriate content identifier? • Are all the identifier of same importance? • If not, how can we discriminate a term from the others?

  6. Choosing content identifier • Use single term/word as individual identifier • Use more complex text representation as identifier • An example • “Industry is the mother of good luck” • Mother said, “Good luck”.

  7. Complex text representation • Set of related terms based on statistical co-occurrence • Term phrases consisting of one of more governing terms (head of the phrase) together with corresponding depending terms • Grouping words under a common heading like thesaurus • Constructing knowledge base to represent the content of the subject area

  8. What is better: single or complex terms? • Construction of complex text representation is inherently difficult. • Need sophisticated syntactic/statistical analysis program • An example • Using term phrase 20% increase in some cases • Other cases it is quite discouraging • Knowledge base • Effective vocabulary tools covering subject areas of reasonable scope is still sort of under-development • Conclusion • Using single terms as content identifier is preferable

  9. The second issue • How to discriminate terms? • Term weight of course! • Effectiveness of IR system • Document with relevant items must be retrieved • Documents with irrelevant/extraneous items must be rejected.

  10. Precision and Recall • Recall • Number of relevant document retrieved divided by total number of relevant documents • Precision • Out of the documents retrieved, how many of them are relevant • Our goal • High recall to retrieve as many relevant documents as possible • High precision to reject extraneous documents. • Basically, it is a trade off.

  11. Weighting mechanism • To get high recall • Term frequency, tf • When high frequency term are prevalent in the whole document collection • With high tf every single documents will be retrieved • To get high precision • Inverse document frequency • Varies inversely with the number of documents, n in which the term appears. • Idf is given by log2 (N/ n) , where N is total number of documents • To discriminate terms • We use tf X idf

  12. Two more things to consider • Current “tf X id” mechanism favors larger documents • introduce a normalizing factor in the weight to equalize the length of the document. • Probabilistic mode • Term weight is the proportion of the relevant documents in which a term occurs divided by proportion of irrelevant items in which the term occurs • Is given by log ((N-n)/n)

  13. Term weighting components • Term frequency components • b, t, n • Collection frequency components • x, f, p • Normalization components • x, c • What would be weighting system given by tfc.nfx?

  14. Experimental evidence • Query vectors • For tf • short query, use n • Long query, use t • For idf • Use f • For normalization • Use x

  15. Experimental evidence • Document vectors • For tf • Technical vocabulary, use n • More varied vocabulary, use t • For idf • Use f in general • Documents from different domain use x • For normalization • Documents with heterogeneous length, use c • Homogenous documents, use x

  16. Conclusion • Best document weighting tfc, nfc (or tpc, npc) • Best query weighting nfx, tfx, bfx (or npx, tpx, bpx) • Questions?

More Related