1 / 88

Automatic Indexing

Automatic Indexing. Hsin-Hsi Chen. Indexing. indexing: assign identifiers to text items. assign: manual vs. automatic indexing identifiers: objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, …

yuki
Télécharger la présentation

Automatic Indexing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Indexing Hsin-Hsi Chen

  2. Indexing • indexing: assignidentifiers to text items. • assign: manual vs. automatic indexing • identifiers: • objective vs. nonobjective text identifiers cataloging rules define, e.g., author names, publisher names, dates of publications, … • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, … • single-term vs. term phrase

  3. Two Issues • Issue 1: indexing exhaustivity • exhaustive: assign a large number of terms • nonexhaustive • Issue 2: term specificity • broad terms (generic)cannot distinguish relevant from nonrelevant items • narrow terms (specific)retrieve relatively fewer items, but most of them are relevant

  4. Parameters of retrieval effectiveness • Recall • Precision • Goal high recall and high precision

  5. Retrieved Part b a Nonrelevant Items Relevant Items d c

  6. A Joint Measure • F-score •  is a parameter that encode the importance of recall and procedure. • =1: equal weight • >1: precision is more important • <1: recall is more important

  7. Choices of Recall and Precision • Both recall and precision vary from 0 to 1. • In principle, the average user wants to achieve both high recall and high precision. • In practice, a compromise must be reached because simultaneously optimizing recall and precision is not normally achievable.

  8. Choices of Recall and Precision (Continued) • Particular choices of indexing and search policies have produced variations in performance ranging from 0.8 precision and 0.2 recall to 0.1 precision and 0.8 recall. • In many circumstance, both the recall and the precision varying between 0.5 and 0.6 are more satisfactory for the average users.

  9. Term-Frequency Consideration • Function words • for example, "and", "or", "of", "but", … • the frequencies of these words are high in all texts • Content words • words that actually relate to document content • varying frequencies in the different texts of a collect • indicate term importance for content

  10. A Frequency-Based Indexing Method • Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words. • Compute the term frequencytfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di. • Choose a threshold frequencyT, and assign to each document Di all term Tj for which tfij > T.

  11. Discussions • High-frequency termsfavor recall • high precisionthe ability to distinguish individual documents from each other • high-frequency termsgood for precision when its term frequency is not equally high in all documents.

  12. Inverse Document Frequency • Inverse Document Frequency (IDF) for term Tjwhere dfj (document frequency of term Tj) is number of documents in which Tj occurs. • fulfil both the recall and the precision • occur frequently in individual documents but rarely in the remainder of the collection

  13. New Term Importance Indicator • weight wij of a term Tj in a document ti • Eliminating common function words • Computing the value of wij for each term Tj in each document Di • Assigning to the documents of a collection all terms with sufficiently high (tfxidf) factors

  14. Term-discrimination Value • Useful index termsdistinguish the documents of a collection from each other • Document Space • two documents are assigned very similar term sets, when the corresponding points in document configuration appear close together • when a high-frequency term without discrimination is assigned, it will increase the document space density

  15. A Virtual Document Space After Assignment of good discriminator After Assignment of poor discriminator Original State

  16. Good Term Assignment • When a term is assigned to the documents of a collection, the few items to which the term is assigned will be distinguished from the rest of the collection. • This should increase the average distance between the items in the collection and hence produce a document space less dense than before.

  17. Poor Term Assignment • A high frequency term is assigned that does not discriminate between the items of a collection. • Its assignment will render the document more similar. • This is reflected in an increase in document space density.

  18. Term Discrimination Value • definitiondvj = Q - Qjwhere Q and Qj are space densities before and after the assignments of term Tj. • dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

  19. Variations of Term-Discrimination Value with Document Frequency Phrase transformation Thesaurus transformation Document Frequency N Low frequency dvj=0 Medium frequency dvj>0 High frequency dvj<0

  20. Another Term Weighting • wij = tfijx dvj • compared with • : decrease steadily with increasing document frequency • dvj: increase from zero to positive as the document frequency of the term increase, decrease shapely as the document frequency becomes still larger.

  21. Document Centroid • Issue: efficiency problem N(N-1) pairwise similarities • document centroidC = (c1, c2, c3, ..., ct)where dkj is the j-th term in document k. • space density

  22. Discussions • dvj and idfjglobal properties of terms in a document collection • idealterm characteristics that occur between relevant and nonrelevant items of a collection

  23. Probabilistic Term Weighting • GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant items of a collection • DefinitionConsider a collection of document vectors of the formx = (x1,x2,x3,...,xt)

  24. Probabilistic Term Weighting • Pr(x|rel), Pr(x|nonrel):occurrence probabilities of item x in the relevant and nonrelevant document sets • Pr(rel), Pr(nonrel):item’s a priori probabilities of relevance and nonrelevance • Further assumptionsTerms occur independently in relevant documents;terms occur independently in nonrelevant documents.

  25. Derivation Process

  26. Given a document D=(d1, d2, …, dt), the retrieval value of D is: where di: term weights of term xi.

  27. Assume di is either 0 or 1. 0: i-th term is absent from D. 1: i-th term is present in D. pi=Pr(xi=1|rel) 1-pi=Pr(xi=0|rel) qi=Pr(xi=1|nonrel) 1-qi=Pr(xi=0|nonrel)

  28. The retrieval value of each Tj present in a document (i.e., dj=1) is: term relevance weight

  29. New Term Weighting • term-relevance weight of term Tj: trj • indexing value of term Tj in document Dj:wij = tfij *trj • IssueIt is necessary to characterize both the relevant and nonrelevant documents of a collection.how to find a representative document sample feedback information from retrieved documents??

  30. Estimation of Term-Relevance • Little is known about the relevance properties of terms. • The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collectionqj = dfj / N • The occurrence probabilities of the terms in the small number of relevant items is equal by using a constant value pj = 0.5 for all j.

  31. When N is sufficiently large, N-dfj  N, = idf 

  32. Estimation of Term-Relevance • Estimate the number of relevant items rj in the collection that contain term Tj as a function of the known document frequency tfj of the term Tj.pj = rj / R qj = (dfj-rj)/(N-R)R: an estimate of the total number of relevant items in the collection.

  33. Term Relationships in Indexing • Single-term indexing • Single terms are often ambiguous. • Many single terms are either too specific or too broad to be useful. • Complex text identifiers • subject experts and trained indexers • linguistic analysis algorithms, e.g., NP chunker • term-grouping or term clustering methods

  34. Tree-Dependence Model • Only certain dependent term pairs are actually included, the other term pairs and all higher-order term combinations being disregarded. • Example:sample term-dependence tree

  35. (children, school), (school, girls), (school, boys)  (children, girls), (children, boys)  (school, girls, boys), (children, achievement, ability) 

  36. Term Classification (Clustering)

  37. Term Classification (Clustering) • Column partGroup terms whose corresponding column representation reveal similar assignments to the documents of the collection. • Row partGroup documents that exhibit sufficiently similar term assignment.

  38. Linguistic Methodologies • Indexing phrases:nominal constructions including adjectives and nouns • Assign syntactic class indicators (i.e., part of speech) to the words occurring in document texts. • Construct word phrases from sequences of words exhibiting certain allowed syntactic markers (noun-noun and adjective-noun sequences).

  39. Term-Phrase Formation • Term Phrasea sequence of related text words carry a more specific meaning than the single termse.g., “computer science” vs. computer; Phrase transformation Thesaurus transformation Document Frequency N Low frequency dvj=0 Medium frequency dvj>0 High frequency dvj<0

  40. Simple Phrase-Formation Process • the principal phrase component (phrase head)a term with a document frequency exceeding a stated threshold, or exhibiting a negative discriminator value • the other components of the phrasemedium- or low- frequency terms with stated co-occurrence relationships with the phrase head • common function wordsnot used in the phrase-formation process

  41. An Example • Effective retrieval systems are essential for people in need of information. • “are”, “for”, “in” and “of”:common function words • “system”, “people”, and “information”:phrase heads

  42. The Formatted Term-Phrases effective retrieval systems essential people need information 2/5 5/12 *: phrases assumed to be useful for content identification

  43. The Problems • A phrase-formation process controlled only by word co-occurrences and the document frequencies of certain words in not likely to generate a large number of high-quality phrases. • Additional syntactic criteria for phrase heads and phrase components may provide further control in phrase formation.

  44. Additional Term-Phrase Formation Steps • Syntactic class indicator are assigned to the terms, and phrase formation is limited to sequences of specified syntactic markers, such as adjective-noun and noun-noun sequences.Adverb-adjective adverb-noun  • The phrase elements are all chosen from within the same syntactic unit, such as subject phrase, object phrase, and verb phrase.

  45. Consider Syntactic Unit • effective retrieval systemsare essential for people in the need of information • subject phrase • effective retrieval systems • verb phrase • are essential • object phrase • people in need of information

  46. Phrases within Syntactic Components [subjeffective retrieval systems][vp are essential] for [obj people need information] • Adjacent phrase heads and components within syntactic components • retrieval systems* • people need • need information* • Phrase heads and components co-occur within syntactic components • effective systems 2/3

  47. Problems • More stringent phrase formation criteria produce fewer phrases, both good and bad, than less stringent methodologies. • Prepositional phrase attachment, e.g., The man saw the girl with the telescope. • Anaphora resolutionHe dropped the plate on his foot and broke it.

  48. Problems (Continued) • Any phrase matching system must be able to deal with the problems of • synonym recognition • differing word orders • intervening extraneous word • Example • retrieval of information vs. information retrieval

More Related