1 / 44

Text Operations

Text Operations. J. H. Wang Feb. 21, 2008. Text. User Interface. 4, 10. user need. Text. Text Operations. 6, 7. logical view. logical view. Query Operations. DB Manager Module. Indexing. user feedback. 5. 8. inverted file. query. Searching. Index. 8. retrieved docs.

kennice
Télécharger la présentation

Text Operations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Text Operations J. H. Wang Feb. 21, 2008

  2. Text User Interface 4, 10 user need Text Text Operations 6, 7 logical view logical view Query Operations DB Manager Module Indexing user feedback 5 8 inverted file query Searching Index 8 retrieved docs Text Database Ranking ranked docs 2 The Retrieval Process

  3. Outline • Document Preprocessing (7.1-7.2) • Text Compression (7.4-7.5): skipped • Automatic Indexing (Chap. 9, Salton) • Term Selection

  4. Document Preprocessing • Lexical analysis • Letters, digits, punctuation marks, … • Stopword removal • “the”, “of”, … • Stemming • Prefix, suffix • Index term selection • Noun • Construction of term categorization structure • Thesaurus

  5. structure Full text Index terms • Logical view of the documents Accents, spacing Noun groups Manual indexing Docs stopwords stemming structure

  6. Lexical Analysis • Converting a stream of characters into a stream of words • Recognition of words • Digits: usually not good index terms • Ex.: The number of deaths due to car accidents between 1910 and 1989, “510B.C.”, credit card numbers, … • Hyphens • Ex: state-of-the-art, gilt-edge, B-49, … • Punctuation marks: normally removed entirely • Ex: 510B.C., program codes: x.id vs. xid, … • The case of letters: usually not important • Ex: Bank vs. bank, Unix-like operating systems, …

  7. Elimination of Stopwords • Stopwords: words which are too frequent among the documents in the collection are not good discriminators • Articles, prepositions, conjunctions, … • Some verbs, adverbs, and adjectives • To reduce the size of the indexing structure • Stopword removal might reduce recall • Ex: “to be or not to be”

  8. Stemming • The substitution of the words by their respective stems • Ex: plurals, gerund forms, past tense suffixes, … • A stem is the portion of a word which is left after the removal of its affixes (i.e., prefixes and suffixes) • Ex: connect, connected, connecting, connection, connections • Controversy about the benefits • Useful for improving retrieval performance • Reducing the size of the indexing structure

  9. Stemming • Four types of stemming strategies • Affix removal, table lookup, successor variety, and n-grams (or term clustering) • Suffix removal • Port’s algorithm (available in the Appendix) • Simplicity and elegance

  10. Index Term Selection • Manually or automatically • Identification of noun groups • Most of the semantics is carried by the noun words • Systematic elimination of verbs, adjectives, adverbs, connectives, articles, and pronouns • A noun group is a set of nouns whose syntactic distance in the text does not exceed a predefined threshold

  11. Thesauri • Thesaurus: a reference to a treasury of words • A precompiled list of important words in a given domain of knowledge • For each word in this list, a set of related words • Ex: synonyms, … • It also involves normalization of vocabulary, and a structure

  12. Example Entry in Peter Roget’s Thesaurus • Cowardlyadjective • Ignobly lacking in courage: cowardly turncoats. • Syns: chicken (slang), chicken-hearted, craven, dastardly, faint-hearted, gutless, lily-livered, pusillanimous, unmanly, yellow (slang), yellow-bellied (slang).

  13. Main Purposes of a Thesaurus • To provide a standard vocabulary for indexing and searching • To assist users with locating terms for proper query formulation • To provide classified hierarchies that allow the broadening and narrowing of the current query request according to the user needs

  14. Motivation for Building a Thesaurus • Using a controlled vocabulary for the indexing and searching • Normalization of indexing concepts • Reduction of noise • Identification of indexing terms with a clear semantic meaning • Retrieval based on concepts rather than on words • Ex: term classification hierarchy in Yahoo!

  15. Main Components of a Thesaurus • Index terms: individual words, group of words, phrases • Concept • Ex: “missiles, ballistic” • Definition or explanation • Ex: seal (marine animals), seal (documents) • Relationships among the terms • BT (broader), NT (narrower) • RT (related): much difficult • A layout design for these term relationships • A list or bi-dimensional display

  16. Automatic Indexing(Term Selection)

  17. Automatic Indexing • Indexing • assign identifiers (index terms) to text documents • Identifiers • single-term vs. term phrase • controlled vs. uncontrolled vocabulariesinstruction manuals, terminological schedules, … • objective vs. nonobjective text identifiers cataloging rules control, e.g., author names, publisher names, dates of publications, …

  18. Two Issues • Issue 1: indexing exhaustivity • exhaustive: assign a large number of terms • Nonexhaustive: only main aspects of subject content • Issue 2: term specificity • broad terms (generic)cannot distinguish relevant from nonrelevant documents • narrow terms (specific)retrieve relatively fewer documents, but most of them are relevant

  19. Recall vs. Precision • Recall (R) = Number of relevant documents retrieved / total number of relevant documents in collection • The proportion of relevant items retrieved • Precision (P) = Number of relevant documents retrieved / total number of documents retrieved • The proportion of items retrieved that are relevant • Example: for a query, e.g. Taipei All docs Relevantdocs Retrieveddocs

  20. More on Recall/Precision • Simultaneously optimizing both recall and precision is not normally achievable • Narrow and specific terms: precision is favored • Broad and nonspecific terms: recall is favored • When a choice must be made between term specificity and term breadth, the former is generally preferable • High-recall, low-precision documents will burden the user • Lack of precision is more easily remedied than lack of recall

  21. Term-Frequency Consideration • Function words • for example, "and", "of", "or", "but", … • the frequencies of these words are high in all texts • Content words • words that actually relate to document content • varying frequencies in the different texts of a collection • indicate term importance for content

  22. A Frequency-Based Indexing Method • Eliminate common function words from the document texts by consulting a special dictionary, or stop list, containing a list of high frequency function words • Compute the term frequency tfij for all remaining terms Tj in each document Di, specifying the number of occurrences of Tj in Di • Choose a threshold frequency T, and assign to each document Di all term Tj for which tfij > T

  23. More on Term Frequency • High-frequency term • Recall • Ex: “Apple” • But only if its occurrence frequency is not equally high in other documents • Low-frequency term • Precision • Ex: “Huntington’s disease” • Able to distinguish the few documents in which they occur from the many from which they are absent

  24. How to Compute Weight wij ? • Inverse document frequency, idfj • tfij*idfj (TFxIDF) • Term discrimination value, dvj • tfij*dvj • Probabilistic term weightingtrj • tfij*trj • Global properties of terms in a document collection

  25. Inverse Document Frequency • Inverse Document Frequency (IDF) for term Tjwhere dfj (document frequency of term Tj) is the number of documents in which Tj occurs. • fulfil both the recall and the precision • occur frequently in individual documents but rarely in the remainder of the collection

  26. TFxIDF • Weight wij of a term Tj in a document di • Eliminating common function words • Computing the value of wij for each term Tj in each document Di • Assigning to the documents of a collection all terms with sufficiently high (tfxidf) weights

  27. Term-discrimination Value • Useful index terms • Distinguish the documents of a collection from each other • Document Space • Each point represents a particular document of a collection • The distance between two points is inversely proportional to the similarity between the respective term assignments • When two documents are assigned very similar term sets, the corresponding points in document configuration appear close together

  28. A Virtual Document Space After Assignment of good discriminator After Assignment of poor discriminator Original State

  29. Good Term Assignment • When a term is assigned to the documents of a collection, the few documents to which the term is assigned will be distinguished from the rest of the collection • This should increase the average distance between the documents in the collection and hence produce a document space less dense than before

  30. Poor Term Assignment • A high frequency term is assigned that does not discriminate between the objects of a collection • Its assignment will render the document more similar • This is reflected in an increasein document spacedensity

  31. Term Discrimination Value • Definitiondvj = Q - Qjwhere Q and Qj are space densities before and after the assignment of term Tj • The average pairwise similarity between all pairs of distinct terms: • dvj>0, Tj is a good term; dvj<0, Tj is a poor term.

  32. Variations of Term-Discrimination Value with Document Frequency Document Frequency N Low frequency dvj=0 Medium frequency dvj>0 High frequency dvj<0

  33. TFijx dvj • wij = tfijx dvj • compared with • : decreases steadily with increasing document frequency • dvj: increases from zero to positive as the document frequency of the term increases, decreases shapely as the document frequency becomes still larger • Issue: efficiency problem to compute N(N-1) pairwise similarities

  34. Document Centroid • Document centroidC = (c1, c2, c3, ..., ct)where wij is the j-th term in document I • A “dummy” average document located in the center of the document space • Space density

  35. Probabilistic Term Weighting • GoalExplicit distinctions between occurrences of terms in relevant and nonrelevant documents of a collection • DefinitionGiven a user query q, and the ideal answer set of the relevant documentsFrom decision theory, the best ranking algorithm for a document D

  36. Probabilistic Term Weighting • Pr(rel), Pr(nonrel):document’s a priori probabilities of relevance and nonrelevance • Pr(D|rel), Pr(D|nonrel):occurrence probabilities of document D in the relevant and nonrelevant document sets

  37. Assumptions • Terms occur independentlyin documents

  38. Derivation Process

  39. Given a document D=(d1, d2, …, dt) Assume di is either 0 (absent) or 1 (present) For a specific document D Pr(xi=1|rel) = pi Pr(xi=0|rel) = 1-pi Pr(xi=1|nonrel) = qi Pr(xi=0|nonrel) = 1-qi

  40. Term Relevance Weight

  41. Issue • How to computepjand qj? pj = rj / R qj = (dfj-rj)/(N-R) • rj: the number of relevant documents that contains term Tj • R: the total number of relevant documents • N: the total number of documents

  42. Estimation of Term-Relevance • The occurrence probability of a term in the nonrelevant documents qj is approximated by the occurrence probability of the term in the entire document collectionqj= dfj/ N • Large majority of documents will be nonrelevant to the average query • The occurrence probabilities of the terms in the small number of relevant documents is assumed to be equal by using a constant value pj= 0.5 for all j

  43. Comparison = idfj  When N is sufficiently large, N-dfj N,

More Related