220 likes | 380 Vues
Collocations and Terminology. Vasileios Hatzivassiloglou University of Texas at Dallas. Collocations. Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics , 1993 Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning
E N D
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas
Collocations • Frank Smadja, “Retrieving Collocations from Text”, Computational Linguistics, 1993 • Recurrent combinations of words that co-occur more often than chance, often with non-compositional meaning • Technical and non-technical
Examples of collocations • The Dow Jones average of industrials • The Dow average • The Dow industrials • *The Jones industrials • The Dow Jones industrial • *The industrial Dow • *The Dow industrial
Collocation properties • Arbitrary (dialect dependent) • ride a bike, set the table • Domain dependent • dry suit, wet suit • Recurrent • Cohesive • Part of a collocation primes for the rest
Applications • Lexicography • Grammatical restrictions (compare with/to but associate with) • Generation • Translation
Types of collocations • Predicative relations • make a decision, hostile takeover • flexible (syntactic variability, intervening words) • Rigid word groups • over the counter market • Phrases with open slots • fluency in a domain
Issues in finding collocations • Possibly more than two words • Need measure that extends beyond the binary case • Possibly intervening words • Possibly morphological and syntactic variation • Semantic constraints (cf. doctors-dentists and doctors-hospitals)
Xtract stage one • For a given word, find all collocates at positions -5 to +5 • Three criteria: • strength (normalized frequency); 95% rejection vs. expected 68% under normal distribution • position histogram must not be flat • select peak from histogram
Xtract stage two • Start from word pairs • Look at each position in between, to the left, and to the right • Keep words that appear very often • If that fails, keep parts of speech that satisfy this criterion
Xtract stage three • Applied to pairs of words • Requires (partial) parsing • Examines the syntactic relationship between words and keeps those pairs with consistent relationships (e.g., verb-object)
Evaluation • Ask lexicographer to evaluate output • 40% precision after stages one and two • 80% precision after stage three • 94% conditional recall
Terminology • Béatrice Daille, “Study and Implementation of Combined Techniques for Automatic Extraction of Terminology”, ACL Balancing Act workshop, 1994 • Terms refer to concepts • Terms key for populating a domain ontology • Terms are typically nominal compounds of certain structure, e.g., NN, N of N
Defining terms • Unique reference • Unique translation • Term extension by • modification (e.g., addition of an adjective) • substitution • extension of structure • coordination
Algorithm • Apply syntactic constraints to match pairs of words in a candidate term • Filter by application of an association measure • Measures examined: pointwise mutual information, Φ2 (chi-square), log-likelihood ratio
Observations • Compare with reference list • Frequency a strong predictor • Log-likelihood ratio works best • Additional criteria: • diversity of the distribution of each word • distance between the two words (determines flexibility but not term status)
Justeson and Katz • Justeson and Katz, “Technical Terminology: Some Linguistic Properties and an Algorithm for Identification in Text”, Natural Language Engineering, 1995.
Analysis • Examined association measures • Well-known problems: • eliminating general-language constructs (e.g., collocations) • what to do with single word terms?
Observations • Frequency works well • But a stronger predictor is P(k>1) compared to P(k≥1) in the same document • Use syntactic patterns to propose terms, then check if they reappear in the same document • Require this across multiple documents
Term Expansion • Jacquemin, Klavans, and Tzoukermann, “Expansion of Multi-Word Terms for Indexing and Retrieval Using Morphology and Syntax”, ACL 1997. • Need to expand a given list of terms, especially for scientific domains
Term variation • Syntactic (same words, different structure) • Morphosyntactic (derivational forms of words) • Semantic (synonyms are used) • In IR, normalization through stemming and removal of stop words
Approach • Process corpus matching new candidate terms to old ones via unification • Matching based on • inflectional morphology (transducer) • derivational morphology (rule-based) • syntactic transformations • additions of words
Results • Manual inspection of several thousand proposed terms • Precision of 89% • Effectiveness in indexing increases by a factor of three when using the variants (P/R from 99.7/72 to 97/93)