210 likes | 348 Vues
HLT, Data Sparsity and Semantic Tagging Louise Guthrie (University of Sheffield) Roberto Basili (University of Tor Vergata, Rome) Hamish Cunningham (University of Sheffield). Outline. A ubiquitous problem: data sparsity The approach: coarse-grained semantic tagging
E N D
HLT, Data Sparsity and Semantic Tagging • Louise Guthrie (University of Sheffield) • Roberto Basili (University of Tor Vergata, Rome) • Hamish Cunningham (University of Sheffield) 1(21)
Outline • A ubiquitous problem: data sparsity • The approach: • coarse-grained semantic tagging • learning by combining multiple evidence • The evaluation: intrinsic and extrinsic measures • The expected outcomes: architectures, tools, development support 2(21)
Applications PresentWe’ve seen growing interest in a range of HLT tasks: e.g. IE, MT Trends • Fully portable IE, unsupervised learning • Content Extraction vs. IE 3(21)
Data Sparsity • Language Processing depends on a model of the features important to an application. • MT - Trigrams and frequencies • Extraction - Word patterns • New texts always seem to have lots of phenomena we haven’t seen before 4(21)
Different kinds of patterns Person was appointed as postof company Company named person to post • Almost all extraction systems tried to find patterns of mixed words and entities. • People, Locations, Organizations, dates, times, currencies 5(21)
Can we do more? Astronauts aboard the space shuttle Endeavor were forced to dodge a derelict Air Force satellite Friday Humans aboard space_vehicle dodge satellitetimeref. 6(21)
Could we know these are the same? The IRA bombed a family owned shop in Belfast yesterday. FMLN set off a series of explosions in central Bogota today. ORGANIZATION ATTACKED LOCATION DATE 7(21)
Machine translation • Ambiguity of words often means that a word can translate several ways. • Would knowing the semantic class of a word, help us to know the translation? 8(21)
Sometimes . . . • Crane the bird vs crane the machine • Bat the animal vs bat for cricket and baseball • Seal on a letter vs the animal 9(21)
SO .. P(translation(crane) = grulla | animal) > P(translation(crane) = grulla) P(translation(crane) = grua | machine) > P(translation(crane) = grua) Can we show the overall effect lowers entropy? 10(21)
Language Modeling – Data Sparseness again .. • We need to estimate Pr (w3 | w1 w2) • If we have never seen w1w2 w3 before • Can we instead develop a model and estimate Pr (w3 | C1 C2) or Pr (C3 | C1 C2) 11(21)
A Semantic Tagging technology. How? • We will exploit similarity with NE tagging, ... • Development of pattern matching rules as incremental wrapper induction • ... with semantic (sense) disambiguation • Use as much evidence as possible • Exploit existing resources like MRD or LKBs • ... and with machine learning tasks • Generalize from positive examples in training data 12(21)
Multiple Sources of Evidence • Lexical information (priming effects) • Distributional information from general and training texts • Syntactic features • SVO patterns or Adjectival modifiers • Semantic features • Structural information in LKBs • (LKB-based) similarity measures 13(21)
Machine Learning for ST • Similarity estimation • among contexts (texts overlaps, …) • among lexical items wrt MRD/LKBs • We will experiment • Decision tree learning (e.g. C4.5) • Support Vector Machines (e.g. SVM light) • Memory-based Learning (TiMBL) • Bayesian learning 14(21)
What’s New? • Granularity • Semantic categories are coarser than word senses (cfr. homograph level in MRD) • Integration of existing ML methods • Pattern induction is combined with probabilistic description of word semantic classes • Co-training • Annotated data are used to drive the sampling of further evidence from unannotated material (active learning) 15(21)
How we know what we’ve done: measurement, the corpus • Hand-annotated corpus • from the BNC, 100-million word balanced corpus • 1 million words annotated • a little under ½ million categorised noun phrases • Extrinsic evaluationPerplexity of lexical choice in Machine Translation • Intrinsic evaluationStandard measures or precision, recall, false positives • (baseline: tag with most common category = 33%) 16(21)
Ambiguity levels in the training data NPs by semantic categories: 0 104824 23.1% 1 119228 26.3% 2 96852 21.4% 3 44385 9.8% 4 35671 7.9% 5 15499 3.4% 6 13555 3.0% 7 7635 1.7% 8 6000 1.3% 9 2191 0.5% 10 3920 0.9% 11 1028 0.2% 12 606 0.1% 13 183 0.0% 14 450 0.1% 15 919 0.2% 17 414 0.1% Total NPs (interim) 453360 17(21)
Maximising project outputs:software infrastructure for HLT • Three outputs from the project: • 1. A new resource • Automatical annotation of the whole corpus • Experimental evidence re. 1.- how accurate the final results are- how accurate the various methods employed are • Component tools for doing 1., based on GATE(a General Architecture for Text Engineering) 18(21)
What is GATE? • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/ 19(21)
Where did GATE come from? • A number of researchers realised in the early- mid-1990s (e.g. in TIPSTER): • Increasing trend towards multi-site collaborative projects • Role of engineering in scalable, reusable, and portable HLT solutions • Support for large data, in multiple media, languages, formats, and locations • Lower the cost of creation of new language processing components • Promote quantitative evaluation metrics via tools and a level playing field • History: • 1996 – 2002: GATE version 1, proof of concept • March 2002: version 2, rewritten in Java, component based, LGPL, more users • Fall 2003: new development cycle 20(21)
Role of GATE in the project • Productivity- reuse some baseline components for simple tasks- development environment support for implementors (MATLAB for HLT?)- reduce integration overhead (standard interfaces between components)- system takes care of persistency, visualisation, multilingual edit, ... • Quantification- tool support for metrics generation - visualisation of key/response differences- regression test tool for nightly progress verification • Repeatability- open source supported, maintained, documented software- cross-platform (Linux, Windows, Solaris, others)- easy install and proven useability (thousands of people, hundreds of sites)- mobile code if you write in Java; web services otherwise 21(21)