120 likes | 256 Vues
Ontology-Aware Information Extraction http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb 4, SIG 5, 2002. GATE, a General Architecture for Text Engineering GATE is….
E N D
Ontology-Aware Information Extraction http://gate.ac.uk/ Hamish Cunningham, Kalina Bontcheva Department of Computer Science, University of Sheffield OntoWeb 4, SIG 5, 2002
GATE, a General Architecture for Text Engineering • GATE is…. • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Free software (LGPL). Mature robust software (in development since 1995). Download at http://gate.ac.uk/download • Comes with… • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. 2(12)
Applications; languages • GATE has been used for a variety of applications, including: • MUMIS: automatic creation of semantic indexes for multimedia programme material • MUSE: a multi-genre IE system • EMILLE: a 70 million word corpus of Indic languages • Metadata for Medline (at Merck) • Creation of metadata for Semantic Web Services; documentation using NLG • HSE: summarisation of health and safety information from company reports • OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. • AKT: language technology in knowledge management • AMITIES: call centre automation • Digital libraries / e-philology for ancient languages researchers • Various Medical Informatics and database technology projects • IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian next year) 3(12)
Some users… At time of writing a representative fraction of GATE users includes: • Longman Pearson publishing, UK; • BT Exact Technologies, UK; • Merck KgAa, Germany; • Canon Europe, UK; • Knight Ridder (the second biggest US news publisher); • BBN Technologies, US; • Sirma AI Ltd., Bulgaria; • Resco AB, Sweden/Finland/Germany; • Glaxo Smith Kline Plc: drug-based navigation of Medline abstracts • Master Foods NV: extraction of commodities events from news • the American National Corpus project, US; • Imperial College, London, the University of Manchester, Queen Mary College, UMIST, the University of Karlsruhe, Vassar College, ISI / the University of Southern California and a large number of other UK, US and EU Universities; • the Perseus Digital Library project, Tufts University, US. 4(12)
Scientific method and HLT • How do we really know that this stuff works?! • Open source systems make experimental repeatability easier and therefore cut down on site-specific skew effects. • GATE's IE tools have competed in MUC, TREC (QA), ACE, and DUC. TIDES Surprise Language exercise next year. • GATE includes markup and automated evaluation tools: easier quantitative evaluation. 5(12)
Collaboration opportunities • Interoperation, integration, not re-invention: collaboration not competition • Take the code, do what you like with it, perhaps contribute something back • Involve us in your 6th Framework projects • Join KITShare: a network of excellence in Knowledge and Interface Tool Sharing. 6(12)
The Holy Grail • Problem: gap between many current IE tools and SemWeb needs 7(12)
What is needed? • Content, not Information Extraction • Identify the ontological reference, not just the class • Maintain referential integrity (coreference) • Ontology-aware IE tools • Use instances already in the ontology • React to changes in the ontology • Support experienced users to change the IE tools 8(12)
GATE and Content Extraction ANNIE - Open-source IE system in GATE, providing modules needed for content extraction • Pre-processing • Named entity recognition • Coreference resolution • ANNIE handles proper names, pronouns, and nominals • Easy-to-use pattern-action rule language to enable customisation and postprocessing of the IE results 9(12)
Ontologies as explicit IE resources • Reuse, not reinvention: • Protégé for ontology maintenance • Sesame/KAON for storage and reasoning • Ontology-aware gazetteers • Provide the ontological class of each entry • Use instances from the ontology for IE 11(12)
Ontology-aware IE • The IE tools can use available formal knowledge and reasoning • Ontology-based anaphora resolution • G. Bush, G. Brown, the president • The correct ontological classes are assigned to the recognised entities • Changes in the ontology available to the IE tools 12(12)