190 likes | 293 Vues
Discussing the future of the web with the Semantic Web, GATE, and language technology; exploring ubiquitous, permeable, and companionable web concepts for semantic discovery and interaction.
E N D
The Semantic Web and Language Technology • BT Exact, Martlesham • Hamish CunninghamDepartment of Computer Science, • University of Sheffield • Friday October 11th 2002 • Next generation web • GATE, language technology infrastructure 1(19)
A Ubiquitous Permeable Web • The next generation of the web must be: • ubiquitous: semantics for every device, every organisation, every individual; • permeable: allow contextual data to penetrate and persist; • companionable: able to engage with us via multiple natural modalities. • Roles for Language Technology: • discovery of semantics (ubiquity); • mediating between context and personal semantic memories (permeability); • conversing with people and the semantic web (companionableness). 2(19)
Critical Mass for the Semantic Web • The SW: machine processable, repurposable data to compliment hypertext • But: semantics = 0.0000000...% of the Web • How to achieve critical mass? Huge scale automatic annotation. Requirements: • Huge scale:– freely available to all EU citizens– distributed (over a Grid)– re-purposeable (delivered as Web Services) • Portability and robustness via:– simple and therefore shallow HLT methods– +ve and –ve learning– analogs of IPSEs for computer-literate users 3 (19)
Motivation for Software Infrastructure for Language Engineering • Need for scalable, reusable, and portable HLT solutions • Support for large data, in multiple media, languages, formats, and locations • Lowering the cost of creation of new language processing components • Promoting quantitative evaluation metrics via tools and a level playing field 4 (19)
Motivation (II): 5 (19)
GATE, a General Architecture for Text Engineering • An architectureA macro-level organisational picture for LE software systems. • A frameworkFor programmers, GATE is an object-oriented class library that implements the architecture. • A development environmentFor language engineers, computational linguists et al, GATE is a graphical development environment bundled with a set of tools for doing e.g. Information Extraction. • Some free components... ...and wrappers for other people's components • Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. • Free software (LGPL). Download at http://gate.ac.uk/download/ 6 (19)
Architectural principles • Non-prescriptive, theory neutral (strength and weakness) • Re-use, interoperation, not reimplementation (e.g. diverse XML support, integration of tools like Protégé, Jena and Weka) • (Almost) everything is a component, and component sets are user-extendable • Component-based development • An OO way of chunking software: Java Beans • GATE components: CREOLE = modified Java Beans (Collection of REusable Objects for Language Engineering) • The minimal component = 10 lines of Java, 10 lines of XML, 1 URL. 7 (19)
GATE Language Resources • GATE LRs are documents, ontologies, corpora, lexicons, …… • Documents / corpora: • GATE documents loaded from local files or the web... • Diverse document formats: text, html, XML, email, RTF, SGML. • Processing Resourcres • Algorithmic components knows as PRs – beans with execute methods. • All PRs can handle Unicode data by default. • Clear distinction between code and data (simple repurposing). • 20-30 freebies with GATE • e.g. Named entity recognition; WordNet; Protégé; Ontology; OntoGazetteer; DAML+OIL export; Information Retrieval based on Lucene 8 (19)
GATE Format Handlers HTML docs RTF docs XML docs … ANNIE … Named entity Core- ference Document content Document metadata Document format data Linguistic data POS tagger … Named entity … Event extraction … A Language AnalysisExample Custom application 1 Relational Database File storage Oracle/ PostgresQL
Building IE Components in GATE (1) • The ANNIE system – a reusable and easily extendable set of components 11 (19)
Building IE Components in GATE (2) • JAPE: a Java Annotation Patterns Engine • Light, robust regular-expression-based processing • Cascaded finite state transduction • Low-overhead development of new components • Rule: Company1 • Priority: 25 • ( • ( {Token.orthography == upperInitial} )+ • {Lookup.kind == companyDesignator} • ):companyMatch • --> • :companyMatch.NamedEntity = { kind = company, rule = “Company1” } 12 (19)
The Semantic Web and GATE • GATE is being used for development of (semi-)automatic methods for: • linking web pages to Ontologies using Information Extraction; • learning and evolving Ontologies via IE and lexical semantic network traversal. 13 (19)
Information Retrieval Support • Based on the Lucene IR engine 16 (19)
Displaying Multilingual Data • All the visualisation and editing tools for ML LRs use enhanced Java facilities: 17 (19)
Applications • GATE has been used for a variety of applications, including: • MUMIS: automatic creation of semantic indexes for multimedia programme material • MUSE: a multi-genre IE system • Metadata for Medline (at Merck) • ACE: participation in the Automatic Content Extraction programme • HSE: summarisation of health and safety information from company reports • OldBaileyIE: NE recognition on 17th century Old Bailey Court reports. • Various Medical Informatics and database technology projects • IE in Romanian, Bulgarian, Greek, Bengali, Spanish, Swedish, German, Italian, and French (Arabic, Chinese and Russian this autumn) 18 (19)
Conclusion • GATE: an infrastructure that lowers the overhead of creating & embedding robust NLP components • Further information: http://gate.ac.uk/ • Online demos, tutorials and documentation • Software downloads • Talks and papers 19 (19)