Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham

Where the Web Went Wrong http://gate.ac.uk/http://nlp.shef.ac.uk/ Hamish Cunningham Dept. Computer Science, University of Sheffield Graz, May 2004

The Web, presentation, and syndication A Semantic Web for eCulture annoy half the audience annoy the other half eCulture, metadata and human language motivation Information Extraction: quantified language computing MUMIS, GATE, ... Cultural memory is not a luxury Contents 2(21)

The web promotes diversity, but also fragmentation Original web: separate content and presentation (“this is a header”, not “set in 20 point bold font”) Now: many incompatible/inaccessible interfaces Memory Institutions (museums, libraries, archives) need to: pool their impact: syndication in networked communities support repurposable content Therefore data must be presentation independent Candidate technologies: DC, CIDOC, XML, RSS, RDF, OWL (“semantic web”)... Syndication and Mediation 3(21)

Memory Institutions (museums, libraries, archives) host massively diverse content Fortunately, the differences are primarily at the level of data structure and syntax. Significant conceptual overlaps exist between the descriptive schema used by memory institutions; elemental concepts such as objects, people, places, events, and the interrelationships between them are almost universal.Building semantic bridges between museums, libraries and archives: The CIDOC Conceptual Reference Model, T. Gill, April 2004 Therefore we can add a semantic metadata layer to provide generalised inter-institution resource location Syndication and mediation for free! Semantic Web (1) 4(21)

The good news: SW focus of AI and metadata work The bad news: AI always fails How does the machine tell the difference between “Mother Theresa is a saint” and “Tony Blair is a saint”?(Or, who tells Google which statement is important?) Other web users do, by linking (also cf. Amazon) Two solutions to the AI problem: allow curators and users to build their own (simple specific models can succeed, but the cost may be too high) use recommender systems to make the user a curator’s assistant (researchers and students may barter for access) Any route to searchable content! Semantic Web (2):good news and bad news 5(21)

Gartner, December 2002: taxonomic and hierachical knowledge mapping and indexing will be prevalent in almost all information-rich applications through 2012 more than 95% of human-to-computer information input will involve textual language A contradiction: to deal with the information deluge we need formal knowledge in semantics-based systems our archived history is in informal and ambiguous natural language The challenge: to reconcile these two phenomena IT context: the Knowledge Economy and Human Language 6(21)

HLT: Closing the Loop KEY MNLG: Multilingual Natural Language GenerationOIE: Ontology-aware Information ExtractionAIE: Adaptive IECLIE: Controlled Language IE (M)NLG Semantic Web; Semantic Grid;Semantic Web Services Formal Knowledge(ontologies andinstance bases) HumanLanguage OIE (A)IE ControlledLanguage CLIE 7(21)

Information Extraction (IE) pulls facts and structured information from the content of large text collections. Contrast IE and Information Retrieval NLP history: from NLU to IE Progress driven by quantitative measures MUC: Message Understanding Conferences ACE: Advanced Content Extraction Information Extraction 8(21)

“The shiny red rocket was fired on Tuesday. It is the brainchild of Dr. Big Head. Dr. Head is a staff scientist at We Build Rockets Inc.” ST: rocket launch event with various participants IE Example • NE: "rocket", "Tuesday", "Dr. Head“, "We Build Rockets" • CO:"it" = rocket; "Dr. Head" = "Dr. Big Head" • TE: the rocket is "shiny red" and Head's "brainchild". • TR: Dr. Head works for We Build Rockets Inc. 9(21)

(Extensive quantitative evaluation since early ’90s; mainly on text, ASR; now also video OCR) Vary according to text type, domain, scenario, language NE: up to 97% (tested in English, Spanish, Japanese, Chinese, others) CO: 60-70% resolution TE: 80% TR: 75-80% ST: 60% (but: human level may be only 80%) Performance levels 10(21)

Bulgaria London XYZ UK Ontology-based IE XYZ was establishedon 03 November 1978 in London. It opened a plant in Bulgaria in … Ontology & KB Location Company HQ partOf City Country type type HQ type type establOn partOf “03/11/1978” 11(21)

Classes, instances & metadata … Entity Person Job-title president G.Brown minister chancellor … “Gordon Brown met George Bush during his two day visit. <metadata> <DOC-ID>http://… 1.html</DOC-ID> <Annotation> <s_offset> 0 </s_offset> <e_offset> 12 </e_offset> <string>Gordon Brown</string> <class>…#Person</class> <inst>…#Person12345</inst> </Annotation> <Annotation> <s_offset> 18 </s_offset> <e_offset> 32 </e_offset> <string>George Bush</string> <class>…#Person</class> <inst>…#Person67890</inst> </Annotation> </metadata> Classes+instances before Bush Classes+instances after 12(21)

Multimedia Indexing and Searching Environment Composite index of a multimedia programme from multiple sources in different languages ASR, video processing, Information Extraction (Dutch, English, German), merging, user interface University of Twente/CTIT, University of Sheffield, University of Nijmegen, DFKI, MPI, ESTEAM AB, VDA An important experimental result: multiple sources for same events can improve extraction quality PrestoSpace applications in news and sports archiving An example: the MUMIS project 13(21)

Semantic Query Not “goal Beckham” (includes e.g. missed goals, or “this was not a goal”) Instead: “goal events with scorer David Beckham” 14(21)

The results: England win! 15(21)

An architecture A macro-level organisational picture for LE software systems. A framework For programmers, GATE is an object-oriented class library that implements the architecture. A development environment For language engineers, a graphical development environment. GATE comes with... Free components, and wrappers for other peoples’ stuff Tools for: evaluation; visualise/edit; persistence; IR; IE; dialogue; ontologies; etc. Free software (LGPL) at http://gate.ac.uk/download/ Used by thousands of people at hundreds of sites GATE, a General Architecture for Text Engineering is... 16(21)

GATE team projects. Past: Conceptual indexing: MUMIS: automatic semantic indices for sports video MUSE, cross-genre entitiy finder HSL, Health-and-safety IE Old Bailey: collaboration with HRI on 17th century court reports Multiflora: plant taxonomy text analysis for biodiversity research e-science ACE/ TIDES: Arabic, Chinese NE JHU summer w/s on semtagging EMILLE: S. Asian languages corpus hTechSight: chemical eng. K. portal Present: Advanced Knowledge Technologies: €12m UK five site collaborative project SEKT Semantic Knowledge Technology PrestoSpace MM Preservation/Access KnowledgeWeb Semantic Web Future: New eContent project LIRICS Thousands of users at hundreds of sites. A representative sample: the American National Corpus project the Perseus Digital Library project, Tufts University, US Longman Pearson publishing, UK Merck KgAa, Germany Canon Europe, UK Knight Ridder, US BBN (leading HLT research lab), US SMEs inc. Sirma AI Ltd., Bulgaria Stanford, Imperial College, London, the University of Manchester, UMIST, the University of Karlsruhe, Vassar College, the University of Southern California and a large number of other UK, US and EU Universities UK and EU projects inc. MyGrid, CLEF, dotkom, AMITIES, Cub Reporter, EMILLE, Poesia... A bit of a nuisance (GATE users) 17(21)

Combines learning and rule-based methods (new work on mixed-initiative learning) Allows combination of IE and IR Enables use of large-scale linguistic resources for IE, such as WordNet Supports ontologies as part of IE applications - Ontology-Based IE Supports languages from Hindi to Chinese, Italian to German GATE – infrastructure for semantic metadata extraction 18(21)

Ontology-BasedMetadata Merging Formal Text Formal Text Formal Text Anno-tations PrestoSpace Semantics Architecture IE ... Formal Text Formal Text Formal Text Final Annotations IE Formal Text IT Formal Text Formal Text Formal Text Formal Text Formal Text Text Sources EN IE Multilingual Conceptual Q & A Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text Formal Text AV Signals Formal Text Signal md, Transcr-iptions ASR, etc. 19(21)

C21st: all the C20th mistakes but bigger & better? If you don’t know where you’ve been, how can you know where you’re going? Archives: ammunition in the war on ignorance Ammunition is useless if you can’t find it: new technology must make our history accessible to all, for all our futures Memory is not a luxury 20(21)

This talk: http://gate.ac.uk/sale/talks/eculture-graz-may2004.ppt Related projects: Links 21(21)

Where the Web Went Wrong gate.ac.uk/ nlp.shef.ac.uk/ Hamish Cunningham