100 likes | 255 Vues
CSC 9010: AeroText, Ontologies, AeroDAML. Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851. AeroText. Information Extraction tool marketed by Lockheed Martin Capabilities similar to GATE Much better developed IDE Less open to extensions of the system itself.
E N D
CSC 9010: AeroText, Ontologies, AeroDAML Dr. Paula Matuszek Paula_A_Matuszek@glaxosmithkline.com (610) 270-6851
AeroText • Information Extraction tool marketed by Lockheed Martin • Capabilities similar to GATE • Much better developed IDE • Less open to extensions of the system itself. • Equally steep learning curve for effective use! • Lockheed AeroText General Overview • Lockheed AeroText White Paper
Ontologies • Information Extraction requires modeling extensive domain knowledge • Other applications of text mining, such as document categorization, can also use domain information • In modeling such knowledge we often create an ontology: An explicit formal specification of how to represent the objects, concepts, and other entities that are assumed to exist in some area of interest and the relationships that hold among them.
A Simple Ontology: Birthdates • Objects, concepts, entities: • Months, days, years • dates • first names • last names • persons • birthdates • Relationships between them • a date has exactly one month, day, year • a birthdate is a date • a person has at least 1 first name and exactly 1 last name • a person has a birthdate • a birthdate has a person
Who and Why? • Many groups are developing ontologies: • standardize terms and vocabulary • facilitate the semantic web • improve information integration • interested in the domain itself • Some ontologies under development • Cyc • GO (Gene ontology) • UMLS (Unified Medical Language System) • CIA World Factbook
DAML • DARPA Agent Markup Language • A language for describing ontologies • Example: an ontology for dates • Extensive information available at www.daml.org.
UBOT • UML Based Ontology Toolkit • Part of a DARPA project to automatically mark up web pages to make them • The purpose of DAML is to annotate information on the web to make it machine-readable so that software agents can interpret it and reason with it: the semantic web • http://ubot.lockheedmartin.com/ubot/intro/index.html
AeroDAML • AeroDAML is a web service that takes a web page as an input and generates DAML markup. • Uses AeroText as the underlying extraction tool. • Works with various ontologies. • Paper describing system
Lab: try out AeroDAML • AeroDAML page • Choose a news page (www.phillynews.com, Google News, ...) and tag it with the Cyc and CIA ontologies. • How well did each ontology do at picking up content? Did they miss things they should have found? Was anything tagged incorrectly? • Repeat for one of your domain-specific documents, or a web page in a specific area. Try a different ontology if you think one of the others might be more interesting. • How was the annotation different? • Are we enabling the semantic web?