1 / 67

Ontotext @ JRC

Ontotext @ JRC. Semantic Web. The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other standards SW is being developed by the W3C , in collaboration with a large number of researchers and industrial partners http://www.w3.org/2001/sw/

royal
Télécharger la présentation

Ontotext @ JRC

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Ontotext @ JRC

  2. Semantic Web • The Semantic Web is the abstract representation of data on the WWW, based on the RDF and other standards • SW is being developed by the W3C, in collaboration with a large number of researchers and industrial partners http://www.w3.org/2001/sw/ http://www.SemanticWeb.org Ontotext @ JRC

  3. Semantic Web (II) • "The Semantic Web is an extension of the current web in which information is given well-defined meaning, better enabling computers and people to work in cooperation.“ [Berners-Lee et al. 2001] The spirit: • Automatically processable metadata regarding: • the structure (syntax) and • the meaning (semantics) • of the content. • Presented in a standard form; • Dynamic interpretationfor unforeseen purposes Ontotext @ JRC

  4. Semantic Web: Languages • RDF(S) – the next slides • SHOE, XOL, etc – the pioneers • Topic Maps – a metadata language with limited impact • OIL – Ontology Interchange Language, the basis of the next two http://www.ontoknowledge.org/oil/ • Description Logics-based multilayered language • DAML+OIL – the predecessor of OWL, not to be developed • OWL – the W3C standard for Semantic Web ontology language, http://www.w3.org/2001/sw/WebOnt/ • Extends RDF(S), but also constraints it • Has multiple layers (Lite, DL, Full) • Transitive/symmetric/etc properties, disjointness, cardinality restrictions Ontotext @ JRC

  5. Semantic Web: Problems • Critical mass of metadata is necessary • Still lack of consensus on many issues (like query languages) • Lack of practices at the proper scale and complexity • Lack of robust Semantic (in our days RDFS) repositories: • Should be as flexible, multi-purpose and easy to use as HTTP servers and • As efficient in structured knowledge management as RDBMS Ontotext @ JRC

  6. What are Sirma & Ontotext? • Established in 1992 as a Bulgarian AI Lab. • Current structure: • Sirma Group International Corp, Montreal, Canada; • 8 subsidiary companies; the most important ones follow below. • Sirma AI, Sofia • The R&D backbone of the group with two divisions: • Sirma Solutions: e-Business, banking, C3, e-Publishing, consultancy; • Ontotext Lab: Knowledge and Language Engineering. • EngView Systems, Montreal • CAD/CAM systems and applications. • WorkLogic.Com, Ottawa • Web-based collaboration, workflow, e-Gov. Ontotext @ JRC

  7. Software Development and Research since 1992 • Track record of success – large companies and government organizations in US, Canada, Western Europe and Bulgaria; • Top-3 SoftwareCompany in Bulgaria; • About 70 developers; • ISO 2001 Certificate; • 1999 EIST prize winner; Ontotext @ JRC

  8. Sirma Businesses and Domains Diversebusiness, ranging from COTSproducts to custom projects, consultancy, and outsourcingservices. Major areas: • AI – expert systems (beside Ontotext); • b2b market places • CAD/CAM (for packaging, quality control) • e-Government, CSCW, Groupware, Workflow; • Banking • C3/C4 Systems (military, airport traffic); • VOIP billing systems; • e-Publishing, Proofing tools. Ontotext @ JRC

  9. Ontotext Lab An R&D lab of Sirma for Knowledge and Language Engineering Research and core technology development for knowledgediscovery, management, and engineering. Specialized for applications in Semantic Web, Knowledge Management, and Web Services. Aside from the scientific matters, most of us are just professionalsoftware developers. Ontotext @ JRC

  10. Leading Semantic Web Technology Provider Ontotext is a leading Semantic Webtechnology provider, being: • the developer of the KIMSemantic Annotation Platform and • a co-developer of the GATElanguage engineering platform; • a co-developer of the Sesamesemantic repository and OWLIM high-performance OWL reasoner; • the developer of the WSMO4Jsemantic web services API; • a partner in the SWANSemantic Web Annotator project. Ontotext is part of most of the majorEuropean research projects in the field; the most successful Bulgarian participant in FP6. Ontotext @ JRC

  11. Mission • A critical mass of research in a number of AI areas made efficient KM almost possible. • the technology on the market is mostly of two sorts: • Expensive black boxes • Academic prototypes Our mission is: • To develop and popularize open, skillfully engineered tools... • For Information Extraction and Knowledge Management, • Which considerably reduce the cost for implementation and use of KM applications. Ontotext @ JRC

  12. Major Research Areas We focus on building cutting-edge expertise and technology in the following areas: • ontologydesign, management, and alignment; • knowledge representation, reasoning; • information extraction (IE), applications in IR; • semantic web services; • upper-level ontologies and lexical semantics; • NLP:POS, gazetteers, co-reference resolution, named entity recognition (NER) • machine learning (HMM, NN, etc.) Ontotext @ JRC

  13. Academic & Technology Partners • NLP Group,Sheffield University, UK; • Digital Enterprise Research Institute (DERI), Institut für Informatik, Innsbruck, Austria, andNational University of Ireland, Galway; • Aduna (Aidministrator) b.v., The Nederland's; • Linguistic Modelling Lab.CLPOI, Bulgarian Academy of Sciences; • British Telecommunications Plc, (BT), UK. • Froschungszentrum Informatik (FZI) and Institut AIFBKarlsruhe, Germany. Ontotext @ JRC

  14. Customers • SemanticEdge GmBH, Berlin, Germany; • QinetiQ Ltd, UK; • Fairway Consultants, UK; Ontotext @ JRC

  15. Research Projects We were/are part of a number of FP5 research projects: • On-To-Knowledge - the project which invented OIL.Ontology Middleware Module and a DAML+OIL reasoner. • VISION - Towards Next Generation Knowledge Management. • OntoWeb - Ontology-based information exchange for knowledge management …. • SWWS - Semantic Web enabled Web Services. Ontotext @ JRC

  16. Research Projects (II) FP6 integrated projects that started Jan 2004, durations ~3 years: • SEKT: Semantic Knowledge Technologies. Targeting a synergy of Ontology and Metadata Technology, Knowledge Discovery and Human Language Technology. • DIP:Data, Information, and Process Integration with Semantic Web Services. • PrestoSpace: Preservation towards storage and access. Standardized Practices for Audiovisual Contents in Europe. • Infrawebs:Intelligent Framework for Generating Open (Adaptable) Development Platforms for Web-Service Enabled Applications Using Semantic Web Technologies, Distributed Decision Support Units and Multi-Agent-Systems Ontotext @ JRC

  17. Introduction to Ontologies Despite the formal definitions, ontologies are: • Conceptual models or schemata • Represented in a formalism which allows • Unambiguous “semantic” interpretation • Inference • Can be considered a combination of: • DB schema • XML Schema • OO-diagram (e.g. UML) • Subject hierarchy/taxonomy (think of Yahoo) • Business logic rules Ontotext @ JRC

  18. Introduction to Ontologies (II) • Imagine a DB storing “John is a son of Mary”. • It will be able to "answer" just: • Which are the sons of Mary? Which son is John? • An ontology with a definition of the family relationships. It could infer: • John is a child of Mary (more general) • Mary is a woman; • Mary is the mother of John(inverse); • Mary is a relative of John (generalized inverse). • The above facts, would remain "invisible" to a typical DB, which model of the world is limited to data-structures of strings and numbers. Ontotext @ JRC

  19. The Ontology Middleware Module (OMM) is an enterprise back-end for formal KR and KM applications based on Semantic Web standards An extension of theSesameRDF(S)repository thataddsa Knowledge Control System. OMM integration options: Built-In, RMI, SOAP, HTTP. Products Ontotext @ JRC

  20. Products • BOR – a DAML+OIL reasoner. • Proprietary GATE components: • Hash Gazetteer. Ahigh-performance lookup tool. • Hidden Markov Model Learner. A stohastic module for filtering annotations, disambiguation, (etc.,) based on confidence measures. • The News Collector is a web service, collecting and indexingarticles from the top-10 global news wires: • About 1000 articles/day, annotated and indexed using KIM; • Used to validate the heuristics and resources of KIM; Ontotext @ JRC

  21. Products (II) • The KIM Platform (the next slides), http://www.ontotext.kim. • SWWS Studio (http://swws.ontotext.com) • Semantic Web Service description development environment • Developed in the course of the SWWS project • Based on WSMO (http://www.wsmo.org) • WSMO4J (http://wsmo4j.sourceforge.net) • A WSMO API and a reference implementation • for building Semantic Web Services applications • Used in WSMO Studio, (http://www.wsmostudio.org/) • The basis for ORDI, used in OMWG (http://www.omwg.org) • Used in projects DIP, SEKT, Infrawebs Ontotext @ JRC

  22. OWLIM is a high-performanceOWL repository Storage and Inference Layer (SAIL) for Sesame RDF database OWLIM performs OWL DLP reasoning It is uses the IRRE (Inductive Rule Reasoning Engine) for forward-chaining and “total materialization” In-memoryreasoning and query evaluation OWLIM provides a reliable persistence, based on RDF N-Triples OWLIM can manage millions of statements on desktop hardware Extremely fast upload and query evaluation even for huge ontologies and knowledge bases OWLIM Ontotext @ JRC

  23. Scalability: Upload and Reasoning Ontotext @ JRC

  24. Q2: Pattern of 12 statement-joins and LIKE literal constraint Scalability: Query Answering Ontotext @ JRC

  25. The Lehigh Univ. evaluation is one of the most comprehensive benchmark experiments published recently (ISWC 2004, WSJ 2005) Synthetically generated OWL knowledge bases The biggest set generated is LUMB(50,0) – 6M explicit statements 14 queries, checking different inferences OWLIM on LUMB: On a desktop machine OWLIM loads LUMB(50,0) in 10 min The only other systems known to load it, does this for 12 hours All the queries are answered correctly Based on this we can claim that: OWLIM is the fastest OWL repository in the world! OWLIM under LUMB Benchmark Ontotext @ JRC

  26. JOCI • “Jobs & Contacts Intelligence”, Innovantage, Fairway Consultants • Gathering recruitment-related information from web-sites of UK organizations • Offering services on top of this data to recruitment agencies, job portals, and other. • JOCI usesKIM for information extraction (IE, text-mining) • JOCI makes use of a domain ontology to: • support the IE process, • to structure the knowledge base with the obtained results, and • facilitate semantic queries. • Sirma is shareholder in Fairway Consultants Ontotext @ JRC

  27. JOCI Dataflow UK Web Space Web UI Information Extraction KIM Server Single-Document IE Semantic Repository Focused Crawler Crawler Classifier Object Consolidation Document Store Ontotext @ JRC

  28. JOCI: Vacancy Consolidation/Matching Consolidated Vacancy locatedIn Vacancy 1 Vacancy 2 hasJobTitle locatedIn “IT Applications Support Analyst” “Support Analyst” locatedIn sub-string Glasgow U.K. Scotland subRegionOf subRegionOf type type Country City subClassOf Location Ontotext @ JRC

  29. JOCI Statistics • The figures below are indicative and reflect an old state of the JOCI system: • The actual figures are to be announced after the launch of JOCI • Web-sites inspected: 0.5M • Web-sites with vacancy announcements: 30K • Extracted vacancies: 100K Ontotext @ JRC

  30. The KIM Platform • A platform offering servicesandinfrastructure for: • (semi-) automatic semantic annotation and • ontology population • semantic indexing and retrieval of content • query and navigation over the formal knowledge • Based on Information Extraction technology Ontotext @ JRC

  31. KIM What’s Inside? The KIM Platform includes: • Ontologies (PROTON + KIMSO + KIMLO) and KIM World KB • KIM Server – with a set of APIs for remote access and integration • Front-ends: Web-UI and plug-in for Internet Explorer. Ontotext @ JRC

  32. The AIM of KIM • Aim: to arm Semantic Web applications • by providing a metadata generation technology • in a standard, consistent, and scalable framework Ontotext @ JRC

  33. What KIM does? Semantic Annotation Ontotext @ JRC

  34. Simple Usage: Highlight, Hyperlink, and… Ontotext @ JRC

  35. Simple Usage: … Explore and Navigate Ontotext @ JRC

  36. Simple Usage: … Enjoy a Hyperbolic Tree View Ontotext @ JRC

  37. KIM is based on the following open-source platforms: GATE – the most popular NLP and IE platform in the world, developed at the University of Sheffield. Ontotext is its biggest co-developer.www.gate.ac.uk and www.ontotext.com/gate OWLIM – OWL repository, compliant with Sesame RDF database from Aduna B.V. www.ontotext.com/owlim Lucene – an open-source IR engine by Apache. jakarta.apache.org/lucene/ KIM is Based On… Ontotext @ JRC

  38. How KIM Searches Better KIM can match a Query like: Documents about atelecomcompany inEurope, John Smith, and a date in the first half of 2002. With a document containing: “At its meeting on the 10th of May, the board of Vodafone appointed John G. Smith as CTO" The classical IR could not match: • Vodafone with a "telecom in Europe“, because: • Vodafone is a mobile operator, which is a sort of a telecom; • Vodafone is in the UK, which is a part of Europe. • 5th of May with a "date infirst half of 2002“; • “John G. Smith” with “John Smith”. Ontotext @ JRC

  39. Entity Pattern Search Ontotext @ JRC

  40. Pattern Search: Entity Results Ontotext @ JRC

  41. Entity Pattern Search: KIM Explorer Ontotext @ JRC

  42. Semantic Metadata in KIM… • Provides a specific metadata schema, • focusing on named entities (particulars), • as well as number and time-expressions, addresses, etc., • everything “specific”, apart from the general concepts. • Defines specific tasks for generation and usage of the metadata which are well-understood and measurable. • Why not metadata about general things (universals)? • It is too complex… • but we leave the door open. • The particulars seem to provide a good 80/20 compromise. Ontotext @ JRC

  43. World Knowledge in KIM Rationale: • The ontology is encoded in OWL Lite and RDF. • provide common knowledge about world entities; • KIM bets on scale and avoids heavy semantics; minimum modeling of common-sense, almost no axioms; • The ontology is encoded in OWL Lite and RDF. • In addition, a number of rules (generative axioms) are defined, e.g.: <X,locatedIn,Y> and <Y,subRegionOf,Z> => <X,locatedIn,Z> • Axioms of this sort are supported by OWLIM and they provide a consistent mechanism for “custom” extensions to the OWL or RDF(S) semantics with respect to a particular ontology Ontotext @ JRC

  44. PROTON • Name. PROTON is an acronym for Proto Ontology • ex-names: BULO (basic upper-level ontology), GO (generic ontology); • not a Russian space rocket  • “proto” – used in the sense of “primary”, “beginning”,“giving rise to”, vs. “first in time” or “oldest”; • connotations: positive, fundamental, elemental, “in favour of”, even romantic (like a science-fiction novel from the 60-ies)  • Intended usage. A Basic Upper-Level Ontology like PROTON - used for: • ontology population • knowledge modelling and integration strategy of a KM environment; • generation of domain, application, and other ontologies. Ontotext @ JRC

  45. PROTON Design • Design principles: • domain-independence; • light-weight logical definitions; • Compliance with popular metadata standards; • good coverage of concrete and/or named entities (i.e. people, organizations, numbers); • no specific support for general concepts (such as “apple”, “love”, “walk”), however the design allows for such extensions Ontotext @ JRC

  46. Some Figures… • PROTON defines about 250 classes and 100 properties • Providing coverage of most of the upper-level concepts necessary for semantic annotation, indexing, and retrieval • A modular architecture, allowing for great flexibility of usage and extension: • SYSTEM module - contains a few meta-level primitives (6 classes and 7 properties); introduces the notion of 'entity', which can have aliases; • TOP module - the highest, most general, conceptual level, consisting of about 20 classes; • UPPER module - over 200 general classes of entities, which often appear in multiple domains. Ontotext @ JRC

  47. PROTON Ontology Language • The current version of the ontology is encoded in OWL Lite. • A few custom entilement rules (axioms) are also defined for usage in tools that support them, for instance: Premise: <xxx, protont:roleHolder, yyy> <xxx, protont:roleIn, zzz> <yyy, rdf:type, protont:Agent> Consequent: <yyy, protont:involvedIn, zzz> • Axioms of this sort are interpreted by OWLIM • PROTON is portable to any OWL(Lite)-compliant tool. • PROTON can be used without such axioms either. Ontotext @ JRC

  48. Other Standards: Relations • ADL Feature Type Thesaurus and GNS • the backbone of the Location branch; • on its turn aligned with the geographic feature designators, of the GNS database of NIMA; • PROTON is more coarse-grained, taking about 80 out of 300 types. • Dublin Core • the basic element set available as properties of protont:InformationResource and protont:Document classes; • the resource type vocabulary is mapped to sub-classes of InformationResource. • OpenCyc and WordNet– consulted and referred to in glosses. • ACE (Automatic Content Extraction) annotation types – covered. • FOAF – assure easy mapping (e.g. the Account class was added). • DOLCE, EuroWordnet Top, and others – consulted to various extent. Ontotext @ JRC

  49. Other Standards: Compliance • Other models are not directly imported (for consistency reasons) • The mapping of the appropriate primitives is easy, on the basis of • a compliant design, and • formal notes in the PROTON glosses, which indicate the appropriate mappings. • For instance, in PROTON, a protont:inLanguage property is defined • as an equivalent of the dc:language element in Dublin Core • with a domain protont:InformationResource • and a range protont:Language Ontotext @ JRC

  50. KIM World KB A quasi-exhaustive coverage of the most popular entities in the world … • What a person is expected to have heard about that is beyond the horizons of his country, profession, and hobbies. • Entities of general importance … like the ones that appear in the news … KIM “knows”: • Locations: mountains, cities, roads, etc. • Organizations, all important sorts of: business, international, political, government, sport, academic… • Specific people, etc. Ontotext @ JRC

More Related