Smartlogic Semaphore: Enterprise-scale Semantic Platform

Contents

A very small introduction to Smartlogic Semaphore

Semaphore in a nutshell Build and manage semantic models Simplify the ingestion, development or customization of ontologies Resource description and authoring Automated and assisted metadata enrichment, entity recognition and fact extraction Knowledge discovery and surfacing Contextual metadata driven search and navigation. Leverage ontologies to deliver the richest experience for users when publishing, using and analyzing content Semaphore delivers these capabilities – enterprise scale

Conceptual architecture

Where Semaphore fits in the Enterprise Business Applications (e.g. compliance) Analytics (graph) Search Mobile (e.g. Targeted content) Content Production Portals (e.g. Self service) Model Management Classification Extraction Text Mining NLP Semantic Enhancement Social Media ERP Content Applications External Systems Linked Data

Proven Technology - Fortune 2000 Customers and Tier 1 Partners USE CASES PARTNER PULL Compliance (KYC, FATCA, MIFID 2) Risk Assessment (Contract analysis) Reputation Mgmt FINANCIAL SERVICES Market Compliance (Adverse Event Reporting IDMP) Metadata Hub for KM Research Recycling ) LIFE SCIENCES Monetising corpus (for new audiences) Metadata automation Operational Intelligence (Data Lake) MEDIA & PUBLISHING Predictive Analytics (preventative healthcare) Customer Service Drug and treatment (knowledge surfacing) HEALTHCARE Compliance (ITARS) Customer Service Social Media (Knowledge Harvesting) HEAVY ASSET SERVICING Metadata Management Data Harmonization Cross-sell/upsell RETAIL Field Intelligence Ops Operational Intelligence Knowledge harvesting INTELLIGENCE & SECURITY

What we do Unified Enterprise Intelligence Professional Services • Assist customers to accelerate deployments, avoid mistakes and maximize their return on investment with Semaphore. • Specialist knowledge in: • Information science • Ontologies, text analytics and graph stores • Integration • Content management • Search • Analytics • Semantically-powered user experience • Semaphore Suite: • Usability/workflow features • New integrations • Advances from R&D • Maintenance • 24x7 software incident Customer Support • Software patch access • Software upgrade access • Documentation Semaphore Enterprise Semantic Platform Training Partners • Semaphore Suite: • Usability/workflow features • New integrations • Advances from R&D • Maintenance • 24x7 software incident Customer Support • Software patch access • Software upgrade access • Documentation • Web-based self-training and instructor led classroom (or virtual classroom) training for: • Ontology Management • Classification • Systems Administration & Integration Partnering with software, reseller, management consulting and system integrators to deliver the most comprehensive semantic technology solutions to organizations around the globe. • Semaphore Suite: • Usability/workflow features • New integrations • Advances from R&D • Maintenance • 24x7 software incident Customer Support • Software patch access • Software upgrade access • Documentation • Semaphore Suite: • Usability/workflow features • New integrations • Advances from R&D • Maintenance • 24x7 software incident Customer Support • Software patch access • Software upgrade access • Documentation • Semaphore Suite: • Usability/workflow features • New integrations • Advances from R&D • Maintenance • 24x7 software incident Customer Support • Software patch access • Software upgrade access • Documentation

Applied taxonomy frameworks

Applied taxonomy frameworks • There is great intellectual and analytical power in building conceptual models such as taxonomies – sense-making • But, ultimately, we want to apply a taxonomy to the most common resource requiring conceptual management in organizations – information • It turns out that the application of taxonomy to information is non-trivial • Applying taxonomy manually is impracticable for numerous reasons • To achieve consistent, coherent, large-scale, all-knowing, fast, untiring, application of taxonomy is not best left to humans • What we want is a way of systematically applying taxonomy to content • We want a computerised, systematized, applied taxonomy framework • Semaphore is one such

Semaphore’s applied taxonomy framework • Semaphore’s approach is: • Model-driven • Rules-applied • Linguistic-aware • Semaphore is model-driven: • The concepts and their labels that drive classification reside in the model (the taxonomy!) • Semaphore is rules-applied: • The classification behaviour is applied using clear, logical, human-readable, rules • Semaphore is linguistic-aware: • Awareness of many linguistic concepts such stems, lemmas, parts of speech, punctuation, etc.

A typical Ontology Editor model

A typical classification looks like this

A conceptual search (supported by SAYT)

The results of the conceptual search

It is Publisher that transforms the model • Publisher facilitates Semaphore being an applied taxonomy framework • Publisher generates the rules • It does this by applying its configuration files to what it has read from the model • It then processes that information through its pipeline of processors for generating variants, etc. • It generates one rulebase per concept in the model, and it publishes all the rules to Classification Server • (It also generates an index of the model which is consumed by Semantic Enhancement Server)

Classification Server • Classification Server then implements what Publisher has output • It is the application that understand the linguistics side of things, parsing, tokenising, stemming, and so on, the labels from the rules as well as the content itself • You send it content and it computes the matches between the rules it has loaded and the content • It then returns tags • The score of the tags represents the strength of the match (as computed by the rules in Classification Server) • It can process thousands of documents per hour (which is scalable to any size)

Semaphore 4 has progressed on several fronts

Semaphore 4On-premise, Cloud and Hybrid solutions (Click for list of connectors)

Semaphore 4 typical physical deployment Nutch Pipeline Connector CMIS toolkit MarkLogic CPF connector Semaphore Central Admin App Term Store Integration Semaphore Columns, Content Type Updater, Workflow Events MarkLogic Apache Solr Alfresco, Documentum, etc Microsoft SharePoint 2010/2013 Search Application Framework Web Application Server (Servlet 3.0 / JSP Spec 2.2) Search Application Framework Web Application Server (Servlet 3.0 / JSP Spec 2.2) Toolkit Widgets Semaphore Web Parts

Triple Store integration

Some modern trends in applying taxonomies

In the old days… • Things have even progressed even since I joined Smartlogic just over seven years ago! • It used to be about: • …thesauri… • …manual tagging, assisted tagging, automatic tagging... • …for some facet, usually some “aboutness” style tagging… • …at the document level… • …of a simple tag… • …with a score as to its pertinence… • Bit of a generalisation, but, generally, that is what we were doing… • With some simple entity extraction

The metadata management landscape Content Intelligence Algorithmic classification tools structured Taxonomy management tools Metadata repositories Data mapping tools STRUCTURED UNSTRUCTURED Data definitions Asset Relationship Mapping Vocabulary Maintenance Interpretation And Classification Semantic metadata enrichment Define data assets by location, type, format etc Manage explicitly defined mapping between assets Maintain business glossary or Taxonomy Catalog and categorize assets according to the taxonomy Classify and tag assets based on content Derive relationships between assets from ontological relationships Extract entities, facts and sentiment Multi-direction flow of metadata Create triples for advanced analytics

Technological changes… • So, what has changed? • The maturation of linked data, graph databases, semantic web standards, etc. • All these things share, with taxonomies, a graph topology • To take advantage of these graph-topologies, we are now seeing a huge uptick in interest in extracting complex, contextualised, structured, graph-like metadata • That is, people are interested in not just document to document links, but also, the contexts within documents, and the facts therein • We are super-excited by this and wish to exploit Semaphore to power this sort of metadata • We call such graph-based metadata facts!

The metadata challenge has always been… • Metadata has to come from somewhere • It does not grow under cabbage leaves • There is no metadata fairy • Creating it is the most difficult thing • Some exists in the wild • BUT, even where it exists… • It is not consistent • Human-added information tends to be inaccurate and incomplete Where does the metadata come from?

…but metadata itself is evolving! • It is changing from document level tags describing the “aboutness” of content • To become much more focussed on the specific facts inherent in content • Representing the contexts of those facts properly • Representing those contexts in structured ways • So as to produce complex, structured, facts that are queryable • This allows certain aspects of the content to be treated as structured data Where do the structured, complex facts come from?

What is a fact? • We have some new content - a restaurant review • It has various “facts” in it: • A review fact: • The reviewer fact • The review date fact • A restaurant fact: • The reviewed restaurant’s name fact • The restaurant’s address fact • The restaurant’s telephone number fact

So what? • This allows the querying of telephone numbers of restaurants, not reviews! • We would like to extract the facts listed on the previous slide from the unstructured text in the context of the objects there are talking about • That is, we want to go from unstructured document-level text to structured object-level facts • We want to group our facts such that the create the right context • The review and all its sub-facts • The restaurant and all its sub-facts • We would like to semantically unify our content such that the same object (concept) across all facts (e.g. the same restaurant) is given the same URI • This allows the content to be unified and integrated at the semantic level of objects / concepts talked about in the content

How is fact extraction approached in Semaphore? • Fact extraction can mean a variety of things to Semaphore • However, ultimately, it is about identifying very specific evidence in content, which will be extracted, as either: • Textual facts – that is, they have no ID • Conceptual facts – that is, they have an ID • Textual facts can be: • Verbatim (e.g. a sentence, a word) • Normalised (e.g. a DATE, a NUMBER) • Conceptual facts can be: • Verbatim (return the evidence in the content matched on by the concept) • Normalised (return the preferred label or URI for the matched concept) • In both cases, we will always have the ID of the concept

Some content: a review • If we want to extract the review’s subtitle, we would use verbatim text • If we wanted to extract the name of the restaurant, we could use either verbatim text or normalised concept • To extract the telephone number we could use the TELEPHONE entity(using verbatim text would be too difficult)

Some content: a review • If we wanted to extract the name of the reviewer, we could use either an entity, i.e. PERSON or a normalised concept • If we wanted to extract the date of the review, we could use normalised text with an entity, i.e. DATE

Conceptual extraction using taxonomies • To get the most benefit, we want the majority of our fact extraction to return concepts from the model, the taxonomy, with IDs • IDs allow us to unify our facts across the corpus • We can exploit the traditional taxonomic properties of alternative labels, associations, etc. when doing the extraction to give us those IDs (e.g. URIs, GUIDs, etc.) • We are also bringing the full weight of Semaphore’s linguistic-awareness to the extraction process • This all massively increases the likelihood of extraction working against variant content – it is still a form of classification! • Just highly objectified and contextualised

We want to extract ingredient facts • The ingredient facts are complicated as there are several ways to structure a recipe ingredient • ½ onion • 1 celery stick • 2 tomatoes, peeled • 1 cup of flour, sieved • 250g of sugar, sieved • Here we see we have • A food (for example, tomatoes) • A measure (typically a number) • An amount (for example, cup or grams) • A preparation (for example, peeled)

Defining our own zones, or grouped facts! • All fact extraction depends on identifying a zone • Semaphore allows you to define zones / facts, by defining a mix of conceptual and textual extractions, allowing for complex, or grouped, facts, to be extracted • This greatly increases the usefulness of fact extraction; we can define complex zones / facts, and have them come back with all the contextual structure intact • So, for example, a recipe is a fact / zone (fact: recipe) • Which is made up of ingredients (fact: ingredient) • Each ingredient has a food (fact: food) • Each ingredient has an amount (fact: amount) • Each ingredient has a preparation (fact: preparation) • This is a complex or grouped fact that has a hierarchal structure

Model based extractions – considerations • We can see that we need to account for at least four kinds of ingredient sub-facts that we may encounter in our extraction of the recipe ingredient facts that go up to make a recipe, ultimately • These facts speak to taxonomies of concepts in the model

Our output facts!

Linked (meta)data! • So, what people want to do with their content is becoming increasingly complex and sophisticated • And taxonomy-managed concepts can support that requirements by being used together, in concert, in sequence, to provide focussed disambiguation to extract facts • Once found, those concepts then provide consistent URIs across the corpus of content, providing semantic integration across the content • “Semantic”, as we can walk our graph of facts according to our model to provide us with the schema to compose further, more interesting queries • For example, give me all recipes that use less than 2 cans of any preparation (whole, chopped, pureed, etc.) of canned tomatoes, etc. • That structure helps make connections between content, between facts, supporting analytics and further inferences!

Taxonomist-managed complex fact extraction and generation

Taking it further • We are now becoming involved in increasingly sophisticated facts projects, involving graph databases, analytics, etc. • We always want to do things at the model level – we want to make applying taxonomy as easy as we can! • So, we have developed a new model-driven solution to enable taxonomists to drive the fact extraction using a model and taxonomies • It also has special templates and config for Publisher – it is all OOTB • This means the taxonomists can control the structure of the facts and how they should be extracted • We are increasingly working with graph database vendors on such projects • Merging the throughput of the Semaphore applied taxonomy framework’s fact extraction and generation with enterprise-scale graph databases

Modelling fact extraction (sequences) • So, what does our new applied taxonomy framework look like? • The FACTS framework defines a number of building blocks that can be used (by taxonomists) together to build fact extraction sequences • A sequence can be very simple or very complex – and anywhere in between! • The fundamental building blocks of any sequence are: Anchors Skips Facts • Those building blocks are ordered into sequences that spell out the patterns we know hold facts in the content

The FACTS building blocks… • An anchor is a concept that anchors the fact sequence in the documents • As a concept, it can have potentially several alternative labels • The semantics are that one or any of the alternative labels needs to be found • A skip is a concept that ignores a configurable number of tokens (words / sentences / paragraphs), and can be: • greedy (eat up as many of the skip tokens you can till you find the fact or anchor) • or non-greedy (eat up as few the skip tokens you can till you find the fact or anchor) • A fact is a concept, and can be simple or complex • You can define any grouping of facts and sub-facts but ….

E.g. Extracting inception / expiry dates • For the content above we can see that there is a sequence pattern that holds the facts that we are after • Imagine we require the inception date and expiry date of a contract:

A naïve sequence • The pattern is a sequence something like this: • Look for words that mean there will be a date • Allow some inconsequential words to appear • Look for a date fact • Allow some inconsequential words to appear • Look for words that mean there will be another date • Allow some inconsequential words to appear • Look for a date • There may be other sequences that hold the content in other documents, but, this is probably a reasonable sequence that, if we make it sufficiently robust and universal, will work across a number of documents

Using our building blocks That simple linear sequence is something like this: Look for evidence of a contract inception date anchor Allow 50 inconsequential words to appear Look for a inception date fact Allow 20inconsequential words to appear Look for evidence of a contract expiry date anchor Allow 20inconsequential words to appear Look for an inception date fact

The Semaphore FACTS model

Semaphore output (and as RDF)

There are many different fact structures / sequences available • Many different fact structures are possible, no only linear ones • You can have many levels of act, with child / sub facts to any depth • This allows several facts to be grouped under a common parent fact, to contextualise them • There are also many different kinds of basic extraction sequences in the model, again, not just linear • There are: • Linear sequences of facts (look for all of facts in the sequence specified) • Intersection sequences of facts (look for several facts anywhere in between two anchors separated by some number of tokens) • Facts that share a common fact (look for a common fact and then look for several facts that refer to it) • Optional facts (look for any of the fact options)

There are many different fact literals / values available • The available fact literals include: • Conceptsfrom taxonomies which use pref / alt labels to match on facts in content • That is, we are looking for a concept, and using its alt labels to also help us identify the concept, as it might appear in variant ways in the content • Wildcardlabels which regex match on facts in content • e.g. £#.# which would match £10.00 in content and return that fact • Entity types which match on facts in content • Semaphore comes with around 30 entity types, including: • PERSON, DATE, URL, PERCERNT, ORGANIZATION, ADDRESS, etc.

Smartlogic Semaphore: Enterprise-scale Semantic Platform

Smartlogic Semaphore: Enterprise-scale Semantic Platform

Presentation Transcript

Contents

Contents

Contents

Contents

Contents

Contents

Contents

CONTENTS

Contents

Contents