IBM Leadership in Search, Text Analysis and Classification

Content Analytics with Enterprise SearchPutting Your Content in MotionRealize the value of content to transform your business 1 1

IBM Leadership in Search, Text Analysis and Classification • IBM has a 50+ year history in text analysis and discovery • As early as 1957, IBM published pioneer research done on text classification (and related topics, such as text search, and automatic creation of text abstracts) • IBM invests ~$50M annually in research and development for search and text analytics • Over 200 people actively engaged in R&D • IBM holds over 200 patents in information access with more each year

Content Analytics Going from raw information to rapid insight Uncover business insight through unique visual-based approach Aggregate and extract from multiple sources Organize, analyze and visualize Search and explore to derive insight … enterprise content (and data) by identifying trends, patterns, correlations, anomalies and business context from collections. … to form large text-based collections from multiple internal and external sources (and types), including ECM repositories, structured data, social media and more. … from collections to confirm what is suspected or uncover something new - before customizing models and integrating with other systems and processes

IBM Content Analytics – A platform for rapid insight • Multiple views for visual analysis, exploration and investigation • 8 unique views of content, including subdocument views • Dynamically search and explore content for new business insight • Connections and Dashboard views to easily detect insights • Add your own custom views • Powerful solution modeling and support for advanced classification tools for more accurate and deeper insight • Enhanced analytics configuration tools • Deliver rapid insight to other systems, users and applications for complete business view • Quickly generate Cognos BI reports, link between Cognos reports and ICA views • Deliver analysis to IBM Case Manager solutions 4 4

Content Analytics – A platform for rapid insight Document Analysis Facets Dashboard Time Series Sentiment Deviations / Trends Connections Facet Pairs

Exploration Analysis Sources External and Internal Information Sources IBM Content Analytics Approach Interactive Assessment and Discoveryof Business Insight IBM Content Analytics Delivery of Insight to Users, Systems and Processes Solution and Modeling Tools Industry Solutions Business Intelligence Predictive Systems ECM Advanced Case Mgt IBM Content Analytics Studio IBM Content Classification

Text Analytics is the basis for Content Analytics Not only was the pick-up line at the counter very long, but I waited 30 minutes just to talk to a rude representative who gave me a car that smelled like smoke, had stained floor mats, a dented fender, and only half a tank of gas What is Text Analytics? Text Analytics (NLP*) describes a set of linguistic, statistical, and machine learning techniques that allow text to be analyzed and key information extraction for business integration What is Content Analytics? Content Analytics (Text Analytics + Mining) refers to the text analytics process plus the ability to visually identify and explore trends, patterns, and statistically relevant facts found in various types of content spread across internal and external content sources * Natural Language Processing 7

Component Issue: “Engine Light” Situation: “Refueling” ExtractedConcept Person Issue Driver action Warning Noun Noun Phrase Prep Phrase Verb RDB IBM Master Data Mgmt IBM Content Analytics – How it works Real-time NLP REST API Content Push API Content Analytics Crawlers “Owner”“reports”“check engine lite” “flashes”“after refueling” ... Content Analytics UIMA Pipeline + Annotators Analyzed Content (and Data) Source Information Corporate (Contact Center, Test Data, Dealer notes, ECM, etc.) and External (NHTSA, Edmunds, Consumer Reports, MotorTrend etc.) Fine grain control over the entities and facets that are created 8

Wikipedia : UIMA standard UIMA stands for Unstructured Information Management Architecture. An OASIS standard[2] as of March 2009, UIMA is to date the only industry standard for content analytics[ UIMA is a component software architecture for the development, discovery, composition, and deployment of multi-modal analytics for the analysis of unstructured information and its integration with search technologies developed by IBM. The source code for a reference implementation of this framework has been made available on SourceForge, and later on the website of the Apache Software Foundation. An example is a logistics analysis software system that could convert unstructured data such as repair logs and service notes into relational tables. These tables can then be used by automated tools to detect maintenance or manufacturing problems. Other examples are systems that are used in medical environments to analyze clinical notes. http://en.wikipedia.org/wiki/UIMA

UIMA Unstructured Information Management Architecture Analysis engines are interchangeable and reusable Analysis engines pass artifacts via common data store, CAS (Common Analysis Structure), as annotations Watson Jeopardy Challenge utilized UIMA (UIMA-AS) to win quiz champions Crime Name=theft Event Annotator CitycityName=New YorkcityDistrict=Queens TimetimeOfDay = noon Named Entity Annotator Sports Car UIMA Framework Part-of-Speech Annotator Noun Noun Verb Verb Preposition Num Noun Preposition Morphological Annotator Porsche was stolen at 11:30 a.m. in Queens

UIMA Compliant Analysis Engines in ICA • ICA provides a number of annotators for advanced text analysis • Language Identification • Linguistic Analysis • Dictionary Lookup • Pattern Matcher • Named Entity Recognition • Document Classification • Custom text analysis can be added as a UIMA annotator • e.g. An annotator that recognizes product number and add additional information such as product name, release date or price • LanguageIdentification • Tokenization • Word Analytics • Named EntityRecognition • Multi-wordAnalytics • Classification Custom Analytics UIMA

https://www.ibm.com/developerworks/mydeveloperworks/blogs/36db6433-2f12-4533-9c68-489067780bfd/entry/overview3?lang=enhttps://www.ibm.com/developerworks/mydeveloperworks/blogs/36db6433-2f12-4533-9c68-489067780bfd/entry/overview3?lang=en

Text Analytics Catalog

Content Analytics Studio Content Analytics Studio is an integrated development environment (IDE) for creating your own custom analysis engine With Content Analytics Studio, you can........ • Create language and domain specific dictionaries • Write rules to match character patterns • Write rules to identify patterns of tokens and other annotations • Create UIMA annotators based on these dictionaries and rules • Annotate text documents and view the details of annotations • Annotate collections of documents ...... all without needing to write code or understand underlying technology

Content Analytics Studio View Project Resources

Content Analytics Studio Sample text for building a model

Content Analytics Studio UIMA Pipeline components

Iterative Process • Content analytics is an iterative process of: • Build : process documents • Validate : verify resources updates • Analyze : perform analytics to find new insights • Modify : update resources with new insights

Architecture

Content Analytics Studio Cognos BI CSVImporter Web LinkAnalysis JDBC DBCrawler Win FSCrawler Term ofInterest WebCrawler DocumentCluster FileNet P8Crawler Unix FSCrawler QuickPlaceCrawler NotesCrawler Case MgrCrawler NNTPCrawler DB2Crawler Seed ListCrawler ThumbnailGeneration DocumentCategorizer Importer Framework ThumbnailIndex SearchIndex TaxonomyIndex Facet CountSub Index Document Processor Document Generator Document Processor Raw DataStore Document Processor Contents Miner UI EnterpriseSearch UI Admin UI Annotator Annotator Parser Indexer Exporter DocumentCache UIMA Annotator Monitor Security DominoDoc MgtCrawler ExchangeServerCrawler ContentIntegratorCrawler WebSpherePortalCrawler Agent forFile SystemCrawler DB2Content MgrCrawler WebContent MgrCrawler SIAPIApplication RESTApplication Real-time NLPApplication Configuration Inspector Crawler Plug-in XML CSV CSV Text Analytics & SearchRuntime Export Plug-in RDB CustomCrawler Cognos BIIntegration Control Logging Scheduler SharePointCrawler CustomPoint ICA V3.0 System Architecture Crawler Framework Collection Indexer Service Global Processing Common Infrastructure

IBM Corp. ’s EPS 10.1% Positive (finance – increase) corporation increase According to finance report IBM Corp. ’s EPS increased by 10.1% according preposition noun(singular) noun(singular) preposition noun(singular) noun(singular) noun(proper) posessive verb(past tense) numeral adjective UIMA Compliant Analysis Engines in ICA • Resolve many ambiguities in text • Recognize domain specific terms / expressions • Deal with grammatical characteristics of each language (e.g. English, Chinese, Japanese, French, German, …) Language Identification • Lexical Analysis • Paragraph/Sentence Segmentation • Tokenization • Character Normalization • Lemmatization • Part-of-Speech Tagging • Phrasal Analysis • Shallow Parsing • - Named Entity Extraction • - Phrase Recognition • (Sentiment Analysis, etc.) • - Deep Parsing • - JJSA (Japanese only) Custom Analysis Document Classification Add-on UIMA PEAR Custom Analysis Engine English IBM Corp.

Supported Languages *1

Data Source • Web • HTTP/HTTPS （RSS、Atom） • News groups (NNTP) • WebSphere Portal, Web Content Management • Relational Database • DB2 family (DB2 UDB, Informix, DB2 for iSeries, DB2 for z/OS) • Oracle, MS SQL Server, Sybase • VSAM, IMS, CA-Datacom, Software AG Adabas • Collaboration System • Lotus Notes/Domino databases, Quick Place, Domino.Doc • Lotus Quickr, Lotus Connections • MS Exchange • Content Management System • IBM Case Manager • DB2 Content Manager • Documentum, FileNet CS, FileNet P8, Hummingbird, LiveLink Open Text, Portal Document Manager (PDM), Microsoft Sharepoint • File System • Unix File System • Windows File System Supported Data Data Format • Plain Text • HTML • XML • Office Document • Adobe Portable Document Format (PDF) • MS Rich Text Format (RTF) • MS Word, MS Excel, MS PowerPoint • Lotus Word Pro, Lotus 1-2-3, Lotus Freelance, • Ichitaro More than 300 format can be supported bychanging the configuration file

ICA Web Application Security • In case of WAS, global security needs to be configured for login setting

Document Level Security by Security Token • You can assign security token at crawling by • Add the fixed value as security token • Assign the security token based on field values (only some crawlers) • Attach the token programmatically using custom crawler plug-in • It needs to customize search application to pass tokens that the current user has • The search engine returns documents only if the given tokens match to indexed security tokens on each document 2.User authentication and credential retrieval Parser Indexer Search runtime SearchIndex Datasource Plugin Crawler Plug-in Plug-in 3.Results filtering by matching Security tokens with user credentials 1.Assigning security tokens to documents Or extracted from native data source

APIs • Custom Search and Admin applications can be implemented by REST API • Language independent • Provides all required functions for creating a search UI • Search navigation • Facet navigation • Search functions • Faceted search • Fetch content, thumbnails and previous document • List spell correction, synonym expansions and type-ahead suggestions • And more… • Provides required functions for administrating search • Managing collections • Controlling and monitoring components • Adding documents to a collection

Extension Point List

Parser / Tokenizer UIMA Annotators SearchIndex Crawled Document Export Export documents with its metadata and content as those are crawled 1 DataStore Crawler Indexer RDB 2 Analyzed Document Export Export documents with the result of text Analytics such as Natural Language Processing, Named Entity Extraction, classification or user implemented logic before indexing 3 Searched Document Export Export documents limited by search or analysis with original content from the index IBM Content Analytics: Analysis Export Capability Content Intelligence Consumers Export Plug-in IBM Master Data Mgmt Content Analytics Export Import Plug-in Exporter Export Plug-in ECM Solutions Limit documents by search or analysis InfoSphere

Basic Analytics and Search Concepts • Structured Content – data that has unambiguous values and is easily processed by a computer program. • Unstructured Content – information that is generally recorded in a natural language as free text. • Text Analytics – A form of natural language processing that includes linguistic, statistical, and machine learning techniques for analyzing text and extracting key information • Collection – A set of data sources and options for crawling, parsing, indexing, and searching those data sources • Analytics Collection – a collection that is set up to be used for content mining. • Search Collection – a collection that is set up to be used for search application • Crawler – A software program that retrieves documents from data sources and gathers information that can be used to create search indexes • Annotator – A software component that performs specific linguistic analysis tasks and produces and records annotations • Parser – A program that interprets documents that are added to the data store. The parser extracts information from the documents and prepares them for indexing, search, and retrieval

Q&A

IBM Leadership in Search, Text Analysis and Classification

IBM Leadership in Search, Text Analysis and Classification

Presentation Transcript

Text and Web Search

Search And Text Analysis

Automatic Text Classification

Text Classification

Novel representations and methods in text classification

Novel representations and methods in text classification

Novel representations and methods in text classification

Text classification: In Search of a Representation

TEXT CLASSIFICATION

Text Classification

Text Classification

Text Classification

Text Classification

Text Classification and Images

Text Classification

Text Classification

Classification Text

Text Classification

TEXT CLASSIFICATION