240 likes | 412 Vues
Text Analytics World Future Directions of Text Analytics. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Introduction: Current State of Text Analytics Survey Roadblocks for Text Analytics
E N D
Text Analytics World Future Directions of Text Analytics Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Introduction: • Current State of Text Analytics • Survey • Roadblocks for Text Analytics • Complexity and Customization • Fast and Slow (Thinking) Text Analytics • Building Text Analytics Brains • New Methods for Text Analytics • Lessons from Watson • Some Wild New Ideas and Approaches • Questions
Introduction: KAPS Group • Knowledge Architecture Professional Services – Network of Consultants • Applied Theory – Faceted taxonomies, complexity theory, natural categories, emotion taxonomies • Services: • Strategy – IM & KM - Text Analytics, Social Media, Integration • Taxonomy/Text Analytics development, consulting, customization • Text Analytics Quick Start – Audit, Evaluation, Pilot • Social Media: Text based applications – design & development • Partners – SAS, Smart Logic, Expert Systems, SAP, IBM, FAST, Concept Searching, Attensity, Clarabridge, Lexalytics • Projects – Portals, taxonomy, Text analytics – news, expertise location, information strategy, text analytics evaluation, Quick Start in Text A. • Clients: Genentech, Novartis, Northwestern Mutual Life, Financial Times, Hyatt, Home Depot, Harvard Business Library, British Parliament, Battelle, Amdocs, FDA, GAO, World Bank, etc. • Presentations, Articles, White Papers – www.kapsgroup.com
Introduction:What is Text Analytics? • Text Mining – NLP, statistical, predictive, machine learning • Semantic Technology – ontology, fact extraction • Extraction – entities – known and unknown, concepts, events • Catalogs with variants, rule based • Sentiment Analysis • Objects and phrases – statistics & rules – Positive and Negative • Auto-categorization • Training sets, Terms, Semantic Networks • Rules: Boolean - AND, OR, NOT • Advanced – DIST(#), ORDDIST#, PARAGRAPH, SENTENCE • Disambiguation - Identification of objects, events, context • Build rules based, not simply Bag of Individual Words
Text Analytics WorldCurrent State of Text Analytics • History – academic research, focus on NLP • Inxight –out of ZeroxParc • Moved TA from academic and NLP to auto-categorization, entity extraction, and Search-Meta Data • Explosion of companies – many based on Inxight extraction with some analytical-visualization front ends • Half from 2008 are gone - Lucky ones got bought • Early applications – News aggregation and Enterprise Search – • Second Wave = shift to sentiment analysis • Enterprise search down, taxonomy up –need for metadata – not great results from either – 10 years of effort for what? • Text Analytics is growing – But
Text Analytics WorldCurrent State of Text Analytics • Current Market: 2012 – exceed $1 Bil for text analytics (10% of total Analytics) • Growing 20% a year • Search is 33% of total market • Other major areas: • Sentiment and Social Media Analysis, Customer Intelligence • Business Intelligence, Range of text based applications • Fragmented market place – full platform, low level, specialty • Embedded in content management, search, No clear leader.
Text Analytics WorldCurrent State of Text Analytics: Vendor Space • Taxonomy Management – SchemaLogic, Pool Party • From Taxonomy to Text Analytics • Data Harmony, Multi-Tes • Extraction and Analytics • Linguamatics (Pharma), Temis, whole range of companies • Business Intelligence – Clear Forest, Inxight • Sentiment Analysis – Attensity, Lexalytics, Clarabridge • Open Source – GATE • Stand alone text analytics platforms – IBM, SAS, SAP, Smart Logic, Expert System, Basis, Open Text, Megaputer, Temis, Concept Searching • Embedded in Content Management, Search • Autonomy, FAST, Endeca, Exalead, etc.
Future Directions: Survey Results • 28% just getting started, 11% not yet • What factors are holding back adoption of TA? • Lack of clarity about value of TA – 23.4% • Lack of knowledge about TA – 17.0% • Lack of senior management buy-in - 8.5% • Don’t believe TA has enough business value -6.4% • Other factors • Financial Constraints – 14.9% • Other priorities more important – 12.8% • Lack of articulated strategic vision – by vendors, consultants, advocates, etc.
Text Analytics WorldPrimary Obstacle: Complexity • Usability of software is one element • More important is difficulty of models: • Conceptual and document models • General need – more structure but also more flexible kinds of structure and interactions • More modules and more ways of combining or interacting – IBM – select best answer but others • Competitive – learn and evolve – Feedback! • Cooperative – join together to form higher level structures
Text Analytics WorldPrimary Obstacle: Complexity: Partial Solutions • Build complex semantic networks – basic concepts – good for demo, gets a start, but very complex to build on • Library of taxonomies – but all need major customization and often are not a good starting point – different types of taxonomies – index vs. categorization • Customization – Text Analytics– heavily context dependent • Content, Questions, Taxonomy-Ontology • Level of specificity – Telecommunications • Specialized vocabularies, acronyms • Specialized relationships – conceptual and organizational • How overcome?
Text Analytics World Thinking Fast and Slow – Daniel Kahneman • System 1 and System 2 – Daniel Kahneman • System 1 – fast and automatic – little conscious control • Represents categories as prototypes – stereotypes • Norms for immediate detection of anomalies – distinguish the surprising from the normal • fast detection of simple differences, detect hostility in a voice, find best chess move (if a master) • Priming / Anchoring – susceptible to systemic errors • Temperature Example • Biased to believe and confirm • Focuses on existing evidence (ignores missing – WYSIATI) • .
Text Analytics World Thinking Fast and Slow • System 2 – Complex, effortful judgments and calculations • System 2 is the only one that can follow rules, compare objects on several attributes, and make deliberate choices • Understand complex sentences • Check the validity of a complex logical argument • Focus attention – can make people blind to all else – Invisible Gorilla • Similar to traditional dichotomies – Tacit – Explicit, etc • Basic Design – System 1 is basic to most experiences, and System 2 takes over when things get difficult – conscious control • Text Analysis and Text Mining / Auto-Cat and TA Cat
Text Analytics WorldSystem 1 & 2 – and Text Analytics Approaches • “Automatic Categorization” – System 1 prototypes • Limited value -- only works in simple environments • Shallow categories with large differences • Not open to conscious control • System 2 – categories – complex, minute differences, deep categories • Together: • Choose one or other for some contexts • Combine both – need to develop new kinds of categories and/or new ways to combine?
Text Analytics World Text Mining and Text Analytics • Text Analytics and Big Data enrich each other • Data tells you what people did, TA tells you why • Text Analytics – pre-processing for TM • Discover additional structure in unstructured text • Behavior Prediction – adding depth in individual documents • New variables for Predictive Analytics, Social Media Analytics • New dimensions – 90% of information, 50% using Twitter analysis • Text Mining for TA– Semi-automated taxonomy development • Apply data methods, predictive analytics to unstructured text • New Models – Watson ensemble methods, reasoning apps • Extraction – smarter extraction – sections of documents, Boolean, advanced rules – drug names, adverse events – major mention
Text Analytics WorldIntegration of Text and Data Analytics • Expertise Location: Case Study: Data and Text • Data Sources: • HR Information: Geography, Title-Grade, years of experience, education, projects worked on, hours logged, etc. • Text Sources: • Document authored (major and minor authors) – data and/or text • Documents associated (teams, themes) – categorized to a taxonomy • Experience description – extract concepts, entities • Self-reported expertise – requires normalization, quality control • Complex judgments: • Faceted application • Ensemble methods – combine evaluations
Text Analytics World : Building on the PlatformExpertise Analysis • Expertise Characterization for individuals, communities, documents, and sets of documents • Experts prefer lower, subordinate levels • Novice & General – high and basic level • Experts language structure is different • Focus on procedures over content • Applications: • Business & Customer intelligence – add expertise to sentiment • Deeper research into communities, customers • Expertise location- Generate automatic expertise characterization based on documents
Text Analytics WorldNew Approaches – Applied Watson • Key concept is that multiple approaches are required – and a way to combine them – confidence score • Aim = 85% accuracy of 50% of questions (Ken Jennings – 92% of 62% • Used a combination of structure and text search • Massive parallelism, many experts, pervasive confidence estimation, integration of shallow and deep knowledge • Key step – fast filtering to get to top 100 (System 1) • Then – intense analysis to evaluate (System 2) – multiple scoring
Text Analytics WorldNew Approaches – Applied Watson • Multiple sources – taxonomies, ontologies, etc. • Special modules – temporal and spatial reasoning – anomalies • Taxonomic, Geospatial, Temporal, Source Reliability, Gender, Name Consistency, Relational, Passage Support, Theory Consistency, etc. • Merge answer scores before ranking • 3 Years, 20 researchers of all types • Got to 70% of 70% - in two hours • More difficult answers / more complete questions
Text Analytics WorldNew Approaches: Adding Structure to Content • Contexts – whole range of types of context • Document types-purpose, Textual complexity, formats • Categorization by page, sections (text markers) or even sentence or phrase – Key – remember what the last page was • [Key– documents are not unstructured – they have a variety of structures] • Use generic components – like the level of generality of terms or concepts (general and context specific)
Text Analytics WorldNew Approaches • Idea – build a higher level language – like tutoring systems • More complex primitives • IDEA – Crowd sourcing – to evolve better structures – how design to avoid design by committee – other side of wisdom of crowds • Design TA Game – 1,000’s to play and evolve • Partner with MOOC - example – better essay evaluation – avoid gaming the system – lots of multi-syllabic words – nonsense • Also to enhance software / modules
New Directions in Text AnalyticsConclusions • Text Analytics is growing – but • Big obstacles remain • Strategic Vision of text analytics in the enterprise, applications • Concrete and quick application to drive acceptance • Software still too complex, un-integrated • New models are being developed Cognitive science – System 1 and 2, AI – brains that learn Watson like integrated approaches • Overcome complexity – modules (System 1/ Standard) with new ways of integrating (System 2 / Customized) – smarter and easier
Questions? Tom Reamytomr@kapsgroup.com KAPS Group http://www.kapsgroup.com Upcoming: Taxonomy Boot Camp – KMWorld -DC, Nov 3-6 Workshop on Text Analytics Text Analytics World – San Francisco, March 17-19
Future Directions for Text AnalyticsSocial Media: Beyond Simple Sentiment • Analysis of Conversations- Higher level context • Techniques: self-revelation, humor, sharing of secrets, establishment of informal agreements, private language • Detect relationships among speakers and changes over time • Strength of social ties, informal hierarchies • Combination with other techniques • Expertise Analysis – plus Influencers • Quality of communication (strength of social ties, extent of private language, amount and nature of epistemic emotions – confusion+) • Experiments - Pronoun Analysis – personality types • Analysis of phrases, multiple contexts – conditionals, oblique
Introduction: Personal • Deep Background: History of Ideas – dissertation – Models of Historical Knowledge • Artificial Intelligence research at Stanford AI Lab • Programming – designed two computer games, educational software • Started an Education Software company, CTO • Height of California recession • Information Architect – Chiron/Novartis, Schwab Intranet • Importance of metadata, taxonomy, search – Verity • From technology to semantics, usability • From library science to cognitive science • 2002 – started consulting company