220 likes | 446 Vues
Text Analytics Workshop Evaluation of Software. Tom Reamy Chief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com. Agenda. Features, Varieties, Vendors Enterprise Context Start with Self-Knowledge Text Analytics Team Evaluation Process
 
                
                E N D
Text Analytics WorkshopEvaluation of Software Tom ReamyChief Knowledge Architect KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com
Agenda • Features, Varieties, Vendors • Enterprise Context • Start with Self-Knowledge • Text Analytics Team • Evaluation Process • Features and Capabilities – Filter • Proof of Concept / Pilot
Text Analytics Software – Features • Entity Extraction • Multiple types, custom classes – entities, concepts, events • Auto-categorization – Taxonomy Structure • Training sets – Bayesian, Vector space • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Boolean– Full search syntax – AND, OR, NOT • Advanced – NEAR (#), PARAGRAPH, SENTENCE • Advanced Features • Facts / ontologies /Semantic Web – RDF + • Sentiment Analysis
Varieties of Taxonomy/ Text Analytics Software • Taxonomy Management • Synaptica, SchemaLogic • Full Platform • SAP-Inxight, Clear Forest, SAS- Teragram, Data Harmony, Concept Searching, IBM • Content Management • Nstein, Interwoven, Documentum, etc. • Embedded – Search • FAST, Autonomy, Endeca, Exalead, etc. • Specialty • Sentiment Analysis - Lexalytics
Vendors of Taxonomy/ Text Analytics Software • Attensity • Business Objects – Inxight • Clarabridge • ClearForest • Data Harmony / Access Innovations • GATE (Open Source) • IBM Content Analyst • Lexalytics • Multi-Tes • Nstein • SAS - Teragram • SchemaLogic • Smart Logic • Synaptica • Wikionomy • Wordmap • Lots More
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Strategic and Business Context • Info Problems – what, how severe • Strategic Questions – why, what value from the taxonomy/text analytics, how are you going to use it • Formal Process - KA audit – content, users, technology, business and information behaviors, applications - Or informal for smaller organization, • Text Analytics Strategy/Model – forms, technology, people • Existing taxonomic resources, software • Need this foundation to evaluate and to develop
Evaluating Taxonomy/Text Analytics Software Start with Self Knowledge • Do you need it – and what blend if so? • Taxonomy Management Only • Multiple taxonomies, languages, authors-editors • Technology Environment – ECM, Enterprise Search – where is it embedded • Publishing Process – where and how is metadata being added – now and projected future • Can it utilize auto-categorization, entity extraction, summarization • Is the current search adequate – can it utilize text analytics? • Applications – text mining, BI, CI, Alerts?
Design of the Text Analytics Selection Team • Traditional Candidates - IT • Experience with large software purchases • Search/Categorization is unlike other software • Experience with needs assessments • Need more – know what questions to ask, knowledge audit • Objective criteria • Looking where there is light? • Asking IT to select taxonomy software is like asking a construction company to select the design of your house. • They have the budget • OK, they can play.
Design of the Text Analytics Selection Team • Traditional Candidates - Business Owners • Understand the business • But don’t understand information behavior • Focus on business value, not technology • Focus on semantics is needed • They can get executive sponsorship, support, and budget. • OK, they can play
Design of the Text Analytics Selection Team • Traditional Candidates - Library • Understand information structure • But not how it is used in the business • Experts in search experience and categorization • Suitable for experts, not regular users • Experience with variety of search engines, taxonomy software, integration issues • OK, they can play
Design of the Text Analytics Selection Team • Interdisciplinary Team, headed by Information Professionals • Relative Contributions • IT – Set necessary conditions, support tests • Business – provide input into requirements, support project • Library – provide input into requirements, add understanding of search semantics and functionality • Much more likely to make a good decision • Create the foundation for implementation
Evaluating Text Analytics Software – Process • Start with Self Knowledge • Eliminate the unfit • Filter One- Ask Experts - reputation, research – Gartner, etc. • Market strength of vendor, platforms, etc. • Feature scorecard – minimum, must have, filter to top 3 • Filter Two – Technology Filter – match to your overall scope and capabilities – Filter not a focus • Filter Three – Focus Group one day visit – 3-4 vendors • Deep pilot (2) / POC – advanced, integration, semantics • Focus on working relationship with vendor.
Evaluating Text Analytics SoftwareFeature Checklist and Score: Basic Features, Admin • New, copy, rename, delete, merge • Branches not just nodes • Scope Notes • Spell check • Search – all parts and selected (only taxonomy nodes) • Names and Identifiers for terms and nodes • Check for duplicates • Versioning, multiple authors • Analytical reports – structure, application to documents
Evaluating Text Analytics SoftwareFeature Checklist and Score: Usability • Ease of use – copy, paste, rename, merge, etc. • User Documentation, user manuals, on-line help, training and tutorials • Visualization • file structure, tree, Hierarchy and alphabetical • Automatic Taxonomy/Node & Rule Generation • Nonsense for Taxonomy • Node – suggestions for sub-categories, rules • Variety of node relationships – child-parent, related
Evaluating Text Analytics SoftwareFeature Checklist and Score: Additional Features • Language support – international - If you have need for it • Scalability – Size of taxonomy rarely important • More important for auto-categorization • Import-Export – XML and SKOS • Support standards – NISO, etc., Mapping between taxonomies • API / SDK • Security, Access Rights, Roles • Advanced Features – future growth • Facts / ontologies /Semantic Web – RDF + • Sentiment Analysis
Evaluating Text Analytics SoftwareAdvanced Features – Text Analytics as Platform • Entity Extraction • Multiple types, custom classes • Summarization • Customizable rules, map to different content • Auto-categorization • Training sets • Terms – literal strings, stemming, dictionary of related terms • Rules – simple – position in text (Title, body, url) • Advanced – saved search queries (full search syntax) • NEAR, SENTENCE, PARAGRAPH • Boolean – X NEAR Y and Not-Z
Evaluating Taxonomy SoftwarePOC • Quality of results is the essential factor • 6 weeks POC – bake off / or short pilot • Real life scenarios, categorization with your content • Preparation: • Preliminary analysis of content and users information needs • Set up software in lab – relatively easy • Train taxonomist(s) on software(s) • Develop taxonomy if none available • Six week POC – 3 rounds of development, test, refine / Not OOB • Need SME’s as test evaluators – also to do an initial categorization of content
Evaluating Taxonomy SoftwarePOC • Majority of time is on auto-categorization • Need to balance uniformity of results with vendor unique capabilities – have to determine at POC time • Risks – getting software installed and working, getting the right content, initial categorization of content • Elements: • Content • Search terms / search scenarios • Training sets • Test sets of content • Taxonomy Developers – expert consultants plus internal taxonomists
Evaluating Taxonomy SoftwarePOC Test Cases: Auto-categorization to existing taxonomy – variety of content Clustering – automatic node generation Summarization Entity extraction – build a number of catalogs – design which ones based on projected needs – example privacy info (SS#, phone, etc.) Entity example –people, organization, methods, etc. Evaluate usability in action by taxonomists
Evaluating Taxonomy SoftwarePOC - Issues • Quality of content • Quality of initial human categorization • Normalize among different test evaluators • Quality of taxonomists – experience with text analytics software and/or experience with content and information needs and behaviors • Quality of taxonomy • General issues – structure (too flat or too deep) • Overlapping categories • Differences in use – browse, index, categorize • IMPORTANT!!!
Conclusion • Start with self-knowledge – what will you use it for? • Current Environment – technology, information • Basic Features are only filters, not scores • Integration – need an integrated team (IT, Business, KA) • For evaluation and development • POC – your content, real world scenarios – not scores • Foundation for development, experience with software • Development is better, faster, cheaper • Categorization is essential, time consuming • Categorization essential issue is complexity of language • Entity Extraction essential issue is scale
Questions? Tom Reamytomr@kapsgroup.com KAPS Group Knowledge Architecture Professional Services http://www.kapsgroup.com