Empowering the Publishing Process with Semantic Technologies

Empowering the Publishing Process with Semantic Technologies Stephen Cohen Principal Consultant O’Reilly Tools of Change Conference 23 February 2010

Agenda • Overview • Semantic technologies • Case studies • Benefits and challenges • Questions

Innodata Isogen – Who We Are Innodata Isogen provides knowledge, production, technology and consulting services to the world’s leading media, publishing and information services companies • We specialize in publishing, to help our clients to: • lower total cost of ownership for their content supply chain • re-engineer business processes • multi-shore services to lower cost, manage risk and balance the cost / quality ratio • combine content and technology outsourcing add value • Our clients include • leading scholarly, business and legal publishers • secondary publishers (content aggregators) • agencies of the U.S. Department of Defense • major aerospace manufacturers 6,500 globalstaff London Paris Israel Delhi Manila Cebu Colombo New Jersey Dallas

Overview • Semantic technologies are often used to more effectively monetize content and improve the customer experience on the Web • semantic advertising • semantic search • They have also been used effectively throughout the publishing process • Today we will talk about companies that are using semantic technologies and text mining to process content better, faster, cheaper

What Do Publishers Have in Common? • They all want to deliver information better, faster, cheaper • Better • offer the information customers and users want and need (focused) • make it easier for customers to discover new information and relationships between information • Faster • get it in the hands of customers ahead of your competition (when they need it) • Cheaper • do it in the most cost effective way possible

Semantic Analysis Tools Can Help • Across the content supply chain • Better • more accurate, consistent content tagging, indexing, abstracting, linking • Faster • find out sooner about new information (e.g., announcements, legal opinions, rules changes) • (semi) automate content enrichment • increase throughput • Cheaper • deploy resources most cost effectively (do more with less)

Semantic Technologies: Some Characteristics • Briefly, semantic technologies are algorithms that seek to model the associative processes that humans perform to extract meaning from information • Knowing a little bit about “the man behind the curtain” can help when it comes to deciding which approach is a good fit for your company’s needs • They can be rules-based, use statistical analysis, use semantic and linguistic clustering, etc. • Not surprisingly, there are many approaches to modeling and each has its strengths and weaknesses

Rules-Based Text Analysis • Precisely defines criteria by which a document belongs to a category • Matches terms in a thesaurus to words in content • Typically uses “if-then-else” rules • Relative easy to deploy; start with simple rules and enhance over time • Rules can get complex, difficult to maintain Word = shrub? Assign Category = ‘bush’ Word = Bush AND within 4 words of President? Assign Category = ‘chief executive’ doc.type = email? Assign Category = ‘internal communication’

Statistical Analysis • Word frequency • Relative placement of words, groupings • Distance between words in a document • Pattern analysis • Co-occurrence of terms to find clumps or clusters of closely related documents • Makes assignments to categories based on a set of training documents • Requires more time to deploy due to need to select a representative set of documents for training the tool • Accuracy of the semantic analysis will depend on how well the training documents have been chosen

Semantic and Linguistic Clustering • Concept extraction • Language dependent • Documents clustered or grouped depending on meaning of words using thesauri, parts-of-speech analyzers, rule-based & probabilistic grammar, etc. • Analyzes structure of sentences • analysis of words - prefixes, suffixes, roots • word-level analysis including parts of speech • analyzes structure & relationships between words in a sentence • possible meanings of a sentence; enhanced by statistical analysis

The Content Supply Chain • We view the publishing process in terms of a supply chain • It begins with content acquisition through conversion and enhancement, on to product assembly and, lastly, to product publishing and distribution • Using semantic tools has an impact on roles and responsibilities, workflows and the way content is processed at each stage of the content supply chain • Semantic tools and text mining are used at different stages of the editorial and production process

Semantic Tools in the Content Supply Chain • Source / Create • Convert / Structure • Normalize • Store / • Manage • Edit / • Enhance • Product Assembly • Publish / Distribute Intelligent agents for targeted retrieval (content federation); “acquire what is new or changed from sites I am interested in” Abstracting, auto-summarization (e.g., synopses, headnotes) Custom publishing; ‘Synthetic documents’ Content delivery for multiple output channels and product formats Linking; entity extraction; citations; classification , machine aided indexing; contextual meaning Extract content for tagging; identify not only document structure but document meaning; structure unstructured content Controlled vocabulary and authority list management; taxonomy managers; knowledge management

Case Studies

Preview of Case Studies • Rules-based auto-classification • Document analysis and entity linking • Auto-summarization • Product assembly • Custom information feeds

Case Study Rules-based Auto-classification

Rules-based Auto-classification TAXONOMY MANAGER DEFINE CLASSIFICATION RULES RULES MANAGEMENT SYSTEM INDEXER Add/remove terms; Create groupings; Map terms Automatic update of rules to reflect changes in taxonomy Indexer defines classification rules Review usage statistics Rules used, not used; add, modify, delete rules Baseline Test Set Test & adjust rules RULES BASE INDEXER REVIEW AUTO-CLASSIFICATION INDEXER Accepts, rejects, adds, classification terms Reviews rules system applied that yielded wrong classification Flag problems to rules builder; suggest new terms Set-up Apply rules to classify content against taxonomy System tracks rules usage (which ones used; frequency) SYSTEM Tracks rules that generated incorrect classifications

Case Study Document Analysis and Entity Linking

Document Analysis and Entity Linking • Focus is on document analysis and entity linking in editorial workflow • Subsidiary of a global legal publishing house • content base of 3.5 million cases, related documents • manages over 17 million citations • updates of case law processed daily • cases growing at 20% per annum • Challenges • avoid processes performed manually by individuals • allow the user to select and filter the information needed for their job • take into account an increasing number of legal information sources • Describes target configuration but not yet fully realized

Goals for the New Process • Aid the process of knowledge extraction and storage • identify legal sources (e.g., official publication, case law decision) • extract legal citations (which source is cited and why?) • populate a knowledge base and cyclically enrich the content • Process each piece of information one time • normalize, tag, enrich, link, form concepts, etc. • Build standardized common knowledge base for use throughout the editorial and production process and by downstream by end-users • Maintain consistent thesauri, ontologies, taxonomies and provide a mechanism for their management and updating

Document Analysis and Linking Process DEFINITION PHASE AUTOMATED TEXT ANALYSIS Domain-specific lists for entity recognition Text mining rules Entity extraction Automated Semantic Analysis Tag content Linking Baseline Test Set Test text analysis tool Iterative application of rules KNOWLEDGE MANAGEMENT SEARCH AND NAVIGATION SERVICES REVIEW AND QC ENRICHED CONTENT LIST AND RULES MAINTENANCE Use search,navigation tools to review,identify, and correct Weekly review of exception reports Legal editors Librarian Entity error Link error Concept error 20

Benefits of the New Process • Workflow • a semi-automated process • editors review QC output from text mining tool to enhance and correct as necessary • analysis and linking by automated text analysis tool • parallel processing in text analysis tool • analysis, referencing and linking became part of the same workflow • Roles and responsibilities • editors no longer need to be experts in mark-up languages; content is tagged automatically • low value editorial tasks handled by text analysis tool • existing staff can focus on high value tasks • new role to maintain and enhance semantic lists and text mining tool rules • Content • quality document analysis improves through enhancements to the lists and rules used by the text mining tool • able to federate metadata across multiple content management systems • same knowledge base and text mining tool integrated into online products

Case Study Auto-summarization

Auto-summarization – Major Newspaper Content in • Document zones • Rules: semantics; dictionary; complex grammar rules • Section weightings • Sentence position • Relative importance of sentences • Markers for start of sections, paragraphs, sentences • Sentence length of summary Document Analysis Source; type; format; content Auto-summarization Rules Base OR Extent of automation depends on article importance Administrator monitors, improves rules set based on usage Auto-summarization (draft version) Expert review and edit (final version) OR Manual summarization Outsource or in-house experts OR

Case Study Product Assembly

Product Assembly New Content Process Source / Capture Convert / Normalize Analyze / Classify / Enhance - Editorial Content Repository Extract Product Content From Repository Select content (XQuery) Select content (XQuery) Select content (XQuery) Select content (XQuery) Render WCSS; Proprietary Render FOSI; XSLFO; Proprietary Render Render XSLT; CSS; RSS Format Product XML Content Store Rich Media

Case Study Custom Information Feeds

Custom Information Feeds Delivery Repository Content End Users PRO BASEBALL REAL-TIME UPDATES TARGETED INFO SCORES XML PLAYERS FOOTBALL NEWS COLLEGE REAL-TIME FEEDS ENRICHED EMAIL PEOPLE RULES SOCCER HIGH SCHOOL STANDINGS STATS RICH MEDIA REAL-TIME FEEDS ENRICHED EMAIL HOCKEY REC SCHEDS

Benefits of Using Semantic Technologies • People • minimize high-value resources performing commodity tasks • editors focus on real editorial added value; no need to be concerned about markup • increased capacity without increasing headcount • novice indexers come up to speed quicker • Process • reduced processing time due to automation • sequential tasks can be performed in one step • products can be more targeted to specific customer needs • parts can be outsourced • Content • richer more consistent classification, linking, summarization, semantic tagging • common controlled vocabularies maintained and applied across entire content base • same content can be classified and summarized along more dimensions to serve different customer groups • greater value can be extracted from unstructured content with text mining and semantic analysis • taxonomy managers support a rigorous approach to maintenance and updating

Challenges Using Semantic Technologies • People • retrain resources for new roles (rules builder, taxonomy manager, etc.) is time consuming • level of accuracy depends on ability of editors to write logical rules • Process • time required to refine rules and train analysis engine can be extensive (some report 12-18 months) • productivity improvements are a function of thesaurus structure, rule-builder’s skill level, document type; the more complex any of these are the longer it takes to achieve return on investment • Content • automated content analysis doesn’t match up to the analytical skills of trained subject area experts (at least in some highly technical disciplines) • some find it difficult to measure the impact of indexing consistency • lower quality when there is fully automated machine aided indexing with no follow-on QC by subject area experts

Questions

Thank You Stephen Cohen Principal Consultant scohen@innodata-isogen.com +1 (201) 371-8044 Innodata Isogen, Inc. Three University Plaza Hackensack, NJ 07601 +1 (201) 371-2828 www.innodata-isogen.com Proprietary and Confidential WWW.INNODATA-ISOGEN.COM

Empowering the Publishing Process with Semantic Technologies

Empowering the Publishing Process with Semantic Technologies

Presentation Transcript

Empowering Translational Research using Semantic Web Technologies

Deploying Semantic Technologies for Digital Publishing

Semantic Web Technologies

Semantic Web Technologies

Publishing to the Semantic Web

Publishing Process

Semantic Web Technologies

Semantic Web Technologies

Semantic Web Technologies

Advanced Semantic Technologies

Semantic Web Technologies

Advanced Semantic Technologies

Semantic Technologies 2

Semantic Chemical Publishing

Dynamic Semantic Publishing Empowering the BBC Sports Site and the 2012 Olympics

Semantic Web Technologies

Business Process Management and Semantic Technologies

New Publishing and the Semantic Web

Semantic Web Technologies

Semantic Web Technologies

Publishing Process

Saving Money With The Publishing Process