What is Semantic Publishing? And Why Should I Care?

What is Semantic Publishing?And Why Should I Care? Jabin White Director of Strategic Content Wolters Kluwer Health – P&E May 13, 2010 PSP Presents – Semantic Publishing: An Introduction

Agenda • Introductions • Some definitions • Vocabularies, Taxonomies, and Ontologies, Oh My! • What is metadata, and why should publishers care? • What is semantic tagging, and why should publishers care? • Impact of all this on publishers’… • Workflows/processes • Business cases • The Semantic Web • Final Thoughts, Recommendations

Introductions: My Company • Director of Strategic Content for Wolters Kluwer Health – Professional & Education • Wolters Kluwer Health includes: • Lippincott Williams & Wilkins titles • Ovid • UpToDate • Provation Order Sets • Drug Facts & Comparisons • Medi-Span • Clin-eguide

Introductions: Me • Started as Editorial Assistant • Dove into SGML in the mid-90s working on drug reference • Six years at Elsevier in Electronic Production • Don’t typecast me! • Joined WK Health in May 2009 • Responsible for making sure content flows through company more efficiently (DTDs, Content Management, Authoring Tools, Semantic Enrichment, Product Information Management, etc.)

The Web - Stop the Insanity! • A few humble web stats: • There are 2 billion (billion!) Google searches daily • There are 1 trillion (1,000,000,000,000) unique URLs in Google’s index • There are 2,695,205 articles in English on Wikipedia • It would take 412.3 years to view all the content on YouTube (3/08), but don’t try, because there are 13 hours of video uploaded every minute ** Source: Adam Singer’s “Social Media, Web 2.0 and Internet Stats site: http://thefuturebuzz.com/2009/01/12/social-media-web-20-internet-numbers-stats/

So What? • Clay Shirky’s concept of “Filter Failure” • When the capacity of people to “keep up with” information is exceeded, curation becomes the value differentiator

Definitions • Controlled vocabulary: a bunch of words, no relationships • But there is advantage if all users use the same terms to describe things • Taxonomy: is a controlled vocabulary with hierarchy • Thesaurus: is interchangeable with controlled vocabulary, also sometimes referred to as an ontology • Ontology: all of the above; think neural network with a bunch of relationships • MetaData: data about data (we’ll get to that)

Some Level-Setting • Unfortunately, these definitions have been diluted to the point of uselessness by their misuse • Think “Content Management” around the year 2000 • MetaThesaurus – a collection of all of these things • EXAMPLE: UMLS

Information Classification • Pretty Wonky, Pretty Fast • Hyperonym: Broader Term, more general • car is a hyperonym of pinto) • Hyponym: Narrower Term • Baseball is a hyponym of sports • Meronym: part term • Kansas is a meronymof United States • Holynym: whole term • European Union is a holynmof France

Taxonomies in STM

Some Heavy Hitters • UMLS • MeSH • SNOMED-CT • ICD-9 and ICD-10 • RxNORM • LOINC, ICPC-93, and VA/KP Subset of SNOMED

UMLS – Unified Medical Language System • More than 5 million terms or named entities • Divided into concepts, and each term has unique identifier • Not a vocabulary, but a mapping BETWEEN vocabularies

UMLS • Vocabularies included in the UMLS: • MeSH Headings in 8 languages • ICPC-93 in 14 languages • WHO Adverse Drug Reaction Terminology in 5 languages • SNOMED-2, SNOMED-3, and UK Clinical Terms (former Read Codes) • ICD-10 in English and German • ICD-10-AM (Australian Modification) • ICD-9 (US Modification)

The Semantic Network (UMLS) • Semantic types are big things like Disease, Syndrome, or Clinical Drug • Semantic relationships are useful links between semantic types (ie, Clinical Drug treats Disease or Symptom)

One Concept, Many Names

MeSH – Medical Subject Headings • An 11-level hierarchy developed and maintained by the National Library of Medicine, part of the US Department of Health and Human Services • The indexing method for MEDLINE/PubMed • Contains more than 16 million references to journal articles in the life sciences, with concentration in biomedicine • 5,200 journals worldwide in 37 languages • Since 2005, 2,000-4,000 references are added daily, Tuesday-Saturday, all indexed to MeSH • Loading suspended for two weeks every November/December while MeSH is updated

The MeSH Staff

SNOMED-CT • Systemized Nomenclature of Medicine (Clinical Terms) • 344,000 concepts, arguably the most complete clinical taxonomy in the world • Developed and maintained by the College of American Pathologists • Licensed by NLM, freely available to license as part of UMLS • US Standard for electronic health information exchange by Health IT standards panel • Adopted for use by US government through the Consolidated Health Informatics (CHI) initiative

ICD-9 and ICD-10 • International Classification of Diseases • Version 9 moving to Version 10 (US is slower than rest of the world on this) • Codes that define diseases: • Example: 411.0 = Postmyocardial infarction syndrome (aka, Dressler’s Syndrome) • Used to drive insurance re-imbursements, billing, and other classifications of diseases • Used to figure morbidity and mortality figures by US government

RxNorm • Standardized names for drugs, collections of drugs, and delivery devices • Like MeSH, developed and maintained by National Library of Medicine • Also includes standard way of expressing generic and trade names, ingredients, strengths, and dose forms

LOINC Mapping Files • Logical Observation Identifiers Names and Codes • A set of universal names and ID codes for identifying laboratory and clinical test results • Used to better communicate with HIT (Health Information Technology) systems • Not much of an impact on publishers, but we should know about them

1/3

What is Metadata, and Why Should Publishers Care?

What is Metadata? • Reading most definitions of metadata and related standards is like trying to resolve disputes with my kids • Metadata is “data about data” • But what does that mean? • Its use may be increasing, but metadata is NOT new

Why Should Publishers Care • In the move from print publishing to digital, metadata is a powerful tool to help publishers get content in the right place, in the right format, and known to the right systems and people, at the right time • Print books were easy • Everyone knew what they were • You could really only use them one way • They had a beginning, an end, a physical presence, and a set price (mostly)

Why Should Publishers Care • Today, computers are often communicating with one another as much as they are with users (people) • Metadata becomes critical in: • B2B relationships • Enhancing B2C relationships • B2-_________ relationships • The quality of the metadata gives publishers a more powerful voice in what happens to their content

Why Should Publishers Care? • For example: • A digital asset (an image) • What file format is it? • How big is the image? • Who took the picture? • Who owns the picture? • Can you use it on your web site? If you do, what credit do you have to give to the owner? • What date was it created? • Is it part of a collection? • Is it related to another piece of content? • Does it stand alone or is it part of a group of images?

Publishers Should Care • If a publisher’s goal is to disseminate content to the widest possible audience, metadata is critical

Publisher Relationships • Again, in books you had one use model • Metadata allows publishers to have diverse relationships with content consumers and other information providers • Customers (duh) • Aggregators • The Open Web (not Google, but other search engines) • But don’t try to “game” the search engines with adult keywords; that’s just wrong • There have been lawsuits over use of meta keywords, including Playboy suing two adult web sites • Technology partners/developers • Systems wherein content is a “value add” • Multiple output formats

Types of Metadata • HTML Metadata • <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1"> • <meta name="verify-v1" content="kBoFGUuwppiWVWGx4Ypzkw1Cs1GgMYEMMbfNr7FY65w=" /> • <meta name="description" content="International publisher of professional health information for physicians, nurses, specialized clinicians & students. Medical & nursing charts, journals, and pda software."> • <meta name="keywords" content="springhouse, medical book, nursing journal, medical pda software, lippincott medical reference, lww, lippincott, lww com, medical publisher"> • <link rel="stylesheet" href="/css/style.css" type="text/css"> For people For search enginges

Types of Metadata • Classifying Metadata • ISBN (I told you this wasn’t new) • Dewey Decimal System • Books in Print/CIP/Library of Congress data • MARC records • DOI (Digital Object Identifier) • Descriptive Metadata (sorry, my examples are from STM) • ICD-9 and ICD-10 Codes • MeSH • SNOMED-CT • NANDA, NIC, NOC for Nursing • NDC, HCPCS for drugs OLD NEW

Types of Metadata • Classifying Metadata • ISBN (I told you this wasn’t new) • Dewey Decimal System • Books in Print/CIP/Library of Congress data • MARC records • DOI (Digital Object Identifier) • Descriptive Metadata (sorry, my examples are from STM) • ICD-9 and ICD-10 Codes • MeSH • SNOMED-CT • NANDA, NIC, NOC for Nursing • NDC, HCPCS for drugs OLD NEW • DOI (Digital Object • Identifier)

Semantic Metadata • Using controlled vocabularies, extra power can be added to content via semantic tagging to drive: • More precise searching • Contextually-based connections • Lowering of “two terms meaning the same thing” syndrome (hypertension vs. high blood pressure; heart attack vs. myocardial infarction) • Filling in of content gaps • Semantic tagging *is* metadata, but it deserves its own section (coming up)

What is Semantic Tagging?

Semantic Basics • Semantics is tagging that describes what content *is* and not how it should *look* on the page or screen • Contrast to structural tagging, which is made of elements such as <para>, <list>, and <title> • Both are XML, but semantics is like XML on steroids! • Doing semantic tagging without a controlled vocabulary is madness for scholarly publishing • Think “folksonomies”

Manual Tagging • DESCRIPTION: A subject matter expert (SME) reads chapter/article, indexes or tags based on content, resulting in enriched content • POSITIVES – If precision needed, and clinical understanding of concepts (ie, judgment) required, probably still the best option • NEGATIVES - Cost prohibitive on large volumes of information; not scalable; inconsistency if controlled vocabulary not followed, or different taggers used

Manual Tagging – Other Factors • Offshore resources have improved in recent years as “knowledge work” has gone global, resulting in cost reductions • Some processes considered “too expensive” to be done manually before could be revisited • Great dependence on *type* of content, which means use cases should drive workflow decisions

Automated Approaches • DESCRIPTION: Software crawls content, adds tags/unique identifiers or finds concepts & patterns to drive more intelligent search or entity extraction • POSITIVES – Very effective in finding “trends” or concepts over a large repository of data; growing industry because of information overload (aka Data Mining, Text Analysis) • NEGATIVES – Sometimes leads to false positives, lack of precision or judgment by machines processing data

Automated Approaches – Other Factors • If used effectively, quick wins on large repositories • Can be used to accomplish projects that would never be attempted (or approved) manually

Combination Approaches • DESCRIPTION: Automated process followed by SME checking (deeper level than straight QA) and addition of specific conceptual information • POSITIVES – best of both worlds for projects that deserve it; can drive precision but can also cover large repositories • NEGATIVES – costs; every time software or people act on your content, there are costs – you don’t get a discount from either because you are doing both 

FUD Around Semantic Search • Semantic Search engines • TEMIS, Collexis, NetBase, Vivisimo, OpenCalais • Finding semantic concepts based on entities and search algorithms • Finding a needle in a haystack • Semantic Tagging • People (SMEs) identify concepts and tag accordingly • Drives precision in search and other things • Finding the right needle in a stack of 10 needles

A Note About “Folksonomies” • Having users “tag” or classify data is increasing in popularity • Not much use in clinical areas of health sciences • If you are sick, do you want to know what 100 people think, or the one expert?

2/3

Impact on Publishers

Impact on Publishers • Impact depends on how deep you want to go • i.e., what am I going to get in return for investing in metadata, and is it worth it? • More and more, this is not an “if” proposition, it’s “how much” • Publishers who buy in have two basic choices on approach:

Option 1: Metadata in the Workflow • Requires deeper commitment, but has bigger potential upside • Positive impact on product creation and development • Requires thinking about tools, workflows, and enterprise-level systems to allow for creation and MAINTENANCE of metadata • Combination of good metadata in the workflow and creativity in product development team can pay big benefits • Allows participation of authors (or subject matter experts in lieu of) at the beginning of the workflow

Option 2: Outside the Workflow • Requires lesser commitment, but potentially fewer rewards • Can be done with zero impact on current systems • Has benefit of content being in “final form” (whatever that means anymore) when intelligence is added in metadata • Can keep SMEs as a separate offshoot of the workflow – easily outsourced • Can attack this problem with brute force semantic search engines, but this is a different thing

Impact on Publishers • Active vs. Passive Metadata • Active metadata • Publisher intentionally associates markup with certain pieces of content • Often using controlled vocabulary • Includes semantic indexing • Can also be machine-based, using scripts, etc. • Passive metadata • Metadata created based on use of content • Image X was used as part of an image bank on pediatric • Inheritance of properties from parent objects

Implications for Search • Machines don’t know the difference between hypertension and high blood pressure • More accurately, machines don’t know they are the SAME • How this is handled is a matter of User Experience (did you mean? … give them the result … etc.), but the content must be tagged first

What is Semantic Publishing? And Why Should I Care?