Probase : A Knowledge Base for Text Understanding

Probase : A Knowledge Base for Text Understanding

Integrating, representing, and reasoning over human knowledge: the grand challenge of the 21stcentury.

Knowledge Bases • Scope: • Cross domain / Encyclopedia: (Freebase, Cyc, etc) • Vertical domain / Directory: (Entity Cube, etc) • Challenges: • completeness (e.g., a complete list of restaurants) • accuracy(e.g., address, phone, menu of each restaurant) • data integration

“By being able to place something into its proper place in the concept hierarchy, one can learn a considerable amount about it.” “Concepts are the glue that holds our mental world together.”

Yellow Pages or Knowledge Bases? • Existing Knowledge Bases are more like big Yellow Page Books • Goals: • Complete list of items • Complete information of each item • But humans do not need “complete” knowledge in order to understand the world

What’s wrong with the yellow pages approach? • Rich in instances, poor in concepts • Understanding = forming concepts • Ontologies always have an elegant hierarchy? • Manually crafted ontologies are for human use • Our mental world does not have an elegant hierarchy

Probase: “It is better to use the world as its own model.” Capture concepts (in our mental world) Quantify uncertainty (for reasoning)

1 More than 2.7 million concepts automatically harnessed from 1.68 billion documents 2 Computation/Reasoning enabled by scoring: Consensus: e.g., is there a company called Apple? Typicality: e.g. how likely you think of Apple when you think about companies? Ambiguity: e.g., does the word Apple, sans any context, represent Apple the company? Similarity:e.g., how likely is an actor also a celebrity? Freshness:e.g., Pluto as a dwarf planet is a claim more fresh than Pluto as a planet. … 4 A little knowledge goes a long way after machines acquire a human touch 3 Give machines a new CPU (Commonsense Processing Unit) powered by a distributed graph engine called Trinity.

Probase has a big concept space Probase: Freebase:Cyc:

Probase has 2.7 million concepts(frequency distribution)

Uncertainty Probasevs. Freebase

Probase Internals artist painter Born Died … Movement Picasso 1881 1973 … Cubism art created by painting Year Type … Guernica … 1937 Oil on Canvas

Data Sources • Patterns for single statements NP such as {NP, NP, ..., (and|or)} NP such NP as {NP,}* {(or|and)} NP NP {, NP}* {,} or other NP NP {, NP}* {,} and other NP NP {,} including {NP ,}* {or | and} NP NP {,} especially {NP,}* {or|and} NP • Examples: • Good: “rich countries such as USA and Japan …” • Tough: “animals other than cats such as dogs …” • Hopeless: “At Berklee, I was playing with cats such as Jeff Berlin, Mike Stern, Bill Frisell, and Neil Stubenhaus.”

Examples Animals other than dogs such as cats … Eating fruits such as apple … can be good to your health. Eating disorders such as … can be bad to your health.

Properties • Given a class, find its properties • Candidate seed properties: • “What is the [property] of [instance]?” • “Where”, “When”, “Who” are also considered

Similarity between two concepts • Weighted linear combinations of • Similarity between the set of instances • Similarity between the set of attributes • (nation, country) • (celebrity, well-known politicians)

Beyond noun phrases • Example: the verb “hit” • Small object, Hard surface • (bullet, concrete), (ball, wall) • Natural disaster, Area • (earthquake, Seattle), (Hurricane Floyd, Florida) • Emergency, Country • (economic crisis, Mexico), (flood, Britain)

Quantify Uncertainty the foundation of text understanding and reasoning • Typicality P(concept | instance) P(instance | concept) P(concept | property) P(property | concept) • Similarity sim(concept1, concept2)

Text Mining / IE: State of the Art • Bag of words based approach: e.g., LDA • Based on multiple document statistics • Simple bag-of-words, no semantics • Supervised learning: e.g., CRF • Labeled training data required • Difficulty for out-of-sample features • Lack of semantics • What role can a knowledgebase play?

Shopping at Bellevue during TechFest Five of us bought 5 Kinects and posed in front of an Apple store. Apple store that sells fruits or apple store that sells iPads?

Step by Step Understanding

Explicit Semantic Analysis (ESA) • Goal: latent topics => explicit topics • Approach: mapping text to Wikipedia articles • An inverted list that record words’ occurrence in wiki articles • Given a document, we derive a distribution of wiki articles

Explicit Semantic Analysis (ESA) bag of words => bag of Wikipedia articles(a distribution over Wikipedia articles) A bag of Wikipedia articles is not equivalent to a clear concept in our mental world. Top 10 concepts for “Bank of America” Bank, Bank of America, Bank of America Plaza (Atlanta), Bank of America Plaza (Dallas), MBNA, VISA (credit card), Bank of America Tower New York City, NASDAQ, MasterCard, Bank of America Corporate Center

Short Text • Challenge: • Not enough statistics • Applications • Twitter • Query/Search Log • Anchor Text • Image/video tag • Document paraphrasing and annotation

Comparison of Knowledge Bases

When the machine sees the word ‘apple’

When the machine sees ‘apple’ and ‘pear’ together

When the machine sees ‘China’ and ‘Israel’ together

What China is but Israel is not?

What Israel is but China is not?

What’s the difference from a text cloud?

When a machine sees attributes …

Entity Abstraction • Given a set of entities • (Naïve Bayes Rule) • Where is a concept, and • is computed based on the concept-entity co-occurrence

How to Infer Concept from Attribute? (university, floridastate university, 75) (university, harvard university, 388) (university, university of california, 142) (country, china, 97346) (country, the united states , 91083) (country, india , 80351) (country, canada , 74481) • (floridastate university, website, 34) • (harvarduniversity, website, 38) • (university of california, city, 12) • (china, capital, 43) • (the united states , capital, 32) • (india, population, 35) • (canada, population, 21) • (university, website, 4568) • (university, city, 2343) • (country, capital, 4345) • (country, population, 3234) • …… • Given a set of attributes • The Naïve Bayes Rule gives • where

Examples • Concepts related to entities - “china” and “india” • Concepts related to attributes – “language” and “population”

When Type of Term is Unknown: Given a set of terms with unknown types Generative model Using Naive Bayesian rule gives: Discriminative model (Noisy-OR) And using twice Bayesian rule gives: where indicate “entity” and indicate “attribute”

Examples Given “china”, “india”, “language” and “population”, “emerging market” will be ranked as 1st

Example (Cont’d)

Clustering Twitter Messages • Problem 1 (unique concepts): use keywords to retrieve tweets in 3 categories: • 1. Microsoft, Yahoo, Google, IBM, Facebook • 2. cat, dog, fish, pet, bird • 3. Brazil, China, Russia, India • Problem 2 (concepts with subtle differences): use keywords to retrieve tweets in 4 categories: • 1. United states, American, Canada • 2. Malaysia, China, Singapore, India, Thailand, Korea • 3. Angola, Egypt, Sudan, Zambia, Chad, Gambia, Congo • 4. Belgium, Finland, France, Germany, Greece, Spain, Switzerland

Comparison Results

Other Possible Applications • Query Expansion • Information retrieval • Content based advertisement • Short Text Classification/Clustering • Twitter analysis/search • Text block tagging; image/video surrounding text summarization • Document Classification/Clustering/Summarization • News analysis • Out-of-sample example classification/clustering

© 2011 Microsoft Corporation. All rights reserved. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Probase : A Knowledge Base for Text Understanding