Ontology Construction from Text: Big Picture and OntoGen Tool

Ontology construction from text Blaz Fortuna

Outline • Big picture • OntoGen • Future work

Big picture

Vision • What is “text”? • From single documents to large corpora • different granularity • What is “structured information”? • From topic taxonomies to full-blown ontologies • different expressivity Extracting structured information from text

Available tools • Text mining • … for dealing with large corpora • Natural Language Processing (NLP) • … for dealing with sentence level structure • Machine learning • … for abstracting structure from data (modeling) • … inside of many text mining and NLP algorithms • Visualization • … for user interactions

The Plan document Semantic Graphs Template Extraction granularity Q&A OntoGen corpus Expressiveness

OntoGen

OntoGen • Tool for semi-automatic ontology construction from large text corpora • Integrates several text-mining methods • Clustering • Active learning • Classification • Visualizations • Publicly available at • ontogen.ijs.si [Fortuna, Mladenić, Grobelnik, 2005]

Ontology construction with OntoGen • Semi-Automatic • provide suggestions and insights into domain • user interacts with parameters of methods • final decisions taken by user • Data-Driven • most of the aid provided by the system is based on some underlying data • instances are described by features extracted from the data (e.g. words-vectors)

Ontology model in OntoGen • Ontology is a data model representing: • a set of concepts within a domain • the relationships between these concepts • OntoGen models ontology as a graph/network structure consisting from: • a set of concepts (vertices in a graph), • a set of instances assigned to a particular concepts (data records assigned to vertices in a graph) • a set of relationships connecting concepts (directed edges in a graph) • each instance is described by a set of features

Example of a Topic Ontology

Instance representation • Bag of words: • Vocabulary: {wi | i =1, …, N } • Documents are represented with vectors (word space): • Example: • Document set: • d1 = “Canonical Correlation Analysis” • d2 = “Numerical Analysis” • d3 = “Numerical Linear Algebra” • Document vector representation: • x1 = (1, 1, 1, 0, 0, 0) • x2 = (0, 0, 1, 1, 0, 0) • x3= (0, 0, 0, 1, 1, 1,) • Vocabulary: • {“Canonical ”, “Correlation ”, “Analysis”, “Numerical ”, “Linear ”, “Algebra”}

Basic idea behind OntoGen Text corpus Ontology Concept A Concept B Domain Concept C 13

Concept discovery – unsupervised • Clustering based approach • K-means clustering of the instances • Clusters offered as suggestions • Users selects relevant suggestions

Concept discovery – unsupervised • Visualization based • Topic-landscape based visualization • One instance one yellow point on the map • Similar instances appear closer together • User can make a concept by selecting a region of the map • Pink points on the map are selected instances

Concept discovery – supervised • Active learning based approach • User enters a query • System ranks the instances according to the query • User labels instances: • Yes – belongs to the concept • No – does not belong to the concept • Once there are enough instances, system switches to SVM based active learning • When done, concept added to the ontology.

Concept discovery – supervised • Classification based approach • Instances are classified into a background ontology • called OntoLight • Concepts with the most instances provided as sub-concept suggestions

Concept naming – unsupervised • Automatic extraction of keywords, for describing the concepts • First approach based on TFIDF weights of words • Second approach based on SVM based feature selection algorithm

Concept naming – supervised • Classification based approach • Concept’s instances are classified into a background ontology • called OntoLight • Names from background ontology, with most classified instances, are provided as suggestions • Shows what is the name in some pre-defined vocabulary

Concept visualization Instances are visualized as points on 2D map. The distance between two instances on the map correspond to their similarity. Characteristic keywords are shown for all parts of the map. User can select groups of instances on the map to create sub-concepts.

Ontology visualization Ontology concepts visualized as points on the 2D topic map. Topic map generated from a set of text documents.

Topics view Countries view Multiple views of the same data Simple taxonomy on top of Reuters news articles Two different views, one focuses on topics, one focuses on geography Each view offers yields a different taxonomy on the data. SVM based method detects importance of keywords for each view. Lloyd’s CEO questioned in recovery suit in U.S. Ronald Sandler, chief executive of Lloyd's of London, on Tuesday underwent a second day of court interrogation about … UK takeovers and mergers The following are additions and deletions to the takeovers and mergers list for the week beginning August 19, as provided by the Takeover …

Word weight learning • The word weight learning method is based on SVM feature selection. • Besides ranking the words it also assigns them weights based on SVM classifier. Notation: • N – number of documents • {x1, …, xN} – documents • C(xi) – set of categories for document xi • n – number of words • {w1, …, wn} – word weights • {nj1, …, njn} – SVM normal vector for j-th category Algorithm: • Calculate linear SVM classifier for each category • Calculate word weights for each category from SVM normal vectors. Weight for i-th word and j-th category is: • Final word weights are calculated separately for each document:

Relations – preprocessing • Name-Entity profile • Extracted sentences from articles in which they name entity appears • Example: Agassi • Olympic champion Agassi meets of Morocco in the first round. • Co-occurrence profiles • Extracted sentences from articles in which two name entities appear together • Example: Sampras – Agassi • There will be no repeat of last year's men's final with eighth-ranked Agassi landing in Sampras's half of the draw. • Relationship • By extracting keywords from co-occurrence profiles we can get summary of relationship between two name entities. • Keywords are extracted by from co-occurrence profile bag-of-words vectors

Relations – example Bill Clinton • Iraq [476] • president, missiles, attacks, Kurdish, northern • Bob Dole [294] • republican, president, presidential, candidates, poll • United States [204] • president, Monday, southern, move, election • White House [146] • president, spokesman, reporters, Friday, campaign • Iran [74] • president, investment, gas, law, penalize • Congress [66] • president, calling, billion, republican, democrat • Chicago [42] • president, conventional, democrat, drug, campaign • Al Gore [40] • president, vice, bus, tour, election Chicago • Clinton [236] • conventional, democrat, training, day, campaign • U.S. [164] • trader, markets, purchasers, index, future • New York [100] • variety, mixed, critical, poll, bulletproof • Dole [70] • conventional, democrat, campaign, drug, Sunday • Kansas City [70] • basis, wheat, bushels, fob, red • Los Angeles [60] • (variety, mixed, critical, poll, stg • Illinois [34] • democrat, state, conventional, trip, mayor • Chicago Board of Trade [34] • future, deliverable, stocks, bus, reporters • San Francisco [34] • operations, municipal, full, remain, services • Boston [32] • fared, comparatively, game, existed, American

Relations – abstraction • Clustering of name entities using k-means clustering • Relations between clusters are established based on the name-entities co-occurrence profiles: • Let C1 and C2 be two clusters • Let pij be a co-occurrence profile between document di and dj • P = {pij | so that di from C1 and dj from C2 } • Relation is defined by a profile set P • Summary of relation is extracted from the centroid vector of profiles from P C1 C2

Relations – example • Example of clusters: • Cluster 1: • Name Entities: Bosnia, Bosnian, Sarajevo • Keywords: serbs, moslems, bosnian, election • Cluster 2: • Name Entities: Russia, Britain, Germany, France • Keywords: meeting, country, government, told • Cluster 3: • Name Entities: Washington, United States • Keywords: spokesman, military, missiles • Example of relations • Cluster 1 vs. Cluster 3: • Name Entities: U.N., U.S., American, Washington, Bosnia, Turkey, Richard Holbrooke, U.N. Security Council, White House • Keywords: election, serb, war, bosnians, moslem, peace, tribunal, police, spokesman, crime • Cluster 1 vs. Cluster 2: • Name Entities: NATO, Yugoslavia, Bosnia, Croatia, Serbia, Belgrade, Balkan, OSCE, Burns • Keywords: country, election, state, international, peace, meeting, secretary, foreign, talks, member

Relations – example Hashimoto, Romano Prodi, Benjamin Netanyahu, Jim Bolger Bill Clinton, Jacques Chirac, Suharto, Hosni Mubarak, Leonid Kuchma minister, prime, meeting, foreign, talks, president, peace, visit, told, officials president, meeting, visit, talks, leaders, minister, secretary, officials, state Russia, Britain, Germany, France, China, EU meeting, country, government, told, officials, union, minister, secretary, trade, report courts, case, year, told, rules, trials, charges, sentenced, law, file plant, powerful, company, venture, electrical, projects, million, joint, province, state Supreme Court, U.S. District Court, Simpson, Justice Department Tennessee Valley Authority, New Hill, TVA, Florida Power & Light Co, St Lucie

Relations – example Minister President Visit Visit Country Rule Invest Court Power plant

Evaluation • First prototype was successfully used: • Applied in multiple domains: • business, legislations and digital libraries (SEKT project) • Users were always domain experts • with limited knowledge and experience with ontology construction / knowledge engineering • Feedback from first trails used as input for the second prototype • the one presented here • User study performed for the second prototype • Main impression • the tool saves time • is especially useful when working with large collections of documents • Main disadvantages • abstraction • unattractive interface design • Used in several EU projects • SWING, TAO, NEON, ECOLEAD, E4, TOOLEAST

From the users • Many users use the program for exploration • New York Times uses it for • analyzing user comments, • segmenting website users • Also used by people from: • Microsoft • Honda, Japan • Siemens Austria • University of Washington • University of Melbourne, Victoria, Australia • FIAT crf, Italy • Universitat Haifa, Izrael • Motilal Nehru National Inistitute of Technology, India • Slovenian Army • Shanghai Jiao Tong University, China • University of Cyprus • Mehiläinen Medical Center, Finland • Food Safety Division, Alberta Agriculture, Canada • University of South Carolina • National Institute of Telecommunications, Poland • KatholiekeUniversiteit Leuven • University Amsterdam • Txt eSolutions, Italy • Insiel, Italy • AMI communities (~1500 development engineers) • Virtuelefabrik • Avtomobilskigrozd • University of Nova Gorica • ISOIN (cluster of 1600 companies, suppliers for Airbus)

Future work

Move towards bigger granularity • Semantic graphs • Extract data-points from sentences level • OntoGen does it on a document level • Based on triplets extracted from sentence structure • Subject • Predicate • Object • Extraction can be done with • Parsers • Structured learning • Triplets from one document can be merged into Semantic graphs • Stronger then bag-of-words • Example application: • Document summarization

Template extraction • Hypothesis: • People view events through “templates” • Models of how things evolve, relate • Use these models to understand, predict • Goal: • automatic extraction of such models from texts

Search over triplets • Triplet extraction ran over Reuters corpus • 800k news articles from 1996 to 1997

Search over triplets

Template earthquake Places Government Time-period Hits Measured by Hits in Earthquake Registered in Kills Richter scale Collapses People Buildings

Thank you! Questions?

Ontology Construction from Text: Big Picture and OntoGen Tool