Querying for relations from the semi-structured Web

Querying for relations from the semi-structured Web SunitaSarawagi IIT Bombay http://www.cse.iitb.ac.in/~sunita Contributors Rahul Gupta GirijaLimayePrashantBorole RakeshPimplikarAdityaSomani

Web Search • Mainstream web search • User  Keyword queries • Search engine  Ranked list of documents • 15 glorious years of serving all of user’s search need into this least common denominator • Structured web search • User  Natural language queries ~/~ Structured queries • Search engine  Point answer, record sets • Many challenges in understanding both query and content • 15 years of slow but steady progress

The Quest for Structure • Vertical structured search engines Structure  Schema  Domain-specific • Shopping: Shopbot: (Etzoini + 1997) • Product name, manufacturer, price • Publications: Citeseer (Lawrence, Giles,+ 1998) • Paper title, author name, email, conference, year • Jobs: Flipdog Whizbang labs (Mitchell + 2000) • Company name, job title, location, requirement • People: DBLife (Doan 07) • Name, affiliations, committees served, talks delivered. Triggered much research on extraction and IR-style search of structured data (BANKS ‘02).

Horizontal Structured Search • Domain-independent structure • Small, generic set of structured primitives over entities, types, relationships, and properties • <Entity> IsA <Type> • Mysore is a city • <Entity> Has <Property> • <City> Average rainful <Value> • <Entity1> <related-to> <Entity2> • <Person> born-in <City> • <Person> CEO-of <Company>

Types of Structured Search • Web+People  Structured databases ( Ontologies) • Created manually (Psyche), or semi-automatically (Yago) • True Knowledge (2009), Wolfram Alpha (2009) • Web annotated with structured elements • Queries: Keywords + structured annotations • Example: <Physicist> +cosmos • Open-domain structure extraction and annotations of web docs (2005—)

Users, Ontologies, and the Web • Users are from Venus • Bi-syllabic, impatient, believe in mind -reading • Ontologies are from Mars • One structure to fit all G • Web content creators are from • some other galaxy • Ontologies=axvhjizb • Let search engines bring the users

What is missed in Ontologies • The trivial, the transient, and the textual • Procedural knowledge • What do I do on an error? • Huge body of invaluable text of various type • reviews, literature, commentaries, videos • Context • By stripping knowledge to its skeletal form, context that is so valuable for search is lost. • As long as queries are unstructured, the redundancy and variety in unstructured sources is invaluable.

Structured annotations in HTML • Is A annotations • KnowITAll (2004) • Open-domain Relationships • Text runner (Banko 2007) • Ontological annotations • SemTag and Seeker (2003) • Wikipedia annotations (Wikify! 2007, CSAW 2009) All view documents as a sequence of tokens Challenging to ensure high accuracy

WWT: Table queries over the semi-structured web

Queries in WWT • Query by content • Query by description

Answer: Table with ranked rows

Keyword search to find structured records Computer science concept inventor year Correct answer is not one click away. Verbose articles, not structured tables Desired records spread across many documents The only document with an unstructured list of some desired records

The only list in one of the retrieved pages

Highly relevant Wikipedia table not retrieved in the top-k Ideal answer should be integrated from these incomplete sources

Attempt 2: Include samples in query alan turing machine codd relational database Known examples Documents relevant only to the keywords Ideal answer still spread across many documents

WWT Architecture User Final consolidated table Query Table Typesystem Hierarchy Type Inference Index Query Builder Resolver builder Ranker Offline Resolver Statistics Web Cell resolver Row resolver Keyword Query Consolidated Table Ontology Extract record sources Tables T1,…,Tk Record labeler Source L1,…,Lk Annotate Consolidator CRF models Row and cell scores Content+ context index Extractor Store

Offline: Annotating to an Ontology Annotate table cells with entity nodes and table columns with type nodes All Indian_films Indian_directors People 2008_films movies Entertainers English_films Indian_films A_Wednesday 2008_films Terrorism_films Wednesday Black&White A_Wednesday Coffee_house (film) Coffee_house (film) Black&White Coffee_house (Loc)

Challenges • Ambiguity of entity names • “Coffee house” both a movie name and a place name • Noisy mentions of entity names • Black&White versus Black and White • Multiple labels • Yago Ontology has average 2.2 types per entity • Missing type links in Ontology cannot use least common ancestor • Missing link: Black&White to 2008_films • Not a missing link: 1920 to Terrorism_films • Scale: • Yago has 1.9 million entities, 200,000 types

A unified approach Graphical model to jointly label cells and columns to maximize sum of scores on ycj = Entity label of cell c of column j yj= Type label of column j • Score(ycj ): String similarity between c & ycj . • Score(yj ): String similarity between header in j & yj • Score( yj, ycj) • Subsumed entity: Inversely proportional to distance between them • Outside enity: Fraction of overlapping entities between yj and immediate parent of ycj • Handles missing links: Overlap of 2008_movies with 2007_movies zero but with Indian movies is non-zero. movies English_films Indian_films Terrorism_films yj Subsumed entity y2j Subsumed entity y1j Outside entity y3j

Extraction: Content queries Extracting queries columns from list records Query: Q A source: Li • New York University (NYU), New York City, founded in 1831. • Columbia University, founded in 1754 as King’s College. • Binghamton University, Binghamton, established in 1946. • State University of New York, Stony Brook, New York, founded in 1957 • Syracuse University, Syracuse, New York, established in 1870 • State University of New York, Buffalo, established in 1846 • Rensselaer Polytechnic Institute (RPI) at Troy. Lists are often human generated.

Extraction Query: Q Extracted table columns • New York University (NYU), New York City, founded in 1831. • Columbia University, founded in 1754 as King’s College. • Binghamton University, Binghamton, established in 1946. • State University of New York, Stony Brook, New York, founded in 1957 • Syracuse University, Syracuse, New York, established in 1870 • State University of New York, Buffalo, established in 1846 • Rensselaer Polytechnic Institute (RPI) at Troy. Rule-based extractor insufficient. Statistical extractor needs training data. Generating that is also not easy!

Extraction: Labeled data generation Lists are unlabeled. Labeled records needed to train a CRF A fast but naïve approach for generating labeled records • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. Query about colleges in NY Fragment of a relevant list source

Extraction: Labeled data generation A fast but naïve approach • New YorkUniv. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. Another match for New York Another match for New York University • In the list, look for matches of every query cell.

Extraction: Labeled data generation A fast but naïve approach • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. 1 2 • In the list, look for matches of every query cell. • Greedily map each query row to the best match in the list

Extraction: Labeled data generation A fast but naïve approach • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. 1 2 Unmapped (hurts recall) Assumed as ‘Other’ Wrongly Mapped • Hard matching criteria has significantly low recall • Missed segments. • Does not use natural clues like Univ = University • Greedy matching can be lead to really bad mappings

Generating labeled data: Soft approach • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York.

Generating labeled data: Soft approach 0.9 • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. 0.3 1.8 • Match score for each query and source row • Score of best segmentation of source row to query columns • Score of a segment s of column c: • Probability Cell c of query row same as segment s • Computed by the Resolver module based on the type of the column

Generating labeled data: Soft approach • New York Univ. in NYC • Columbia University in NYC • Monroe Community College in Brighton • State University of New York in • Stony Brook, New York. 0.7 0.3 2.0 • Match score for each query and source row • Score of best segmentation of source row into query columns • Score of a segment s of column c: • Probability Cell c of query row same as segment s • Computed by the Resolver module based on the type of the column

Generating labeled data: Soft approach • New York Univ. in NYC • Columbia University in NYC • Monroe Community College • in Brighton • State University of New York in • Stony Brook, New York. 0.9 0.3 0.7 1.8 0.3 1.8 2 Greedy matching in red • Compute the maximum weight matching • Better than greedily choosing the best match for each row • Soft string-matching increases the labeled candidates significantly • Vastly improves recall, leads to better extraction models.

Extractor • Use CRF on the generated labeled data • Feature Set • Delimiters, HTML tokens in a window around labeled segments. • Alignment features • Collective training of multiple sources

Experiments • Aim: Reconstruct Wikipedia tables from only a few sample rows. • Sample queries • TV Series: Character name, Actor name, Season • Oil spills: Tanker, Region, Time • Golden Globe Awards: Actor, Movie, Year • Dadasaheb Phalke Awards: Person, Year • Parrots: common name, scientific name, family

Experiments: Dataset • Corpus: • 16M lists from 500M pages from a web crawl. • 45% of lists retrieved by index probe are irrelevant. • Query workload • 65 queries. Ground truth hand-labeled by 10 users over 1300 lists. • 27% queries not answerable with one list (difficult). • True consolidated table = 75% of Wikipedia table, 25% new rows not present in Wikipedia.

Extraction performance • Benefits of soft training data generation, alignment features, staged-extraction on F1 score. • More than 80% F1 accuracy with just three query records

Queries in WWT • Query by content • Query by description

Extraction: Description queries Non-informative headers No headers

Context to get at relevant tables All • Ontological annotations • Context is union of • Text around tables • Headers • Ontology labels when present Chemical element People Alkali Chemical_elements Non_Metals Metals Alkali Non-gas Gas Non alkali Hydrogen Lithium Aluminium Sodium Carbon

Joint labeling of table columns • Given • Candidate tables: T1 ,T2,..Tn • Query column q1, q2,.. qm • Task: label columns of Ti with {q1, q2,…, qm, }to maximize sum of these scores • Score (T , j , qk) = Ontology type match + Header string match with qk • Score (T , * , qk) = Match of description of Twith qk • Score (T , j, T’ , j’, qk) = Content overlap of column j of table T with column j’ of table T’ when both label qk • Inference algorithm in a graphical model  solve via Belief Propagation.

Step 3: Consolidation Merging the extracted tables into one + Merging duplicates =

Consolidation • Challenge: resolving when two rows are the same in the face of • Extraction errors • Missing columns • Open-domain • No training. • Our approach: a specially designed Bayesian Network with interpretable and generalizable parameters

Resolver Bayesian Network P(RowMatch|rows q,r) P(nth cell match|qn,rn) P(1st cell match|q1,r1) P(ith cell match|qi,ri) • Cell-level probabilities • Parameters automatically set using list statistics • Derived from user-supplied type-specific similarity functions 45

Ranking Factors for ranking Relevance: membership in overlapping sources Support from multiple sources Completeness: importance of columns present Penalize records with only common ‘spam’ columns like City and State Correctness: extraction confidence 47

Relevance ranking on set membership • Weighted sum approach • Score of a set t: • s(t) = fraction of query rows in t • Relevance of consolidated row r: • r  t s(t) • Graph walk based approach • Random walk from rows to table nodes starting from query rows along with random restarts to query rows Tables Consolidated rows Query rows

Ranking Criteria • Score(Row r): • Graph-relevance of r. • Importance of columns C present in r (high if C functionally determines the other) • Sum of cell extraction confidence: noisy-OR of cell extraction confidence from individual CRFs

Overall performance Difficult Queries All Queries • Justify sophisticated consolidation and resolution. So compare with: • Processing only the magically known single best list • => no consolidation/resolution required. • Simple consolidation. No merging of approximate duplicates. • WWT has > 55% recall, beats others. Gain bigger for difficult queries.

Running time • < 30 seconds with 3 query records. • Can be improved by processing sources in parallel. • Variance high because time depends on number of columns, record length etc.

Related Work • Google-Squared • Developed independently. Launched in May 2009 • User provides keyword query, e.g. “list of Italian joints in Manhattan”. Schema inferred. • Technical details not public. • Prior methods for extraction and resolution. • Assume labeled data/pre-trained parameters • We generate labeled data, and automatically train resolver parameters from the list source.

Summary • Structured web search & the role of non-text, partially structured web sources • WWT system • Domain-independent • Online: structure interpretation at query time • Relies heavily on unsupervised statistical learning • Graphical model for table annotation • Soft-approach for generating labeled data • Collective column labeling for descriptive queries • Bayesian network for resolution and consolidation • Page rank + confidence from a probabilistic extractor for ranking

What next? • Designing plans for non-trivial ways of combining of sources • Better ranking and user-interaction models. • Expanding query set • Aggregate queries: tables are rich in quantities • Point queries: attribute value and relationship queries • Interplay between semi-structured web & Ontologies • Augmenting one with the other. • Quantify information in structured sources vis-à-vis text sources on typical query workloads

Querying for relations from the semi-structured Web

Querying for relations from the semi-structured Web

Presentation Transcript

Querying the Web for Genealogical Information

Collectively Representing Semi-Structured Data from the Web

Bootstrapping information extraction from semi-structured web pages

Structured Querying of Web Text: A Technical Challenge

Semi-Structured Data Models

Querying Structured Text

Reasoning and Querying for the Web urq.deri.ie

Structured Querying of Web Text A Technical Challenge

Querying the deep Web

Deep Web Integration: Querying Structured Data on the Deep Web

Optimized Index Structures for Querying RDF from the Web

YARS2: A Federated Repository for Querying Graph Structured Data from the Web

8. Querying Structured Text

Querying The Web Database

Bootstrapping Information Extraction from Semi-Structured Web Pages

Semi-structured data - exercises

Semi-structured Data

Semi-Structured data (XML)

Structured Querying of Web Text: A Technical Challenge