Extracting Relations from XML Documents

C. T. Howard Ho Joerg Gerhardt Eugene Agichtein* Vanja Josifovski Extracting Relations from XML Documents IBM Almaden and Columbia University*

Extraction for Data Integration: Motivating Example External Schema Native Schema Products Publications video books music book item title author publisher ISBN price booktitle author publisher ISBN price

Why Extract Data from XML? • XML query processing is still in development. Still not as fast as RDBMS • Relational query processing is still standard for many business applications • By extracting into one relational schema, avoid overhead of XML runtime data integration • Extracted relations can be best exploited for relatively static data (e.g., product catalogs)

Related Work • XTRACT (induces DTDs) • Lore/DataGuides • HTML Wrappers (LixTo, RoadRunner, WHISK, STALKER, … ) • Plain Text Information Extraction (Proteus, Snowball, Rapier) • Supervised/Assisted XML Schema Mapping (e.g., Clio)

Outline • Motivation • Problem statement • XMLMiner approach • Training XMLMiner • Extraction from new documents • Some observation from the prototype • Summary

Problem Statement • Given a target flat relation R, extract information for the tuples in R from XML (or HTML) documents, with potentially significant variations in schema. • Problems with current integration/extraction approaches: • Hard-coding the rules/queries requires significant effort; The resulting rules can be brittle. • XML Schema or DTD is not always provided

XMLMiner Approach • Learn signatures from example XML documents • Represent document structure while maintaining flexibility (to allow schema variations) • Assume that a tuple in the target relation corresponds to a subtree rooted at an instancenode.(The subtree may contain more detailed info of the tuple than needed.) • Represent input document nodes as vectors, and then find the closest (i.e., most similar) instance node vector • Use labels and data values to map children of the instance node to target tuple attributes

XMLMiner Architecture: Training and Extraction Canonical Tree Canonical Tree

High Level Description • Training: • Each XML document is merged/split to a schema-like tree, called canonical tree • User identifies the attributes nodes (under instance node), corresponding to the target tuple attributes • System derives the instance node in the tree • Build a model for the structure of the tuple and each attribute • Extracting: • Apply the model to find the most likely instance node and attribute nodes in the new XML documents

Training Stage I: Create Canonical Tree for each Example Document

Canonical Form Conversion Example:Merging Similar Nodes Original Document Structure “Merged” Document • Merge all siblings with the same label (e.g., Item  Item*) • Intuition: Siblings with the same label represent “similar” entities.

Example: Split Heterogeneous Nodes  Canonical Form Canonical Tree:

Training Stage I Result: Canonical Tree OriginalDocument: Canonical Form:

Training Stage II: Generate Instance Node Signatures • Features used to createsignatures for an instance node I(item)in the canonical tree: • A: Ancestors of I • S: Siblings of I • C: Descendants of I • I: Self: Tag of I • Siblings and Ancestors position of I in the document • The Descendants :  internal structure of I

Training Stage (cont.):Example Instance Node Signature Signature (A,S,C,I) for Item:[A: { “Products”, “Books”},S: { “Category_Desc”},C: { “Title”, “Author”, “Publisher”, “New”, “Used”, “ISBN”, “Price”, “Num_Copies” }I: {“Item”} ]

Signature Similarity • Vector Space model, TF*IDF weights for terms • Incorporates structure (similarity-by-region) SX: [ A: { “Products”:1, }, S: { “Music”:0.33, “Video”:0.33}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “New”:0.2, “Used”:0.2, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Item”} ] SY:[ A: { “Products”:1, “Books”:0.5}, S: { “CDs”:0.5}, C: { “Title”:0.33, “Author”:0.33, “Publisher”:0.33, “ISBN”:0.6, “Price”:0.2, “Copies”:0.5 }, I: {“Book”} ] Similarity(SX, SY) = SX.A *SY.A+ SX.S *SY.S+ +SX.C *SY.C+ SX.I *SY.I

Training Stage III: Attribute Signatures • Structural + Data signature S(D, A, S, C, I) • 1: Data signature Dfor the values of R.X(e.g., can be a histogram of values for X) • Structure signature for attribute X:(A; S; C; I): • Similar to instance signature • Original instance node  “document” root, • A  ancestors (Item, Publisher, New) • I  self (ISBN) • S  siblings (Price, NumCopies) • C  null.

Outline • Motivation • Problem statement • XMLMiner approach • Training XMLMiner • Extraction from new documents • XMLMiner prototype • Summary

Extraction Stage • Assumption: Input documents have internal regularity • Compute canonical tree for some of the input documents • Build signature of each node in the canonical form, and compute similarity with known instance node signatures • Map descendants of highest scoring node to attributes of target table using attribute signatures

Extraction I: Represent test documents in canonical form Canonical Form Test Document Publications Publications book book book* editor title author publisher editor title author publisher ISBN price ISBN price • Intuition: • Robustness (allows “optional” nodes) • Efficiency: Canonical form has fewer nodes that original tree

Extraction II: Find Instance Node in Canonical Tree Publications • For each node K in CT • Compute Signature of KSK • Compute score for K as Similarity( SK , SI ) • SI is the signature of instance node I from training • The node with highest score is the instance node in CT book* editor title author publisher ISBN price

Extraction III: Map children of instance node to attributes book* editor title author publisher • For each node J of subtree at K • For each attribute X of R • ASJ Attribute Signature of J • ASX Attribute Signature of X • Compute score for J as Similarity( ASJ ,ASX ) • Pick mapping such that Product of the scores over attributes of R is maximized. ISBN price

Extraction IV: Generate XPath queries for the new documents • Apply XPath queries to the “new” XML documents • Simple XPath queries can be handled by Xerces parser or more advanced “streaming parser”

XMLMiner Prototype Successfully finds best instance node (“Book”) in test document

Summary • Partially supervised, low effort XML relational extraction • Flexible vector space representation that preserves some original structure • Can potentially be more robust than current state-of-the-art systems that rely on rules

Extracting Relations from XML Documents