Exploring Data Extraction Techniques and Schema Alignment Roadmap

Data Extraction

Road map • String Matching and Tree Matching • Multiple Alignments • Building DOM Trees • Extraction Given a List Page: Flat Data Records • Extraction Given a List Page: Nested Data Records • Extraction Given Multiple Pages • Summary CS511, Bing Liu, UIC

Some useful algorithms • The key is to finding the encoding template from a collection of encoded instances of the same type. • A natural way to do this is to detect repeated patterns from HTML encoding strings. • String edit distance and tree edit distance are obvious techniques for the task. We describe these techniques. CS511, Bing Liu, UIC

String edit distance • String edit distance: the most widely used string comparison technique. • The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of: • (1) change a letter, • (2) insert a letter, and • (3) delete a letter. CS511, Bing Liu, UIC

String edit distance (definition) CS511, Bing Liu, UIC

Dynamic programming CS511, Bing Liu, UIC

An example • The edit distance matrix and back trace path • alignment CS511, Bing Liu, UIC

Tree Edit Distance • Tree edit distance between two trees A and B (labeled ordered rooted trees) is the cost associated with the minimum set of operations needed to transform A into B. • The set of operations used to define tree edit distance includes three operations: • node removal, • node insertion, and • node replacement. A cost is assigned to each of the operations. CS511, Bing Liu, UIC

Definition CS511, Bing Liu, UIC

Simple tree matching • In the general setting, • mapping can cross levels, e.g., node a in tree A and node a in tree B. • Replacements are also allowed, e.g., node b in A and node h in B. • We describe a restricted matching algorithm, called simple tree matching(STM), which has been shown quite effective for Web data extraction. • STM is a top-down algorithm. • Instead of computing the edit distance of two trees, it evaluates their similarity by producing the maximum matching through dynamic programming. CS511, Bing Liu, UIC

Simple Tree Matching algo CS511, Bing Liu, UIC

An example CS511, Bing Liu, UIC

Schema Alignment: Three Steps [BBR11] Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables linkage, fusion to be semantically meaningful

Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables domain specific modeling

Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Identifies correspondences between schema attributes

Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Specifies transformation between records in different schemas

Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mediated schemas: automatically created by inspecting sources • Clustering of source attributes • Volume, varietyof sources → uncertainty in accuracy of clustering

Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Example P-mediated schema MS • M1({name}, {hPhone, pPh}, {oPhone}, {hAddr, pAddr}, {oAddr}) • M2({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr, oAddr}) • M3({name}, {hPhone, pPh}, {oPhone}, {hAddr}, {pAddr}, {oAddr}) • M4({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr}, {oAddr}) • MS = {(M1, 0.6), (M2, 0.4)}

Probabilistic Mappings [DHY07, DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Example mappings between M1 and S1 • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …) • G = {(G1, 0.6), (G2, 0.4)}

Probabilistic Mappings S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Answering queries on P-mediated schema based on P-mappings • By table semantics: one mapping for all tuples in a table • By tuple semantics: different mappings are okay in a table

Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …)

Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

Probabilistic Mappings: By Table Semantics Now consider query Q2: SELECT pAddr FROM MS Result of Q2, under by table semantics, across all possible worlds

Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)

Probabilistic Mappings: By Tuple Semantics • Now consider query Q2: SELECT pAddr FROM MS • Result of Q2, under by tuple semantics, across all possible worlds • Note the difference with the result of Q2, under by table semantics

Introduction to Hadoop • Hadoop Map/Reduce is • a java based software framework for easily writing applications • which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware • in a reliable, fault-tolerant manner.

Hadoop Cluster Architecture Job submission node HDFS master Client JobTracker NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Slave node Slave node Slave node From Jimmy Lin’s slides

Hadoop HDFS

Hadoop Cluster Rack Awareness

Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides

Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” From Jimmy Lin’s slides

High-level MapReduce pipeline

Detailed Hadoop MapReduce data flow

Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A

Word Count with MapReduce Doc 1 Doc 2 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 4 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A

Calculating document pairwise similarity • Trivial Solution • load each vector o(N) times • load each term o(dft2)times Goal scalable and efficient solutionfor large collections From Jimmy Lin’s slides

Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores From Jimmy Lin’s slides

Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores reduce map From Jimmy Lin’s slides

Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc From Jimmy Lin’s slides

Inverted Indexing with MapReduce Doc 2 Doc 1 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides

Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 From Jimmy Lin’s slides

2 2 2 2 2 1 1 2 3 1 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama How to deal with the long list? 1 1 From Jimmy Lin’s slides

Record Linkage for Big Data Slides from Luna Dong’s VLDB Tutorial

Record Linkage: Three Steps [EIV07, GM12] Blocking Pairwise Matching Clustering • Record linkage: blocking + pairwise matching + clustering • Scalability, similarity, semantics

Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Blocking: efficiently create small blocks of similar records • Ensures scalability

Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Pairwise matching: compares all record pairs in a block • Computes similarity

Exploring Data Extraction Techniques and Schema Alignment Roadmap

Exploring Data Extraction Techniques and Schema Alignment Roadmap

Presentation Transcript

Web Data Extraction

Extraction of Coocurrence Data

Measurement and data extraction.

Inpatient Pharmacy Data Extraction

Data Extraction Workshop

Data extraction

Coding Procedures (Data Extraction)

Feature extraction/data compression

Statistics Session3: Data Extraction

Data Extraction Interactive Quiz

Modal Data Extraction

New Data Extraction Tool

Modal Data Extraction

Waterford Data Extraction Utility

Outsource Data Extraction Services

Data extraction services

Web Scraping ,Data Scraping,Web Extraction,Data Extraction - USA

Dentists Data Extraction

Lawyers Data Extraction _ Data Scraping

DATA EXTRACTION SERVICES

Data Extraction Interactive Quiz

Data Extraction