750 likes | 768 Vues
Data Extraction. Road map. String Matching and Tree Matching Multiple Alignments Building DOM Trees Extraction Given a List Page: Flat Data Records Extraction Given a List Page: Nested Data Records Extraction Given Multiple Pages Summary. Some useful algorithms.
E N D
Road map • String Matching and Tree Matching • Multiple Alignments • Building DOM Trees • Extraction Given a List Page: Flat Data Records • Extraction Given a List Page: Nested Data Records • Extraction Given Multiple Pages • Summary CS511, Bing Liu, UIC
Some useful algorithms • The key is to finding the encoding template from a collection of encoded instances of the same type. • A natural way to do this is to detect repeated patterns from HTML encoding strings. • String edit distance and tree edit distance are obvious techniques for the task. We describe these techniques. CS511, Bing Liu, UIC
String edit distance • String edit distance: the most widely used string comparison technique. • The edit distance of two strings, s1 and s2, is defined as the minimum number of point mutations required to change s1 into s2, where a point mutation is one of: • (1) change a letter, • (2) insert a letter, and • (3) delete a letter. CS511, Bing Liu, UIC
String edit distance (definition) CS511, Bing Liu, UIC
Dynamic programming CS511, Bing Liu, UIC
An example • The edit distance matrix and back trace path • alignment CS511, Bing Liu, UIC
Tree Edit Distance • Tree edit distance between two trees A and B (labeled ordered rooted trees) is the cost associated with the minimum set of operations needed to transform A into B. • The set of operations used to define tree edit distance includes three operations: • node removal, • node insertion, and • node replacement. A cost is assigned to each of the operations. CS511, Bing Liu, UIC
Definition CS511, Bing Liu, UIC
Simple tree matching • In the general setting, • mapping can cross levels, e.g., node a in tree A and node a in tree B. • Replacements are also allowed, e.g., node b in A and node h in B. • We describe a restricted matching algorithm, called simple tree matching(STM), which has been shown quite effective for Web data extraction. • STM is a top-down algorithm. • Instead of computing the edit distance of two trees, it evaluates their similarity by producing the maximum matching through dynamic programming. CS511, Bing Liu, UIC
Simple Tree Matching algo CS511, Bing Liu, UIC
An example CS511, Bing Liu, UIC
Schema Alignment: Three Steps [BBR11] Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables linkage, fusion to be semantically meaningful
Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Enables domain specific modeling
Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Identifies correspondences between schema attributes
Schema Alignment: Three Steps Mediated Schema Attribute Matching Schema Mapping • Schema alignment: mediated schema + matching + mapping • Specifies transformation between records in different schemas
Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mediated schemas: automatically created by inspecting sources • Clustering of source attributes • Volume, varietyof sources → uncertainty in accuracy of clustering
Probabilistic Mediated Schemas [DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Example P-mediated schema MS • M1({name}, {hPhone, pPh}, {oPhone}, {hAddr, pAddr}, {oAddr}) • M2({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr, oAddr}) • M3({name}, {hPhone, pPh}, {oPhone}, {hAddr}, {pAddr}, {oAddr}) • M4({name}, {hPhone}, {pPh, oPhone}, {hAddr}, {pAddr}, {oAddr}) • MS = {(M1, 0.6), (M2, 0.4)}
Probabilistic Mappings [DHY07, DDH08] S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Example mappings between M1 and S1 • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …) • G = {(G1, 0.6), (G2, 0.4)}
Probabilistic Mappings S1 S4 name hPhonehAddroPhoneoAddr name pPhpAddr • Mapping between P-mediated schema and a source schema • Answering queries on P-mediated schema based on P-mappings • By table semantics: one mapping for all tuples in a table • By tuple semantics: different mappings are okay in a table
Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …)
Probabilistic Mappings: By Table Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by table semantics, in a possible world • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)
Probabilistic Mappings: By Table Semantics Now consider query Q2: SELECT pAddr FROM MS Result of Q2, under by table semantics, across all possible worlds
Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)
Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)
Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)
Probabilistic Mappings: By Tuple Semantics • Consider query Q1: SELECT name, pPh, pAddr FROM MS • Result of Q1, under by tuple semantics, in a possible world • G1({M1.n, name}, {M1.phP, hPhone}, {M1.phA, hAddr}, …) • G2({M1.n, name}, {M1.phP, oPhone}, {M1.phA, oAddr}, …)
Probabilistic Mappings: By Tuple Semantics • Now consider query Q2: SELECT pAddr FROM MS • Result of Q2, under by tuple semantics, across all possible worlds • Note the difference with the result of Q2, under by table semantics
Introduction to Hadoop • Hadoop Map/Reduce is • a java based software framework for easily writing applications • which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware • in a reliable, fault-tolerant manner.
Hadoop Cluster Architecture Job submission node HDFS master Client JobTracker NameNode TaskTracker DataNode TaskTracker DataNode TaskTracker DataNode Slave node Slave node Slave node From Jimmy Lin’s slides
Hadoop Development Cycle 1. Scp data to cluster 2. Move data into HDFS 3. Develop code locally 4. Submit MapReduce job 4a. Go back to Step 3 Hadoop Cluster You 5. Move data out of HDFS 6. Scp data from cluster From Jimmy Lin’s slides
Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 Combine “Result” From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Word Count with MapReduce Doc 1 Doc 2 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 4 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides
Outline • Hadoop Basics • Case Study • Word Count • Pairwise Similarity • PageRank • K-Means Clustering • Matrix Factorization • Cluster Coefficient • Resource Entries to ML labs • Advanced Topics • Q&A
Calculating document pairwise similarity • Trivial Solution • load each vector o(N) times • load each term o(dft2)times Goal scalable and efficient solutionfor large collections From Jimmy Lin’s slides
Better Solution Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores From Jimmy Lin’s slides
Decomposition Each term contributes only if appears in • Load weights for each term once • Each term contributes o(dft2)partial scores reduce map From Jimmy Lin’s slides
Standard Indexing (a) Map (b) Shuffle (c) Reduce Shuffling group values by: terms tokenize doc combine posting list tokenize doc combine posting list tokenize doc combine posting list tokenize doc From Jimmy Lin’s slides
Inverted Indexing with MapReduce Doc 2 Doc 1 Doc 3 one red cat 1 1 2 1 3 1 red fish, blue fish cat in the hat one fish, two fish Map two blue hat 1 1 2 1 3 1 fish fish 1 2 2 2 Shuffle and Sort: aggregate values by keys cat 3 1 blue 2 1 Reduce fish 1 2 2 2 hat 3 1 one 1 1 two 1 1 red 2 1 From Jimmy Lin’s slides
Indexing (3-doc toy collection) Clinton ObamaClinton Clinton Obama Clinton Clinton 1 2 Indexing 1 ClintonCheney Cheney Clinton Cheney 1 Barack 1 Clinton Barack Obama ClintonBarackObama Obama 1 1 From Jimmy Lin’s slides
2 2 2 2 2 1 1 2 3 1 1 1 1 Pairwise Similarity (a) Generate pairs (b) Group pairs (c) Sum pairs Clinton 1 2 1 Cheney 1 Barack 1 Obama How to deal with the long list? 1 1 From Jimmy Lin’s slides
Record Linkage for Big Data Slides from Luna Dong’s VLDB Tutorial
Record Linkage: Three Steps [EIV07, GM12] Blocking Pairwise Matching Clustering • Record linkage: blocking + pairwise matching + clustering • Scalability, similarity, semantics
Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Blocking: efficiently create small blocks of similar records • Ensures scalability
Record Linkage: Three Steps Blocking Pairwise Matching Clustering • Pairwise matching: compares all record pairs in a block • Computes similarity