400 likes | 473 Vues
AJAX: Model, Declarative Language, and Algorithms. Helena Galhardas. Plan. Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions. Application context. Eliminate errors and duplicates within a single source
E N D
AJAX: Model, Declarative Language, and Algorithms Helena Galhardas
Plan • Context • Problem statement • Contributions • Our data cleaning solution • Validation • Related solutions • Conclusions
Application context • Eliminate errors and duplicates within a single source • Integrate data from different sources • Migrate poorly structured data into structured data
Human Knowledge Human Knowledge Data Transformation Typical architecture TARGET DATA SOURCE DATA Data Extraction Data Transformation Data Loading ... ... Data Analysis Metadata Dictionaries Schema Integration
Data cleaning Activity of transforming source data into target data without errors, duplicates, and inconsistencies
Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) Events(eventKey, name) PubsAuthors(pubKey, authorKey) Data Cleaning Motivating example (1) DirtyData(paper:String)
Authors Publications DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Events PubsAuthors PDIS | Conference on Parallel and Distributed Information Systems Data Cleaning QGMW96 | DQua QGMW96 | AGup …. DirtyData [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95 Motivating example (2)
Plan • Context • Problem statement • Contributions • Our data cleaning solution • Validation • Related solutions • Conclusions
Modeling a data cleaning process Authors Duplicate Elimination DirtyTitles... DirtyEvents DirtyAuthors • A data cleaning process is • modeled by a directed acyclic • graph of data transformations Extraction Standardization Cities Tags Formatting DirtyData
Existing technology • Ad-hoc code • difficult to maintain • Extraction Transformation Loading(ETI, Informatica, Sagent) • limited cleaning functionality • Data Reengineering(Integrity) • fixed implementation for certain operators • Specific-domain cleaning(idCentric, PureIntegrate) • names and addresses • Duplicate elimination(DataCleanser, matchIt) • finds/eliminates duplicates
App. Domain 1 App. Domain 2 App. Domain 3 Problems of existing solutions (1) Data cleaning transformations ... The semantics of some data transformations is defined in terms of their implementation algorithms
Clean data Rejected data Cleaning process Dirty Data Problems of existing solutions (2) There is a lack of interactive facilities to tune a data cleaning application program
AJAX • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application
Data cleaning framework • Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation • Physical level: set of algorithms that implement the logical operations
View Cluster Match Map Merge Apply Logical level: parametric operators • View: arbitrary SQL query • Map: iterator-based one-to-many mapping with arbitrary user-defined functions • Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function • Merge: extends SQL group-by with user-defined aggregate functions • Apply: executes an arbitrary user-defined algorithm
Authors Duplicate Elimination DirtyTitles... DirtyAuthors Extraction Standardization Cities Tags Formatting DirtyData Logical level
Physical level Authors Authors Merge Java Scan Duplicate Elimination Cluster TC Match NL DirtyTitles... DirtyTitles... DirtyAuthors DirtyAuthors Extraction Map Java Scan Standardization Map Java Scan Cities Tags Cities Tags Formatting Map SQL Scan DirtyData DirtyData Logical level
Contributions • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application
Match • Input: 2relations • Finds data records that correspond to the same real object • Calls distance functions for comparing field values and computing the distance between input tuples • Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples
Example Authors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination
Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination
Authors Merge Cluster Match DirtyAuthors Duplicate Elimination Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag ... MatchAuthors
Implementation of the match operator s1 S1,s2 S2 (s1, s2) is a match if editDistance (s1, s2) < maxDist
S1 S2 ... Nested loop editDistance • Very expensive evaluation when handling large amounts of data • Need alternative execution algorithms for the same logical specification
A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; • No optimization supported for a Cartesian product with external function calls
Window scanning S n
S Window scanning n
S Window scanning n • May loose some matches
length- 1 length length length + 1 editDistance String distance filtering S1 S2 John Smit John Smith Jogn Smith John Smithe maxDist = 1
Annotation-based optimization • The user specifies types of optimization • The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors
Contributions • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application
Management of exceptions • Problem: to mark tuples not handledby the cleaning criteria of an operator • Solution: to specify the generation of exceptional tupleswithin a logical operator • exceptions are thrown by external functions • output constraints are violated
Example (1) CREATE MAP ExtractionCities FROMStandardizedDirtyData dd LET city = extractCities(dd.paper, Cities), { SELECT dd.paperKey AS pubKey, city AS city INTOExtractedCities CONSTRAINT NOT NULL city } ExtractedCities (pubKey, city) Extraction Map Cities StandardizedDirtyData (pubKey, paper)
4| ManyDifferentCities StandardizedDirtyDataexc ExtractedCities ExtractionCities Cities StandardizedDirtyData 4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park Example(2)
Debugger facility • Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions • Supports the interactive data modification and the incremental execution of some logical operators
BackwardDerivation ForwardDerivation Backward/forward data derivation 4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland 4| ManyDifferentCities StandardizedDirtyDataexc DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities Cities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992
StandardizedDirtyDataexc DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData Interactive data correction (1) 4|Technical Report, Univ. Of Wisconsin and Univ. Of Maryland 4| ManyDifferentCities Cities 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992
4| Technical Report, Univ. Of Wisconsin 101| Technical Report, Univ. Of Maryland 4| Madison 101| College Park DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 Interactive data correction(2) incremental incremental Cities incremental incremental