1 / 40

AJAX: Model, Declarative Language, and Algorithms

AJAX: Model, Declarative Language, and Algorithms. Helena Galhardas. Plan. Context Problem statement Contributions Our data cleaning solution Validation Related solutions Conclusions. Application context. Eliminate errors and duplicates within a single source

deron
Télécharger la présentation

AJAX: Model, Declarative Language, and Algorithms

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AJAX: Model, Declarative Language, and Algorithms Helena Galhardas

  2. Plan • Context • Problem statement • Contributions • Our data cleaning solution • Validation • Related solutions • Conclusions

  3. Application context • Eliminate errors and duplicates within a single source • Integrate data from different sources • Migrate poorly structured data into structured data

  4. Human Knowledge Human Knowledge Data Transformation Typical architecture TARGET DATA SOURCE DATA Data Extraction Data Transformation Data Loading ... ... Data Analysis Metadata Dictionaries Schema Integration

  5. Data cleaning Activity of transforming source data into target data without errors, duplicates, and inconsistencies

  6. Publications(pubKey, title, eventKey, url, volume, number, pages, city, month, year) Authors(authorKey, name) Events(eventKey, name) PubsAuthors(pubKey, authorKey) Data Cleaning Motivating example (1) DirtyData(paper:String)

  7. Authors Publications DQua | Dallan Quass AGup | Ashish Gupta JWid | Jennifer Widom ….. QGMW96| Making Views Self-Maintainable for Data Warehousing |PDIS| null | null | null | null | Miami Beach | Florida, USA | 1996 Events PubsAuthors PDIS | Conference on Parallel and Distributed Information Systems Data Cleaning QGMW96 | DQua QGMW96 | AGup …. DirtyData [1] Dallan Quass, Ashish Gupta, Inderpal Singh Mumick, and Jennifer Widom. Making Views Self-Maintainable for Data Warehousing. In Proceedings of the Conference on Parallel and Distributed Information Systems. Miami Beach, Florida, USA, 1996 [2] D. Quass, A. Gupta, I. Mumick, J. Widom, Making views self-maintianable for data warehousing, PDIS’95 Motivating example (2)

  8. Plan • Context • Problem statement • Contributions • Our data cleaning solution • Validation • Related solutions • Conclusions

  9. Modeling a data cleaning process Authors Duplicate Elimination DirtyTitles... DirtyEvents DirtyAuthors • A data cleaning process is • modeled by a directed acyclic • graph of data transformations Extraction Standardization Cities Tags Formatting DirtyData

  10. Existing technology • Ad-hoc code • difficult to maintain • Extraction Transformation Loading(ETI, Informatica, Sagent) • limited cleaning functionality • Data Reengineering(Integrity) • fixed implementation for certain operators • Specific-domain cleaning(idCentric, PureIntegrate) • names and addresses • Duplicate elimination(DataCleanser, matchIt) • finds/eliminates duplicates

  11. App. Domain 1 App. Domain 2 App. Domain 3 Problems of existing solutions (1) Data cleaning transformations ... The semantics of some data transformations is defined in terms of their implementation algorithms

  12. Clean data Rejected data Cleaning process Dirty Data Problems of existing solutions (2) There is a lack of interactive facilities to tune a data cleaning application program

  13. AJAX • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application

  14. Data cleaning framework • Logical level: set of logical operators to express cleaning criteria enclosed in each data transformation • Physical level: set of algorithms that implement the logical operations

  15. View Cluster Match Map Merge Apply Logical level: parametric operators • View: arbitrary SQL query • Map: iterator-based one-to-many mapping with arbitrary user-defined functions • Match: iterator-based approximate join • Cluster: uses an arbitrary clustering function • Merge: extends SQL group-by with user-defined aggregate functions • Apply: executes an arbitrary user-defined algorithm

  16. Authors Duplicate Elimination DirtyTitles... DirtyAuthors Extraction Standardization Cities Tags Formatting DirtyData Logical level

  17. Physical level Authors Authors Merge Java Scan Duplicate Elimination Cluster TC Match NL DirtyTitles... DirtyTitles... DirtyAuthors DirtyAuthors Extraction Map Java Scan Standardization Map Java Scan Cities Tags Cities Tags Formatting Map SQL Scan DirtyData DirtyData Logical level

  18. Contributions • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application

  19. Match • Input: 2relations • Finds data records that correspond to the same real object • Calls distance functions for comparing field values and computing the distance between input tuples • Output: 1 relation containing matching tuples and possibly 1 or 2 relations containing non-matching tuples

  20. Example Authors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination

  21. Example Authors CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Merge Cluster MatchAuthors Match DirtyAuthors Duplicate Elimination

  22. Authors Merge Cluster Match DirtyAuthors Duplicate Elimination Example CREATE MATCH MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET distance = editDistance(da1.name, da2.name) WHERE distance < maxDist INTO MatchAuthors Input: DirtyAuthors(authorKey, name) 861|johann christoph freytag 822|jc freytag 819|j freytag 814|j-c freytag Output: MatchAuthors(authorKey1, authorKey2, name1, name2) 861|822|johann christoph freytag| jc freytag 822|814|jc freytag|j-c freytag ... MatchAuthors

  23. Implementation of the match operator  s1 S1,s2  S2 (s1, s2) is a match if editDistance (s1, s2) < maxDist

  24. S1 S2 ... Nested loop editDistance • Very expensive evaluation when handling large amounts of data • Need alternative execution algorithms for the same logical specification

  25. A database solution CREATE TABLE MatchAuthors AS SELECT authorKey1, authorKey2, distance FROM (SELECT a1.authorKey authorKey1, a2.authorKey authorKey2, editDistance (a1.name, a2.name) distance FROM DirtyAuthors a1, DirtyAuthors a2) WHERE distance < maxDist; • No optimization supported for a Cartesian product with external function calls

  26. Window scanning S n

  27. S Window scanning n

  28. S Window scanning n • May loose some matches

  29. length- 1 length length length + 1 editDistance String distance filtering S1 S2 John Smit John Smith Jogn Smith John Smithe maxDist = 1

  30. Annotation-based optimization • The user specifies types of optimization • The system suggests which algorithm to use Ex: CREATE MATCHING MatchDirtyAuthors FROM DirtyAuthors da1, DirtyAuthors da2 LET dist = editDistance(da1.name, da2.name) WHERE dist < maxDist % distance-filtering: map= length; dist = abs % INTO MatchAuthors

  31. Contributions • An extensible data cleaning framework • A declarative language for logical operators • Efficient implementation of the match operator • A debugger facility for tuning a data cleaning program application

  32. Management of exceptions • Problem: to mark tuples not handledby the cleaning criteria of an operator • Solution: to specify the generation of exceptional tupleswithin a logical operator • exceptions are thrown by external functions • output constraints are violated

  33. Example (1) CREATE MAP ExtractionCities FROMStandardizedDirtyData dd LET city = extractCities(dd.paper, Cities), { SELECT dd.paperKey AS pubKey, city AS city INTOExtractedCities CONSTRAINT NOT NULL city } ExtractedCities (pubKey, city) Extraction Map Cities StandardizedDirtyData (pubKey, paper)

  34. 4| ManyDifferentCities StandardizedDirtyDataexc ExtractedCities ExtractionCities Cities StandardizedDirtyData 4|y ioannidis r ng k shim and t sellis parametric query optimization technical report univ of wisconsin madison and univ of maryland college park Example(2)

  35. Debugger facility • Supports the (backward and forward) data derivation of tuples wrt an operator to debug exceptions • Supports the interactive data modification and the incremental execution of some logical operators

  36. BackwardDerivation ForwardDerivation Backward/forward data derivation 4|Technical Report, Univ. Of Wisconsin, and Univ. Of Maryland 4| ManyDifferentCities StandardizedDirtyDataexc DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities Cities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison and Univ. Of Maryland, College Park, 1992

  37. StandardizedDirtyDataexc DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData Interactive data correction (1) 4|Technical Report, Univ. Of Wisconsin and Univ. Of Maryland 4| ManyDifferentCities Cities 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992

  38. 4| Technical Report, Univ. Of Wisconsin 101| Technical Report, Univ. Of Maryland 4| Madison 101| College Park DirtyEvents ExtractedCities ExtractionAuthorsTitleEvent ExtractionCities StandardizedDirtyDataForExtraction StandardizedDirtyData StandardizeDataForExtraction StandardizeData KeyDirtyData 4| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Wisconsin, Madison, 1992 101| Y. Ioannidis, R. Ng, K. Shim, and T. Sellis. Parametric query optimization. Technical Report, Univ. Of Maryland, College Park, 1992 Interactive data correction(2) incremental incremental Cities incremental incremental

  39. AJAX Architecture

  40. AJAX Demo

More Related