1 / 41

Conceptual Modeling for ETL processes

Conceptual Modeling for ETL processes. Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil , asimi , spiros}@dblab . ece . ntua . gr. National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr. General Idea. The problem:

niyati
Télécharger la présentation

Conceptual Modeling for ETL processes

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil,asimi,spiros}@dblab.ece.ntua.gr National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr

  2. General Idea • The problem: • The conceptual part of the definition of ETL processin the early stages of a DW project • The key idea: • The mapping of the attributes of the data sources to the attributes of the DW tables

  3. Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work

  4. Extract Transform & Clean Load Sources DSA DW Extract-Transform-Load (ETL)

  5. Motivation • Practical necessity • e.g., 80% of the development time in a DW project • In-house development, ad-hoc solutions • Lack of related work • The front end of the DW has monopolized the research on the conceptual part of DW modeling Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations

  6. PK S1.A DW.A Motivation • Early stages of the DW design : • Conceptsare still fuzzy and changing frequently • Lots ofinterviews with people • No time for a full, clean-cut definition of the DW and the ETL workflow • Still, we can: • Trace the mapping of the attributes of the data sources to the attributes of the DW tables • Trace necessary constraints and transformations for the ETL process

  7. Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work

  8. Conceptual Model • Entities of our model: • Concepts • Attributes • Part-of Relationships • Transformations • Serial Composition of Transformations • Provider Relationships • Notes • ETL Constraints • Candidate Relationships

  9. Conceptual Model • Concepts • a name, finite set of attributes • represent an entity in the source database or in the DW • Attributes • same role as in ER/dimensional models • a granular module of information We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model

  10. Conceptual Model • Part-of Relationships • finite set of attributes • emphasize the fact that a concept is composed of a set of attributes

  11. Conceptual Model • Example • Source 1 • S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST} • Data Warehouse • DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}

  12. Conceptual Model

  13. Conceptual Model • Transformations • finite set of input/output attributes, a symbol • abstractions that represent parts, or full modules of code, executing a single task • two categories: • filtering or data cleaning operations (e.g., foreign key violations) • transformation operations (e.g., aggregation)

  14. Conceptual Model • Provider Relationships • finite set of input/output attributes, an appropriate transformation • map a set of input attributes to a set of output attributes through a relevant transformation* * If the attributes are semantically and physically compatible, no transformation is required

  15. Conceptual Model

  16. Conceptual Model • Notes • informal tags, exactly as in UML modeling • used for: • simple comments explaining design decisions • explanation of the semantics of the applied transformation • tracing of runtime constraints

  17. Conceptual Model

  18. Conceptual Model • ETL Constraints • finite set of attributes, a single transformation • express the fact that the data of a certain concept fulfill several requirements

  19. Conceptual Model

  20. Conceptual Model • Candidate Relationships • a single candidate concept, a single target concept • used when a certain DW concept is populated by a finite set of more than one candidate source concepts Active Candidate Relationship • a certain candidate that has been selected for the population of the target concept • a specialization of candidate relationships

  21. Conceptual Model

  22. Conceptual Model

  23. Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work

  24. Instantiation & Specialization Layers • The key issues: • generecity • identification of a small set of generic constructs to capture all cases • usability • construction of a ‘palette’ of frequently used types

  25. Instantiation & Specialization Layers • Metamodel layer • a set of generic entities, able to represent any ETL scenario • involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship • Template layer • a set of ‘built-in’ specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios • Schema layer • a specific ETL scenario • all the entities of the Schema layer are instances of the classes of the Metamodel layer

  26. Instantiation & Specialization Layers

  27. Instantiation & Specialization Layers • Template layer • Four groups of logical transformations • Filters • Unary transformations • Binary transformations • Composite transformations • Two groups of physical transformations • Transfer operations • File operations

  28. Filters Selection (σ) Not null (NN) Primary key violation (PK) Foreign key violation (FK) Unique value (UN) Domain mismatch DM) Unary transformations Push Aggregation (γ) Projection (π) Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN) Binary transformations Union (U) Join () Diff (Δ) Update Detection (ΔUPD) Composite transformations Slowly changing dimension (Type 1,2,3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (σ*) Extended union (U) File operations EBCDIC to ASCII conversion (EB2AS) Sort file (Sort) Transfer operations Ftp (FTP) Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr) Instantiation & Specialization Layers

  29. Outline • Introduction • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work

  30. Methodology • Step 1 • Identification of the proper data stores • Step 2 • Candidates and active candidates for the involved data stores • Step 3 • Attribute mapping between the providers and the consumers • Step 4 • Annotating the diagram with runtime constraints

  31. Outline • Introduction • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work

  32. Conclusions • Our contributions lies in: • The proposal of a novel conceptual model which is customized for the tracing of inter-attributerelationships and the respective ETL activities • A customizable and extensible construction • The introduction of a 'palette' of a set of frequently used ETL activities

  33. On-going/Future Work The Arktos II project is aimed towards the • Conceptual modeling • Logical modeling • Optimization • What-if analysis of ETL scenarios http://www.dblab.ece.ntua.gr/ ~pvassil/projects/arktos_II

  34. Thank you

  35. Back-up slides

  36. Logical Model [DMDW’02] DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2  Aggregate2 FTP2 Sources DW

  37. Conceptual Model

  38. The lifecycle of a Data Warehouse and its ETL processes

  39. Conceptual Model

  40. Conceptual Model • General Notes • It is not a process/workflow model • It is orthogonal to the conceptual models which are available for the modeling of DW star schemata • It is specifically tailored for the back end of the DW • Any of the proposals for the DW front end can be combined with our approach

  41. Conceptual Model • Serial Composition of Transformations • a single initiating transformation, a single subsequent transformation • combine several transformations in a single provider relationship

More Related