420 likes | 625 Vues
Conceptual Modeling for ETL processes. Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil , asimi , spiros}@dblab . ece . ntua . gr. National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr. General Idea. The problem:
E N D
Conceptual Modeling for ETL processes Panos Vassiliadis, Alkis Simitsis, Spiros Skiadopoulos {pvassil,asimi,spiros}@dblab.ece.ntua.gr National Technical University of Athens KDBS Laboratory http://www.dbnet.ece.ntua.gr
General Idea • The problem: • The conceptual part of the definition of ETL processin the early stages of a DW project • The key idea: • The mapping of the attributes of the data sources to the attributes of the DW tables
Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work
Extract Transform & Clean Load Sources DSA DW Extract-Transform-Load (ETL)
Motivation • Practical necessity • e.g., 80% of the development time in a DW project • In-house development, ad-hoc solutions • Lack of related work • The front end of the DW has monopolized the research on the conceptual part of DW modeling Thus, the design, development and deployment of ETL processes, needs modeling, design and methodological foundations
PK S1.A DW.A Motivation • Early stages of the DW design : • Conceptsare still fuzzy and changing frequently • Lots ofinterviews with people • No time for a full, clean-cut definition of the DW and the ETL workflow • Still, we can: • Trace the mapping of the attributes of the data sources to the attributes of the DW tables • Trace necessary constraints and transformations for the ETL process
Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work
Conceptual Model • Entities of our model: • Concepts • Attributes • Part-of Relationships • Transformations • Serial Composition of Transformations • Provider Relationships • Notes • ETL Constraints • Candidate Relationships
Conceptual Model • Concepts • a name, finite set of attributes • represent an entity in the source database or in the DW • Attributes • same role as in ER/dimensional models • a granular module of information We do not employ standard UML notation for concepts and attributes, for the reason that we need to treat attributes as first class citizens of our model
Conceptual Model • Part-of Relationships • finite set of attributes • emphasize the fact that a concept is composed of a set of attributes
Conceptual Model • Example • Source 1 • S1.PARTSUPP {PKEY, SUPPKEY, QTY, COST} • Data Warehouse • DW.PARTSUPP {PKEY, SUPPKEY, DATE, QTY, COST}
Conceptual Model • Transformations • finite set of input/output attributes, a symbol • abstractions that represent parts, or full modules of code, executing a single task • two categories: • filtering or data cleaning operations (e.g., foreign key violations) • transformation operations (e.g., aggregation)
Conceptual Model • Provider Relationships • finite set of input/output attributes, an appropriate transformation • map a set of input attributes to a set of output attributes through a relevant transformation* * If the attributes are semantically and physically compatible, no transformation is required
Conceptual Model • Notes • informal tags, exactly as in UML modeling • used for: • simple comments explaining design decisions • explanation of the semantics of the applied transformation • tracing of runtime constraints
Conceptual Model • ETL Constraints • finite set of attributes, a single transformation • express the fact that the data of a certain concept fulfill several requirements
Conceptual Model • Candidate Relationships • a single candidate concept, a single target concept • used when a certain DW concept is populated by a finite set of more than one candidate source concepts Active Candidate Relationship • a certain candidate that has been selected for the population of the target concept • a specialization of candidate relationships
Outline • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work
Instantiation & Specialization Layers • The key issues: • generecity • identification of a small set of generic constructs to capture all cases • usability • construction of a ‘palette’ of frequently used types
Instantiation & Specialization Layers • Metamodel layer • a set of generic entities, able to represent any ETL scenario • involves classes: Concept, Attribute, Transformation, ETL Constraint and Relationship • Template layer • a set of ‘built-in’ specializations of the entities of the Metamodel layer, specifically tailored for the most frequent elements of ETL scenarios • Schema layer • a specific ETL scenario • all the entities of the Schema layer are instances of the classes of the Metamodel layer
Instantiation & Specialization Layers • Template layer • Four groups of logical transformations • Filters • Unary transformations • Binary transformations • Composite transformations • Two groups of physical transformations • Transfer operations • File operations
Filters Selection (σ) Not null (NN) Primary key violation (PK) Foreign key violation (FK) Unique value (UN) Domain mismatch DM) Unary transformations Push Aggregation (γ) Projection (π) Function application (f) Surrogate key assignment(SK) Tuple normalization (N) Tuple denormalization (DN) Binary transformations Union (U) Join () Diff (Δ) Update Detection (ΔUPD) Composite transformations Slowly changing dimension (Type 1,2,3) (SDC-1/2/3) Format mismatch (FM) Data type conversion (DTC) Switch (σ*) Extended union (U) File operations EBCDIC to ASCII conversion (EB2AS) Sort file (Sort) Transfer operations Ftp (FTP) Compress/Decompress (Z/dZ) Encrypt/Decrypt (Cr/dCr) Instantiation & Specialization Layers
Outline • Introduction • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work
Methodology • Step 1 • Identification of the proper data stores • Step 2 • Candidates and active candidates for the involved data stores • Step 3 • Attribute mapping between the providers and the consumers • Step 4 • Annotating the diagram with runtime constraints
Outline • Introduction • Motivation • Conceptual Model • Instantiation and Specialization Layers • Methodology for the usage of the conceptual model • Conclusions and Future Work
Conclusions • Our contributions lies in: • The proposal of a novel conceptual model which is customized for the tracing of inter-attributerelationships and the respective ETL activities • A customizable and extensible construction • The introduction of a 'palette' of a set of frequently used ETL activities
On-going/Future Work The Arktos II project is aimed towards the • Conceptual modeling • Logical modeling • Optimization • What-if analysis of ETL scenarios http://www.dblab.ece.ntua.gr/ ~pvassil/projects/arktos_II
Logical Model [DMDW’02] DS.PS1.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW1 DS.PS_NEW1.PKEY, DS.PS_OLD1.PKEY SUPPKEY=1 COST DATE DS.PS1 DIFF1 A2EDate SK1 $2€ Add_SPK1 DS.PS_OLD1 U rejected rejected rejected Log Log Log DS.PS2.PKEY, LOOKUP_PS.SKEY, SUPPKEY DS.PS_NEW2 DS.PS_NEW2.PKEY, DS.PS_OLD2.PKEY SUPPKEY=2 COST DATE=SYSDATE QTY>0 DS.PS2 NotNULL AddDate Add_SPK2 SK2 CheckQTY DIFF2 DS.PS_OLD2 rejected rejected Log Log DSA PKEY, DAY MIN(COST) DW.PARTSUPP S1_PARTSUPP V1 Aggregate1 FTP1 PKEY, MONTH AVG(COST) DW.PARTSUPP.DATE, DAY TIME S2_PARTSUPP V2 Aggregate2 FTP2 Sources DW
Conceptual Model • General Notes • It is not a process/workflow model • It is orthogonal to the conceptual models which are available for the modeling of DW star schemata • It is specifically tailored for the back end of the DW • Any of the proposals for the DW front end can be combined with our approach
Conceptual Model • Serial Composition of Transformations • a single initiating transformation, a single subsequent transformation • combine several transformations in a single provider relationship