1 / 36

The Oracle9i Multi-Terabyte Data Warehouse

Session id:. The Oracle9i Multi-Terabyte Data Warehouse. Jeff Parker Manager Data Warehouse Development Amazon.com. The Challenges. Rapidly evolving business Growing data volumes Do more with less. The Challenges. Rapidly evolving business New international markets

elan
Télécharger la présentation

The Oracle9i Multi-Terabyte Data Warehouse

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Session id: The Oracle9i Multi-Terabyte Data Warehouse Jeff ParkerManager Data Warehouse Development Amazon.com

  2. The Challenges • Rapidly evolving business • Growing data volumes • Do more with less

  3. The Challenges • Rapidly evolving business • New international markets • Continual innovation of features on Amazon • Buy it used • Magazine subscriptions • Marketplace Partnerships – Toys R Us, Target • Growing data volumes • Do more with less

  4. The Challenges • Rapidly evolving business • Growing data volumes • 2X growth yearly over the past 5 years • Currently 10 Terabytes of raw data • Do more with less

  5. The Challenges • Rapidly evolving business • Growing data volumes • Do more with less • Innovative use of technology and resources • Throwing money and people at the problem is not an option • Leverage existing investment in Oracle

  6. Addressing the issues • Rapidly evolving business • Denormalize only for performance reasons • Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs • Growing data volumes • Do more with less

  7. Addressing the issues • Rapidly evolving business • Growing data volumes • Dual database approach to ETL • Staging database for efficient transformation of large datasets. SQL and hash-joins allow transforms to scale in a non-linear fashion • Second database optimized for analytics • Oracle as an API • Simplifies ETL architecture • Better scalability than traditional ETL tools • Do more with less

  8. Addressing the issues • Rapidly evolving business • Growing data volumes • Do more with less • One DW schema supports all countries • Cut costs by eliminating unneeded software • Data driven Load functionality

  9. The ETL Process • Extract data from source • The Load process • Dimensional Transforms

  10. The ETL Process • Extract data from source • Can create one or more files to be loaded • Must produce Metadata upon which the Load process can depend • The Load Process • Dimensional Transforms

  11. Extract produced Metadata • Describes each field in database type terms • Changes as the dataset changes • Can reference multiple files • Very reliable • No additional overhead

  12. XML Based Metadata <DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2" DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE" DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2" DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES> </DATA>

  13. The ETL Process • Extract data from source • The Load Process • Makes extensive use of External Tables • MERGE and Bulk Insert • Contains integrated DBA tasks • Every load is tracked in an operational database • Dimensional Transforms

  14. The Load Process

  15. The Load Process • External Tables • access to files on the operating system • Is a building block in a broader ETL process • MERGE & Bulk Insert • Integrated DBA tasks

  16. The External Table • Created by using Metadata from the Extract process • Data is read-only • No indexes • Use DBMS_STATS to set number of rows

  17. Example External Table • Copy the data to the database server • Data must reside in a file system location specified by the DBA’s. - create directoryDAT_DIR as ‘/stage/flat’

  18. Example External Table 2. Create the external table using the DML from the extract. CREATE TABLE XT_datasets_77909 ( dataset_id NUMBER , dataset_name VARCHAR2(80) , creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' )

  19. The External Table • No pre-staging of data • Ability to describe a flat file to Oracle • Handles horizontally partitioned files • Good error messaging

  20. The Load Process • External Tables • MERGE • Can be run in parallel • Combined with external table provides a powerful set of ETL tools • Integrated DBA tasks

  21. MERGE • Allows for update or insert in a single statement • If key value already exists • Yes, update row • No, insert row • MERGE statement is auto-generated • Row level column transforms are supported

  22. MERGE

  23. MERGE example MERGE into DATASETS ds USING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) src On ( xt.dataset_id = ds.dataset_id ) When matched then UPDATE SET ds.dataset_name = src.dataset_name ,ds.creation_date = src.dataset_name ,ds.created_by = src.created_by ,ds.last_updated = sysdate when not matched then INSERT( dataset_name, creation_date, created_by, last_updated ) VALUES( dataset_name, creation_date, created_by, sysdate )

  24. MERGE • Issues we faced • Duplicate records in the dataset • NESTED-LOOPS from external table • Parallelism is not enabled by default • Bulk Load partition determination

  25. The Load Process • External Tables • MERGE • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error

  26. Integrated DBA Tasks • Provided by the DBA team • Managed by the DBA team • ETL team does not need special knowledge of table layout

  27. Integrated DBA Tasks • Truncate Partition developer makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes alter table TABLE-NAME drop partition dbi20020930_101;

  28. Integrated DBA Tasks • Analyze Partition developer makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes dbms_stats.gather_table_stats(ownname , tabname , partname , cascade , estimate_percent, granularity);

  29. Integrated DBA Tasks • Return Partition Name developer makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables.

  30. Integrated DBA Tasks • Partitioning utilities • Helps to streamline the process • Reduces workload of DBA team • Helps to eliminate the problem of double loads for Snapshot tables and partitions

  31. The Load Process • External Tables • Provides access to flat files outside the database • MERGE • Parallel “upsert” simplifies ETL • Row level transforms can be performed in SQL • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error • Loads are repeatable processes

  32. Summary • Reduction in time to integrate new subject areas • Oracle parallelism scales well • Eliminated unneeded software

  33. Summary • Oracle has delivered on the DW promise • Oracle External table combined with MERGE is a viable alternative to other ETL tools • ETL tools are ready today

  34. & Q U E S T I O N S A N S W E R S

  35. Reminder – please complete the OracleWorld session surveyThank you.

More Related