The Oracle9i Multi-Terabyte Data Warehouse
360 likes | 558 Vues
Session id:. The Oracle9i Multi-Terabyte Data Warehouse. Jeff Parker Manager Data Warehouse Development Amazon.com. The Challenges. Rapidly evolving business Growing data volumes Do more with less. The Challenges. Rapidly evolving business New international markets
The Oracle9i Multi-Terabyte Data Warehouse
E N D
Presentation Transcript
Session id: The Oracle9i Multi-Terabyte Data Warehouse Jeff ParkerManager Data Warehouse Development Amazon.com
The Challenges • Rapidly evolving business • Growing data volumes • Do more with less
The Challenges • Rapidly evolving business • New international markets • Continual innovation of features on Amazon • Buy it used • Magazine subscriptions • Marketplace Partnerships – Toys R Us, Target • Growing data volumes • Do more with less
The Challenges • Rapidly evolving business • Growing data volumes • 2X growth yearly over the past 5 years • Currently 10 Terabytes of raw data • Do more with less
The Challenges • Rapidly evolving business • Growing data volumes • Do more with less • Innovative use of technology and resources • Throwing money and people at the problem is not an option • Leverage existing investment in Oracle
Addressing the issues • Rapidly evolving business • Denormalize only for performance reasons • Create a solution that allows new datasets to be brought in rapidly to the DW, but without high maintenance costs • Growing data volumes • Do more with less
Addressing the issues • Rapidly evolving business • Growing data volumes • Dual database approach to ETL • Staging database for efficient transformation of large datasets. SQL and hash-joins allow transforms to scale in a non-linear fashion • Second database optimized for analytics • Oracle as an API • Simplifies ETL architecture • Better scalability than traditional ETL tools • Do more with less
Addressing the issues • Rapidly evolving business • Growing data volumes • Do more with less • One DW schema supports all countries • Cut costs by eliminating unneeded software • Data driven Load functionality
The ETL Process • Extract data from source • The Load process • Dimensional Transforms
The ETL Process • Extract data from source • Can create one or more files to be loaded • Must produce Metadata upon which the Load process can depend • The Load Process • Dimensional Transforms
Extract produced Metadata • Describes each field in database type terms • Changes as the dataset changes • Can reference multiple files • Very reliable • No additional overhead
XML Based Metadata <DATA CHARSET="UTF8" DELIMITER="\t" ROWS=”1325987> <COLUMNS> <COLUMN ID="dataset_id" DATA_TYPE="NUMBER" DATA_PRECISION="38" DATA_SCALE="0“/> <COLUMN ID="dataset_name" DATA_TYPE="VARCHAR2" DATA_LENGTH="80“/> <COLUMN ID="CREATION_DATE" DATA_TYPE="DATE" DATE_MASK="YYYY/MM/DD.HH24:MI:SS“/> <COLUMN ID="CREATED_BY" DATA_TYPE="VARCHAR2" DATA_LENGTH="8“/> </COLUMNS> <FILES> <FILE PATHNAME="/flat/datasets_20020923_US.txt.1“/> <FILE PATHNAME="/flat/datasets_20020923_US.txt.2“/> </FILES> </DATA>
The ETL Process • Extract data from source • The Load Process • Makes extensive use of External Tables • MERGE and Bulk Insert • Contains integrated DBA tasks • Every load is tracked in an operational database • Dimensional Transforms
The Load Process • External Tables • access to files on the operating system • Is a building block in a broader ETL process • MERGE & Bulk Insert • Integrated DBA tasks
The External Table • Created by using Metadata from the Extract process • Data is read-only • No indexes • Use DBMS_STATS to set number of rows
Example External Table • Copy the data to the database server • Data must reside in a file system location specified by the DBA’s. - create directoryDAT_DIR as ‘/stage/flat’
Example External Table 2. Create the external table using the DML from the extract. CREATE TABLE XT_datasets_77909 ( dataset_id NUMBER , dataset_name VARCHAR2(80) , creation_date DATE ,created_by VARCHAR2(8) ) ORGANIZATION EXTERNAL( TYPE ORACLE_LOADER DEFAULT DIRECTORY dat_dir ACCESS PARAMETERS( records delimited by newline characterset UTF8 fields terminated by '\t' LOCATION (‘/flat/datasets_20020923_US.txt' )
The External Table • No pre-staging of data • Ability to describe a flat file to Oracle • Handles horizontally partitioned files • Good error messaging
The Load Process • External Tables • MERGE • Can be run in parallel • Combined with external table provides a powerful set of ETL tools • Integrated DBA tasks
MERGE • Allows for update or insert in a single statement • If key value already exists • Yes, update row • No, insert row • MERGE statement is auto-generated • Row level column transforms are supported
MERGE example MERGE into DATASETS ds USING ( SELECT ds.dataset_name ,ds.creation_date ,nvl(created_by,’nobody’) as created_by ,sysdate as last_updated FROM XT_datasets_77909 xt ) src On ( xt.dataset_id = ds.dataset_id ) When matched then UPDATE SET ds.dataset_name = src.dataset_name ,ds.creation_date = src.dataset_name ,ds.created_by = src.created_by ,ds.last_updated = sysdate when not matched then INSERT( dataset_name, creation_date, created_by, last_updated ) VALUES( dataset_name, creation_date, created_by, sysdate )
MERGE • Issues we faced • Duplicate records in the dataset • NESTED-LOOPS from external table • Parallelism is not enabled by default • Bulk Load partition determination
The Load Process • External Tables • MERGE • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error
Integrated DBA Tasks • Provided by the DBA team • Managed by the DBA team • ETL team does not need special knowledge of table layout
Integrated DBA Tasks • Truncate Partition developer makes call truncate_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes alter table TABLE-NAME drop partition dbi20020930_101;
Integrated DBA Tasks • Analyze Partition developer makes call analyze_partition( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and executes dbms_stats.gather_table_stats(ownname , tabname , partname , cascade , estimate_percent, granularity);
Integrated DBA Tasks • Return Partition Name developer makes call get_partition_name( ‘TABLE-NAME’, partition-key1, partition-key2, partition-key3 ) DBA utility translates this and returns the appropriate name of the partition. This is very useful when bulk loading tables.
Integrated DBA Tasks • Partitioning utilities • Helps to streamline the process • Reduces workload of DBA team • Helps to eliminate the problem of double loads for Snapshot tables and partitions
The Load Process • External Tables • Provides access to flat files outside the database • MERGE • Parallel “upsert” simplifies ETL • Row level transforms can be performed in SQL • Integrated DBA tasks • Reduces workload required by the DBA team • Streamlines the load process • Eliminates human error • Loads are repeatable processes
Summary • Reduction in time to integrate new subject areas • Oracle parallelism scales well • Eliminated unneeded software
Summary • Oracle has delivered on the DW promise • Oracle External table combined with MERGE is a viable alternative to other ETL tools • ETL tools are ready today
& Q U E S T I O N S A N S W E R S
Reminder – please complete the OracleWorld session surveyThank you.