What is Data Warehousing?

What is Data Warehousing?

A data warehousing is a technique for collecting and managing data from varied sources to provide meaningful business insights. It is a blend of technologies and components which allows the strategic use of data.

It is electronic storage of a large amount of information by a business which is designed for query and analysis instead of transaction processing. It is a process of transforming data into information and making it available to users in a timely manner to make a difference.

How Datawarehouse works? • A Data Warehouse works as a central repository where information arrives from one or more data sources. Data flows into a data warehouse from the transactional system and other relational databases. • Data may be: • Structured • Semi-structured • Unstructured data

The data is processed, transformed, and ingested so that users can access the processed data in the Data Warehouse through Business Intelligence tools, SQL clients, and spreadsheets. A data warehouse merges information coming from different sources into one comprehensive database. • By merging all of this information in one place, an organization can analyze its customers more holistically. This helps to ensure that it has considered all the information available. Data warehousing makes data mining possible. Data mining is looking for patterns in the data that may lead to higher sales and profits.

Who needs Data warehouse? • Decision makers who rely on mass amount of data • Users who use customized, complex processes to obtain information from multiple data sources. • It is also used by the people who want simple technology to access the data • It also essential for those people who want a systematic approach for making decisions. • If the user wants fast performance on a huge amount of data which is a necessity for reports, grids or charts, then Data warehouse proves useful. • Data warehouse is a first step If you want to discover 'hidden patterns' of data-flows and groupings.

What Is a Data Warehouse Used For? • Airline • Banking • Healthcare • Public sector • Investment and Insurance sector • Telecommunication • Hospitality Industry

Advantages of Data Warehouse • Data warehouse allows business users to quickly access critical data from some sources all in one place. • Data warehouse provides consistent information on various cross-functional activities. It is also supporting ad-hoc reporting and query. • Data Warehouse helps to integrate many sources of data to reduce stress on the production system. • Data warehouse helps to reduce total turnaround time for analysis and reporting.

Restructuring and Integration make it easier for the user to use for reporting and analysis. • Data warehouse allows users to access critical data from the number of sources in a single place. Therefore, it saves user's time of retrieving data from multiple sources. • Data warehouse stores a large amount of historical data. This helps users to analyze different time periods and trends to make future predictions.

Disadvantages of Data Warehouse • Not an ideal option for unstructured data. • Creation and Implementation of Data Warehouse is surely time confusing affair. • Data Warehouse can be outdated relatively quickly • Difficult to make changes in data types and ranges, data source schema, indexes, and queries.

The data warehouse may seem easy, but actually, it is too complex for the average users. • Despite best efforts at project management, data warehousing project scope will always increase. • Sometime warehouse users will develop different business rules. • Organisations need to spend lots of their resources for training and Implementation purpose.

ETL process • ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. ETL involves the following tasks:

Extracting the data • Extracting the data from source systems (SAP, ERP, other operational systems), data from different source systems is converted into one consolidated data warehouse format which is ready for transformation processing.

Transforming the data transforming the data may involve the following tasks: • applying business rules (so-called derivations, e.g., calculating new measures and dimensions), • cleaning (e.g., mapping NULL to 0 or "Male" to "M" and "Female" to "F" etc.), • filtering (e.g., selecting only certain columns to load), • splitting a column into multiple columns and vice versa,

joining together data from multiple sources (e.g., lookup, merge), • transposing rows and columns, • applying any kind of simple or complex data validation (e.g., if the first 3 columns in a row are empty then reject the row from processing)

Loading the data • loading the data into a data warehouse or data repository other reporting applications

OLTP vs. OLAP • We can divide IT systems into transactional (OLTP) and analytical (OLAP). In general we can assume that OLTP systems provide source data to data warehouses, whereas OLAP systems help to analyze it.

On-line Transaction Processing • OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF).

On-line Analytical Processing • OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema).

Data Warehouse Schema Architecture Data Warehouse environment usually transforms the relational data model into some special architectures. There are many schema models designed for data warehousing but the most commonly used are: • Star Schema • Snowflake Schema

The determination of which schema model should be used for a data warehouse should be based upon the analysis of project requirements, accessible tools and project team preferences.

Multidimensional Data Model • Multidimensional data model stores data in the form of data cube. Mostly, data warehousing supports two or three-dimensional cubes. • A data cube allows data to be viewed in multiple dimensions. A dimensions are entities with respect to which an organization wants to keep records. For example in store sales record, dimensions allow the store to keep track of things like monthly sales of items and the branches and locations.

A multidimensional databases helps to provide data-related answers to complex business queries quickly and accurately. • Data warehouses and Online Analytical Processing (OLAP) tools are based on a multidimensional data model. OLAP in data warehousing enables users to view data from different angles and dimensions.

What is star schema? • The star schema architecture is the simplest data warehouse schema. It is called a star schema because the diagram resembles a star, with points radiating from a center. The center of the star consists of fact table and the points of the star are the dimension tables. Usually the fact tables in a star schema are in third normal form(3NF) whereas dimensional tables are de-normalized. Despite the fact that the star schema is the simplest architecture, it is most commonly used nowadays and is recommended by Oracle.

What is snowflake schema? • The snowflake schema architecture is a more complex variation of the star schema used in a data warehouse, because the tables which describe the dimensions are normalized.

OLAP Operations • Roll-up (drill-up) The roll-up operation performs aggregation on a data cube either by climbing up the hierarchy or by dimension reduction.

Example : Roll-up (drill-up)

Drill-down:-

Slice and dice • The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. Reduces the dimensionality of the cubes.

For example, if we want to make a select where Medal = 5

The dice operation defines a sub-cube by performing a selection on two or more dimensions. • For example, if we want to make a select where Medal = 3 or Location = New York

dice operation

Pivot • Pivot is also known as rotate. It Rotates the data axis to view the data from different perspectives.

In the OLAP world, there are mainly two different types: Multidimensional OLAP (MOLAP) and Relational OLAP (ROLAP). Hybrid OLAP (HOLAP) refers to technologies that combine MOLAP and ROLAP.

Definition of ROLAP • ROLAP is Relational Online Analytical Processing model, where the data is stored as in relational database i.e. rows and columns in the data warehouse. In the ROLAP model data is present in the front of the user in the multidimensional form. To display the data, in a multidimensional view, a semantic layer of metadata is created that maps dimension to the relational tables. Metadata also supports aggregation of the data.

Definition of MOLAP • MOLAP is a Multidimensional Online Analytical Processing model. The data used for analysis is stored in specialized multidimensional databases (MDDBs). The multidimensional database management systems are proprietary software systems.

Difference between ROLAP/MOLAP • ROLAP stands for Relational Online Analytical Processing whereas; MOLAP stands for Multidimensional Online Analytical Processing. • In both the cases, ROLAP and MOLAP data is stored in the main warehouse. In ROLAP data is directly fetched from the main warehouse whereas, in MOLAP data is fetched from the proprietary databases MDDBs. • In ROLAP, data is stored in the form of relational tables but, in MOLAP data is stored in the form of a multidimensional array made of data cubes.

ROLAP deals with large volumes of data whereas, MOLAP deals with limited data summaries kept in MDDBs. • ROLAP engines use complex SQL to fetch data from the data warehouse. However, MOLAP engine creates prefabricated and pre calculated data cubes to present multidimensional view of data to a user and to manage data sparsity in data cubes, MOLAP uses Sparse matrix technology.

ROLAP engine creates a multidimensional view of data dynamically whereas, MOLAP statically stores multidimensional view of data in proprietary databases MDDBs for a user to view it from there. • As ROLAP creates a multidimensional view of data dynamically, it is slower than MOLAP which do not waste time in creating a multidimensional view of data.

What is Data Mart? • A data mart is focused on a single functional area of an organization and contains a subset of data stored in a Data Warehouse. • A data mart is a condensed version of Data Warehouse and is designed for use by a specific department, unit or set of users in an organization. E.g., Marketing, Sales, HR or finance. It is often controlled by a single department in an organization.

Data Mart usually draws data from only a few sources compared to a Data warehouse. Data marts are small in size and are more flexible compared to a Data warehouse.

What is Data Warehousing?