1 / 57

BI & DM for CRM

BI & DM for CRM. Lecture 2 Data Warehouse and OLAP. Steps in Knowledge Discovery. Steps in KD cont. Data Cleaning : To remove noise and inconsistent data / check data validity & correctness – PLAY WİTH DATA ( good step to know your data)

Télécharger la présentation

BI & DM for CRM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. BI & DM for CRM Lecture 2 Data Warehouse and OLAP

  2. Steps in Knowledge Discovery

  3. Steps in KD cont... • Data Cleaning: Toremovenoiseandinconsistent data / check data validity & correctness – PLAY WİTH DATA (good step toknowyour data) • Data İntegration: Tocombinemultiplesources of data / integratevarious data sources on variousplatforms (DBMS/flatfiles/spreedsheetsetc..) • Data Selection: Retrieverelevent data from data repositoryforanalysis • Data Transformation: Totransformorconsolidate data intodifferentformstodeploy in DM operations. (If Data Warehouseswereused, theprocesswould be beforethe data selection.) • Data Mining: Toapplyintelligentmodelsforextractingpatternsfrom data. • Pattern Evaluation: Toidentifyinterestingpatterns, result of DM process, on somemeasures. • Knowledge presentation: Tovisualizetherepresentation of knowledge, drivenby DM processandpatternevaluationz

  4. DM System Architecture

  5. So, what is Data Warehouse? Looselyspeaking; • A decision support database that is maintained separately from the organization’s operational database • Support information processing by providing a solid platform of consolidated, historical data for analysis. • W.H. Inmon- ‘ A data warehouse is a subject-oriented, integrated, time-varientandnonvolatilecollection of data in support of management’sdecisionmakingproces.’

  6. DW—Subject-Oriented • Organized around major subjects, such as customer, product, sales. • Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing. • Provide a simple and concise view around particular subject issues by excluding data that are not useful in the decision support process. 6

  7. DW—Integrated • Constructed by integrating multiple, heterogeneous data sources • relational databases, flat files, on-line transaction records • Data cleaning and data integration techniques are applied. • Ensure consistency in naming conventions, encoding structures, attribute measures, etc. among different data sources • E.g., Hotel price: currency, tax, breakfast covered, etc. • When data is moved to the warehouse, it is converted. 7

  8. DW—Time Variant • The time horizon for the data warehouse is significantly longer than that of operational systems. • Operational database: current value data. • Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) • Every key structure in the data warehouse • Contains an element of time, explicitly or implicitly • But the key of operational data may or may not contain “time element”. 8

  9. DW—Non-Volatile • A physically separate store of data transformed from the operational environment. • Operational update of data does not occur in the data warehouse environment. • Does not require transaction processing, recovery, and concurrency control mechanisms • Requires only two operations in data accessing: • initial loading of data and access of data. 9

  10. On-line transaction processing vs On-line Analytical Processing 10

  11. Why seperate Database? • Main reason is high performance for both systems – processing OLAP queries in transactional DBs would substantially degrade the performance of operational tasks • Unlike OLTP, OLAP usually needs read only access from the database. So, for OLAP operations DBs does not require concurrent control and recovery. If applied for OLAP this may jeopardize the execution of OLTP operations • Systems are mainly used for different operations...

  12. Data Model- Multidimentional DW & OLAP are based on multidementional data models... Actually... It is all about multidimentional data models...

  13. 2D view of data Facebook Data...

  14. 3D view of Data cont

  15. LATTICE OF CUBOIDS ALL 0 D – Apex cuboid Posting 1-D Cuboids Gender Education Time Time, Posting Time, Gender Time, Education Posting, Gender 2-D Cuboids Time, Education Gender, Education Time, Posting, Gender Time, Posting, Education Time, Gender, Education Posting, Gender, Education 3-D Cuboids Time, Posting, Gender, Education 4-D Cuboids

  16. Latice & Cuboids • Any n-D data as a series of (n-1)-D “cubes” • In data warehousing literature, • A data cube is referred to as a cuboid • The lattice of cuboids forms a data cube. • The cuboid holding the lowest level of summarization is called a • the 4-D cuboid is the base cuboid for the given four dimensions • base cuboid. • The top most 0-D cuboid, which holds the highest-level of summarization, is called the apex cuboid. • Here this is the total posting • typically denoted by all

  17. Conceptual Modelling of DW • Modeling data warehouses: dimensions & measures • Star schema: A fact table in the middle connected to a set of dimension tables • Snowflake schema: A refinement of star schema where some dimensional hierarchy is normalized into a set of smaller dimension tables, forming a shape similar to snowflake • Fact constellations: Multiple fact tables share dimension tables, viewed as a collection of stars, therefore called galaxy schema or fact constellation


  19. Snow Flake Schema FACT TABLE

  20. Fact Constellation

  21. Categorization & Computation • How are measures computed? • Data Cube measure is a numarical function that can be evaluated at each point in the data cube space. • Distributive- sum(), min(), max()- • How many posts are from USA? • Sum (Washington)+Sum(Nevada).... • Algebraic: sum(),count(), standard_deviation()... • Holisatic: median(), mode()

  22. Concept Hierarchies • What is it? Flow of a concept (could be dimension) from lower levels to higher levels. More specific More General

  23. Example for concept hierarchyLocation all all Europe ... North_America region Germany ... Spain Canada ... Mexico country Vancouver ... city Frankfurt ... Toronto L. Chan ... M. Wind office 23

  24. HierarchicalvslatticeStructures(ConceptHierarchy) year country quarter Province or state week month city Street day We also have user defined concept hierarcies like Fiscal year or Academic Year

  25. How can we use Concept Hierarchies useful in OLAP? • In multi dimensional model, data are organized in multi dimensions and each dimension contains multi level of abstraction defined by concept hierarchies • This type of organization provides user to view the data from various perspectives • Basically, OLAP provides user friendly environment for interactive data analysis. OLAP Operations in Multi Dimensional Data

  26. Roll-up (drill-up) • Drill-down • Slice and Dice: The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. • Pivot: visualization operation that rotates the data axes in view in order to provide an alternative presentation of the data. OLAP ACTIONS

  27. Roll-up (drill-up):Performs aggregation on a data cube, either by climbing up a concept hierarchy for a dimension or dimension reduction. Hierarchy Roll Up Dimension Roll Up Roll up action exp...

  28. Drill Down:Drill down is reverse of roll-up. It navigates from general hierarchy to more specific hierarchy. Adding new dimension to data Drill Down Drill down action exp...

  29. The slice operation performs a selection on one dimension of the given cube, resulting in a subcube. The operation defines a subcube by performing a selection on two or more dimensions. Slice Dice Slice & Dice Action exp...

  30. drill across: involving (across) more than one fact table • drill through: through the bottom level of the cube to its back-end relational tables (using SQL) • ranking the top N or bottom N items in lists • moving averages • growth rates • interests Other OLAP Actions

  31. Radial lines from a central point • each line represents a concept hierarchy for a dimension • each abstraction level is called a footprint • granularities available for use by OLAP • four radial lines for concept hierarchies • location,customer,item,time • time line has 4 footprints: • day,month,quarter,year Star-Net Query Model

  32. A Star-Net Query Model Customer Orders Shipping Method Customer CONTRACTS AIR-EXPRESS ORDER TRUCK PRODUCT LINE Time Product ANNUALY QTRLY DAILY PRODUCT ITEM PRODUCT GROUP CITY SALES PERSON COUNTRY DISTRICT REGION DIVISION Each circle is called a footprint Location Promotion Organization

  33. Design of a Data Warehouse: A Business Analysis Framework • Four views regarding the design of a data warehouse • Top-down view • allows selection of the relevant information necessary for the data warehouse • Data source view • exposes the information being captured, stored, and managed by operational systems • Data warehouse view • consists of fact tables and dimension tables • Business query view • sees the perspectives of data in the warehouse from the view of end-user

  34. Data Warehouse Design Process • Top-down, bottom-upapproachesor a combination of both • Top-down: Starts withoveralldesignandplanning (mature) • Bottom-up: Starts withexperimentsandprototypes (rapid) • From software engineeringpoint of view • Waterfall: structuredandsystematicanalysis at each step beforeproceedingtothenext • Spiral: rapidgeneration of increasinglyfunctionalsystems, shortturnaround time, quickturnaround • Typical data warehousedesignprocess • Choose a businessprocessto model, e.g., orders, invoices, etc. • Choosethegrain (atomiclevel of data) of thebusinessprocess • Choosethedimensionsthatwillapplytoeachfacttablerecord • Choosethemeasurethatwillpopulateeachfacttablerecord

  35. Enterprise Warehouse • Collects all information about subject spanning of the entire organization • Data Mart • A subset of corporate-wide data that is valuable to specific groups of users. Such as marketing • Virtual Warehouse • A set of views over operational databases • Only some of the possible summary views may be materialized Three Data Warehouse Models

  36. Cuboids are referred as aggregations • One factor affecting storage requirements • Sparsity: the amount of empty cells in a cube • The base cuboid is likely to contain many empty cells • it is a spares cube or array • the 0 or lower dimensional cuboids are less spares than the higher dimensional ones • it is not likely that they contain empty cells • Moving along higher levels for the dimension hierarchy • the cuboids becomes less spares or more dense Storage of a cube

  37. Two dimensional sparse cuboid One dimensional dense cuboid

  38. ROLAP – Relational OLAP • MOLAP – Multidimensional OLAP • HOLAP – Hybrid OLAP OLAP Server Architectures

  39. Use relational or extended-relational DBMS to store and manage warehouse data and OLAP middle ware to support missing pieces • query response is generally slower • low storage requirement • Include optimization of DBMS backend, implementation of aggregation navigation logic, and additional tools and services • greater scalability • appropriate for large data sets that are infrequently queried • historical data from less recent previous years ROLAP

  40. Array-based multidimensional storage engine (sparse matrix techniques) • fast indexing to pre-computed summarized data • a two-level storage representation • dense subcubes are stored as array structures • spars subcubes are stored by compression techniques • appropriate for cubes with frequent use and rapid query response MOLAP

  41. combines ROLAP and MOLAP benefiting from • greater scalability of ROLAP • faster computation of MOLAP • Large volumes of data base cuboid is stored in a relational database • aggregations are stored as arrays • appropriate for for cubes that requre • rapid query response for summaries based on a large amount of base data HOLAP

  42. Data cubes can be viewed as lattice of cuboids • The bottom cuboid is the base cuboid • The top is the apex cuboid • What is the number of cuboids for N dimensional data cube? 2N • OLAP computes at least some of the cuboids • For fast response • For avoiding redundant calculation Efficient Data Cube Computation

  43. all 0-D(apex) cuboid country product date 1-D cuboids product,date product,country date, country 2-D cuboids 3-D(base) cuboid product, date, country Example

  44. Three Types • Materialize every cuboids • Huge amounts of memory space • Non materialization – Zero cube calculation • Show processing of queries • Some (partial) materialization • Trade off between storage space and response time • Selection of which cuboids to materialize • Based on size, sharing, access frequency and etc Materialization of data cube

  45. Complete load of the cube • all dimension and fact table data is read and • all specified aggregations cuboids are calculated • process a cube when • its structure is new or • its dimensions or measures have been edited • Incrementally updating a cube • new data is added but existing data not changed and cube structure si the same • Refreshing • data cleared and reloaded • its aggregations recalculated • faster then processing:no design of aggregation tables Processing the cubes

  46. Dimension member or measure whose value is computed at run time using an expression • Only the definitions are stored but values exists only in memory upon a query • do not increase in cube size • Ex: if sales and cost are included in the base fact table • a profit measure can be a calculated member • profit = sales – cost • Average_sales = sales/#_items_sold Calculated Members

  47. Combination of multiple cubes in one logical cube • can be based on a single cube to expose only selected subsets of measures and dimensions • Require no physical space • store only the dimensionsinformation not actualdata • provide a valuable security function • limiting the access of some users Virtual Cubes

  48. Attribute of a dimension member • provides additional information about the member • a column in the same dimension table as the associated members • used in queries • provide users more options when analysing cube data Member Cubes

  49. A typical time table: (time_id,day,month,quarter,year,businessday,leap,day of the week) dimension levels day<month<quarter<year • member properties for day: weekend or business day:0 or 1 • day of the week:1,2,3,...,7 • a member property for year is whether it is leap year or not:0 or 1 Exp Member Property – Time Table

  50. Logical dimension based on a member property of a level in a physical dimension • enables users to analyse cube data based on the member properties of dimension levels • add a virtual dimension to a cube only if • the dimension that supplies its member property is also included in the cube • adding a virtual dimension does not increase cube size • not affect cube processing time • calculated in memory when needed • query processing time is slower Virtual Dimension

More Related