190 likes | 322 Vues
This comprehensive guide explores essential concepts in data warehousing and mining, focusing on vital quality dimensions such as timeliness, relevance, and reliability. It delves into effective data management strategies, the significance of internal and external data sources, and the role of databases in business intelligence. Additionally, it highlights the challenges and benefits associated with data warehousing projects, including the importance of metadata and overcoming organizational barriers. Aimed at helping organizations leverage data for informed decision-making, this resource is crucial for data professionals.
E N D
Data Warehousing/Mining MSC636 Winter 2003
Timeliness Sufficiency Detail/aggregation Understandability Bias Relevance Comparability Reliability Redundancy Cost efficiency Quantifiability Format Data/Information Quality Issues
Internal data Shared sources TPS/operational data Financial Accounting Production HRM Data warehouses/data marts Private/personal data External data Particularly for EIS Commercial/ subscription databases Industry data Governmental/agency/public access data Primary data collection Sources of Data
Data Management • DBMS software well developed • Hierarchical • Simple, predefined relationships • Efficient large-volume processing • Network • Complex, pre-defined relationships • Efficient, but inflexible • Relational • Flexible • Efficient-enough
Data Warehouses • Database system extracted from TPS and other data sources • Specifically designed to support DSS/OLAP/DM applications • Data warehousing • Implies type of data being collected • Implies interest in exploiting data and building knowledge about customers, markets, other environments
Characteristics of DW • Separate from TPS data (for speed) • Subject-oriented (vs. application oriented) • Integrated (filtered & cleaned) • Non-volatile (no user updates) • Time variant (historical, time stamped) • Summarized (various levels of aggregation) • Larger (historical, intentionally non-normalized, 50 GB to 1+ TB) • Data sources (internal & external)
Related Concepts • Data store--> operational databases • Data mart--> smaller scope, devoted to SBU or department (baby step?) • Data warehouse--> generally enterprise-level strategy • Benefits of integrating info needs • Challenges of analysis, implementation, & management
Meta-Data in DW • Descriptions of the data in the DW; how it was acquired, aggregated, processed; and where it was collected • Comparable to repository & data dictionaries • Tables, attributes, & keys • Data sources • Coding, domains, aliases • Record counts, table sizes • Transformation mapping • Explanation of business rules & changes • Algorithms for summarization • Data ownership • Recording DW use history • Logical data models from TPS data
Why Warehouse Data? • Data quality • Data “clean-up” • Less variability in extracted data (spider web) • Data access (30% in Sentry Market survey) • Dependability (cabinet effect) • Simplified access to multiple sources • Summarized and historical data • System performance • Less interference with TPS/OLTP processing • IS staff productivity • UPS, HP: programmer reductions
Why? • Sentry Market survey 1996 • 55%-->DW vital to business objectives • 30%-->Improved enterprise-wide access to data top benefit • Other top reasons: improved data accuracy, better data control, better consistency, & cost savings • >50% of Fortune 500 have DW projects underway or completed • Assumption that DW will be used, decisions will be made differently
Justifying the DW Project • ROI, NPV used to allocate resources to projects • DW costs are uncertain & benefits are qualitative • Suggested approaches: value analysis, intangible value analysis, business value-added analysis, scenario approach • 1996 paper (IDC): avg 3-yr ROI 401% • 62 non-random selection, completed projects • 85% fail to meet objectives • Higher than expected costs, poor utilization, poor planning • 40% never fully implemented • Investigating DW vital, implementation optional
Warehousing Woes • Turf battles, user resistance, & power struggles • Crossing organizational boundaries • Issues of information ownership • High-level champion • Assigning “stewardship” for data with business units with IS assuming technical & advisor responsibilities • Establish conflict resolution process • Demonstrate benefits & roll out continuous progress • Encourage user involvement & management expectations • 7 Deadly Sins • Technology issues (lack of standardization)
OLAP • Approach to DSS (E.F. Codd, 1993) • On-Line Analytical Processing • vs. On-Line Transactional Processing • Requires rapid response to complex queries • Consolidation, drill-down, slicing & dicing • Synonymous with multi-dimensional analysis, BI systems, data analysis systems • Supported by warehoused data • Multi-dimensional databases (MOLAP) • Modified relational databases (ROLAP)
Multi-Dimensional DBMS • Data related in database via • Measurement of interest (sales, inventory, profit) • Dimension 1: aggregated by category (product, region, segment, etc.) • Dimension 2: aggregated by time period (daily, weekly, monthly, etc.) • Data organized in cubes (hypercubes) rather than related tables • Worksheet analogy (sheets, books, more books) • Sparcity creates inefficient storage (50 GB)
Relational OLAP • De-normalized tables • Star schema • Dimension tables (entity related) • Fact tables (creates “slices” of derived data regarding dimensions) • Able to handle more data than MOLAP • Focus on speed justifies redundancies • Beta vs. VHS?
MD-DBMS Realities • Larger storage requirement • up to 40% larger than RDBMS • More expensive DBMS software • up to 50% higher than RDBMS • Implementation slower & maintenance more difficult • Standards not yet set & compatibility limited • IBM, Microsoft, Sybase, Oracle, MicroStrategy, Arbor Essbase, Cognos • Web-enabled OLAP servers • www.cognos.com-->PowerPlay, web-enabled OLAP tool
Data Mining • Competitive advantage of knowledge discovery • Algorithmic models for finding patterns in data and inferring rules from those patterns (one aspect of OLAP) • Siftware, data digging, BI, KDS • IBM, SAS, Hyperion, Cognos • Designed for functional specialists and/or analysts • Future integration into DBMS and industry-specific products
Data Mining • ANN, AI, inferential & predictive statistics • Identify unpredictable patterns not amenable to traditional abstract or conceptual modeling • Classification (assign individuals to categories; supervised training; fraud testing, buyer categories) • Associations (correlation between events or measures; market basket analysis--things bought together) • Sequences (ordering of events; explanatory) • Clustering (creating new groups based on known or unknown attributes)
Data Visualization • Process of turning numerical data into meaningful images • Beyond simple charting/graphing • Tools to integrate analysis & visual presentation in OLAP environment • Relies on human capacity to interpret images more easily than numbers • Geographic Information Systems • Specialized database that stores, retrieves, and analyzes spatial data • Marketing utility