modern data warehousing mining and visualization core concepts n.
Skip this Video
Loading SlideShow in 5 Seconds..
Modern Data Warehousing, Mining, and Visualization: Core Concepts PowerPoint Presentation
Download Presentation
Modern Data Warehousing, Mining, and Visualization: Core Concepts

Modern Data Warehousing, Mining, and Visualization: Core Concepts

170 Vues Download Presentation
Télécharger la présentation

Modern Data Warehousing, Mining, and Visualization: Core Concepts

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Chapter 7: The Future of Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts

  2. 7-1: The Future of Data Warehousing • As a DW becomes a mature part of an organization, it is likely that it will become as “transparent” as any other part of the IS. • One challenge to face is coming up with a workable set of rules that ensure privacy as well as facilitating the use of large data sets. • Another is the need to store unstructured data such as multimedia, maps and sound. • The growth of the Internet allows integration of external data into a DW, but its varying quality is likely to lead to the evolution of third-party intermediaries whose purpose is to rate data quality.

  3. Predicting the Future • In a technology-intensive area, it doesn’t pay to get too far ahead of the curve. • “The past is the best prelude to history.” • Old example: Josephson junctions. • A switching element based on superconductivity --- rendered useless by IC’s. • New Example: Quantum computing. • We’re inventing clever algorithms for a device that may well never exist.

  4. The data explosion • The amount of data stored in electronic storage media increases at a fast pace • UC Berkeley estimated that 5 Exabytes of new data were generated in 2002 • Amount of data doubles every 18-24 months • 1 Exabyte = 1 billion Gigabytes • It took 300,000 years for humans to accumulate 12 Exabytes of information, it took only 2.5 years more for the next 12 Exabytes

  5. Billion Gigs A guide to collective names for scientific units • Kilo = 103 • Mega = 106 • Giga = 109 • Tera = 1012 • Peta = 1015 • Exa = 1018 • Zetta = 1021 • Yotta = 1024

  6. The data explosion • In March 2007, an IDC study reported that 161 Exabytes of new data were generated in the year 2006. At the same time, 185 Exabytes of storage were available.

  7. The data explosion • In March 2008, another IDC study reported that, at 281 billion gigabytes (281 exabytes), the digital universe in 2007 was 10% bigger than originally estimated ! •

  8. The data explosion • According to the June 2009 update of the Cisco Visual Networking Index IP traffic forecast, by 2013, annual global IP traffic will reach two-thirds of a zettabyte or 667 exabytes. • Internet video will generate over 18 exabytes per month in 2013. • Global mobile data traffic will grow at a CAGR of 131 percent between 2008 and 2013, reaching over two exabytes per month by 2013.

  9. Long-Lived Themes • Very high-level query languages. • If you are going to deal with very large amounts of data, there has to be a lot of uniformity in what you do. • SQL-based user interfaces, like QBE in Access will be central to the future of Data Warehouses • Query optimization. • The success of a very high-level language depends on the ability to produce efficient implementations.

  10. Some Good, New Directions • Languages and systems for automating the process of integrating databases. • Everyone acts as if this problem were solved, but it is not. • Stream data collection processing. • Many applications where data whizzes by so fast that storage and processing are limited. • E.g., telecom billing, intrusion detection, etc.

  11. More New Directions • New kinds of data • e.g., images, audio. • Data mining: • SAS Enterprise Mining-type GUI interfaces • Automation of database design and tuning. • Exploiting new architectures: • Parallel database machines. • Peer-to-peer and distributed systems.

  12. Integrated Architecture • Historically, market and business forces have moved organizations toward ineffective nonintegrated DW systems . • Far too often, a “silo” DW simply replaces a silo OLTP system. • To survive in a future world of low-cost, turnkey application systems, the transition to a federated architecture must be made.

  13. i2 Supply Chain Oracle Financials Siebel CRM 3rd Party Data Marketing DW Supply Chain Data Mart Oracle Financial DW Subset Non-Architected Data Marts Typical Nonintegrated Information Architecture

  14. Federated Integrated Information Architecture Siebel CRM 3rd Party Data i2 Supply Chain Oracle Financials Common Data Staging Area Federated Supply Chain Data Mart Federated Financial DW Federated Marketing DW Subset Non-Architected Data Marts

  15. Future • The future of data warehousing is clearly multi-faceted. • There is a lot of blurring today with: • CRM, Enterprise Systems and E-commerce initiatives. • Data warehousing is really becoming the method for storing analytic-capable data for all these applications and more, many of which are packaged. • Architectures will need to be more tightly integrated. • E-commerce is cranking up data volumes.

  16. 7-2: Alternate Storage and the Data Warehouse • Surprisingly, the future of data warehousing is nothigh-performance disk storage, but an array of alternative storage. • Involves two forms of alternative storage: • Near-line storage involves an automated silo where tape cartridges are handled automatically. • Secondary storage which is slower and less expensive, such as CD-ROMs and floppy disks. • Firms like Teradata, Inc., Storage Technology Corp. [STC] and others specialize in high volume storage systems

  17. Speed and Capacity of Various Near-Line Storage Media

  18. Robotic Tape Retrieval Arm Main View of the Tape Silo View Through Silo Entry Door Close-up of tape storage carousel Typical Near-Line Tape Storage Silo

  19. Why Use Alternative Storage? • The data in a DW are stable. They are placed there once and left alone, so do not need to be updated at high speed. • The queries that operate on the DW data often require long streams of data stored sequentially. Operational access requires different units of data from different storage areas. • The DW is of indeterminate size and is always increasing in volume, requiring flexible capacity. • When data gets accessed less often as it ages, it can be moved to secondary storage, making access to newer data more efficient.

  20. 7-3: Trends in Data Warehousing • Customer interaction and learning relationships require capturing information “everywhere” and massive scalability. • Enterprise applications generate data that is doubling very 9-12 months. • The time available for working with data is shrinking and the need for 24×7 access is becoming the norm. • Fast implementation and easeof management are becoming more and more important. • In the future, more organizations will build Web applications that operate in conjunction with the DW.

  21. 7-4: The Future of Data Mining As promising as the field may be, it has pitfalls: • The quality of data can make or break the data mining effort. • In order to mine the data, companies first have to integrate, transform and cleanse it. • To obtain value from data mining, organizations must be able to change their mode of operation and maintain the effort (agile corporations). • Finally, there are concerns about privacy.

  22. Personalization versus Privacy • Companies that use data mining for target marketing walk a tightrope between personalization and privacy. • Implementation of the recent FTC guidelines about information practices can be a problem since companies often do not know how they will use information ahead of time. Signed releases from customers increasing required. • Further, technology appears to create new ways to acquire information faster than the legal system can handle the ethical and property issues.

  23. 7-5: Using Data Mining to Protect Privacy • While Internet use has grown, so have the problems of network intrusion. • One current intrusion detection technique is misuse detection – scanning for known malicious activity patterns known by signatures. • Another technique is anomaly detection where there is an attempt to identify malicious activity based on deviations from norms. • Most intrusion detection systems operate by the signature approach.

  24. Shortfalls of Current Detection Schemes • Variants – although signature lists are updated frequently, minor changes in the “exploit code” can produce a “new undetected” intruder. • False positives – a detection system may be too conservative and declare an intrusion when there is none. • E.g., Intruder scoring techniques for email … • False negatives – an intrusion won’t be detected until a signature has been identified. • Data overload – as traffic grows, the ability to find new hacks becomes harder and harder.

  25. 7-6: Trends Affecting the Future of Data Mining • While the available data increases exponentially, the number of new data analysts graduating each year has been fairly constant. Either of lot of data will go unanalyzed or automatic procedures will be needed. • Increases in hardware speed and capacity makes it possible to analyze data sets that were too large just a few years ago. • The next generation Internet will connect sites 100 times faster than current speeds. • To be more profitable, businesses will need to react more quickly and offer better service, and do it all with fewer people and at a lower cost.

  26. 7-7: The Future of Data Visualization • Weapons performance and safety – • Data visualization coupled with simulation models can show how weapons perform under typical conditions and the effect of weapons aging. • Medical trauma treatment – • Today’s surgeons use computer vision to assist in surgery. In the future this trend suggests that local medical personnel can also be assisted from afar by specialists through telepresence. • X-ray transmission resolution now at acceptable limits

  27. Augmented-reality Headset Worn by Surgeon

  28. 7-8: Components of Future Visualization Applications • The data visualization environment links the critical components and enables the smooth flow of information among the components. • In the future, the bounds between computers, graphics and human knowledge will become more blurred. • Many advances in technology will be need to handle the visualization environment of the future. • Intelligent file systems and data management software will contend with thousands of coupled storage devices.

  29. In Conclusion • Data explosion recent years have seen a dramatic increase in the amount of information stored in electronic format. • It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster • Data and information are crucial for decision making, especially in business operations. As a prominent top manager aid said, • "Whoever has information fastest and uses it wins" • [Watterson K., from BYTE].

  30. Future Vision • The objective of taking a view on the future is not so much about trying to guess lottery numbers, it is about combining the past and the present with what we think is likely to occur. • That way we believe we are able to forecast with some accuracy. • Predicting the future is like predicting the weather, events will occur that were unexpected and geniuses have a habit of seeing things differently leading to major shifts in the way things are done.

  31. Accounting and DW • •