1 / 63

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI. EXTENSIONS TO DSS. BI systems (aka EIS) Geographical Information Systems (GIS) Collaborative Systems (formerly GDSS) Expert Systems OLAP/Data mining/warehousing. DATA WAREHOUSES. DATA WAREHOUSE.

abla
Télécharger la présentation

MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MINING & WAREHOUSING (MSS2) BY CHANDRA S. AMARAVADI

  2. EXTENSIONS TO DSS • BI systems (aka EIS) • Geographical Information Systems (GIS) • Collaborative Systems (formerly GDSS) • Expert Systems • OLAP/Data mining/warehousing

  3. DATA WAREHOUSES

  4. DATA WAREHOUSE A large collection of historical data that is organized specifically for use in decision support (i.e. OLAP, data mining)

  5. DATA WAREHOUSING USERS APPLICATIONS WAREHOUSE User1 Prodn. S E R V E R Mktg. User2 Fin. User3 User4 Acctg.

  6. THE DATA LIFE CYCLE Dashboard/BI Internal Data Data Visualization Data Marts OLAP External Data Data Warehouse Decisions Internal Data Data Mining Data Marts Knowledge The activities taking place with respect to data for warehouse/OLAP/mining

  7. INTEGRATION OF DATA Integration Operational databases Data Warehouse application A – m,f M/F application B – 1,0 application C – x,y application D – male, female

  8. CHARACTERISTICS OF WAREHOUSES Data warehouses have a number of characteristics • Subject-oriented • Integrated • Time-variant • Non-volatile

  9. KEY FEATURES OF DATA WAREHOUSES… Subject-oriented: A data warehouse is organized around major subjects, such as customer, supplier, product, and sales. Integrated: A data warehouse is usually constructed by integrating data from multiple heterogeneous sources, such as relational databases, flat files, and on-line transaction records. Time-variant: Data are stored to provide information from a historical perspective (e.g., the past 5-10 years). Every key structure in the data warehouse contains, either implicitly or explicitly, an element of time Nonvolatile: The data in a warehouse is permanent.

  10. DESIGN OF WAREHOUSE Design of warehouses is similar to databases: • Schema not radically different • W/H designed primarily for queries • Database will be converted to cube at run time.

  11. ORGANIZING THE W/H STAR SCHEMA: Consists of a large central table and a set of smaller tables, one each for each dimension. SNOWFLAKE SCHEMA: A variant of the star schema, Where some dimension tables are normalized, thereby splitting the data into additional tables. CONSTELLATION SCHEMA: A collection of stars.

  12. STAR SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS What are the dimensions here?

  13. SNOWFLAKE SCHEMA OF A DATA WAREHOUSE (FYI) BRANCH ACCOUNTS TRANSACTIONS CREDIT DEBIT

  14. CONSTELLATION SCHEMA OF A DATA WAREHOUSE FOR SALES AND SHIPPING (FYI) DEPARTMENTS BRANCH ACCOUNTS EMPLOYEES TRANSACTIONS CREDIT DEBIT

  15. EXAMPLE OF WAREHOUSE & MARTS Weekly sales by region 2007-2013 Sales by Product Line 2010-2013 Weekly sales by state 2005-2013 Weekly sales by product 2008-2013 Data Mart Data Mart Warehouse Sales Detail Sales Detail 2003-2013

  16. OLAP & DATA WAREHOUSES

  17. ONLINE ANALYTICAL PROCESSING OLAP: Tools to analyze data in a warehouse for decision support. How many light bulbs sold in December? • Aggregation (“data summarization”) • Dimension • Slice & Dice • Cube organization (Multi-dimensional database)

  18. AGGREGATION & CUBE Region Units NE SE SW North. E 40 South E. 20 South W. 30 Midwest 20 South W. 50 North E. 65 65 20 30 40 50 sales in the Northern region?

  19. DIMENSIONS & CONCEPT HIERARCHIES A dimension is an aspect of the data, it is a characteristic of a variable such as location, for sales variable. Dimensions can have hierarchies (or various levels of aggregations) A concept hierarchy defines a sequence of mappings from a set of low-level concepts to higher-level, more general concepts

  20. CONCEPT HIERARCHY Mfr. dimension iPhone5 iPhone4 150,600 GALII GAL FX G G2 13,600 30,000 15,067 18,240 51,326

  21. MULTI-DIMENSIONAL ORGANIZATION Sales, costs etc. nw Products (tables, desks, lamps..) sw Regions Cube organization supports slice & dice

  22. MULTI-DIMENSIONAL ORGANIZATION.. March sales February sales January sales shows multi-dimensional/cube organization

  23. CUBE ORGANIZATION • Data from warehouse imported into memory • A sophisticated 3D representation is created • Referred to as “sparse matrix” • Sides of cube are dimensions • Allows “slice & dice” • Answers to high level queries/reports

  24. DATA MINING

  25. DATA MINING Application of statistical and AI techniques to identify patterns that exist in large databases but are hidden in the vast amounts of data. e.g. sequence/association, classification, and clustering • Buyer non-buyer • Success/failure of an operation • Customers who spend more • Preferences of target groups

  26. SOME DATA MINING APPLICATIONS • Predicting the probability of default for consumer loans • Predicting audience response to TV advertisements • Predicting the probability that a cancer patient will respond to radiation therapy. • Predicting the probability that an offshore well will produce oil • A phone company mined data from 140 million households, each with as many as 10,000 attributes, including life-style and calling habits. Identified 22 profiles

  27. DATA MINING AS A STEP IN THE PROCESS OF KNOWLEDGE DISCOVERY Evaluation and Presentation Knowledge Data Mining Selection and Transformation Patterns Data Warehouse Cleaning and Integration Database Flat files

  28. DATA MINING PROCESS Cleaning & integration – data is brought in from multiple sources Selection & transformation – sometimes called dimensionality reduction, it is concerned with selection of dimensions and sometimes the raw data needs to be transformed to suit the problem e.g. calculate margin. Data mining - process of extracting data patterns, using statistical or AI techniques. Pattern evaluation - identifying patterns useful and relevant to the organizational context. Knowledge presentation -- Visualization and knowledge representation techniques are used to present the mined knowledge to the user.

  29. OLAP, MINING & WAREHOUSING... Data warehousing refers to the use of high speed/high capacity servers to store historical transaction information and to make this information accessible to decision makers. OLAP is used to perform high level analysis of data based on data summarization (aggregation) and slice and dice operations. For e.g. how many shoes sold in midwest in Feb? Data mining refers to identification of patterns from data.

  30. DATA MINING TECHNIQUES

  31. TYPES OF ANALYSES Sequence -- Activities occurring one after another e.g. loan after buying car, warranty. Association -- (AKA Market Basket Analysis) Activities which occur together (e.g. bread and meat) Classification -- Identifying profiles of data classified into pre-defined groups (frequent & infrequent shoppers) Clustering -- Identifying natural characteristics of data (what major areas are cust. coming from?)

  32. SEQUENTIAL ANALYSIS

  33. SEQUENTIAL ANALYSIS Predict purchasing • 40% of customers buy a gray skirt six months after buying a red jacket • Identified by time-series analysis, Neural networks or genetic algorithms

  34. IDENTIFYING SEQUENTIAL PATTERNS.. applications in forecasting exchange rates, meat consumption , bankruptcies etc.

  35. ASSOCIATIVE ANALYSIS

  36. ASSOCIATION Identifies items purchased together • Customers who bought items A, B, C also buy Y • 85 percent of customers who buy a certain wine brand also buy a certain type of pasta • Many men buy a six-pack when they purchase diapers • Enhances product placement in stores

  37. ANALYSIS TECHNIQUES ASSOCIATION Algorithm A-priori • list all items – 1 item set (C1) • filter by min. transaction support (L1) • identify 2-item sets (L1* L1) = (C2 ) • filter and generate L2 • repeat process... Database Min. transaction support is the number (sometimes given as %) of transactions in which the item must occur.

  38. A-PRIORI ALGORITHM minimum Support: 2 Scan Database C1 L1 Scan Database minimum Support: 2 L2 C2 C2 Scan Database L3 C3 C3

  39. DISCUSSION Apply associative rule mining (Use A-priori algorithm) to the following portfolios of clients of a brokerage company, to identify stocks that are purchased together. Use a minimum support of two.

  40. CLASSIFICATION ANALYSIS

  41. CLASSIFICATION A technique for grouping data into pre-defined classes using certain attributes of the data. E.g. defaulter or not, cruise customer, 4G subscriber or not etc. • Define classes • Identify classification criteria • Select suitable technique • applicationy

  42. CLASSIFICATION TECHNIQUES • Discriminant analysis (categorical membership) • Bayesian classification • Logistic regression (probabilistic membership) • Neural networks

  43. DISCRIMINANT ANALYSIS* A method of classification that uses a Discriminant Function to decide classes • DF based on attributes and weights. • Calculated as a weighted average of attributes and weights (of training data) (Zave score) E.g. (GMAT + 200 * UGPA) > 1200 • Given a new set of data, calculate Z score • Compare with Zav DF – Discriminant Function *a simplified version

  44. CLASSIFICATION WITH BAYESIAN ANALYSIS

  45. REFRESHER ON PROBABILITY Probability is the chance that an event/outcome will take place. • E.g. Cubs had 43 wins 36 losses in one season • Probability of winning?

  46. PRIOR (CONDITIONAL) PROBABILITIES Prior probabilities are knowledge of other events which may help improve predictions • E.g. p(IPO success) = 0.33 • Suppose we know that a big company behind IPO • p(IPO success/big company) = 0.99 Which is higher? P(successful cellphone call) or P(successful call/subscriber in service area)

  47. CONDITIONAL PROBABILITIES.. If we see a student in the union and he/she is a WIU student What is the probability he/she is a) CBT?, b) COAS?, c) COFAC?, d) COE?

  48. BAYESIAN ANALYSIS Bayes theorem can be exploited for classification • Bayes theorem • P(A/B) = [P(B/A) x P(A)]/P(B) • Classify into class I or II based on conditional probability • P(class I/x) > P(class II/x)? * • E.g. wiu student in union, which college are they? A method for classifying objects/events into classes based on probabilities of occurrence of the objects/events *x is some condition e.g. surgery or being a shopper in a retail chain

  49. INTRODUCTION TO BAYESIAN ANALYSIS We are interested in p(person becoming a manager/mba)* How can we use Baye’s theorem? ASSUME • 300 m population; 100 m employees • 500,000 are managers • 10,000 managers go to college for an MBA • 20 m go to college • 100,000 do MBA *you need to write formula using terms from the problem

  50. THIS SLIDE IS INTENTIONALLY BLANK

More Related