1 / 85

The Software Infrastructure for Electronic Commerce

The Software Infrastructure for Electronic Commerce. Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes. Lectures Three and Four. Data preprocessing Multidimensional data analysis Data mining

barny
Télécharger la présentation

The Software Infrastructure for Electronic Commerce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Software Infrastructurefor Electronic Commerce Databases and Data Mining Lecture 3: An Introduction To Data Mining (I) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes

  2. Lectures Three and Four • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering

  3. Why Data Preprocessing? • Quality decisions come from quality data. • Problems with real life data: • Data needs to be integrated from different sources • Missing values • Noisy and inconsistent values • Data is not at the right level of aggregation

  4. Recall: The Relational Data Model • A relational database is a set of relations • A relation has two components: • The relation instance. Basically a table, with rows (also: records, tuples) and columns (also: fields, attributes).Number of records in the relation instance: Cardinality. • The relation schema. Specifies name of the relation, plus name and type of each column. • Turing award (Nobel price in CS) for Codd in 1981 for his work on the relational model

  5. Example: Customer Relation • Relation schema:Customers (cid: integer, name: string, byear: integer, state: string) • Relation instance:

  6. Data Integration • Integrate data from multiple sources into a common format for data mining. • Note: A good data warehouse has already taken care of this step. Data Warehouse Server OLTPDBMSs Extract, clean,transform, aggregate,load, update Other Data Sources Data Marts

  7. Data Integration (Contd.) Problem: Heterogeneous schema integration • Different attribute names • Different units: Sales in $, sales in Yen, sales in DM

  8. Data Integration (Contd.) Problem: Heterogeneous schema integration • Different scales: Sales in dollars versus sales in pennies • Derived attributes: Annual salary versus monthly salary

  9. Data Integration (Contd.) Problem: Inconsistency due to redundancy • Customer with customer-id 150 has three children in relation1 and four children in relation2 • Computation of annual salary from monthly salary in relation1 does not match “annual-salary” attribute in relation2

  10. Missing Values Often the values of some attributes in a record are not known. Reasons for missing values: • Attribute does not apply (e.g., maiden name) • Inconsistency with other recorded data • Equipment malfunction • Human errors • Attribute introduced recently (e.g., email address)

  11. Missing Values: Approaches • Ignore the record • Complete the missing value: • Manual completion: Tedious and likely to be infeasible • Fill in a global constant, e.g., “NULL”, “unknown” • Use the attribute mean or mode • Construct a data mining model that predicts the missing value

  12. Noisy Data • Examples: • Faulty data collection instruments • Data entry problems, misspellings • Data transmission problems • Technology limitation • Inconsistency in naming conventions • Duplicate records with different values for a common field

  13. Noisy Data: Remove Outliers

  14. Noisy Data: Smoothing y Y1 Y1’ x X1

  15. Noisy Data: Normalization • Scale data to fall within a small, specified range • Leave out extreme order statistics • Min-max normalization • Z-score normalization • Normalization by decimal scaling

  16. Data Reduction Problem: • Data might not be at the right scale for analysis.Example: Individual phone calls versus monthly phone call usage • Complex data mining tasks might run a very long time.Example: Multi-terabyte data warehousesOne disk drive: About 20MB/s

  17. Data Reduction: Attribute Selection • Select the “relevant” attributes for the data mining task • If there are k attributes, there are 2k-1 different subsets • Example: {salary,children,byear}, {salary,children}, {salary,byear}, {children,byear}, {salary}, {children}, {byear} • Choice of the right subset depends on: • Data mining task • Underlying probability distribution

  18. Attribute Selection (Contd.) • How to choose relevant attributes: • Forward selection: Select greedily one attribute at a time • Backward elimination: Start with all attributes, eliminate irrelevant attributes • Combination of forward selection and backward elimination

  19. Data Reduction: Parametric Models • Main idea: • Fit a parametric model to the data (e.g., multivariate normal distribution) • Store the model parameters, discard the data (except for outliers)

  20. Instead of storing (x,y) pairs, store only the x-value.Then recompute the y-value usingy = ax + b Parametric Models: Example y x

  21. Data Reduction: Sampling • Choose a representative subset of the data • Simple random sampling may have very poor performance in the presence of skew

  22. Data Reduction: Sampling (Contd.) Stratified sampling: Biased sampling • Example: Keep population group ratios • Example: Keep minority population group count

  23. Data Reduction: Histograms • Divide data into buckets and store average (sum) for each bucket • Can be constructed “optimally” for one attribute using dynamic programming • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1-2,12,16), (3-6,8,36), (7-9,6,48), (10-12,6,66)

  24. Histograms (Contd.) • Equal-width histogram • Divides the domain of an attribute into k intervals of equal size • Interval width = (Max – Min)/k • Computationally easy • Problems with data skew and outliers • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1-3,14,22), (4-6,6,30), (7-9,6,48), (10-12,6,66)

  25. Histograms (Contd.) • Equal-depth histogram • Divides the domain of an attribute into k intervals, each containing the same number of records • Variable interval width • Computationally easy • Example:Dataset: 1,1,1,1,1,1,1,1,2,2,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10,11,11,12,12Histogram: (range, count, sum)(1,8,8), (2-4,8,22), (5-8,8,52), (9-12,8,84)

  26. Data Reduction: Discretization • Same concept as histograms • Divide domain of a numerical attribute into intervals. • Replace attribute value with label for interval. • Example: • Dataset (age; salary):(25;30,000),(30;80,000),(27;50,000),(60;70,000),(50;55,000),(28;25,000) • Discretized dataset (age, discretizedSalary):(25,low),(30,high),(27,medium),(60,high),(50,medium),(28,low)

  27. Discretization: Natural Hierarchies • Replace low-level concepts with high-level concepts • Example replacements: • Product SKU by category • City by state Year Industry Country Quarter Category State Month Week Product City Day

  28. Data Reduction: Aggregation • Natural hierarchies on attributes can be used to aggregate data along the hierarchy Year Industry Country Quarter Category State Month Week Product City Day

  29. Aggregation: Example

  30. Data Reduction: Other Methods • Principal component analysis • Fourier transformation • Wavelet transformation

  31. Problems during data integration: Different attribute names Different units Different scales Derived attributes Redundant data Missing values Imputation Prediction Noisy data: Outlier removal Smoothing Normalization Data Reduction: Attribute selection Fitting parametric models Sampling Histograms Discretization Aggregation Data Preprocessing: Summary

  32. Data Analysis • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering

  33. Multidimensional Data Analysis Recall: Transactions(ckey, timekey, pkey, units, price) Customers(ckey, cid, name, byear, city, state, country) Time(tkey, day, month, quarter, year) Products(pkey, pname, price, pid, category, industry) Hierarchies on dimensions: Year Industry Country Quarter Category State Month Week Product City Day

  34. Multidimensional Data Analysis Year Industry Country=“USA” Quarter Category State Month Week Product City Day

  35. Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.country = “USA”GROUP BY P.industry, C.state • We think that Industry3 in CA is interesting. Year Industry Country=“USA” Quarter Category State Month Week Product City Day

  36. Slice and Drill-Down Year Industry=“Industry3” Country Quarter Category State=“CA” Month Week Product City Day

  37. Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND P.industry = “Industry3” AND C.state = “CA”GROUP BY P.category, C.city • We think that Category3 is interesting. Year Industry=“Industry3” Country Quarter Category State=“CA” Month Week Product City Day

  38. Slice and Drill-Down Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Product Day

  39. Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3”GROUP BY P.product, C.city • Nothing new in this view of the data. Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Product Day

  40. Pivot To (City, Year) Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Day Product

  41. Corresponding Query in SQL • SELECT SUM(units)FROM Transactions T, Products P, Customers CWHERE T.pkey = P.pkey AND T.ckey = C.ckey AND C.state = “CA” AND P.category = “Category3”GROUP BY C.city, T.year Year Country Industry Quarter State=“CA” Category=“Category3” Month Week City Day Product

  42. Multidimensional Data Analysis Set of data manipulation operators • Roll-up: Go up one step in a dimension hierarchy. Example: month -> quarter • Drill-down: Go down one step in a dimension hierarchy. Example: quarter -> month • Slice: Select a subset of the values of a dimension. Example: All categories -> only Category3 • Dice: Select all values of a dimension. Example: Only Category3 -> all categories • Pivot: Select new dimensions to visualize the data. Example: Pivot to Time(quarter) and Customer(state)

  43. Visual Intuition: Cube roll-up to category Customer Data Mart roll-up to state SH SF LA Product1 Product2Product3 Product4Product5Product6 20 30 20 15 10 50 Product roll-up to week M T W Th F S S Time 50 Units of Product6 sold on Monday in LA

  44. OLAP Server Architectures • Relational OLAP (ROLAP) • Relational DBMS stores data mart • OLAP middleware: • Aggregation and navigation logic • Optimized for DBMS in the background, but slow and complex • Basically only one vendor: Microstrategy • Multidimensional OLAP (MOLAP) • Specialized array-based storage structure • Vendors: Hyperion (Essbase), Appix (iTM1), Oracle, Microsoft

  45. OLAP Server Architectures • Desktop OLAP (DOLAP) • Performs OLAP directly at your PC • Vendors: Cognos (Powerplay), Business Objects, Brio Technology, Hummingbird • Hybrids and Application OLAP • More: www.olapreport.com

  46. The OLAP Market

  47. Enterprise Reporting Tools

  48. Summary: Multidimensional Analysis • Spreadsheet style data analysis • Roll-up, drill-down, slice, dice, and pivot your way to interesting cells in the CUBE • Mainstream technology • Established enterprises already have OLAP installations • When establishing your e-business, OLAP will be your first step in data analysis

  49. Data Mining

  50. Definition Data mining is the exploration and analysis of large quantities of data in order to discover valid, novel, potentially useful, and ultimately understandable patterns in data. Example pattern (Census Bureau Data):If (relationship = husband), then (gender = male). 99.6%

More Related