1 / 58

The Software Infrastructure for Electronic Commerce

The Software Infrastructure for Electronic Commerce. Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes. Lectures Three and Four. Data preprocessing Multidimensional data analysis Data mining

aurek
Télécharger la présentation

The Software Infrastructure for Electronic Commerce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Software Infrastructurefor Electronic Commerce Databases and Data Mining Lecture 4: An Introduction To Data Mining (II) Johannes Gehrke johannes@cs.cornell.edu http://www.cs.cornell.edu/johannes

  2. Lectures Three and Four • Data preprocessing • Multidimensional data analysis • Data mining • Association rules • Classification trees • Clustering

  3. Types of Attributes • Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) • Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) • Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

  4. Classification Goal: Learn a function that assigns a record to one of several predefined classes.

  5. Classification Example • Example training database • Two predictor attributes:Age and Car-type (Sport, Minivan and Truck) • Age is ordered, Car-type iscategorical attribute • Class label indicateswhether person boughtproduct • Dependent attribute is categorical

  6. Regression Example • Example training database • Two predictor attributes:Age and Car-type (Sport, Minivan and Truck) • Spent indicates how much person spent during a recent visit to the web site • Dependent attribute is numerical

  7. Types of Variables (Review) • Numerical: Domain is ordered and can be represented on the real line (e.g., age, income) • Nominal or categorical: Domain is a finite set without any natural ordering (e.g., occupation, marital status, race) • Ordinal: Domain is ordered, but absolute differences between values is unknown (e.g., preference scale, severity of an injury)

  8. Definitions • Random variables X1, …, Xk (predictor variables) and Y (dependent variable) • Xi has domain dom(Xi), Y has domain dom(Y) • P is a probability distribution on dom(X1) x … x dom(Xk) x dom(Y)Training database D is a random sample from P • A predictor d is a functiond: dom(X1) … dom(Xk)  dom(Y)

  9. Classification Problem • If Y is categorical, the problem is a classification problem, and we use C instead of Y.|dom(C)| = J. • C is called the class label, d is called a classifier. • Take r be record randomly drawn from P. Define the misclassification rate of d:RT(d,P) = P(d(r.X1, …, r.Xk) != r.C) • Problem definition: Given dataset D that is a random sample from probability distribution P, find classifier d such that RT(d,P) is minimized.

  10. Regression Problem • If Y is numerical, the problem is a regression problem. • Y is called the dependent variable, d is called a regression function. • Take r be record randomly drawn from P. Define mean squared error rate of d:RT(d,P) = E(r.Y - d(r.X1, …, r.Xk))2 • Problem definition: Given dataset D that is a random sample from probability distribution P, find regression function d such that RT(d,P) is minimized.

  11. Goals and Requirements • Goals: • To produce an accurate classifier/regression function • To understand the structure of the problem • Requirements on the model: • High accuracy • Understandable by humans, interpretable • Fast construction for very large training databases

  12. Different Types of Classifiers • Linear discriminant analysis (LDA) • Quadratic discriminant analysis (QDA) • Density estimation methods • Nearest neighbor methods • Logistic regression • Neural networks • Fuzzy set theory • Decision Trees

  13. Difficulties with LDA and QDA • Multivariate normal assumption often not true • Not designed for categorical variables • Form of classifier in terms of linear or quadratic discriminant functions is hard to interpret

  14. Histogram Density Estimation • Curse of dimensionality • Cell boundaries are discontinuities. Beyond boundary cells, estimate falls abruptly to zero.

  15. Kernel Density Estimation • How to choose kernel bandwith h? • The optimal h depends on a criterion • The optimal h depends on the form of the kernel • The optimal h might depend on the class label • The optimal h might depend on the part of the predictor space • How to choose form of the kernel?

  16. K-Nearest Neighbor Methods • Difficulties: • Data must be stored; for classification of a new record, all data must be available • Computationally expensive in high dimensions • Choice of k is unknown

  17. Difficulties with Logistic Regression • Few goodness of fit and model selection techniques • Categorical predictor variables have to be transformed into dummy vectors.

  18. Neural Networks and Fuzzy Set Theory Difficulties: • Classifiers are hard to understand • How to choose network topology and initial weights? • Categorical predictor variables?

  19. Minivan YES Sports,Truck YES NO 0 30 60 Age What are Decision Trees? Age <30 >=30 Car Type YES Minivan Sports, Truck NO YES

  20. Decision Trees • A decision tree T encodes d (a classifier or regression function) in form of a tree. • A node t in T without children is called a leaf node. Otherwise t is called an internal node.

  21. Internal Nodes • Each internal node has an associated splitting predicate. Most common are binary predicates.Example predicates: • Age <= 20 • Profession in {student, teacher} • 5000*Age + 3*Salary – 10000 > 0

  22. Internal Nodes: Splitting Predicates • Binary Univariate splits: • Numerical or ordered X: X <= c, c in dom(X) • Categorical X: X in A, A subset dom(X) • Binary Multivariate splits: • Linear combination split on numerical variables:Σ aiXi <= c • k-ary (k>2) splits analogous

  23. Leaf Nodes Consider leaf node t • Classification problem: Node t is labeled with one class label c in dom(C) • Regression problem: Two choices • Piecewise constant model:t is labeled with a constant y in dom(Y). • Piecewise linear model:t is labeled with a linear model Y = yt + Σ aiXi

  24. Encoded classifier: If (age<30 and carType=Minivan)Then YES If (age <30 and(carType=Sports or carType=Truck))Then NO If (age >= 30)Then NO Example Age <30 >=30 Car Type YES Minivan Sports, Truck NO YES

  25. Choice of Classification Algorithm? • Example study: (Lim, Loh, and Shih, Machine Learning 2000) • 33 classification algorithms • 16 (small) data sets (UC Irvine ML Repository) • Each algorithm applied to each data set • Experimental measurements: • Classification accuracy • Computational speed • Classifier complexity

  26. Classification Algorithms • Tree-structure classifiers: • IND, S-Plus Trees, C4.5, FACT, QUEST, CART, OC1, LMDT, CAL5, T1 • Statistical methods: • LDA, QDA, NN, LOG, FDA, PDA, MDA, POL • Neural networks: • LVQ, RBF

  27. Experimental Details • 16 primary data sets, created 16 more data sets by adding noise • Converted categorical predictor variables to 0-1 dummy variables if necessary • Error rates for 6 data sets estimated from supplied test sets, 10-fold cross-validation used for the other data sets

  28. Ranking by Mean Error Rate Rank Algorithm Mean Error Time 1 Polyclass 0.195 3 hours 2 Quest Multivariate 0.202 4 min 3 Logistic Regression 0.204 4 min 6 LDA 0.208 10 s 8 IND CART 0.215 47 s 12 C4.5 Rules 0.220 20 s 16 Quest Univariate 0.221 40 s …

  29. Other Results • Number of leaves for tree-based classifiers varied widely (median number of leaves between 5 and 32 (removing some outliers)) • Mean misclassification rates for top 26 algorithms are not statistically significantly different, bottom 7 algorithms have significantly lower error rates

  30. Decision Trees: Summary • Powerful data mining model for classification (and regression) problems • Easy to understand and to present to non-specialists • TIPS: • Even if black-box models sometimes give higher accuracy, construct a decision tree anyway • Construct decision trees with different splitting variables at the root of the tree

  31. Clustering • Input: Relational database with fixed schema • Output: k groups of records called clusters, such that the records within a group are more similar to records in other groups • More difficult than classification (unsupervised learning: no record labels are given) • Usage: • Exploratory data mining • Preprocessing step (e.g., outlier detection)

  32. Clustering (Contd.) • In clustering we partitioning a set of records into meaningful sub-classes called clusters. • Cluster: a collection of data objects that are “similar” to one another and thus can be treated collectively as one group. • Clustering helps users to detect inherent groupings and structure in a data set.

  33. Example input database: Two numerical variables How many groups are here? Requirements: Need to define “similarity” between records Clustering (Contd.)

  34. Graphical Representation

  35. Clustering (Contd.) • Output of clustering: • Representative points for each cluster • Labeling of each record with each cluster number • Other description of each cluster • Important: Use the “right” distance function • Scale or normalize all attributes. Example: seconds, hours, days • Assign different weights associated with importance of the attribute

  36. Clustering: Summary • Finding natural groups in data • Common post-processing steps: • Build a decision tree with the cluster label as class label • Try to explain the groups using the decision tree • Visualize the clusters • Examine the differences between the clusters with respect to the fields of the dataset • Try different number of clusters

  37. Web Usage Mining • Data sources: • Web server log • Information about the web site: • Site graph • Metadata about each page (type, objects shown) • Object concept hierarchies • Preprocessing: • Detect session and user context (Cookies, user authentication, personalization)

  38. Web Usage Mining (Contd.) • Data Mining • Association Rules • Sequential Patterns • Classification • Action • Personalized pages • Cross-selling • Evaluation and Measurement • Deploy personalized pages selectively • Measure effectiveness of each implemented action

  39. Large Case Study: Churn • Telecommunications industry • Try to predict churn (whether customer will switch long-distance carrier) • Dataset: • 5000 records (tiny dataset, but manageable here in class) • 21 attributes, both numerical and categorical attributes (very few attributes) • Data is already cleaned! No missing values, inconsistencies, etc. (again, for classroom purposes)

  40. Churn Example: Dataset Columns • State • Account length: Number of months the customer has been with the company • Area code • Phone number • International plan: yes/no • Voice mail: yes/no • Number of voice: Average number of voice messages per day • Total (day, evening, night, international) minutes: Average number of minutes charged • Total (day, evening, night, international) calls: Average number of calls made • Total (day, evening, night, international) charge: Average amount charged per day • Number customer service calls: Number of calls made to customer support in the last six months • Churned: Did the customer switch long-distance carriers in the last six months

  41. Churn Example: Analysis • We start out by getting familiar with the dataset • Record viewer • Statistics visualization • Evidence classifier • Visualizing joint distributions • Visualizing geographic distribution of churn

  42. Churn Example: Analysis (Contd.) • Building and interpreting data mining models • Decision trees • Clustering

  43. Evaluating Data Mining Tools

  44. Evaluating Data Mining Tools • Checklist: • Integration with current applications and your data management infrastructure • Ease of usage • Automation • Scalability to large datasets • Number of records • Number of attributes • Datasets larger than main memory • Support of sampling • Export of models into your enterprise • Stability of the company that offers the product

  45. Integration With Data Management • Proprietary storage format? • Native support of major database systems: • IBM DB2, Informix, Oracle, SQL Server, Sybase • ODBC • Support of parallel database systems • Integration with your data warehouse

  46. Cost Considerations • Proprietary or commodity hardware and operating system • Client and server might be different • What server platforms are supported? • Support staff needed • Training of your staff members • Online training, tutorials • On-site training • Books, course material

  47. Data Mining Projects • Checklist: • Start with well-defined business questions • Have a champion within the company • Define measures of success and failure • Main difficulty: No automation • Understanding the business problem • Selecting the relevant data • Data transformation • Selection of the right mining methods • Interpretation

  48. Understand the Business Problem Important questions: • What is the problem that we need to solve? • Are there certain aspects of the problem that are especially interesting? • Do we need data mining to solve the problem? • What information is actionable, and when? • Are there important business rules that constrain our solution? • What people should we keep in the loop, and with whom should we discuss intermediate results? • Who are the (internal) customers of the effort?

  49. Hiring Outside Experts? Factors: • One-time problem versus ongoing process • Source of data • Deployment of data mining models • Availability and skills of your own staff

  50. Hiring Experts Types of experts: • Your software vendor • Consulting companies/centers/individuals Your goal: Develop in-house expertise

More Related