1 / 106

2013/2014 Summer

Data Minin g and Knowledge Acquizition — Chapter 7 — — Data Mining Overwiev and Exam Questions —. 2013/2014 Summer. Data Mining. Methodology Problem definition Data set selection Preprocessing transformations Functionalities Classification/prediction Clustering Association

lane
Télécharger la présentation

2013/2014 Summer

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining and Knowledge Acquizition — Chapter 7 ——Data Mining Overwievand Exam Questions— 2013/2014 Summer

  2. Data Mining • Methodology • Problem definition • Data set selection • Preprocessing transformations • Functionalities • Classification/prediction • Clustering • Association • Sequential analysis • others

  3. Methodology cont. • Algorithms • For classification you can use • Decision trees ID3,C4.5 CHAID are algorithms • For clustering you can use • Partitioning methods k-means,k-medoids • Hierarchical AGNES • Probabilistic EM is an algorithm • Presenting results • Back transformations • Reports • Taking action

  4. Two basic style of data mining • Descriptive • Cross tabulations,OLAP,attribute oriented induction,clustering,association • Predictive • Classification,prediction • Questions answered by these styles • Difference between classification and prediction

  5. Classification • Methods • Decision trees • Neureal networks • Bayesian • K-NN or model based reasoning • Adventages disadventages • Given a problem which data processing techniques are required

  6. Classification (cnt.d) • Accuracy of the model • Measures for classification/numerical prediction • How to better estimate • Holdout,cross validation, bootstraping • How to improve • Bagging, boosting • For unbalanced classes • What to do with models • Lift charts

  7. Clustering • Distance measures • Dissimilarity or similarity • For different type of variables • Ordinal,binary,nominal,ratio,interval • Why need to transform data • Partitioning methods • K-means,k-medoids • Adventage disadventage • Hierarchical • Density based • probablistic

  8. Association • Apriori or FP-Growth • How to measure strongness of rules • Support and confidence • Other measures critique of support confidence • Multiple levels • Constraints • Sequential patterns

  9. OLAP • Concept of cube • Fact table • measures • Dimensions • Sheams • Star, snowflake • Concept hierarchies • Set grouping such as price age • Parent child

  10. Pre processing • Missing values • Inconsistencies • Redundent data • Outliers • Data reduction • Attribute elimination • Attribute combination • Samplinng • Histograms

  11. Exam Questions • Intorduction • Basic functionalities • Data description • Data preperation • Data warehousing olap • Clustering • classification/numerical prediction • frequent pattern mining

  12. Introduction • Defining data mining problems • Data mining functionalities

  13. Define data mining problems • 1. Suppose that a data warehouse for Big-University Library consists of the following three dimensions: users, books, time, and each dimension has four levels not including the all level. There are three measures: You are asked to perform a data mining study on that warehouse (25 pnt) • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation?

  14. Define data mining problems • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  15. Define data mining problems • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  16. Data mining on MIS • A data warehouse for the MIS department consists of the following four dimensions: student, course, instructor, semester and each dimension has five levels including the all level. There are two measures: count and average grade. At the lowest level of average grade is the actual grade of a student. You are asked to perform a data mining study on that warehouse (25 pnt)

  17. Data mining on MIS 2 • Define three data mining problems on that warehouse: involving association, classification and clustering functionalities respectively. Clearly state the importance of each problem. What is the advantage of the data being organized as OLAP cubes compared to relational table organisation? • In data preprocessing stage of the KDD • What are the reasons for missing values? and How do you handle them? • what are possible data inconsistencies • do you make any discritization • do you make any data transformations • do you apply any data reduction strategies

  18. Data mining on MIS 3 • Define your target and input variables in classification. Which classification techniques and algorithms do you use in solving the classification problem? Support your answer • Define your variables indicating their categories in clustering Which clustering techniques and algorithms do you use in solving the clustering problem? Support your answer. • Describe association task in detail; specifying the algorithm interestingness measures or constraints if any.

  19. Final 2010/2011 Spring (MIS) • 3 ( 35 pt.) The aim of Knowledge Discovery from Databases (KDD) is to extract interesting, potentially useful, …, knowledge from data. The extracted knowledge can be represented in a knowledge base similar to a database. Considering the data mining functionalities and algorithms we covered in this course describe five different knowledge types. For each type discuss the following aspects: • a) From which functionality and algorithm they are obtained? • b) How they are represented in knowledge base? (Do not consider data structures ) • c) What are the quality characteristics? • d) How they are used in the deployment phase?

  20. BIS 541 2011/2012 Final • 1. For each of the following problem identify relevant data mining tasks • a) A weather analyst is interested in calculating the likely change in temperatue for the coming days. • b) A marketing analyst is looking for the groups of customers so as to apply different CRM strategies for ecach group • c) A medical doctor must decide whether a set of symptoms is an indication of a particular disease. • d) A educational psychologist would like to determine exceptional students to sugget them for special educational programs. .

  21. BIS 541 2012/2013 Final • For each of the following problem identify relevant data mining tasks with a brief explanation • a) A weather analyst is interested in wheather the temperature will be up or down for the coming day • b) An insurance analyst intends to group policy holders according to characteristics of customers and policies • c) A medical researcher is looking for symptoms that are occurring together among a large set of pationes. • d) An educational program director would like to determine likely GPA of applicant to a MA program from their ALES scores, undergraduate GPAs and enterence exam scores.

  22. Basic Fuctionalities • Decision tree - ID3 • information gain • Association – Apriori • Clustering – k-means

  23. Information gain • Consider a data set of two attributes A and B. A is continuous, whereas B is categorical, having two values as “y” and “n”, which can be considered as class of each observation. When attribute A is discretized into two equiwidth intervals no information is provided by the class attribute B but when discretized into three equiwidth intervals there is perfect information provided by B. Construct a simple dataset obeying these characteristics.

  24. Node 2 A=a1 Decision Y Node 3 A=a2 Node 4 B=b1 Decision N Node 5 B=b2 Decision is Y Decision tree • 2. a-Construct a data set that generates the tree shown below In addition the following conditions are satisfied

  25. Midterm 2006/2007 Spring (MIS) • 2. Show that entroy is not a symetric measure of association like correlation coefficient is. Construct a simple data set of two categorical attributes A and B such that i knowing the values of A provides perfect information to predict B but ii) knowing the values of B does not provide perfect information to precict A

  26. at a particular node • when information gain is 0 • when it gets maximum value

  27. Associations • In a particular database; AC and BC are strong association rules based on the support confidence measure. A and B are independent items. Does this imply that A  BC is also a strong rule based on the lift measure? A,B,C are items in a transaction database. • -if A B and BC are strong. Is AC a strong rule • -if A B and AC are strong. İs BC a strong rule

  28. Data Description/Preprocessing

  29. Midterm 2004/2005 Spring (MIS) • Consider the correlation coefficient between two numerical variables. Does its umerical value affected by the unit of measures of these variables?. (such as measureing temperature in oC or öF)

  30. Midterm 2011/2012 Fall generate data • 5. (10 points) Consider two continuous variables X and Y. Generate data sets • a) where PCA (principle component analysis) can not reduces the dimensionality from two to one • b) where although the two variables are related (a functional relationship exists between these two variables), PCA is not able to reduce the dimensionality from two to one

  31. Midterm 2010/2011 Spring (MIS) • 3. (25 points) Consider a data set of two continuous variables X and Y. X is right skewed and Y is left skewed. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y separately. • b) Draw box plots of X and Y separately. • c) Draw q-plots (quantile) of X and Y separately. • d) Draw q-q plot of X and Y.

  32. MIS 541 2012/2013 Final • 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y on the same graph. • b) Draw box plots of X and Y separately.

  33. Final 2011/2012 Fall data description • 1 (20 points) Give two examples of outliers. • a) Where outliers are useful and essential patterns to be mined. • b) Outliers are useless steaming from error or noise.

  34. Final 2011/2012 Fall preprocessing • 2 (20 points) Considering the classification methods we cover in class, describe two distinct reasons why continuous input variables have to be normalized for classification problems(each reason 10 points).

  35. Midterm 2008/2009 Spring • 4. (20 points) Principle components is used for dimensionality reduction then may be followed by cluster analysis – say for segmentation purposes – Consider a two continuous variable problem. Using scatter plots • a) Generate a data set where PCA reduces the dimensionality from two to one • b) Generate a data set where although there is a relation between the two variables, PCA • is not able to reduce the dimensionality to one • c) Generate a data set where there are natural clusters and PCA can reduce the dimensionality • d) Generate a data set where there are natural clusters but PCA is not the appropriate method for reducing the dimensionality

  36. Midterm 2012/2013 Fall (MIS) • 1. (20 pts) Consider a data set of two continuous variables X and Y. X both has the same mean, both have no skewness (symetric)ç X has a higher variance then Y. Both represent measures about same quantity (sales categories, exam grades,…) • a) Draw typical distributions of X and Y on the same graph. • b) Draw box plots of X and Y separately. • c) Draw q-plots (quantile) of X and Y separately. • d) Draw q-q plot of X and Y.

  37. Data Warehousing/OLAP • Design of olap cubes • Measures

  38. Midterm 2005/2006 Spring (MIS) • A large hypermarket has lots of branchs through out the country. Quantity purchased Qi, price Pi, for each item i are stored in a warehouse. The top management is interested in finding the cheapest large sold items minp(maxq item i). Is it possible to accomplish this in a distributive maner? In other word is minp(maxq item i) a distributive measure?

  39. Final 2007/2008 Spring (MIS) • 1. (25 pnt) Suppose an aggregation is to be designed to obtain weekly dollar values from daily values by two different ways described below. Can they be computed in a distributive manner? (the database has day ID and dollar value fields. Records are randomly selected and assigned to different processing units) • a) Taking the daily averages • b) Taking the last day’s value of the week

  40. Data warehouse for library • A data warehouse is constructed for the library of a university to be used as a multi-purpose DSS. Suppose this warehouse consists of the following dimensions: user , books , time (time_ID, year, quarter, month, week, academic year, semester, day), and . “Week” is considered not to be less than “month”. Each academic semester starts and ends at the beginning and end of a week respectively. Hence, week<semester. • Describe concept hierarchies for the three dimensions. Construct meaningfull attributes for each dimension tables above . Describe at least two meaningfull measures in the fact table. Each dimension can be looked at its ALL level as well. • What is the total number of cuboids for the library cube? • Describe three meaningfull OLAP queries and write sql expresions for one of them.

  41. OLAP Big University • 2. (Han page 100,2.4) Suppose that the data warehouse for the Big-University consists of the following dimensions: student,course,instructor,semester and two measures count and average_grade. Where at the lowset conceptual level (for a given student, instructor,course, and semester) the average grade measure stores teh actual grade of the student. At higher conceptual levels the average_grade stores the average grade for the given combination. (when student is MIS semester 2005 all terms, course MIS 541, instructor Ahmet Ak, average_grade is the average of students grades in thet course by that instructer in all semester in 2005)

  42. cont. • a) draw a snawflake sheam diagram for that warehouse • What are the concept hierarchys for the dimensions • b) What is the total nmber of cuboids

  43. MIS 542 Final S06 1 olap • 1. MIS department wants to revise academic strategies for the following ten years. Relevent • questions are: What portion of the courese are required or elective? What is the full time part • time distribution of instuctors? What is the course load of instructors? What percent of • technical or managerial courses are thought by part time instructors? How all theses things

  44. MIS 542 Final S06 1 cont. • changed over years? You can add similar stategic quustions of your own. Do not conside • students aspects of the problem for the time being. Desing and OLAP sheam to be used as a • strategic tool. You are free to decide the dimensions and the fact table. Describe the concept • hierarchies, virtual dimensions and calculated members. Finally show OLAP opperations to • answer three of such strategic questions

  45. Midterm 2006/2007 Spring • 1. A data warehouse is constructed for the web site of a e-commerce company to be used for customer segmentation. Each visitor click stream data is recorded Each session has an ID Suppose this warehouse consists of the following dimensions: visitor, time, product. There is a concept hierarcy for products which is reflected to the design of the web site so that products can be seen in a hierarchical manner. When a product is seen it can be purchased. Only registered customers can use the system so each visitor has an ID. When registering a form is field out so that socio-demographic information is taken form a customer. Suppose income (a a numerical variable), birthday, gender, profesion, marital status is asked.

  46. cont. • a) Describe concept hierarchies for the three dimensions. Construct meaningful attributes for each dimension tables above.(What transformations are required before constructing these attributes) Describe at least two meaningful measures in the fact table. • b) Each dimension can be looked at its ALL level as well. • Describe three meaningful OLAP queries and write sql expressions for one of them. • c) Define a clustering problem: Which variables are important? Is there a missing value problem? What data transformation are needed? Which algorithm would you suggest?

  47. Midterm 2007/2008 Spring • 1. (20 points) Consider a shipment company responsible for shipping items from one location to another on predetermined due dates. Design a star schema OLAP cube for this problem to be used by managers for decision making purposes. The dimensions are time, item to be shipped, person responsible for shipping the item, location.. For each of these dimensions determine three levels in the concept hierarchy. Design the fact table with appropriate measures:and keys (include two measure and at least one calculated member in the fact table) • Show one drilldown and role up operations • Show the SQL query of one of the cuboids.

  48. Midterm 2008/2009 Spring • 1. (25 points) In an organization a data warehouse is to be designed for evaluating performance of employees. To evaluate performance of an employee, survey questionnaire is consisting a set of questions with 5 Likered scale are answered by other employees in the same company at specified times. That is, performance of employees are rated by other employees. • Each employee has a set of characteristics including department, education,… Each survey is conducted at a particular date applied to some of the employees. Questions are aimed to evaluate broad categories of performance such as motivation, cooperation ability,… • Typically, a question in a survey, aiming to measure a specific attitude about an employee is evaluated by another employee (rated f rom 1 to 5) Data is available at question level.

  49. cont. • Cube design: a star schema • Fact table: Design the fact table should contain one calculated member. What are the measures and keys? • Dimension tables: Employee, and Time are the two essential dimensions include a Survey and Question dimensions as well. For each dimension show a concept hierarchy. • State three questions that can be answered by that OLAP cube. • Show drilldown and role up operations related to these questions

  50. MIS 541 2012/2013 Final • 2. (20 pts) Suppose that a data warehouse for a hospital consists of the following dimensions: time, doctor and patient and the two measures count and charge, where charge is the fee a doctor charge a patient for a visit. • Design a warehouse with star schema: • a) Fact table: Design the fact table. • b) Dimension tables: For each dimension show a reasonable concept hierarchy. • c) State two questions that can be answered by that OLAP cube. • d) Show drilldown and roll up operations related to one of these questions

More Related