1 / 193

Data Mining

Data Mining . Seonggyu Kim. Data Mining. Data – recorded facts Information – set of patterns or expectations. Data Mining – extraction of implicit, previously unknown, and potentially useful information from data

dominy
Télécharger la présentation

Data Mining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Mining Seonggyu Kim

  2. Data Mining • Data – recorded facts • Information – set of patterns or expectations Data Mining– extraction of implicit, previously unknown, and potentially useful information from data Based on Machine Learning Programs that sift through databases automatically, seeking regularities or structural patterns for making accurate predictions on future data • most patterns are not interesting • patterns may be inexact (or even completely spurious) if data is garbled or missing Garbage in Garbage out

  3. Structural Patterns Rule If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft

  4. Database • A collection of interrelated data • Data Sharing • Data Independence

  5. Database System

  6. Traditional OLTP • DBMS used for on-line transaction processing (OLTP) • order entry, balance transfer, etc. • Clerical data processing tasks • Detailed up-to-date data • Structured, repetitive tasks • Short transactions • Isolation, recovery, and integrity

  7. OLTP to OLAP OnLine Transaction Processing OnLine Analytical Processing

  8. Relational Source Data

  9. OLAP Cube

  10. Transactions Name or identifier of customer items purchased price paid dates on which items are sold Business decision what to stock when to stock what discount to offer Decision Support Systems • SQL Extensions • Statistical Analyses of data • Knowledge discovery, in other word, Data Mining • Data warehouse - allow company to retrieve • diverse data on demand

  11. OLAP(OnLine Analytical Processing) • Data is viewed in the frame of a table or with more than three axes, hypercube • Dimension Category, Location • Slice single, condo • Level year, month or city, state • Measure source data • Data warehouse Multidatabase Approach

  12. Web Interface to Database Network Database Applications Database applications on Internet and Intranet Static Report Publishing DB Query Publishing Private Network Public Network via Firewall

  13. Database Access Standards Three-tier architecture Standard & DBMS-independent Interfaces for accessing database server Browser DB Web Server Database Server Browser ODBC ADO (Active Data Object) OLE/DB JDBC Native Calls HTML DHTML XML

  14. RDB Native Interface DB Server NRDB ODBC Browser File OLE DB ADO Web Server Browser E-mail Browser Multimedia Data Role of ADO Enables programmer in almost any languages to be able to access OLD DB Works in conjunction with Remote Data Services objects

  15. Markup Languages Specify the appearance and behavior of Web pages • SGML • HTML • www.w3.org • No way to specify contents from layout & format • Lack of style definitions • No way to access web page elements with scripts • No construct to facilitate caching & data manipulation on clients • D(Dynamic)HTML • Microsoft implementation of HTML 4.0 • Document Object Model(DOM) enables contents to be altered without refreshing them from server • Cascading Style Sheet • Remote Data Service(RD), ActiveX objects for exchanging data with server

  16. XML (eXtensible Markup Language) • Clear separation of contents, layout and materialization • Standards, but extensible by developer • Standard means for expressing the structure of database views using DTD or XML Schema • No built-in compression scheme A Standardized facility to describe, validate and materialize any database view

  17. Multidatabase Systems Manipulation of databases located in a heterogeneous collection of hardware and software environments Local database has different logical model, DDL, DML Multidatabase system integrates local databases logically without requiring physical database integration • Common Conceptual Schema • Local, Global Transaction • Two-level Serializability

  18. Data Warehouse A repository of information gathered from multiple sources, stored a unified schema at a single site. • A store of enterprise data for decision making • Multidimensional, Subject-oriented, Integrated, Time-variant, Nonvolatile data checking saving

  19. Warehouse, Mart, Mine • Data Warehouse where corporate data is stored • Data Mart where departmental data is stored • Star schemas or normalized form • Data Mine where data is re-organized for analysis and information is extracted from the data • For a specific business objectives • Enriched by additional external data

  20. Characteristics of Data Warehouse • Subject-oriented • What to analyze, i.e., customer, sales • Integrated • Time-variant • History • Nonvolatile • Load only, almost no update

  21. Information Flow in Data Warehouse Environment • In flow • Information to Data warehouse • Up flow • Summary of Information • Down flow • Small OLTP storage media • Out flow • Information to customer • Meta flow • Control flow for operation

  22. Purpose of Data Warehouse • OLTP systems are designed to capture data – detailed level of the individual transaction • Data warehouses are designed for getting data out What were the best-selling products last week? What’s the quarterly trend in orders by region? To answer business questions by interpreting business requirement in dimensional way

  23. Dimensional Modeling Star Schema Approach Dimensional Model • An analytic tool in planning the data warehouse • A physical design for its implementation in a relational database • A direct reflection of the manner in which a business process is viewed • A representation of Important business measurements and parameters by which the measurements are broken out

  24. Business Vocabulary 1 • Gross marginbyproduct category • Average account balancebyeducational level • Compare inventory level with salesbyproductbywarehouse • Outstanding 180-day receivablesbygeneral ledger account • Return ratebysupplier

  25. Business Vocabulary 2 • Business process – orders, account management, inventory management, account receivables, return • Summarizing a lot of transactions • Subject area – margin, product category, account balance, education level • How business measures the process

  26. Measurements • Margin is broken out or rolled up by • Product category • Individual product • Sales person • Customer • Time period Broken out or Rolled up by Numerous parameters  Dimension tables

  27. Measures and Dimensions

  28. A Dimensional Model Customer Customer Name Customer Code Billing Address Shipping Address Product Product Name SKU Brand Order Measures Order Dollars Cost Margin Dollars Quantity Sold Order Date Date Month Quarter Year Salesperson Salesperson name Territory name Region name Fact table Dimension table Star schema

  29. A Star Schema Fact table & its associated dimensions Product Customer Product_Key Product SKU Brand Customer_key Name Customer_id Billing_Address Billing_city Billing_state Billing_zip Shipping_Address Order_Facts Product_key Order_date_key Salesperson_key Customer_key Order_dollars Extended_cost Margin_dollars Quantity_ordered Order_number Order_line Date Date_key Date Day_of_week_no Day_of_week_name Day_of_month Month_number Month_name Fiscal_period Year Salesperson Salesperson_key Salesperson_name Salesperson_code Territory_name Region_name same level of detail

  30. Some Remarks • Fact table has foreign keys that relate each measurement to the appropriate rows in each of the dimension tables • Grain – the level of detail at which measure will be recorded • Dimension table holds attributes • To be used to qualify queries • To be used to break out measure • Primary key of a dimension table is a single system defined attribute

  31. Where Data Mining Happens • Above the warehouse, as a set of conceptual views • Beside the warehouse, as a separate repository • Within the warehouse, as a distinct set of resources

  32. Above the warehouse • Minimal architecture for the discovery and analysis • Data Mining is Not a key objective

  33. Most Effective Approach Beside the Warehouse

  34. Within the Warehouse • Massively Parallel Processing Computer

  35. Information is Crucial • In vitro fertilization case • Given : embryos described by 60 features • Problem : selection of embryos that will survive • Data : historical records of embryos and outcome • Cow culling case • Given : cows described by 700 features • Problem : selection of cows what should be culled • Data : historical records and farmer’s decisions

  36. Machine Learning Techniques • Technical basis for Data Mining – algorithms for acquiring structural descriptions from examples • Structural descriptions represent patterns explicitly • Can be used to predict outcome in new situation • Can be used to understand and explain how prediction is derived • Methods originate from artificial intelligence, statistics and research on database

  37. Structural Descriptions • Classification rule: predicts value of pre-specified attribute (the classification of an example) If outlook = sunny and humidity = high then play = no • Associations rule: predicts value of arbitrary attribute or combination of attributes If temperature = cool then humidity = normal If humidity > 85 and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high

  38. Visualizing ML as Search • Inductive learning : finding a concept description that fits the data • Example : rule sets as description language • Enormous, but finite, search space • Simple solution : enumerating the concept space eliminating descriptions that do not fit examples • Surviving descriptions contain target concept result of learning

  39. Enumerating the concept space • Search space for the problem • 288 possibilities for each of 14 rules • 2.7 x 1034 different rule sets • Possible remedy : candidate-elimination algorithm • Other practical problems: • More than one description may survive • No description may survive • Language may not able to describe target concept • Data may contain noise

  40. BIAS • The most important decisions in learning systems: • The concept description language • The order in which the space is searched • The way that overfitting to the particular training data is avoided • These properties form the Bias of the search • Language bias represent negation or disjunction • Search bias search heuristics or search direction • Overfitting-avoidance bias

  41. Data Mining & Ethics Are resources put to good use? • Many ethical issues arise in practical applications • Data mining often used to discriminate • i.e., loan applications: using some information such as sex, religion, or race is unethical • Ethical situation depends on application • i.e., same information ok in medical application • Attributes may contain problematic information • i.e., area code may correlate with race

  42. Input & Output Knowledge Representation for Concepts

  43. What to Learn

  44. Components of the input • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Instances: the individual, independent examples of a concept • Note: more complicated forms of input are possible • Attributes: measuring aspects of an instance • We will focus on nominal and numeric ones • Practical issue: a file format for the input

  45. What’s a Concept? • Styles of learning: • Classification learning: predicting a discrete class • Association learning: detecting associations between features • Clustering: grouping similar instances into clusters • Numeric prediction: predicting a numeric quantity • Concept: thing to be learned • Concept description: output of learning scheme

  46. Classification vs Association Learning • Classification learning • Supervised for scheme is being provided with actual outcome • Class is the outcome • Success can be measured on fresh data for which class labels are known • Association learning • Can be applied if no class is specified and any kind of structure is interesting • Far more association rules than classification rules • Minimum coverage and minimum accuracy constraints

  47. Clustering, Numeric Prediction • Clustering • Finding groups of items that are similar • Unsupervised for the class is not known • Numeric prediction • Like classification learning but with numeric class • Supervised since scheme is being provided with target value • Success is measured on test data

  48. What’s in an Example? • Most common form in data mining • Instance : specific type of example • Things to be classified, associated or clustered • Individual, independent example of target concept • Characterized by a predetermined set of attributes • Input to learning scheme : set of instances / dataset • Represented a single relation / flat file • Rather restricted form of input • No relationships between objects

  49. A Family Tree

  50. The Family Tree In Table Sister of? Closed World Assumption

More Related