1.99k likes | 2.38k Vues
Data Mining . Seonggyu Kim. Data Mining. Data – recorded facts Information – set of patterns or expectations. Data Mining – extraction of implicit, previously unknown, and potentially useful information from data
E N D
Data Mining Seonggyu Kim
Data Mining • Data – recorded facts • Information – set of patterns or expectations Data Mining– extraction of implicit, previously unknown, and potentially useful information from data Based on Machine Learning Programs that sift through databases automatically, seeking regularities or structural patterns for making accurate predictions on future data • most patterns are not interesting • patterns may be inexact (or even completely spurious) if data is garbled or missing Garbage in Garbage out
Structural Patterns Rule If tear production rate = reduced then recommendation = none Otherwise, if age = young and astigmatic = no then recommendation = soft
Database • A collection of interrelated data • Data Sharing • Data Independence
Traditional OLTP • DBMS used for on-line transaction processing (OLTP) • order entry, balance transfer, etc. • Clerical data processing tasks • Detailed up-to-date data • Structured, repetitive tasks • Short transactions • Isolation, recovery, and integrity
OLTP to OLAP OnLine Transaction Processing OnLine Analytical Processing
Transactions Name or identifier of customer items purchased price paid dates on which items are sold Business decision what to stock when to stock what discount to offer Decision Support Systems • SQL Extensions • Statistical Analyses of data • Knowledge discovery, in other word, Data Mining • Data warehouse - allow company to retrieve • diverse data on demand
OLAP(OnLine Analytical Processing) • Data is viewed in the frame of a table or with more than three axes, hypercube • Dimension Category, Location • Slice single, condo • Level year, month or city, state • Measure source data • Data warehouse Multidatabase Approach
Web Interface to Database Network Database Applications Database applications on Internet and Intranet Static Report Publishing DB Query Publishing Private Network Public Network via Firewall
Database Access Standards Three-tier architecture Standard & DBMS-independent Interfaces for accessing database server Browser DB Web Server Database Server Browser ODBC ADO (Active Data Object) OLE/DB JDBC Native Calls HTML DHTML XML
RDB Native Interface DB Server NRDB ODBC Browser File OLE DB ADO Web Server Browser E-mail Browser Multimedia Data Role of ADO Enables programmer in almost any languages to be able to access OLD DB Works in conjunction with Remote Data Services objects
Markup Languages Specify the appearance and behavior of Web pages • SGML • HTML • www.w3.org • No way to specify contents from layout & format • Lack of style definitions • No way to access web page elements with scripts • No construct to facilitate caching & data manipulation on clients • D(Dynamic)HTML • Microsoft implementation of HTML 4.0 • Document Object Model(DOM) enables contents to be altered without refreshing them from server • Cascading Style Sheet • Remote Data Service(RD), ActiveX objects for exchanging data with server
XML (eXtensible Markup Language) • Clear separation of contents, layout and materialization • Standards, but extensible by developer • Standard means for expressing the structure of database views using DTD or XML Schema • No built-in compression scheme A Standardized facility to describe, validate and materialize any database view
Multidatabase Systems Manipulation of databases located in a heterogeneous collection of hardware and software environments Local database has different logical model, DDL, DML Multidatabase system integrates local databases logically without requiring physical database integration • Common Conceptual Schema • Local, Global Transaction • Two-level Serializability
Data Warehouse A repository of information gathered from multiple sources, stored a unified schema at a single site. • A store of enterprise data for decision making • Multidimensional, Subject-oriented, Integrated, Time-variant, Nonvolatile data checking saving
Warehouse, Mart, Mine • Data Warehouse where corporate data is stored • Data Mart where departmental data is stored • Star schemas or normalized form • Data Mine where data is re-organized for analysis and information is extracted from the data • For a specific business objectives • Enriched by additional external data
Characteristics of Data Warehouse • Subject-oriented • What to analyze, i.e., customer, sales • Integrated • Time-variant • History • Nonvolatile • Load only, almost no update
Information Flow in Data Warehouse Environment • In flow • Information to Data warehouse • Up flow • Summary of Information • Down flow • Small OLTP storage media • Out flow • Information to customer • Meta flow • Control flow for operation
Purpose of Data Warehouse • OLTP systems are designed to capture data – detailed level of the individual transaction • Data warehouses are designed for getting data out What were the best-selling products last week? What’s the quarterly trend in orders by region? To answer business questions by interpreting business requirement in dimensional way
Dimensional Modeling Star Schema Approach Dimensional Model • An analytic tool in planning the data warehouse • A physical design for its implementation in a relational database • A direct reflection of the manner in which a business process is viewed • A representation of Important business measurements and parameters by which the measurements are broken out
Business Vocabulary 1 • Gross marginbyproduct category • Average account balancebyeducational level • Compare inventory level with salesbyproductbywarehouse • Outstanding 180-day receivablesbygeneral ledger account • Return ratebysupplier
Business Vocabulary 2 • Business process – orders, account management, inventory management, account receivables, return • Summarizing a lot of transactions • Subject area – margin, product category, account balance, education level • How business measures the process
Measurements • Margin is broken out or rolled up by • Product category • Individual product • Sales person • Customer • Time period Broken out or Rolled up by Numerous parameters Dimension tables
A Dimensional Model Customer Customer Name Customer Code Billing Address Shipping Address Product Product Name SKU Brand Order Measures Order Dollars Cost Margin Dollars Quantity Sold Order Date Date Month Quarter Year Salesperson Salesperson name Territory name Region name Fact table Dimension table Star schema
A Star Schema Fact table & its associated dimensions Product Customer Product_Key Product SKU Brand Customer_key Name Customer_id Billing_Address Billing_city Billing_state Billing_zip Shipping_Address Order_Facts Product_key Order_date_key Salesperson_key Customer_key Order_dollars Extended_cost Margin_dollars Quantity_ordered Order_number Order_line Date Date_key Date Day_of_week_no Day_of_week_name Day_of_month Month_number Month_name Fiscal_period Year Salesperson Salesperson_key Salesperson_name Salesperson_code Territory_name Region_name same level of detail
Some Remarks • Fact table has foreign keys that relate each measurement to the appropriate rows in each of the dimension tables • Grain – the level of detail at which measure will be recorded • Dimension table holds attributes • To be used to qualify queries • To be used to break out measure • Primary key of a dimension table is a single system defined attribute
Where Data Mining Happens • Above the warehouse, as a set of conceptual views • Beside the warehouse, as a separate repository • Within the warehouse, as a distinct set of resources
Above the warehouse • Minimal architecture for the discovery and analysis • Data Mining is Not a key objective
Most Effective Approach Beside the Warehouse
Within the Warehouse • Massively Parallel Processing Computer
Information is Crucial • In vitro fertilization case • Given : embryos described by 60 features • Problem : selection of embryos that will survive • Data : historical records of embryos and outcome • Cow culling case • Given : cows described by 700 features • Problem : selection of cows what should be culled • Data : historical records and farmer’s decisions
Machine Learning Techniques • Technical basis for Data Mining – algorithms for acquiring structural descriptions from examples • Structural descriptions represent patterns explicitly • Can be used to predict outcome in new situation • Can be used to understand and explain how prediction is derived • Methods originate from artificial intelligence, statistics and research on database
Structural Descriptions • Classification rule: predicts value of pre-specified attribute (the classification of an example) If outlook = sunny and humidity = high then play = no • Associations rule: predicts value of arbitrary attribute or combination of attributes If temperature = cool then humidity = normal If humidity > 85 and windy = false then play = yes If outlook = sunny and play = no then humidity = high If windy = false and play = no then outlook = sunny and humidity = high
Visualizing ML as Search • Inductive learning : finding a concept description that fits the data • Example : rule sets as description language • Enormous, but finite, search space • Simple solution : enumerating the concept space eliminating descriptions that do not fit examples • Surviving descriptions contain target concept result of learning
Enumerating the concept space • Search space for the problem • 288 possibilities for each of 14 rules • 2.7 x 1034 different rule sets • Possible remedy : candidate-elimination algorithm • Other practical problems: • More than one description may survive • No description may survive • Language may not able to describe target concept • Data may contain noise
BIAS • The most important decisions in learning systems: • The concept description language • The order in which the space is searched • The way that overfitting to the particular training data is avoided • These properties form the Bias of the search • Language bias represent negation or disjunction • Search bias search heuristics or search direction • Overfitting-avoidance bias
Data Mining & Ethics Are resources put to good use? • Many ethical issues arise in practical applications • Data mining often used to discriminate • i.e., loan applications: using some information such as sex, religion, or race is unethical • Ethical situation depends on application • i.e., same information ok in medical application • Attributes may contain problematic information • i.e., area code may correlate with race
Input & Output Knowledge Representation for Concepts
Components of the input • Concepts: kinds of things that can be learned • Aim: intelligible and operational concept description • Instances: the individual, independent examples of a concept • Note: more complicated forms of input are possible • Attributes: measuring aspects of an instance • We will focus on nominal and numeric ones • Practical issue: a file format for the input
What’s a Concept? • Styles of learning: • Classification learning: predicting a discrete class • Association learning: detecting associations between features • Clustering: grouping similar instances into clusters • Numeric prediction: predicting a numeric quantity • Concept: thing to be learned • Concept description: output of learning scheme
Classification vs Association Learning • Classification learning • Supervised for scheme is being provided with actual outcome • Class is the outcome • Success can be measured on fresh data for which class labels are known • Association learning • Can be applied if no class is specified and any kind of structure is interesting • Far more association rules than classification rules • Minimum coverage and minimum accuracy constraints
Clustering, Numeric Prediction • Clustering • Finding groups of items that are similar • Unsupervised for the class is not known • Numeric prediction • Like classification learning but with numeric class • Supervised since scheme is being provided with target value • Success is measured on test data
What’s in an Example? • Most common form in data mining • Instance : specific type of example • Things to be classified, associated or clustered • Individual, independent example of target concept • Characterized by a predetermined set of attributes • Input to learning scheme : set of instances / dataset • Represented a single relation / flat file • Rather restricted form of input • No relationships between objects
The Family Tree In Table Sister of? Closed World Assumption