1 / 54

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会. 今日安排. 微软 SQL 数据挖掘技术概述 左洪 微软公司 数据仓库在电信的应用 贝志城 明天高科 数据挖掘在 CRM 中的应用 王立军 中圣公司 灵通 IT Service 维护管理服务系统 邹雄文 广州灵通. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司. Agenda. What is Data Mining The Data Mining Market

neola
Télécharger la présentation

欢迎光临 微软 SQL 数据挖掘 / 数据仓库 技术研讨会

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 欢迎光临微软SQL数据挖掘/数据仓库技术研讨会欢迎光临微软SQL数据挖掘/数据仓库技术研讨会

  2. 今日安排 • 微软SQL数据挖掘技术概述 • 左洪 微软公司 • 数据仓库在电信的应用 • 贝志城 明天高科 • 数据挖掘在CRM中的应用 • 王立军 中圣公司 • 灵通IT Service维护管理服务系统 • 邹雄文 广州灵通

  3. Introduction to Data Mining with SQL Server 2000 左洪 高级产品市场经理 微软(中国)有限公司

  4. Agenda • What is Data Mining • The Data Mining Market • OLE DB for Data Mining • Overview of the Data Mining Features in SQL Server 2000 • Q&A

  5. What Is Data Mining?

  6. What is DM? • A process of data exploration and analysis using automatic or semi-automatic means • “Exploring data” – scanning samples of known facts about “cases”. • “knowledge”: Clusters, Rules, Decision trees, Equations, Association rules… • Once the “knowledge” is extracted it: • Can be browsed • Provides a very useful insight on the cases behavior • Can be used to predict values of other cases • Can serve as a key element in closed loop analysis

  7. What drive high school students to attend college?

  8. IQ=High IQ=Low Attend College: 79% Yes 11% No Attend College: 45% Yes 55% No Wealth = False Parents Encourage = Yes Parents Encourage = No Wealth = True Attend College: 70% Yes 30% No Attend College: 31% Yes 69% No Attend College: 94% Yes 6% No Attend College: 69% Yes 21% No The deciding factors for high school students to attend college are… All Students Attend College: 55% Yes 45% No IQ ? Wealth Parents Encourage?

  9. Business Oriented DM Problems • Targeted ads • “What banner should I display to this visitor?” • Cross sells • “What other products is this customer likely to buy? • Fraud detection • “Is this insurance claim a fraud?” • Churn analysis • “Who are those customers likely to churn?” • Risk Management • “Should I approve the loan to this customer?” • …

  10. Http://www.tunes.com

  11. Data To Predict Training Data Mining Model Mining Model Mining Model Predicted Data Mining Process - Illustrated DM Engine DM Engine

  12. The Data Mining Market

  13. The $$$: Y2000 Market Size • DM Tools Market: $250M • 40% - license fees • 60% consulting * Gartner

  14. The Players • Leading vendors • SAS • SPSS • IBM • Hundreds of smaller vendors offering DM algorithms… • Oracle –Thinking Machines acquisition

  15. The Products • End-to-end Data Mining tools • Extraction, Cleansing, Loading, Modeling, Algorithms (dozens), Analysts workbench, Reporting, Charting…. • The customer is the power-analyst • PhD in statistics is usually required… • Closed tools – no standard API • Total vendor lock-in • Limited integration with applications • DM an “outsider” in the Data Warehouse • Extensive consulting required • Sky rocketing prices • $60K+ for a single user license

  16. What the analysts say… • “Stand-alone Data Mining Is Dead” - Forrester • “The demise of [stand alone] data mining” – Gartner

  17. The Microsoft Approach

  18. DataPro Users Survey 1999-2001 “Data mining will be the fastest-growing BI technology…”

  19. The $$$: 2000 Market Size • DM Applications Market Size: $1.5B * IDC

  20. SQL Server 2000 - The Analysis Platform • SQL 2000 provides a complete Analysis Platform • Not an isolated, stand alone DM product • Platform means: • The infrastructure for applications • Not an application by itself • Integrated vision for all technologies, tools • Standard based API’s (OLE DB for DM) • Extensible • Scaleable

  21. Reports & Analysis Data Flow DM Apps DM OLTP DW OLAP

  22. DM Analysis Services 2000 - Architecture Manager UI DM Wizards DM DTS Task DSO DMM OLE DB OLAP Client Analysis Server OLAP Engine (local) DM Engine (local) OLAP Engine DM Engine Ext. Ext.

  23. OLE DB for Data Mining…

  24. Why OLE DB for DM? Make DM a mass market technology by: • Leverage existing technologies and knowledge • SQL and OLE DB • Common industry wide concepts and data presentation • Changing DM market perception from “proprietary” to “open” • Increasing the number of players: • Reduce the cost and risk of becoming a consumer – one tool works with multiple providers • Reduce the cost and risk of becoming a provider – focus on expertise and find many partners to complement offering • Dramatically increase the number of DM developers

  25. Integration With RDBMS • Customers would like to • Build DM models from within their RDBMS • Train the models directly off their relational tables • Perform predictions as relational queries (tables in, tables out) • Feel that DM is a native part of their database. • Therefore… • Data mining models are relational objects • All operations on the models are relational • The language used is SQL (w/Extensions) • The effect: every DBA and VB developer can become a DM developer

  26. Creating a Data Mining Model (DMM)

  27. Identifying the “Cases” • DM algorithms analyze “cases” • The “case” is the entity being categorized and classified • Examples • Customer credit risk analysis: Case = Customer • Product profitability analysis: Case = Product • Promotion success analysis: Case = Promotion • Each case encapsulate all we know about the entity

  28. A Simple Set of Cases

  29. More Complicated Cases

  30. A DMM is a Table! • A DMM structure is defined as a table • Training a DMM means inserting data into the table • Predicting from a DMM means querying the table • All information describing the case are contained in columns

  31. Creating a Mining Model CREATE MINING MODEL [Plans Prediction] ( StudentID LONG KEY, Gender TEXT DISCRETE, ParentIncome LONG CONTINUOUS, IQ DOUBLE CONTINUOUS, Encouragement TEXT DISCRETE, CollegePlans TEXT DISCRETE PREDICT ) USING Microsoft_Decision_Trees

  32. Creating a mining model with nested table Create Mining Model MoviePrediction ( CutomerId long key, Age long continuous, Gender discrete, Education discrete, MovieList table predict ( MovieName text key ) ) using microsoft_decision_trees

  33. Training a DMM

  34. Training a DMM • Training a DMM means passing it data for which the attributes to be predicted are known • Multiple passes are handled internally by the provider! • Use an INSERT INTO statement • The DMM will not persist the inserted data • Instead it will analyze the given cases and build the DMM content (decision tree, segmentation model, association rules) INSERT [INTO] <mining model name> [(columns list)] <source data query>

  35. INSERT INTO INSERT INTO [Plans Prediction] ( StudentID, Gender, ParentIncome, IQ, Encouragement, CollegePlans ) SELECT [StudentID], [Gender], [ParentIncome], [IQ], [Encouragement], [CollegePlans] FROM [CollegePlans]

  36. When Insert Into Is Done… • The DMM is trained • The model can be retrained • Content (rules, trees, formulas) can be explored • OLE DB Schema rowset • SELECT * FROM <dmm>.CONTENT • XML string (PMML) • Prediction queries can be executed

  37. Predictions

  38. What are Predictions? • Predictions apply the rules of a trained model to a new set of data in order to estimate missing attributes or values • Predictions = queries • The syntax is SQL - like • The output is a rowset • In order to predict you need: • Input data set • A trained DMM • Binding (mapping) information between the input data and the DMM • Specification of what to predict

  39. The Truth Table Concept

  40. Prediction It’s a JOIN!

  41. The Prediction Query Syntax SELECT <columns to return or predict> FROM <dmm> PREDICTION JOIN <input data set> ON <dmm column> = <dmm input column>…

  42. Example SELECT [New Students].[StudentID], [Plans Prediction].[CollegePlans], PredictProbability([CollegePlans]) FROM [Plans Prediction] PREDICTION JOIN [New Students] ON [Plans Prediction].[Gender] = [New Students].[Gender] AND [Plans Prediction].[IQ] = [New Students].[IQ] AND ...

  43. OLE DB DM Sample Provider with Source • All required OLE DB objects, such as session, command, and rowset • The OLE DB for Data Mining syntax parser • Tokenization of input data • Query processing engine • A sample Naïve Bayes algorithm • Model persistence in XML and binary formats • Available at www.microsoft.com/data/oledb/DMResKit.htm

  44. Integrated OLAP and DM Analysis

  45. Why Use DM with OLAP • Relational DM is designed for: • Reports of patterns • Batch predictions fed into an OLTP system • Real-time singleton prediction in an operational environment • OLAP is designed for • interactive analysis by a knowledge worker • Consistent and convenient navigational model • Pre-aggregations of OLAP allow faster performance

  46. Understanding DM Content – Decision Trees Customers having high debt and college education: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = “High” And Customers.CurrentMember.Properties(“Education”) = “College”) All Customers Credit Risk: 65% Good 35% Bad Debt ? Customers having low debt and are self employed: Filter([Individual Customers].Members, Customers.CurrentMember.Properties(“Debt”) = Low And Customers.CurrentMember.Properties(“Employment Type”) = “Self Employed”) Debt=Low Debt=High Credit Risk: 89% Good 11% Bad Credit Risk: 45% Good 55% Bad Employ- -ment Type? Education? ET = Self Employed Education= College Education= High School ET = Salaried Credit Risk: 70% Good 30% Bad Credit Risk: 31% Good 69% Bad Credit Risk: 94% Good 6% Bad Credit Risk: 79% Good 21% Bad

  47. …Equivalent DM Dimension All Customers Customers with low debt Customers with low debt and self employed Customers with low debt and salaried Customers with high debt Customers with high debt and college education Customers with high debt and high school education

  48. Tree = Dimension • Every node on the tree is a dimension member • The node statistics are the member properties • All members are calculated • Formula aggregates the case dimension members that apply to this node • The MDX is generated by the DM algorithm • Analysis Service will automatically generate the calculated dimension based on the DM content and also a virtual cube • Applies to • Classification (decision trees) • Segmentation (clusters)

  49. Browsing the Virtual Cube • Pivot the DM dimension: Credit Risk: 70% Good, 30% Bad

More Related