1 / 47

Advanced Analytics

Advanced Analytics. Data Mining using SQL Server Tuesday , April 17, 2012 from 5:30 PM to 7:30 PM (CT) Thomas Arehart Microsoft Technology Center. Growing Business Use.

Rita
Télécharger la présentation

Advanced Analytics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Analytics Data Mining using SQL Server Tuesday, April 17, 2012 from 5:30 PM to 7:30 PM (CT) Thomas Arehart Microsoft Technology Center

  2. Growing Business Use Whether delivered as dashboards, scorecards or standalone tools, the number of users benefiting from access to business intelligence (BI) and analytics tools is taking off. Once limited to only a few number crunchers with degrees in advanced mathematics, BI and analytics tools are rapidly being deployed to all professionals in many organizations, and to everyone in a substantial number of companies, according to analysts and recent surveys. While traditional BI tools were complex and expensive, access to powerful BI and analytics capabilities is no longer out of reach for the masses. Today, BI capabilities are increasingly embedded in a wide range of software applications. Another reason for the broader use of these tools is that the market has evolved into a broad ecosystem. A wide swath of vendors in a variety of fields essentially have collaborated to simplify the technology front ends as well as focused the tools on specific vertical markets such as retailing, telecom and consumer packaged goods manufacturing. The BA market ranges from platform technologies such as data warehouse management to end user-facing analytic applications and BI tool. (1)

  3. Business Need With BI capabilities now found in a wide range of software applications as well as lighter weight, standalone packages, new-generation BI is often invisible to its users. This lets them focus on making better decisions and serving customers more effectively as opposed to staying up to speed on the latest technology acronyms. Knowledge workers need analytical tools to explore the gaps in a process when things break. Analytical software that analyzes a multitude of databases and transaction histories can provide guidance and predictions about future customer needs and behavior. This guidance empowers employees to anticipate customer needs and reduce costs and improve overall efficiency. Companies want more automation and consistency around the decisions employees make on a daily basis. (1)

  4. Table 1. Steps in the Evolution of Data Mining

  5. Analytic Algorithm Categories

  6. Analytic Algorithm Categories

  7. Analytic Algorithm Categories

  8. Analytic Algorithm Categories

  9. Analytic Algorithm Categories

  10. Analytic Algorithm Categories

  11. Analytic Algorithm Categories

  12. Conventional BI Reporting Architecture

  13. Microsoft SQL Server 2008 Data Mining Add-Ins for Office 2007 Analyze Key Influencers (Table Analysis Tools for Excel) The Analyze Key Influencers tool enables you to select a column that contains a desired outcome or target value, and then analyze the patterns in your data to determine which factors had the strongest influence on the outcome. For example, if you have a customer list that includes a column that shows the total purchases for each customer over the past year, you could analyze the table to determine the customer demographics for your top purchasers.

  14. Microsoft Office 2007 Data Mining Tasks (4)

  15. Data Analysis Expressions (DAX) is the standard PowerPivot formula language that supports custom calculations in PowerPivot tables and Excel PivotTables. While many of the functions used in Excel are included, DAX also offers additional functions for carrying out dynamic aggregation and other operations with your data. (8) (7)

  16. Reporting is a fundamental activity in most businesses, and SQL Server 2008 Reporting Services provides a comprehensive solution for creating, rendering, and deploying reports throughout the enterprise. SQL Server Reporting Services can render reports directly from a data mining model by using a data mining extensions (DMX) query. This enables users to visualize the content of data mining models for optimized data representation. Furthermore, the ability to query directly against the data mining structure enables users to easily include attributes beyond the scope of the mining model requirements, presenting complete and meaningful information. (4) The DMX query editor for SQL Server Reporting Services

  17. You can also call VBA functions, or create your own functions. For more information, see Functions (DMX).

  18. Prediction Queries (Data Mining) (9) SELECT PredictTimeSeries([Forecasting].[Amount]) as [PredictedAmount] , PredictTimeSeries([Forecasting].[Quantity]) as [PredictedQty] FROM [Forecasting]

  19. SQL Server 2008 data mining supports a number of application programming interfaces (APIs) that developers can use to build custom solutions that take advantage of the predictive analysis capabilities in SQL Server. DMX, XMLA, OLEDB and ADOMD.NET, and Analysis Management Objects (AMO) offer a rich, fully documented development platform, empowering developers to build data mining aware applications and providing real-time discovery and recommendation through familiar tools. This extensibility creates an opportunity for business organizations and independent software vendors (ISVs) to embed predictive analysis into line-of-business applications, introducing insight and forecasting that inform business decisions and processes. For example, the Analytics Foundation adds predictive scoring to Microsoft Dynamics® CRM, to enable information workers across sales, marketing, and service organizations to identify attainable opportunities that are more likely to lead to a sale, increasing efficiency and improving productivity (for more information, see the Microsoft Dynamics site).

  20. Integration Services Data Mining Tasks and Transformations • -------------------------------------------------------------------------------- • SQL Server Integration Services provides many components that support data mining. • Some tools in Integration Services are designed to help automate common data mining tasks, including prediction, model building, and processing. • For example: • Create an Integration Services package that automatically updates the model every time the dataset is updated with new customers • Perform custom segmentation or custom sampling of case records. • Automatically generate models passed on parameters. • However, you can also use data mining in a package workflow, as an input to other processes. • For example: • Use probability values generated by the model to weight scores for text mining or other classification tasks. • Automatically generate predictions based on prior data and use those values to assess the validity of new data. • Using logistic regression to segment incoming customers by risk.

  21. Microsoft SQL Server 2008 Integration Services provides a powerful, extensible ETL platform that Business Intelligence solution developers can use to implement ETL operations . SQL Server Integration Services includes a Data Mining Model Training destination for training data mining models, and a Data Mining Query transformation that can be used to perform predictive analysis on data as it is passed through the data flow. Integrating predictive analysis with SQL Server Integration Services enables organizations to flag unusual data, classify business entities, perform text mining, and fill-in missing values on the fly based on the power and insight of the data mining algorithms. (4) Data mining in SQL Server Integration Services

  22. After you have created a mining structure and mining model by using the Data Mining Wizard, you can use the Data Mining Designer from either SQL Server Data Tools (SSDT) or SQL Server Management Studio to work with existing models and structures. • The designer includes tools for these tasks: • Modify the properties of mining structures, add columns and create column aliases, change the binning method or expected distribution of values. • Add new models to an existing structure; copy models, change model properties or metadata, or define filters on a mining model. • Browse the patterns and rules within the model; explore associations or decision trees. Get detailed statistics about • Custom viewers are provided for each different time of model, to help you analyze your data and explore the patterns revealed by data mining. • Validate models by creating lift charts, or analyzing the profit curve for models. Compare models using classification matrices, or validate a data set and its models by using cross-validation. • Create predictions and content queries against existing mining models. Build one-off queries, or set up queries to generate predictions for entire tables of external data.

  23. SQL Server 2008 Analysis Services provides a highly scalable platform for multidimensional OLAP analysis. Many customers are already reaping the benefits of creating a unified dimensional model (UDM) in Analysis Services and using it to slice and dice business measures by multiple dimensions. Predictive analysis, being part of SQL Server 2008 Analysis Services provides a richer OLAP experience, featuring data mining dimensions that slice your data by the hidden patterns within.(4) A data mining dimension in an OLAP cube

  24. Data Mining Algorithms (Analysis Services - Data Mining) Choosing an Algorithm by Task To help you select an algorithm for use with a specific task, the following table provides suggestions for the types of tasks for which each algorithm is traditionally used.

  25. Many businesses use KPIs to evaluate critical business metrics against targets. SQL Server 2008 Analysis Services provides a centralized platform for KPIs across the organization, and integration with Microsoft Office PerformancePoint® Server 2007 enables decision makers to build business dashboards from which they can monitor the company’s performance. KPIs are traditionally retrospective, for example showing last month’s sales total compared to the sales target. However, with the insights made possible through data mining, organizations can build predictive KPIs that forecast future performance against targets, giving the business an opportunity to detect and resolve potential problems proactively. Predictive analysis can detect attributes that influence KPIs. Together with Office PerformancePoint Server 2007, users can monitor trends in key influencers to recognize those attributes that have a sustained effect. Such insights enable businesses to inform and improve their response strategy. (4) Microsoft Office PerformancePoint Server 2007

  26. The SQL Server data mining toolset is fully extensible through Microsoft .NET–stored procedures, plug-in algorithms, custom visualizations and PMML. This enables developers to extend the out-of-the-box data mining technologies of SQL Server 2008 to meet uncommon business needs that are specific to the organization by: • Creating custom data mining algorithms to solve business-specific analytical problems. • Using data mining algorithms from other software vendors. • Creating custom visualizations of data mining models through plug-in viewer APIs. Although the data mining functionality provided with SQL Server 2008 is comprehensive enough to meet the needs of a wide range of business scenarios, its extensibility ensures that it can be used to solve virtually any predictive problem. The ability to extend the data mining technologies of SQL Server through custom algorithms and visualizations, together with the ability to embed predictive functionality into line-of-business applications makes SQL Server 2008 a powerful platform for introducing predictive analysis into existing business processes to add insight and recommendations into everyday operations. (4)

  27. Plugin Algorithms SQL Server 2012 SQL Server 2008 R2 SQL Server 2008 SQL Server 2005 In addition to the algorithms that Microsoft SQL Server Analysis Services provides, there are many other algorithms that you can use for data mining. Accordingly, Analysis Services provides a mechanism for "plugging in" algorithms that are created by third parties. As long as the algorithms follow certain standards, you can use them within Analysis Services just as you use the Microsoft algorithms. Plugin algorithms have all the capabilities of algorithms that SQL Server Analysis Services provides. For a full description of the interfaces that Analysis Services uses to communicate with plugin algorithms, see the samples for creating a custom algorithm and custom model viewer that are published on CodePlex Web site.

  28. One Way ANOVA (Analysis of Variance) When to Use One-Way, Single Factor ANOVA In a manufacturing or service environment, you might wonder if changing a formula, process or material might deliver a better product at a lower cost. Saving a penny a pound on five million pounds a month can really add up. Saving ten minutes of wait time in hospital might add $100,000 to the bottom line and deliver better patient outcomes. Comparing two or more drug formulations might pinpoint the best drug for a desired result. How can you compare the old formula with a new one and be certain that you have an opportunity to improve? Use one-way ANOVA (also known as single factor ANOVA) to determine if there's a statistically significant difference between two or more alternatives.

  29. One Way ANOVA (Analysis of Variance) Imagine that you manufacture paper bags and that you want to improve the tensile strength of the bag. You suspect that changing the concentration of hardwood in the bag will change the tensile strength. You measure the tensile strength in pounds per square inch (PSI). So, you decide to test this at 5%, 10%, 15% and 20% hardwood concentration levels. These "levels" are also called "treatments." Since we are only evaluating a single factor (hardwood concentration) this is called one-way ANOVA. The nullhypothesis is that the means are equal: H0: Mean1 = Mean2 = Mean3 = Mean4 The alternate hypothesis is that at least one of the means are different: Ha: At least one of the means is different To conduct the one-way ANOVA test, you need to randomize the trials (assumption #1). Imagine that we've conducted these trials at each of the four levels of hardwood concentration.

  30. One Way ANOVA (Analysis of Variance) You'll find the results of these trials in the ANOVA test data provided with the QI Macros at c:\qimacros\testdata\anova.xls. The QI Macros will prompt you for the significance level you desire. While the default is 0.05 (95% confident), in this example we want to be even more certain, so we use 0.01 (99% confident).

  31. One Way ANOVA (Analysis of Variance) Interpreting the Anova One Way test results The QI Macros automatically compares the p value to a, but you might want to know how to do this manually. The "null" hypothesis assumes that there is no difference between the hardwood concentrations. The P-value of 0.000 is less than the significance level (0.01), so we can reject the null hypothesis and safely assume that hardwood concentration affects tensile strength. F (19.60521) is greater than F crit (4.938193), so again, we can reject the null hypothesis.

  32. One Way ANOVA (Analysis of Variance) Now we can look at the average tensile strength and variances: The average tensile strength increases, but we cannot say for certain which means differ. The variance at the 15% level looks substantially lower than the other levels. We might need to do additional analysis. If we reran the one way Anova test with just 10% and 15%, we'd discover there is no statistically significant difference between the two means. The P value (0.349) is greater than the signficance level (0.01), so we cannot reject the null hypothesis that the means are equivalent. And F (0.963855) is less than F crit (10.04429) so we cannot reject the null hypothesis. Based on this analysis, if we were aiming for a tensile strength of 15 PSI or greater, the 10% level might be more cost effective.

  33. Two Way ANOVA (Analysis of Variance) - Without Replication What's cool about QI Macros Two-Way ANOVA? Unlike other statistical software, the QI Macros is the only SPC software that compares the p-values to the significance level and tells you when to "Accept or Reject the Null Hypothesis" and what that tells you: "Means are Same or Different ". Two Way Anova (Analysis of variance) , also known as two factor Anova, can help you determine if two factors have the same "mean" or average. This is a form of "hypothesis testing."

  34. Two Way ANOVA (Analysis of Variance) - Without Replication The null hypothesis is that the means are equal: •H0: Factor 1's Means = Factor 2's Means The alternate hypothesis is: •Ha: The means are different. The goal is to accept or reject the null hypothesis (i.e., the samples have different means) at a certain confidence level (95% or 99%).

  35. Two Way ANOVA (Analysis of Variance) - Without Replication Using Excel and the QI Macros, run a two-way analysis without replication (alpha=0.05 for a 95% confidence). Click on QI Macros menu and select: ANOVA Two Factor without replication.

  36. Two Way ANOVA (Analysis of Variance) - Without Replication Interpreting the Anova Two Way Without Replication Results In case you want to know how to do this manually, use these instructions. Here, the P-value for Rows (i.e., golfers) is less than alpha (0.05), so we can reject the hypothesis that all of the golfers are the same. The P-Value for Columns (i.e., golf balls) is also less than alpha, so we can reject the hypothesis that all of the golf balls are the same.

  37. Two Way ANOVA (Analysis of Variance) - Without Replication It does look like Brand B and C are similar. We could run a paired two sample t test on Brands B and C to determine if they deliver the same average distance. Since the p values are greater than alpha (0.05), we can accept the null hypothesis that there is no difference between the two brands of golf balls, except perhaps price. Since the p values are greater than alpha (0.05), we can accept the null hypothesis that there is no difference between the two brands of golf balls, except perhaps price.

  38. Two Way ANOVA (Analysis of Variance) With Replication When to Use Two Way Anova Two Way Anova (Analysis of variance) , also known as two factor Anova, can help you determine if two or more samples have the same "mean" or average. This is a form of "hypothesis testing." The null hypothesis is that the means are equal. The alternate hypothesis is that the means are not equal. •H0: Mean1 = Mean2 = Mean3 •Ha: Mean1 <> Mean2 <> Mean3 The goal is to accept or reject the null hypothesis (i.e., the samples have different means) at a certain confidence level (95% or 99%).

  39. Two Way ANOVA (Analysis of Variance) With Replication What if you have two populations of patients (male/female) and three different kinds of medications, and you want to evaluate their effectiveness? You might run a study with three "replications", three men and three women.

  40. Two Way ANOVA (Analysis of Variance) With Replication What's cool about QI Macros ANOVA? Unlike other statistical software, the QI Macros is the only SPC software that compares the p-values (0.179) to the signficance (0.05) and tells you to "Accept the Null Hypothesis because p>0.05" and that the "Means are the same ". Using the QI Macros, run a two-way Anova analysis with replication (alpha=0.05 for a 95% confidence).

  41. Two Way ANOVA (Analysis of Variance) With Replication Interpreting the Anova Two Way Results In case you want to know how to do this manually, use these instructions:

  42. References 1) Pervasive insights produce better business decision opening access to business intelligence by embedding analytics capabilities into everyday software tools pays substantial dividends. By Lauren Gibbons Paul 2) Data Mining Algorithms (Analysis Services - Data Mining) http://msdn.microsoft.com/en-us/library/ms175595.aspx 3) Data Mining Query Task http://msdn.microsoft.com/en-us/library/ms141728.aspx 4) Predictive Analysis with SQL Server 2008 - White Paper - Microsoft - Published: November 2007 5) Predictive Analytics for the Retail Industry - White Paper - Microsoft - Writer: Matt Adams Technical Reviewer: Roni Karassik, Published: May 2008 6) Breakthrough Insights using Microsoft SQL Server 2012 - Analysis Services https://www.microsoftvirtualacademy.com/tracks/breakthrough-insights-using-microsoft-sql-server-2012-analysis-services 7) Useful DAX Starter Functions and Expressions http://thomasivarssonmalmo.wordpress.com/category/powerpivot-and-dax/ 8) Stairway to PowerPivot and DAX - Level 1: Getting Started with PowerPivot and DAX By Bill_Pearson, 2011/12/21 9) Data Mining Tool http://technet.microsoft.com/en-us/library/ms174467.aspx 10) DAX Cheat Sheet http://powerpivot-info.com/post/439-dax-cheat-sheet

More Related