400 likes | 558 Vues
Big Data and Data Mining. Professor Tom Fomby Director Richard B. Johnson Center for Economic Studies Department of Economics SMU May 23, 2013. Big Data: Many Observations on Many Variables . Data File. Types of Problems. Customer and Student Retention Employee Churn
E N D
Big Data and Data Mining Professor Tom Fomby Director Richard B. Johnson Center for Economic Studies Department of Economics SMU May 23, 2013
Types of Problems Customer and Student Retention Employee Churn Credit Scoring (Auto or Home Loans) Bond Ratings What Characteristics Make for a Successful Mary Kay Representative? Detection of Fraudulent Insurance Claims Is a Newly Introduced Product Meeting with Consumer Acceptance or Rejection? Who is a likely Donor to your Charity? Early Detection of a Stolen or Compromised Credit Card
Types of Problems What kind of genetic markers imply certain susceptibilities to specific diseases? Netflix and recommendations of Related and Suggested Movies Recommendations for Book Purchases: Amazon Side-Bars Click Stream Analysis of Optimal Web Base Design
Example of Statistical Hypothesis Testing A Clinical Trial of 400 people – 200 randomly selected into a Control (Placebo) Group and the Other 200 into a Treatment Group Question: Does the Drug Treatment Significantly Reduce a Person’s Cholesterol Count? Method: Conventional Statistical Methods Like T-Test Of Significant Difference in Population Means
Example of a Prediction Problem Early Detection of a Stolen or Compromised Credit Card Not So Interested in How or Why the Credit Card was Stolen but Instead Whether Recent Transactions are Indicative of a Stolen or Compromised Credit Card Tool – Box Plot
Data Rich, Information Poor • The Amount of Raw Data Stored in Corporate Databases is Exploding • Most of this information is recorded instantaneously and with minimal cost • Data bases are measured in gigabytes and terabytes (One terabyte = one trillion bytes. A terabyte is equivalent to about 2 million books!) • Walmart uploads 20 million point-of-sale transactions to 500 parallel processing storage devices each day. • Raw data by itself, however does not provide much information. That is where Data Mining Comes in!
What is Data Mining? • “Extracting useful information from large datasets” (Hand et al., 2001) • “Data mining is the process of exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns and rules.” (Berry and Linoff, 1997, 2000) • “Data mining is the process of discovering meaningful new correlations, patterns and trends by sifting through large amounts of data stored in repositories, using pattern recognition technologies as well as statistical and mathematical techniques” (Gartner Group, 2004)
Four Distinct Characteristics ofData Mining Projects Partitioning given data into Training, Validation, and Test Parts Cross Validation – using the Validation and Test Parts to gauge the worthiness of competing models Using Ensemble Methods to increase predictive accuracy. (There is no such thing as a correct model!) Continual Monitoring of a PA system to guard against structural change and to maintain predictive accuracy
More Detailed Discussion of Specific Data Mining Applications Text Mining (Classification of Documents and Evolution of Opinions on Blogs) Target Marketing Credit Scoring Bond Ratings: Calculating Default Probabilities on Bonds (Bond rating services like Moody’s, Standard & Poor’s, Fitch, etc.) Fraud Detection Customer Retention Franchise Locations and Performance Customer Segmentation Affinity Analysis (i.e. “Market Basket” Analysis) Link Analysis (Webpage design) Many Other Fields including Clinical Science, Statistical Genetics, Political Science, Real Estate Assessment, and College Admissions Practices
Who Wrote the Federalist Papers?Frederick Mosteller and David Wallace“Inference in an Authorship Problem” JASA, June 1963
Doc 1 Doc 2 Word Size Word Freq Sentence Freq Paragraph Freq Sentence Size Paragraph Size Comparing Two Documents
Target Marketing • Target Marketing is the process of choosing specific customers to advertise to and/or to offer discounts to in order to increase the sales of the company • Target Marketing usually proceeds in two stages: (1) Determining the probability that the solicited customer will purchase products from the company once solicited and (2) Once the solicited customer decides to purchase items from the company, estimating the profit that will likely be generated by the customer’s purchases. • Thus the goal is to advertise only to those potential customers that represent expected profits that exceed the cost of advertising to the customer • We then need to use data mining techniques to determine (1) the probability of purchase and (2) conditional on purchase, the expected profit of purchase. • Expected Profit of Purchase = (Probability of Purchase) x (Expected profits from purchase, conditional on purchase)
Credit Scoring • Credit scoring involves using data mining tools determine the credit worthiness of loan applicants • The task is determining the probability that a potential borrower will default on his or her obligations, given the personal characteristics of the borrower and the macroeconomic conditions of the economy at the time • Some Examples: Citibank and Credit Card Issuers reviewing applicants for credit cards; Banks considering loaning money for mortgages
Bond Ratings: Calculating Default Probabilities on Bonds • Given the financial characteristics of a bond issuer and the macroeconomic conditions at the time, what is the probability that the bond issuer will, at some time in the future, not be able to service the obligations of the bond? • Bond rating services like Moody’s, Standard and Poor’s, and Fitch build probability of default models and use them to give bonds their credit ratings (AAA, AAB, …, BBB, etc.). The lower the probability of default, the higher the bond rating and vice versa. In turn, these ratings give rise to differential interest rates paid by the bond issuers. (See Town and Gown PPT for example.)
Fraud Detection • Of interest to IRS, Credit Card Companies, and Auditors • Given a history of transactions, a record of “typical” income tax reports or income or balance sheets, which transactions\reports appear to be “outliers”? • Basic Tool: Statistical Outlier Analysis. Roughly speaking: “What is three or more standard deviations from the norm?”
Customer Retention • What factors determine the loyalty displayed by a customer? • When is a customer likely to “jump ship”? • Would loyalty programs be useful? • Basic Tool: Duration Modeling. This method determines what factors extend or limit the durations of customers with companies. • Purpose: To identify potential “fragile” customers and then “incentivize” them so that they will remain loyal • Result: Higher profits
Facets of a Data Mining Job Development of Problem Statement and Consultation with Domain Experts Data Acquisition Data Preparation and Cleaning Data Visualization and Summarization Type of Task? Supervised Learning (Prediction, Classification), or Unsupervised Learning Evaluation of Models (Data Partitioning and Cross Validation) Scoring of New Data Continual Review of Model Usefulness
Franchise Locations and Performance • What location factors affect the eventual profitability and success of franchises? • Even within a set of franchises, should the product mix be the same for all franchises or should franchises be treated differently? • Can franchisees by put into “Clusters” and treated differently so as to maximize the profits of the entire franchise operation?
Customer Segmentation • Suppose you are a giant publisher of magazines of various types. How do your subscribers differ across your portfolio of magazines? • When soliciting advertising for your magazines, how do you match your potential advertisers with your magazines so that the advertisers receive the maximum benefit for their advertising expenditures? • Is there a niche market (customer segment) that none of your magazines (or those of your competitors) is currently serving? Is this niche market substantial enough to warrant introducing a new magazine? • Also, retailers often like to be able to distinguish between customers with low versus high elasticities of demand for their products so that they will know who to offer discounts to increase their revenues and profits. • Basic Tool: Cluster Analysis
Affinity Analysis • Given that a customer purchases a given set of items, what is the probability that they will purchase another set of items? That is, what does the customer’s finalmarket basket look like, given a partially-filled one? • Purpose: Arrange the store shelves of a retail store so as make it most convenient for customers to purchase related goods and minimize the time of search and shopping. We want the customer to be able to shop quickly but at the same time buy a lot! • On book seller web pages, once you have indicated an interest in purchasing a given book, several related books are often brought to your attention by “advertisements” in the margins of the page you are currently on. Affinity analysis is helpful in generating “associated” sales on retail web pages. This increases the profits of the web retailer. • Major Tool: Association Rules – The A priori Algorithm.
Link Analysis • Explores Associations between groups (individuals, organizations, web sites, nation-states and the like) • Uses: To improve webpage design, to facilitate criminal investigations, and to benefit medical research in epidemiology and pharmacology, among other uses
Text Mining • To Understand Textual Content • For Finding Interesting Regularities in Text • Help Classify Documents by Type and Content • Useful for Medical Science Search Engines seeking most current research on particular maladies seen in patients • Beneficial in Building Spam Filters • Help Examine Evolution of Opinion vis-à-vis Blogs
Other Fields Where Data Mining is Used • Clinical Science and Providing Baseline Guidance for Clinical Treatment • Political Science (Modeling Voting Patterns, Election Outcomes and Appeal and Supreme Court Decisions) • Statistical Genetics – Relating Genetic characteristics with medical outcomes • Real EstateAssessment Models – County Assessors using predictive models to gauge the current value of houses for the purpose of assessing real estate taxes • College Admissions Practices – Which students should be admitted and how much financial aid is needed to insure that the chosen student will matriculate?
Prediction• MLR • K-Nearest Neighbor • Regression Trees • Neural Nets Data Preparation & Exploration •Sampling •Cleaning •Summaries •Visualization •Partitioning •Dimension reduction Classification• K-Nearest Neighbor • Naïve Bayes • Logistic Regression • Classification Trees • Neural Nets • Discriminant Analysis Model Evaluation & Selection Deriving Insight Segmentation/Clustering Deriving Insight Affinity Analysis/ Association Rules Figure 1.2: Data mining from a process perspective G. Samueli, N. R. Patel and P.C. Bruce. Data Mining for Business Intelligence (2007).
Available Software Packages • XLMINER (Frontline Systems) • SAS Enterprise Miner (SAS Product) • SPSS Modeler (IBM Product) • R (Open Source) • Data Mining Certificates are available for SAS EM and SPSS Modeler
The Shortage of Trained Personnelfor Doing Data Mining “Big data: The next frontier for innovation, competition, and productivity” McKinsey Global Institute, May 2011 • 140,000 – 190,000 more deep analytical talent positions over the next decade • 1.5 Million more data-savvy managers to take advantage of insights offered by Data Mining
What is SMU doing about this shortage? Department of Economics: MS in Applied Economics and Predictive Analytics – Starting Fall of 2013 Department of Statistics: MS in Statistics and Data Analytics – Started Fall of 2012 Cox School of Business: MS in Business Analytics – Starting Fall of 2013