A data mining approach to the prediction of corporate failure

A data mining approach to the prediction of corporate failure Advisor : Dr. Hsu Presenter : ching-wen Hong Author : Feng Yu Lin, Sally McClean

Outline • Motivation • Objection • Classifier technique • The five steps of Data Mining: the SAS SEMMA methodology • Sampling • Data exploration • Data manipulation • Modelling and results • Assessment • Conclusion • My opinion

Motivation • Due to recent changes in the world economy and as more firms, large or small, seem to fail now more than ever corporate failure prediction is of increasing importance.

Objection • This paper uses a data mining approach to the prediction of corporate failure.

Classifier technique • The classifiers include numerous statistical methods and machine learning methods. • The statistical methods include discriminant analysis (DA) and logistic regression (LG). • The machine learning methods include artificial neural networks (NN) and decision tree method (C5.0). • Another one, the hybrid method is proposed by the author.

The five steps of Data Mining: the SAS SEMMA methodology • Sampling: The data sampling consists of company financial data from the UK. • Explore: Preprocessing of data • Manipulate: Feature selection • Model: The classification models used to the prediction of corporate failure • Assess: Comparison of prediction accuracy

Sampling • The data sampling consists of company financial data from the UK. The financial data were accessed from Datastream/ICV. • The companies are divided into two groups:one is the failed companies group and the other is the nonfailed companies group. • This training sample consists of 690 nonfailed companies and 106 failed companies.(1980-1990) • The test dataset consists of 289 nonfailed companies and 48 failed companies.(1991-1999)

Data exploration • Exploration or preprocessing of data is very important and is sometimes the most time-consuming part. • Exploration of data is included: (1)The data is presented in a ready to use state. (2)The preprocessing of missing data. (3)We must filter out the redundant records such as duplicated data. • We decide to delete x719,x734,x735,x761,x766and x792, since these variables include too many missing values.

Data manipulation • One main purpose of data manipulation is to feature selection. • We will use two feature selection methods. The first one is based on financial theory and human judgement( Feature selection І ), the second is based on ANOVA ( Feature selection Ⅱ ),.

Feature selection І- human judgement based on financial theory • Laurent[21],Ezzamuel et al.[13] and Clarke[10] attempted to reduce the wide variety of financial ratios into several major groups (e.g. profitability, liquidity,etc), the suggestion being that a researcher need only select one ratio from each of the groups to obtain an indication of the companies overall performance. • The features selected are from these categories: return on capital employed(ROCE);Turnover to total assets employed(turnover/TA);capital gearing(CG);working capital(WC)ratio.

Feature selection Ⅱ- ANOVA statistical method • To classify failed companies and the nonfailed companies effectively, the predictors should be chosen to enable us to distinguish between these two groups. • We use ANOVA to select the variables that are significantly different between these two groups. • Table 3, the ANOVA output of the training set, we can get the variables indicated by *, which indicate a significantly different between the failed and nonfailed group.

Modelling and results • The classification models used in this study are: • Statistics:DA,LG; • Machine learning methods:decision trees,NN.

Hybrid models combining different classifiers • A hybrid method usually integrates two or more technologies. The purpose of integrating technologies is to strengthen the best features of each. • A hybrid method that combines the best features of several classification models is developed to increase the prediction performance.

Our purpose is to minimize the number of wrong predictions: Min δ=∑i=1n (Ti*Oi) s.t. Oi=f(w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim-θ),i=1,…,n n=the number of classifiers. m=the number of companies in the sample. Ti the actual outcome of ith company in the test sample. Oi the predicted outcome of ith company in the test sample based on the hybrid method Oi=1 if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ, Oi=0 else Vij the predicted output of ith company in the test sample by classifier j. wj is the weight of classifier j,Θ is the threshold. *returns 1 if both sides of *are different, *returns 0 if both sides of *are same.

The problem of finding such a combination of classifiers is to try to find (1)the combination of weight,(2)their respective classifiers and (3) the threshold θ such that the miss hit δ between actual output Ti and Oi is as small as possible.

The hybrid algorithm • 1.Compte the total hit ratio (total accuracy) for the training sample itself for all m independent classifiers.(w1, w2 ,…, wm ) • 2.Take the outcome prediction Vij of classifier j for all the companies i in the test sample for j=1,…,m, i=1,…,n • 3.Take all the population of classifiers, or subsets of it. Compute w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim for each companies i in the test sample.

4. if w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim>θ,then Oi=1 , else Oi=0. Adjust the parameter θ, where 0<θ< w1+ w2+ w3+…+ wm , such that the misclassification δ is as small as possible.

We present three hybrid classifiers. Hybrid1-DA+LG+NN+C5.0 Hybrid2-DA+NN+C5.0 Hybrid3-LG+C5.0

Assessment

Conclusions • For all the models,DA,LG,NN,C5.0 and hybrid classifiers, we found the ANOVA feature selection is better than human judgement feature selection except for DA. • The machine learning methods (NN and decision trees) show better performance than the statistical approach. • We present three hybrid classifiers:Hybrid1-DA+LG +NN+C5.0;Hybrid2-DA+NN+C5.0; Hybrid3- LG + C5.0.The empirical tests show that a hybrid classifiers produces higher prediction accuracy than individual classifiers.

My opinion • Advantage: A hybrid method that combines the best features of several classification models is developed to increase the prediction performance. • Disadvantage: (1)The hybrid algorithm is time-consuming in computing w1Vi1+ w2Vi2+ w3Vi3+…+ wmVim ,i=1,…,n for subsets of classifiers. (2) The hybrid method is not good in the application.

A data mining approach to the prediction of corporate failure

A data mining approach to the prediction of corporate failure

Presentation Transcript

Data Mining: Classification and Prediction

Music Recommendation A Data Mining Approach

A prediction approach to representative sampling

Data Mining for Crystal Structure Prediction

On a dynamic approach to the analysis of multivariate failure time data

Data Mining for Protein Structure Prediction

DATA MINING APPROACH IGES 2008

A Wordification Approach to Relational Data Mining : Early Results

An adaptive modular approach to the mining of sensor network data

Effective Prediction of Web-user Accesses: A Data Mining Approach

Data Mining: Classification and Prediction

Data Mining on New Road Prediction

A Novel Approach to Event Duration Prediction

A Clinical Approach to Acute Renal Failure

A Clinical Approach to Acute Renal Failure

A data mining approach to developing the profiles of hotel customers

The Case for A Data Mining Approach to Technical Analysis

Data Mining: Classification and Prediction

A Clinical Approach to Acute Renal Failure

Ceng 714 Data Mining Classification and Prediction