1 / 27

Evaluating Performance for Data Mining Techniques

Evaluating Performance for Data Mining Techniques. Evaluating Numeric Output. Mean absolute error (MAE) Mean square error (MSE) Root mean square error (RMSE). Mean Absolute Error (MAE). The average absolute difference between classifier predicted output and actual output.

elon
Télécharger la présentation

Evaluating Performance for Data Mining Techniques

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluating Performancefor Data Mining Techniques

  2. Evaluating Numeric Output Mean absolute error (MAE) Mean square error (MSE) Root mean square error (RMSE)

  3. Mean Absolute Error (MAE) The average absolute difference between classifier predicted output and actual output.

  4. Mean Square Error (MSE) The average of the sum of squared differences between classifier predicted output and actual output.

  5. Root Mean Square Error (RMSE) The square root of the mean square error.

  6. Clustering Techniques

  7. Clustering Techniques • Clustering Techniques apply some measure of similarity to divide instances of the data to be analyzed into disjoint partitions • The partitions are generalized by computing a group mean for each cluster or by listing a most typical subset of instances from each cluster

  8. Clustering Techniques • 1st approach: unsupervised clustering • 2nd approach: to partition data in a hierarchical fashion where each level of the hierarchy is a generalization of the data at some level of abstraction.

  9. Clustering Techniques

  10. The K-Means Algorithm • The K-means algorithm is a simple (but widely used) statistical clustering technique, which is used for unsupervised clustering • The K-means algorithm divides instances of the data to be analyzed into disjoint K partitions (clusters). • Proposed by S.P. Lloyd in 1957, first published in 1982.

  11. The K-Means Algorithm Choose a value for K, the total number of clusters. Randomly choose K points as cluster centers. Assign the remaining instances to their closest cluster center (for example, using Euclidian distance as a criterion). Calculate a new cluster center for each cluster. Repeat steps 3-5 until the cluster centers do not change.

  12. The K-Means Algorithm: Analysis • Choose a value for K, the total number of clusters – this step requires an initial discussion about how many clusters can be distinguished within a data set

  13. The K-Means Algorithm: Analysis • Randomly choose K points as cluster centers – the initial cluster centers are selected randomly, but this is not essential if K was chosen properly; the resulting clustering in this case should not depend on the selection of the initial cluster centers

  14. The K-Means Algorithm: Analysis • Calculate a new cluster center for each cluster – new cluster centers are the means of the cluster members that were placed to their clusters in the previous step

  15. The K-Means Algorithm: Analysis • Repeat steps 3-5 until the cluster centers do not change – the process instance classification and cluster center computation continues until an iteration of the algorithm shows no change in the cluster centers. • The algorithm terminates after j iterations if for each cluster Ci all instances found in Ci after iteration j-1 remain in cluster Ci upon the completion of iteration j

  16. Euclidian Distance Euclidian distance between two n-dimensional vectors is determined as

  17. Cluster Quality • How we can evaluate the cluster quality, its reliability? • One evaluation method, which is more suitable for the clusters of about equal size, is to calculate the sum of square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

  18. Cluster Quality • Another evaluation method is to calculate the mean square error differences between the instances of each cluster and their cluster center. Smaller values indicate clusters of higher quality.

  19. Optimal Clustering Criterion • Clustering is considered optimal, when the average (taken over all clusters) mean square deviation of the cluster members from their center is either: • minimal over several (s) experiments • or less than some predetermined acceptable value

  20. An Example Using the K-Means Algorithm

  21. Unsupervised Model Evaluation

  22. The K-Means Algorithm:General Considerations • Requires real-valued data. • We must select the number of clusters present in the data. • Works best when the clusters that exist in the data are of approximately equal size. If an optimal solution is represented by clusters of unequal size, the K-Means algorithm is not likely to • Attribute significance cannot be determined. • A supervised data mining tool must be used to gain into the nature of the clusters formed by a clustering tool.

  23. Supervised Learning for Unsupervised Model Evaluation • Designate each formed cluster as a class and assign each class an arbitrary name. • Choose a random sample of instances from each class for supervised learning. • Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model.

More Related