100 likes | 120 Vues
College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology. Data Mining. Chapter 6 : Clustering Methods. Prepared by: Mahmoud Rafeek Al-Farra. 2013. www.cst.ps/staff/mfarra. Course’s Out Lines. Introduction Data Preparation and Preprocessing
E N D
College of Science & Technology Dep. Of Computer Science & IT BCs of Information Technology Data Mining Chapter 6: Clustering Methods Prepared by: Mahmoud Rafeek Al-Farra 2013 www.cst.ps/staff/mfarra
Course’s Out Lines • Introduction • Data Preparation and Preprocessing • Data Representation • Classification Methods • Evaluation • Clustering Methods • Mid Exam • Association Rules • Knowledge Representation • Special Case study : Document clustering • Discussion of Case studies by students
Out Lines • Definition of Clustering • Why clustering? • Where to use clustering? • Next: Types of Data in Cluster Analysis • Next: A Categorization of Major Clustering Methods
Definition of Clustering • Clustering can be considered the most important unsupervised learning technique; so, as every other problem of this kind, it deals with finding a structure in a collection of unlabeled data. • Clustering is “the process of organizing objects into groups whose members are similar in some way”. • A cluster is therefore a collection of objects which are “similar” between them and are “dissimilar” to the objects belonging to other clusters.
Definition of Clustering • Cluster: a collection of data objects • Similar to one another within the same cluster • Dissimilar to the objects in other clusters • Cluster analysis • Grouping a set of data objects into clusters • Clustering is unsupervised classification: no predefined classes
Why clustering? • Simplifications • Pattern detection • Useful in data concept construction • Unsupervised learning process
Where to use clustering? • Data mining • Information retrieval • text mining • Web analysis • marketing • medical diagnostic
Which method should I use? • Type of attributes in data • Scalability to larger dataset • Ability to work with irregular data • Time cost • complexity • Data order dependency • Result presentation