1 / 76

A Kit For Knowledge Discovery

A Kit For Knowledge Discovery. Data, Data everywhere yet. I can’t find the data I need data is scattered over the network many versions, subtle differences I can’t get the data I need need an expert to get the data I can’t understand the data I found available data poorly documented

ora
Télécharger la présentation

A Kit For Knowledge Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Kit For Knowledge Discovery

  2. Data, Data everywhere yet ... • I can’t find the data I need • data is scattered over the network • many versions, subtle differences • I can’t get the data I need • need an expert to get the data • I can’t understand the data I found • available data poorly documented • I can’t use the data I found • results are unexpected • data needs to be transformed from one form to other

  3. ? • There are sequence of steps (with eventual feedback loops) that should be followed to discover knowledge (e.g., patterns) in data. • Achieving Standardized Process Model

  4. 1 2 3 • Legitimate • Innovative • Probably useful • Accurate understandable patterns in data. What is KDD ? Knowledge Discovery in Data is the significant method of evaluating

  5. __ ____ __ ____ __ ____ Patterns and Rules Knowledge Discovery Process Interpretation & Evaluation Knowledge Data Mining Knowledge Integration RawData Transformation Selection & Cleaning Understanding Transformed Data Target Data DATA Ware house

  6. Clustering Based On Attributes Events Correlation – Association Sequencing Events ~ Later Predictions Outcomes of Data Mining Forecasting Future Classification on Recognizing patterns

  7. Data Mining • Look for hidden patterns and trends in data that is not immediately apparent from summarizing the data

  8. Data Mining + = Interestingness criteria Hidden patterns Data

  9. Data Mining Type of Patterns + = Interestingness criteria Hidden patterns Data

  10. Data Mining Type of data Type of Interestingness criteria + = Interestingness criteria Hidden patterns Data

  11. What is a Data Warehouse? A single, complete and consistent store of data obtained from a variety of different sources made available to end users in a what they can understand and use in a business context.

  12. Information Data What is Data Warehousing? A process of transforming data into information and making it available to users in a timely enough manner to make a difference

  13. 3 Data Mining Process Problem Definition Data Integration & Cleaning Model Framing & Evaluation Knowledge Discovery 1 2 4

  14. Data Mining Task Basic Operations in DM • Descriptive: • Clustering / Similarity Matching • Association rules • Deviation detection • Predictive: • Regression • Classification • Collaborative Filtering

  15. Why Machine Learning Growing flood of online data Budding industry Progress in algorithms and theory • Data mining: using historical data to improve decision • medical records ⇒ medical knowledge • log data to model user • Software applications we can’t program by hand • autonomous driving • speech recognition • Self customizing programs • Newsreader that learns user interests

  16. Machine Learning Unsupervised Data have no target attribute. Explore Data to find Patterns Text Unsupervised Supervised Data Mining Machine Learning Supervised Discover patterns in the data. Presence of Target Attribute

  17. Applications Of Data Mining

  18. Applications of Data Mining • Fraud/Non-Compliance Anomaly detection • Isolate the factors that lead to fraud, waste and abuse • Target auditing and investigative efforts more effectively • Credit/Risk Scoring • Intrusion detection • Recruiting/Attracting customers • Maximizing profitability (cross selling, identifying profitable customers) • Service Delivery and Customer Retention • Build profiles of customers likely to use which services

  19. Tools For Data Mining LinkOut NCBI Sequin Rapid Miner LibSvm ADaM etc….

  20. Why Weka Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes.

  21. About WEKA Waikato Environment for Knowledge Analysis (WEKA) Developed by the Department of Computer Science, University of Waikato, New Zealand Machine learning/data mining software coded in Java Used for research, education, and applications Exclusively for KDD. Various Versions are available such as Version 2.3, 1998; Version 3.0, 1999; Version 3.4, 2003; Version 3.6, 2008.

  22. Weka GUI Chooser

  23. A Vital Part In Weka ww.themegallery.com Explorer

  24. Weka !!!!!!!! Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either be applied directly to a dataset or called from your own Java code. Weka contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. Perfectly suited for developing new machine learning schemes.

  25. Explorer Weka’s Structural Layout Knowledge Flow Simple CLI Experimenter Performing experiments and conducting statistical tests between learning schemes Supports the same functions as the Explorer but with drag-and-drop Provides a simple command-line interface that allows direct execution of WEKA An environment for exploring data with WEKA

  26. Algorithms www.themegallery.com

  27. WEKA ! File WEKA stores data in flat files (ARFF format). Easy to transform EXCEL file to ARFF format. ARFF file consists of a list of instances ARFF file can be created using Notepad or Word. Attribute Relation File Format (ARFF) • Name of the dataset is with @relation • Attribute information is with @attribute • Data is with @data.

  28. Sample ARFF

  29. Select Attributes 5 Associate 4 Cluster 3 Classify 2 Preprocess 1 Intrinsic Operations

  30. Pre-Processing

  31. Preprocessing • Changing Data formats as per the Needs. • Varies as Per Mining Datasets. • Some of the Preprocessing Steps • Adding/removing attributes • Attribute value substitution • Discretization (MDL, Kononenko, etc.) • Time series filters (delta, shift) • Sampling, randomization • Missing value management • Normalization and other numeric transformations

  32. Algorithms

  33. Opening Files Current Relation Operations Browse for the data file in local file system. • Relations • Instances • Schema • Attributes • Filters Pre-Processing

  34. Weka – Formulating Files

  35. Dataset -.txt Format

  36. Weka ~ Dataset’s

  37. Missing Values

  38. GenericObjectEditor • A Property Editor for objects as editable in the GenericObjectEditor configuration file, which lists possible values that can be selected from, and themselves configured. • The configuration file is called "GenericObjectEditor.props" and may live in either the location given by "user.home" or the current directory (this last will take precedence), and a default properties file is read from the weka distribution.

  39. Weka ~ GenericObjectEditor • This Editor allows configure a filter. • Same kind of dialog box is used to configure other objects, such as classifiers and clusterers.

  40. Sample - Cluster Attributes for Cluster

  41. Weka’s Viewer

  42. PCA Analysis

  43. Pre-Processing Retrievals Before After

  44. Retrieving Significant Attributes

  45. Select Attribute !

  46. Algorithms

  47. Feature Selection • Some columns are noisy or redundant. This noise makes it more difficult to discover meaningful patterns from the data; • To discover quality patterns, most data mining algorithms require much larger training data set on high-dimensional data set. • Feature selection, also known as variable selection, feature reduction, attribute selection or variable subset selection, • is the technique of selecting a subset of relevant features for building robust learning models

  48. Attribute Selection • Attribute selection involves searching through all possible combinations of attributes in the data to find which subset of attributes works best for prediction. • To do this, two objects must be set up: • The evaluator determines what method is used to assign a worth to each subset of attributes. • The search method determines what style of search to be done • The Attribute Selection Mode box has two options: • 1. Use full training set. • 2. Cross-validation.

  49. Attribute Selection • Very flexible: arbitrary combination of search and evaluation methods • Both filtering and wrapping methods • Search methods • best-first • genetic • ranking ... • Evaluation mmeasures • Relief • information gain • gain ratio ...

  50. Applying Algorithm

More Related