1 / 42

Lecture 2

Lecture 2. Themes in this session Knowledge discovery in databases Data mining Multidimensional analysis and OLAP. Knowledge discovery in databases. What is Knowledge?. Data symbols representing properties of events and their environments Information

kimn
Télécharger la présentation

Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 Themes in this session • Knowledge discovery in databases • Data mining • Multidimensional analysis and OLAP

  2. Knowledge discovery in databases

  3. What is Knowledge? • Data • symbols representing properties of events and their environments • Information • is contained in descriptions, provides the answers to a number of basic questions • Knowledge • basic know-how facilitates allows action • Understanding • achieved through diagnosis and prescription • Wisdom • judgement of what is efficient and effective

  4. Characteristics of discovered knowledge • non-trivial • valid • novel • potential useful • understandable • An aggregated measure is “interestingness” • validity • novelty • usefulness • simplicity

  5. A more formal definition of knowledge • Pattern • A pattern is an expression E in a language L describing facts in a subset FE of F. Eis called a pattern if it is simpler than the enumeration of all the facts in FE • Knowledge • A pattern E  Lis called knowledge if for some user-specified threshold i  Mi , I(E,F,C,N,U,S) > i • where C = validity, N = novelty, U = usefulness, S = simplicity

  6. What is KDD? • Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data. • KDD is a process • involves the extraction, organisation and presentation of discovered information • KDD is effected by a human-centred system • is in itself a knowledge intensive task consisting of complex interactions between a human and a (large) database.

  7. Overview of the analyst’s tasks Goals Insight gains formulates enriches Queries generates Analyses DB Output Dataset

  8. Characteristics of the KDD process • highly iterative • protracted over time • numerous sub-tasks • highly complex • numerous input systems

  9. A description of the KDD process Task discovery Data cleaning Model development Data analysis Output generation Goal formulation Data discovery

  10. Goal formulation Based on a means-ends chain extending into the workings of the organisation • Formulate a goal for improving the operations of the business • Decide what one needs to know in order to fulfil this goal and perform the business activity in a better manner • On the basis of what one needs to know formulate goals for how to discover this information by using the KDD process • Revise all of the goals above if needs on the basis of iterative discovery

  11. Data discovery • Try and understand the domain in order to determine which entities are relevant to the discovery process • Check the coverage and content of the data • sift through the source data to see what is available • sift through the source data to see what is not available • Determine the quality of the data • Determine the structure of the data

  12. Task discovery • Find means stipulated by the ends contained in the knowledge discovery goals • Find out what the real requirements on the tasks and the performance of these tasks are • Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions

  13. Data cleaning • Ensure the quality of the data that will be used in the KDD process • Eliminate data quality problems in the data such as… • inconsistencies due to differences between various data sources • missing data • different forms of data representation • data incompatibility

  14. Model development Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals • Select the parameters for the model • formulate measures that can be used to quantify achievement of the goal (outcome variable or dependent variable) • select a set of independent variables which are deemed to have relevance to the outcome variables • Segment the data • find possible relevant subsets in the population • Choose an analysis model which fits the problem domain NOTE: This whole phase demands background knowledge of the domain

  15. Data analysis Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal • specify the chosen model • use some form of formal expression • fit the model to the data • perform initial adjustments to some of the parameters • evaluate the model • check the soundness of the model against the data • refine the model • modify the model on the basis of its discrepancies with the evidence presented by the data

  16. Output generation • Reports of findings in the analysis • Action suggestions on the basis of the findings • Models for use in similar analysis scenarios • Monitoring mechanisms which observe the variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.

  17. Developing KDD applications Purpose: an application to answer a key business question • a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed • encoding of the discovered knowledge within a specific problem solving architecture • application of the knowledge in the context of a real world task by a well understood class of end-users • Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data

  18. Data mining

  19. What is data mining? Rather formal definition: • Data mining involves fitting models to, and observing patterns from, observed data through the application of specific algorithms. Less formally: • Data analysis in order to explain an aspect of a complex reality by expressing it as an understandable simplification

  20. Goals for data mining • Prediction • involve using some variables or fields in the database to predict unknown or future values of other variables of interest • Description • focuses on finding human interpretable patterns describing the data

  21. Rationale for data mining • Dramatic increase in the amount of data available (the data explosion) • Increasing competition in the world’s market • The low relative value of easily discovered information • Increasing cleverness • Emergence of new enabling technology

  22. Enabling factors for data mining • Increased data storage ability • Increased data gathering ability • Increased processing power • The introduction of new computationally intensive methods of machine learning

  23. Background to data mining • Inductive learning • supervised learning • unsupervised learning • Statistics • Machine learning • Differences between DM and ML • DM finds understandable knowledge, ML improves the performance of an agent • DM is concerned with large, real-world databases, ML with smaller data sets • ML is a broader files, not only learning by example

  24. Data mining algorithms Specific mix of three components: • The model • function • representational form • parameters from the data • The model evaluation (preference) criterion • preference of one set of models or set of parameters over another • based on goodness-of-fit function • The search method • a method for finding particular models and parameters • Given: data, family of models, preference criterion

  25. Primary operations in data mining A number of basic operations can be used for prediction and depiction • Classification • Regression • Clustering • Summarisation • Dependency modelling • Change and deviation detection

  26. Classification • Learning a function that maps (classifies) a data item into one of several predefined classes • In supervised learning it is the user that defines the classes. • The classification is applied in the form of one or more attributes that denotes the class of the data item. • These classifying attributes are known as predicted attributes. A combination of values for the predicted attributes defines a class • Other attributes of the data item are known as predicting attributes

  27. Regression • A common statistical technique for modelling the relationship between two or more variables • Learning a function which maps a data item to a real-valued prediction variable • Simple linear regression uses the straight line model Y = 0 + 1X +  , where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable) • Multiple regression involves more than two variables and uses the model Y = 0 + 1X1 + 2X2 +…+ nXn +  , where Y is the prediction variable and X1… Xn are the predictive variables

  28. Clustering • A common descriptive task for determining a finite set of categories or clusters to describe the data • Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories • A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups • Correlations and functions of distance between elements are used in defining the clusters

  29. Summarisation • Methods for finding a compact description for a subset of data • Often relies on statistical methods such as the calculating of means and standard derivations • Are often applied to interactive exploratory data analysis and automated report generation.

  30. Dependency modelling • Consists for finding a model which describes significant dependencies between variables • There are two levels of dependency in dependency models: • The structural level specifies which variables are locally dependent on each other • The quantitative level specifies the strengths of the dependencies using some numerical scale • Often in the form: x% of all record containing items A and B, also contain items D and E

  31. Change and deviation detection • Focuses on discovering the most significant changes in the data from previously measured or normative values • Often used on a long time series of records in order to discover trends • Often used to discover sequential patterns occurring over extended time periods

  32. Problems and issues in data mining • Limited information • Noise and missing values • Uncertainty • Size of databases • Irrelevance of certain fields • Updates to databases

  33. Multidimensional analysis and OLAP

  34. OLAP vs OLTP • OLTP servers handle mission-critical production data accessed through simple queries • usually handles queries of an automated nature • OLTP applications consist of a large number of relatively simple transactions. • Most often contains data organised on the basis of logical relations between normalised tables • OLAP servers handle management-critical data accessed through an iterative analytical investigation • usually handles queries of an ad-hoc nature • supports more complex and demanding transactions • contains logically organised data in multiple dimensions

  35. What is OLAP? Definition:The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data. • Flexible information synthesis • Multiple data dimensions/consolidation paths • Dynamic data analysis

  36. Codd’s four data models for data analysis • Categorical data models • Exegetical data models • Contemplative data models • Formulaic data models

  37. Dimensionality revisited

  38. OLAP Tool evaluation criteria (1-6) • Multidimensional conceptual view • Transparency • Accessibility • Consistent reporting performance • Client-Server architecture • Generic dimensionality

  39. OLAP Tool evaluation criteria (7-12) • Dynamic Sparse Matrix handling • Multi-user support • Unrestricted cross-dimensional analysis • Intuitive data manipulation • Flexible reporting • Unlimited dimensions and aggregation levels

  40. Functionality of OLAP tools • Drill-down • Drill-up • Roll-up or consolidation • “Slicing and dicing” by pivoting • Drill-through • Drill-across

  41. An OLAP “answer set”

  42. Different forms of OLAP • True OLAP • ROLAP (relational OLAP) • MOLAP (multidimensional OLAP)

More Related