Lecture 2

Lecture 2 Themes in this session • Knowledge discovery in databases • Data mining • Multidimensional analysis and OLAP

Knowledge discovery in databases

What is Knowledge? • Data • symbols representing properties of events and their environments • Information • is contained in descriptions, provides the answers to a number of basic questions • Knowledge • basic know-how facilitates allows action • Understanding • achieved through diagnosis and prescription • Wisdom • judgement of what is efficient and effective

Characteristics of discovered knowledge • non-trivial • valid • novel • potential useful • understandable • An aggregated measure is “interestingness” • validity • novelty • usefulness • simplicity

A more formal definition of knowledge • Pattern • A pattern is an expression E in a language L describing facts in a subset FE of F. Eis called a pattern if it is simpler than the enumeration of all the facts in FE • Knowledge • A pattern E  Lis called knowledge if for some user-specified threshold i  Mi , I(E,F,C,N,U,S) > i • where C = validity, N = novelty, U = usefulness, S = simplicity

What is KDD? • Knowledge Discovery in Databases involves the extraction of implicit, previously unknown and potentially useful information from data. • KDD is a process • involves the extraction, organisation and presentation of discovered information • KDD is effected by a human-centred system • is in itself a knowledge intensive task consisting of complex interactions between a human and a (large) database.

Overview of the analyst’s tasks Goals Insight gains formulates enriches Queries generates Analyses DB Output Dataset

Characteristics of the KDD process • highly iterative • protracted over time • numerous sub-tasks • highly complex • numerous input systems

A description of the KDD process Task discovery Data cleaning Model development Data analysis Output generation Goal formulation Data discovery

Goal formulation Based on a means-ends chain extending into the workings of the organisation • Formulate a goal for improving the operations of the business • Decide what one needs to know in order to fulfil this goal and perform the business activity in a better manner • On the basis of what one needs to know formulate goals for how to discover this information by using the KDD process • Revise all of the goals above if needs on the basis of iterative discovery

Data discovery • Try and understand the domain in order to determine which entities are relevant to the discovery process • Check the coverage and content of the data • sift through the source data to see what is available • sift through the source data to see what is not available • Determine the quality of the data • Determine the structure of the data

Task discovery • Find means stipulated by the ends contained in the knowledge discovery goals • Find out what the real requirements on the tasks and the performance of these tasks are • Refine the requirements and choice of tasks until you’re sure you’re setting about answering the correct questions

Data cleaning • Ensure the quality of the data that will be used in the KDD process • Eliminate data quality problems in the data such as… • inconsistencies due to differences between various data sources • missing data • different forms of data representation • data incompatibility

Model development Involves activities concerned with forming a basic hypothesis which can satisfy the knowledge discovery goals • Select the parameters for the model • formulate measures that can be used to quantify achievement of the goal (outcome variable or dependent variable) • select a set of independent variables which are deemed to have relevance to the outcome variables • Segment the data • find possible relevant subsets in the population • Choose an analysis model which fits the problem domain NOTE: This whole phase demands background knowledge of the domain

Data analysis Involves activities aimed at determining the rules/reasons governing the behaviour of those entities focused on by the knowledge discovery goal • specify the chosen model • use some form of formal expression • fit the model to the data • perform initial adjustments to some of the parameters • evaluate the model • check the soundness of the model against the data • refine the model • modify the model on the basis of its discrepancies with the evidence presented by the data

Output generation • Reports of findings in the analysis • Action suggestions on the basis of the findings • Models for use in similar analysis scenarios • Monitoring mechanisms which observe the variables covered in the analysis and “trigger” notifications when certain conditions are noted in the data.

Developing KDD applications Purpose: an application to answer a key business question • a labour intensive initial discovery of knowledge by someone who understands the domain as well as the specific data analysis techniques needed • encoding of the discovered knowledge within a specific problem solving architecture • application of the knowledge in the context of a real world task by a well understood class of end-users • Installation of analysis, monitoring, and reporting mechanisms as a base for continual evaluation of data

Data mining

What is data mining? Rather formal definition: • Data mining involves fitting models to, and observing patterns from, observed data through the application of specific algorithms. Less formally: • Data analysis in order to explain an aspect of a complex reality by expressing it as an understandable simplification

Goals for data mining • Prediction • involve using some variables or fields in the database to predict unknown or future values of other variables of interest • Description • focuses on finding human interpretable patterns describing the data

Rationale for data mining • Dramatic increase in the amount of data available (the data explosion) • Increasing competition in the world’s market • The low relative value of easily discovered information • Increasing cleverness • Emergence of new enabling technology

Enabling factors for data mining • Increased data storage ability • Increased data gathering ability • Increased processing power • The introduction of new computationally intensive methods of machine learning

Background to data mining • Inductive learning • supervised learning • unsupervised learning • Statistics • Machine learning • Differences between DM and ML • DM finds understandable knowledge, ML improves the performance of an agent • DM is concerned with large, real-world databases, ML with smaller data sets • ML is a broader files, not only learning by example

Data mining algorithms Specific mix of three components: • The model • function • representational form • parameters from the data • The model evaluation (preference) criterion • preference of one set of models or set of parameters over another • based on goodness-of-fit function • The search method • a method for finding particular models and parameters • Given: data, family of models, preference criterion

Primary operations in data mining A number of basic operations can be used for prediction and depiction • Classification • Regression • Clustering • Summarisation • Dependency modelling • Change and deviation detection

Classification • Learning a function that maps (classifies) a data item into one of several predefined classes • In supervised learning it is the user that defines the classes. • The classification is applied in the form of one or more attributes that denotes the class of the data item. • These classifying attributes are known as predicted attributes. A combination of values for the predicted attributes defines a class • Other attributes of the data item are known as predicting attributes

Regression • A common statistical technique for modelling the relationship between two or more variables • Learning a function which maps a data item to a real-valued prediction variable • Simple linear regression uses the straight line model Y = 0 + 1X +  , where Y is the prediction variable (dependent variable) and X is the predictive variable (independent variable) • Multiple regression involves more than two variables and uses the model Y = 0 + 1X1 + 2X2 +…+ nXn +  , where Y is the prediction variable and X1… Xn are the predictive variables

Clustering • A common descriptive task for determining a finite set of categories or clusters to describe the data • Categories may be mutually descriptive and exhaustive, or consist of richer representations such as hierarchical or overlapping categories • A cluster is a group of objects grouped together because of their similarity of proximity. Data units in a cluster are both homogeneous and differ significantly from other groups • Correlations and functions of distance between elements are used in defining the clusters

Summarisation • Methods for finding a compact description for a subset of data • Often relies on statistical methods such as the calculating of means and standard derivations • Are often applied to interactive exploratory data analysis and automated report generation.

Dependency modelling • Consists for finding a model which describes significant dependencies between variables • There are two levels of dependency in dependency models: • The structural level specifies which variables are locally dependent on each other • The quantitative level specifies the strengths of the dependencies using some numerical scale • Often in the form: x% of all record containing items A and B, also contain items D and E

Change and deviation detection • Focuses on discovering the most significant changes in the data from previously measured or normative values • Often used on a long time series of records in order to discover trends • Often used to discover sequential patterns occurring over extended time periods

Problems and issues in data mining • Limited information • Noise and missing values • Uncertainty • Size of databases • Irrelevance of certain fields • Updates to databases

Multidimensional analysis and OLAP

OLAP vs OLTP • OLTP servers handle mission-critical production data accessed through simple queries • usually handles queries of an automated nature • OLTP applications consist of a large number of relatively simple transactions. • Most often contains data organised on the basis of logical relations between normalised tables • OLAP servers handle management-critical data accessed through an iterative analytical investigation • usually handles queries of an ad-hoc nature • supports more complex and demanding transactions • contains logically organised data in multiple dimensions

What is OLAP? Definition:The dynamic synthesis, analysis and consolidation of large volumes of multidimensional data. • Flexible information synthesis • Multiple data dimensions/consolidation paths • Dynamic data analysis

Codd’s four data models for data analysis • Categorical data models • Exegetical data models • Contemplative data models • Formulaic data models

Dimensionality revisited

OLAP Tool evaluation criteria (1-6) • Multidimensional conceptual view • Transparency • Accessibility • Consistent reporting performance • Client-Server architecture • Generic dimensionality

OLAP Tool evaluation criteria (7-12) • Dynamic Sparse Matrix handling • Multi-user support • Unrestricted cross-dimensional analysis • Intuitive data manipulation • Flexible reporting • Unlimited dimensions and aggregation levels

Functionality of OLAP tools • Drill-down • Drill-up • Roll-up or consolidation • “Slicing and dicing” by pivoting • Drill-through • Drill-across

An OLAP “answer set”

Different forms of OLAP • True OLAP • ROLAP (relational OLAP) • MOLAP (multidimensional OLAP)

Lecture 2

Lecture 2

Presentation Transcript

Lecture 2-2

Lecture 2

Lecture 2

Lecture 2

Lecture 2

Lecture 2

Lecture 2

LECTURE 2

Lecture-2

Lecture 2

Lecture 2

Lecture 2

LECTURE 2

Lecture 2

Lecture # 2

Lecture # 2

Lecture 2

Lecture 2

LECTURE 2