160 likes | 278 Vues
This report by Qi Liu delves into various theoretical frameworks for data mining. It discusses how these frameworks encompass typical data mining tasks and focus on probabilistic methods, inductive generation, and different data types. Key frameworks explored include statistics, machine learning, probabilistic models, data compression, microeconomics, inductive databases, and information theory. Each framework highlights specific goals and methodologies, emphasizing the iterative process of data mining and the importance of context and background knowledge for identifying interesting discoveries within data.
E N D
Theoretic Frameworks for Data Mining Reporter: Qi Liu
How can be a framework for data mining? • Encompass all or most typical data mining tasks • Have a probablistic nature • Be able to talk about inductive generations • Deal with different types of data • Recognize the data mining process as an iterative and interactive process • Account for background knowledge in deciding what is an interesting discovery
Statistics Framework • Statistics viewpoint: • Volume of data • Computational feasibility • Database integration • Simplicity of use • Understandablity of results
Machine Learning Framework • Data mining is applied machine learning • Machine learning focuses on the prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data • Data mining can not use supervised methods due to unavailability
Probablistic Framework • To find the underlying joint distribution(e.g., Bayesian network) of the variables in the data. • Advantages: • Solid background • Clustering/Classification fit easily into this framework • Lackage: • Can not take the iterative and interactive nature of the data mining process into account
Data Compression Framework • Goal: to compress the data set by finding some structure for it and then encoding the data using few bits. • Minimum description length(MDL) principle • Instances: association rules, a decision tree, clustering
Microeconomic Framework • To find actionable patterns that increase utility • Define utility function from a perspective of customers
Inductive Database Framework • Store both data and patterns • An inductive database I(D,P) consist of a data component D and a pattern component P. • We assume that both the data and the pattern components D and P are sets of sets. This assumption is motivated by an analogy with traditional relational databases. • PS: deductive database: partial rules
Information Theoretic Framework • Data mining is a process of information transimission from an algorithm to data miner. • Model the data miner’s state of mind as a probability distribution, called the background distribution, which represents the uncertainty and misconceptions. • In the data mining process, properties of the data(referred as patterns) are revealed.
Attention! • Focus on the data miner as much as on the data. An interesting pattern should be defined subjectively, rather than objectively. • The primary concern is understanding the data itself, rather than the stochastic source than generated it.
Bird’s eye view on IT framework • A data miner is able to formalize her beliefs in a background distribution, denoted P* • Kraft’s inequality is an equality • Code length of x with a probability P: -log(P(x)) • The entropy of P* could be small due to the data miner being overly confident • Update P* to be a new background distribution P*’ • Measure the reduction of code length: Information gain
Trade-off • Good data mining algorithms are those that are able to pinpoint those patterns that lead to a large information gain. • A trade-off between the information gain due to the revealing of a pattern in the data, and the description length of the pattern, that should define a pattern’s interest to the data miner.
How to determine P* and P*’? • Given a set of probability distribution of maximum entropy. • Given P , i.e. • P is a good surrogate for P*
Patterns • Formalize a pattern as a constraint for some X’ • For a pattern above: • P*(x) = 0 for all x’ • Update P to be P’(called updated surrogate background distribution): • Self-inforamtion(w.r.t. P’) of the pattern is
More issues about the framework • The cost of a pattern should be specified in advance by the data miner. • Joint Patterns • Cases: • Clustering and alternative clustering • Dimensionality reduction(PCA) • Frequent pattern mining • Community detection • Subgroup discovery and supervised learning