Data mining: some basic ideas

Data mining: some basic ideas Francisco Moreno Excerpts from Fundamentals of DB Systems, Elmasri & Navathe and other sources

Data mining • For many years, organizations have generated a large amount of data in the form of files and databases • These data can be processed using database technology with languages such as SQL • SQL drawbacks: it is assumed that the user is aware of the DB schema, some queries can become very complex, for example, those oriented to discover information…

Data mining • Data mining refers to the discovery of information in terms of patterns or rules from vast amounts of data • To be useful, data mining must be carried out efficiently on large files and databases • Data mining uses techniques from areas such as machine learning, statistics, neural networks, and genetic algorithms, among others.

Data mining • We will highlight the nature of the information that is discovered, the types of problems faced in databases and potential applications • Data mining is related with a broader area called knowledge discovery (see below)

Data mining • Remember: the goal of a Data Warehouse (DW) is to support decision making with data: Data mining can be used in conjuntion with a DW to help with decision making processes • It is possible to apply data mining to operational databases (or files) with individual transactions • However, to make data mining more efficient a DW could be used, where we could take advantage of the aggregated collection of data

Data mining • Data mining helps in extracting meaningful patterns that cannot be found necessarily by merely querying or processing data in the DW • Data mining requirements should be considered early, during the design of a DW • Indeed, for very large databases, succesful use of data mining will depend first on the construction of the DW

Data mining • Data mining is a part of the knowledge discovery process • Knowledge discovery in databases (KDD), typically encompasses more than data mining

KDD • The KDD comprisessixphases: • Data cleansing • Enrichment • Data transformation and encoding • Data selection • Data mining • Reporting and display of thediscoveredinformation Data integration

KDD Knowledge Pattern Evaluation Data Mining Selection Data Warehouse Data integration: Data cleansing, enrichment, data transformation, encoding Databases

KDD: Data integration • During data cleansing, invalid data can be fixed: fix zip codes or eliminate records with wrong phone prefixes

KDD: Data integration • Enrichment typically enhances the data with additional information from other sources. For example, given the customer names and phone numbers, an organization can get (perhaps buy) other data such as age, income, and credit card rating and then append them to each customer record.

KDD: Data integration • Data transformation and encoding may be done to reduce the amount of data. For example, product codes may be grouped in terms of product categories. Zip codes may be aggregated into geographic regions, incomes may be divided into ranges, and so on.

Data mining • During data selection, data about specific products or categories of specific products, or from stores in a specific region, may be selected • After such preprocessing, data mining techniques are used to discover rules and patterns

Data mining • For example, the result of mining could discover: • Association rules: whenever a customer buys video equipment, he also buys another electronic gadget • Sequential patterns: a customer who buys a camera, he will buy photographic supplies usually within the next three months, and within six months, an accesory item. A customer who buys more than twice in the lean periods* may be likely to buy at least once during Christmas period * Periodos de escasez

Data mining • Classification trees: customers may be classified by frequency of visits, by types of financing used, by amount of purchase, by affinity for types of items  some revealing statistics may be generated for such classes

Data mining • This information can then be used • to plan additional store locations based on demographics • to run store promotions • to combine products in advertisements • to plan seasonal marketing strategies

Goals of data mining and knowledge discovery • The goals of data mining fall into the following classes: • Prediction • Identification • Classification • Optimization

Goals of data mining and knowledge discovery • Prediction: Data mining can show how certain attributes within the data will behave in the future: analysis of buying transactions to predict what consumers will buy under certains discounts, how much sales volume a store would generate in a given period, and whether deleting a product line would yield more profits

Goals of data mining and knowledge discovery • Identification: to identify the existence of an item, an event, or an activity: intruders may be identified by the programs executed, files accessed, and CPU time per session; a gene can be identified by certain sequences of nucleotide symbols in the DNA sequence.

Goals of data mining and knowledge discovery • Classification: Data mining can partition the data so that different classes can be identified based on combination of parameters: customers in a supermarket can be classified into discount-seekers or shoppers in a rush.

Goals of data mining and knowledge discovery • Optimization: to optimize the use of limited resources such as time, space, money, or materials and to maximize output variables such as sales under a given set of constraints  A strong resemblance with the objective function in operations research field (there is no sharp line separating data mining from this and other related disciplines)

Data mining • Some types of knowledge discovered during data mining: • Association rules • Sequential patterns • Patterns within time series • Categorization and segmentation

Data mining • Association rules*: correlate the presence of items with another range of values for another set of variables: when a female retail shopper buys a handbag, she is likely to buy shoes. * Later, we will focus on this type of knowledge.

Data mining • Sequential patterns: a sequence of actions or events is sought: if a patient underwent cardiac bypass surgery and later developed high blood urea within a year of surgery, he is likely to suffer from kidney within the next year. • Note that detection of sequential patterns is equivalent to detecting association among events with certain temporal relationships

Data mining • Patterns within time series: similarities can be detected within positions of time series: stocks of a utility (service) company A and a financial company B show the same pattern during a year, two products show the same selling price pattern in summer but a different one in winter.

Data mining • Categorization and segmentation: a given population of events or items can be partitioned into sets of “similar” elements: • a population of treatment data may be divided into groups based on similarity of side effects • a population may be categorized into groups from “most likely to buy” to “least likely to buy” • web accesses made by users may be analized in terms of keywords to reveal clusters of users Web usage mining

Association rules • The database is regarded a collection of transactions (for example, purchases), each involving a set of items • A common example is that of market-based data • Consider the following example with four transactions:

Association rules Transaction_id Items_bought 1 milk, bread, juice 2 milk, juice 3 milk, eggs 4 bread, cookies, coffee Note: Some important information is not considered, for example, the quantity of each item purchased in each transaction

Association rules • Another example: a text document data set, where each document is treated as a set of keywords: • Doc 1: {student, teach, school} • Doc 2: {student, school} • Doc 3: {teach, school, city, game} • Doc 4: {baseball, basketball} • Doc 5: {basketball, team, city, game} Text mining, Web content mining

Association rules • An association rule is of the form: • LHS(left hand side) RHS(right hand side) X Y where X = {x1, x2, …, xn} and Y = {y1, y2, …, ym} are set of items, xi and yi being distinct items for all i and j and XY =  • This association states that if a customer buys X, he is also likely to buy Y.

Association rules • Association rules should include both support (prevalence) and confidence (strenght) • The support for a rule LHS  RHS is the percentage of transactions that hold all the items in the set LHS  RHS. • If the support is low, it implies that there is no overwhelming evidence that the items LHS  RHS occur together.

Association rules: Support examples • Milk Juice has 50% support. • Bread  Juice has 25% support.

Association rules • To compute confidence, we consider all transactions that include items in LHS. The confidence for LHS RHS is the percentage of such transactions that also include RHS.

Association rules: Confidence examples • Milk Juice has 66.6% confidence. • Bread  Juice has 50% confidence.

Association rules • n = number of transactions, then: • (XY).count • (XY).count Support = n Confidence = X.count

Data mining: some basic ideas