Introduction to Data Analysis and Mining

1. Introduction to Data Analysis and Mining By Laura Jordana

2. Decision-Support Systems Database applications can be classified as either transaction-processing or decision-support systems. Transaction-processing systems are extensively used today: bank transactions, online sale transactions, etc. These systems generate a large amount of information.

3. Decision-Support Systems Decision-support systems attempt to extract useful information from the generated information in order to make business decisions. For example, it can analyze customer behavior to help managers decide what products to stock in a store or what market to advertise their products to.

4. OLAP Many decision-support queries can be written in SQL. However, others cannot, or cannot be expressed easily. Extensions are available to make data analysis easier. OLAP (Online Analytical Processing) consists of tools for data analysis. Examples: statistical data such as finding percentiles, cumulative distributions

5. Data Warehousing A data warehouse is an archive of information gathered from multiple data sources. A company may have different databases for different purposes. These databases might only contain current data. The purpose of the data warehouse is to store ALL the data for a long time. Decision-support queries are easier to write, and online transaction-processing systems are not affected by this additional workload.

6. Components of a Data Warehouse

7. Issues When and how to gather data � source-driven (from the data source to the warehouse) or destination-driven (warehouse sends requests for new data) Schema to be used � Different data sources are likely to have different schemas Data transformation and cleansing � correcting minor errors such as a street name being spelled incorrectly

8. Issues (cont.) Propogating updates � how to update the data warehouse when an update occurs at the data source Summarizing data � may not necessarily need or have room to store all raw data

9. Data Mining The process of analyzing large databases to find useful patterns. Data mining attempts to discover rules and patterns from data. Also called �knowledge discovery�.

10. Knowledge Discovery A rule can be the result of knowledge discovery. For example: �Young women with annual incomes are most likely to buy small sports cars.� These rules are not universally true, and have degrees of �support� and �confidence�.

11. Applications of Knowledge Discovery Predictions: For example, a credit-card company may want to predict a person�s credit risk based on known factors. Associations: Suggesting books to a customer who has purchased books at an online bookstore, or suggesting accessories to go with an item. Real-World Example: The National Basketball Association uses a data-mining application in conjunction with video recordings of basketball games to analyze plays and discover interesting patterns in game data. (Source: http://citeseer.ist.psu.edu/cachedpage/421882/1)

12. Classification Items belong to one of several classes. The problem is to predict what class a new item belongs to (i.e. predicting a person�s credit risk). Attributes of the item are used to predict its class (i.e. age, education, annual income, current debts). The decision-tree is one way to perform classification.

13. Decision-Tree A decision tree has leaf nodes that represent classes. Each internal node is associated with a predicate or function which is used to determine which child to traverse to. Basically, a decision-tree is a flow chart of if-then scenarios.

14. Decision-Tree

15. Association Association is a topic of interest particularly in the retail industry. Companies are interested in the associations among different items that people purchase. For example: Someone who buys bread will probably buy milk. Someone who bought a book on PHP is likely to purchase a book on MySQL.

16. Association Rules bread => milk PHP => MySQL As mentioned before, rules have degrees of �support� and �confidence�. Support measures what percentage of the population satisfies both sides of the rule (i.e. what percentage of all purchases include both milk and bread). Confidence is a measure of how often the population satisfies the right hand side of the rule when the left hand side is true (i.e. what percentage of the purchases that include bread also include milk). Note: Confidence of bread=>milk can be different from milk=>bread although they have the same support.

17. Other Types Of Mining Text mining � uses data mining techniques on text documents Data visualization � helps users observe patterns visually

18. References http://www.purdue.edu/UNS/html4ever/2004/041018.Caruthers.discover.html A. Silberschatz, H.F. Korth, S. Sudershan: Database System Concepts, 5th ed., McGraw-Hill, 2006

Introduction to Data Analysis and Mining

Introduction to Data Analysis and Mining

Presentation Transcript

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to data mining

Introduction to DATA MINING

Introduction to Data Mining

INTRODUCTION TO DATA MINING

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to Data Mining

Introduction to data mining

Introduction to Data Mining

Introduction to Data Mining

Introduction To Data Mining