300 likes | 404 Vues
Data Mining: Current Status and Directions. What is Data Mining?. Data mining (also called knowledge discovery in databases)
E N D
What is Data Mining? • Data mining (also called knowledge discovery in databases) • Extraction of interesting (non-trivial,implicit, previously unknown and potentially useful) information (knowledge) or patterns from data in large databases or other information repositories • The goal is to understand and use data, to make data itself something of value and strategic importance
Data is everywhere! • Relational databases—A commodity of every enterprise • POS (Point of Sales): Transactional DBs are often terabytes in size • Legacy databases • Spatial databases (GIS), remote sensing database (EOS), and scientific/engineering databases • Time-series data (e.g., stock trading) and temporal data • Text (documents, emails) and multimedia databases • WWW: A huge, hyper-linked, dynamic, global information system
The potential for Data Mining Is Everywhere, too! • Knowledge to be mined • Characterization, discrimination, association, classification, clustering, trend, deviation and outlier analysis, etc. • Techniques utilized • Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, neural networks, etc. • Applications adapted • Retail, telecommunication, banking, fraud analysis, DNA mining, stock market analysis, Web mining, Weblog analysis, etc.
Data Mining: A Confluence of Multiple Disciplines Database Technology Statistics Data Mining Machine Learning (AI) Visualization Information Science Other Disciplines
Multi-Dimensional Data Analysis • Data warehousing: integration from heterogeneous or semi-structured databases • Multi-dimensional modeling of data: star & snowflake schemas (in Relational DBMS) • Efficient and scalable computation of data cubes or iceberg cubes (in MDDB) • OLAP (on-line analytical processing): drilling, dicing, slicing, etc. • Discovery-driven (data driven) exploration of data cubes
Start with standard normalized relational database tables. Creating Multi-dimensional data warehouses
Data warehouse ‘STAR’ Schema In order to reduce the number of joins that must be performed, data is reformatted into ‘fact’ tables. Fact tables typically consist of many foreign keys
Data Warehouse ‘Snowflake’ Schema Very similar to the snowflake schema, can you tell what this schema lets us see that the snowflake did not?
Making optimal use of storage space • Many cuboids can be materialized by analyzing another cuboid as opposed to the entire data set Example: Consider analyzing sales based on the dimensions of Route, Source, and Time. The number of rows in each view is given in Millions. Route, Source, Time 6 M Route, Time 6 M Route, Source .8 M Source, Time 6 M Time .1 M Route .2 M Source .01 M Materialization of all views would require roughly 19.1 Million rows None
Dependent Cuboids Selective materialization in this case can reduce the number of stored rows by 12 Million Assume that ‘Part’ can be further partitioned into ‘size’ and ‘color’, ‘Customer’ can be partitioned into ‘Individual’, ‘State’, and ‘Country’ Part, Supplier, Customer 6 M Part, Supplier Supplier, Customer Part, Customer .8 M 6 M 6 M Part(color), Customer (State) Part(size), Customer (State) Part (color), Customer (Country) Part (Size), Customer (Country) Part (Color), Customer (Individual) Part (Size), Customer (Individual) Part Customer
Association and Frequent Pattern Analysis • Objective is to find patterns in the tendency of items to be found together. • A typical 2-item association rule output will generally look something like this: • ComputerSoftware (7%, 72%) • This is telling you that 7% (a.k.a. confidence level) of your sales transactions involved computers AND software, and that 72% (a.k.a. support level) of all computer sales involved the sale of software.
Association and Frequent Pattern Analysis • Associations can also be found among 3, 4, or more item sets, for example: • (Computers, Software) Mouse Pad (8%, 65%) This tells you that 8% of transactions involved computers, software, and mouse pads. And that 65% of transactions involving computers and software also involved the purchase of a mouse pad
Association and Frequent Pattern Analysis • The problem with unguided associative analysis is that the number of associations can be enormous. • Consider a store like L.L. Bean trying to identify meaningful associations. The output could number in the millions. • In order to “filter” the output, users will frequently set parameters for confidence and support thresholds.
Clustering and Outlier Analysis • Attribute of interest is plotted on a graph whose axes represent the dimensions of interest. Cluster analysis is frequently two dimensional, but does not have to be. • The objective of the data mining algorithm is to find the centers of clusters that maximizes the distance between cluster centers while minimizing the distance between points in a cluster and the center of the cluster. • The center of the cluster typically defines the cluster (e.g. males between 30 and 35 years old with incomes between 50K and 75K) and axes are usually parametric rather than continuous
Clustering Analysis • Can include user-specified constraints (e.g. no cluster has less than 1000 customers)
Sequential Patterns and Time-Series Analysis • Trend analysis • Trend movement vs. cyclic variations, seasonal variations and random fluctuations • Similarity search in time-series database • Handling gaps, scaling, etc. • Indexing methods and query languages for time-series • Sequential pattern mining • Various kinds of sequences, various methods • Periodicity analysis • Full periodicity, partial periodicity, cyclic association rules
Data Mining Industry and Applications • Industry has grown rapidly over the past few years • From research prototypes to data mining products, languages, and standards • IBM Intelligent Miner, SAS Enterprise Miner, SGI MineSet, Clementine, MS/SQLServer 2000, etc. • A few data mining languages and standards (esp. MS OLEDB for Data Mining). • Application achievements in many domains • Market analysis, trend analysis, fraud detection, outlier analysis, Web mining, etc.
The data mining industry • Data mining is growing rapidly • R & D has seen huge increases • Applications have been broadened substantially • But not as rapidly as some may have hoped. Why not? • Value is easy to objectively measure • It is difficult to sell on hype alone, although they try! • Not on-the-shelf in nature • Need training, understanding, and customization • Definite learning curve associated with effective use • Benefit of effective use not seen immediately
Trends in data mining • Web mining (and incorporating data from outside the organization into the analysis of internal data) • Towards integrated data mining environments and tools • “Vertical” (or application-specific) data mining • Invisible data mining • Towards intelligent, efficient, and scalable data mining methods
Web Mining: A Rapidly Expanding area in Data Mining • Mine what the Web search engine finds • Automatic classification of Web documents • Discovery of authoritative Web pages, Web structures and Web communities • Meta-Web Warehousing: Web yellow page service • Web usage mining
Mining the results of Web Search Engine Finds • Current Web search engines: • keyword-based, return too many, often low quality answers, still missing a lot, not customized, etc. • Data mining will help: • coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies • better search primitives: user preferences/hints • linkage analysis: authoritative pages and clusters • customization: home page + Weblog + user profiles • Identification of “hub” pages
A Layered Meta-Web Architecture More Generalized Descriptions Layern ... Layer1 Generalized Descriptions Layer0
Importance of Constructing Multi-Layer Meta Web • Benefits of Multi-Layer Meta-Web: • Multi-dimensional Web info summary analysis • Approximate and intelligent query answering • Web high-level query answering (WebSQL, WebML) • Web content and structure mining • Observing the dynamics/evolution of the Web • Is it realistic to construct such a meta-Web? • It benefits even if it is partially constructed • The benefit may justify the cost of tool development, standardization, and partial restructuring
Web Usage (Click-Stream) Mining • Web-log provides rich information about Web dynamics • Multidimensional Web-log analysis: • disclose potential customers, users, markets, etc. • Plan mining (mining general Web accessing regularities): • Web linkage adjustment, performance improvements • Trend analysis: • Dynamics of the Web: what has been changing? • Customized to individual users
Intelligent Tools for Data Mining • Integration of users and mining algorithms paves the way to intelligent mining • Smart interface brings intelligence • Easy to use, understand and manipulate • One picture may be worth 1,000 words • Visual and audio data mining • Towards self-tuning, self-managing, self-triggering data mining