1 / 21

Lecture 16

Lecture 16.     Data Mining. Microsoft Synchronization Services for ADO.NET

kottley
Télécharger la présentation

Lecture 16

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 16     Data Mining

  2. Microsoft Synchronization Services for ADO.NET Microsoft Synchronization Services for ADO.NET* provides the ability to synchronize data from disparate sources over two-tier, N-tier, and service-based architectures. It is a set of DLLs that provides a composable API. The Synchronization Services API provides a set of components to synchronize data between data services and a local store. Synchronization Services uses a hub-and-spoke model. All changes from each client are synchronized with the server before the changes are sent from the server to other clients (clients do not exchange changes directly with each other). Synchronization Services provides snapshot, download-only, upload-only, and bidirectional synchronization. Snapshot and download-only synchronization are typically used to store and update reference data, such as a product list, on a client. Data changes that are made at the server are downloaded to the client database during synchronization. Snapshot synchronization refreshes data every time that the client is synchronized. Download-only synchronization downloads only the incremental changes that have occurred since the previous synchronization. Upload-only synchronization is typically used to insert data, such as a sales order, on a client. Bidirectional synchronization is typically used for data that can be updated at the client and server. Any conflicting changes must be handled during synchronization. *ActiveX Data Object http://msdn.microsoft.com/en-us/library/bb725998.aspx

  3. Client & Server Database The client database for Synchronization Services applications is SQL Server Compact 3.5. Synchronization Services provides an infrastructure to track incremental changes in the client database. This infrastructure is enabled the first time any table is synchronized by using a method other than snapshot synchronization. The server database can be any database for which an ADO.NET provider is available.

  4. ADO.NET ADO.NET is a set of computer software components that can be used by programmers to access data and data services. It is a part of the base class library that is included with the Microsoft .NET Framework. It is commonly used by programmers to access and modify data stored in relational database systems, though it can also be used to access data in non-relational sources. Functionality exists in the Visual Studio IDE to create specialized subclasses of the DataSet classes for a particular database schema, allowing convenient access to each field through strongly-typed properties. http://en.wikipedia.org/wiki/ADO.NET

  5. Defining Data Mining http://www.thearling.com/

  6. A Sample Problem http://www.thearling.com/

  7. A Solution http://www.thearling.com/

  8. The Big Picture http://www.thearling.com/

  9. Tools of Modern Data Mining The "What animal am I thining of" Game If it walks like a duck and talks like a duck... If you've got tons of data and no clue what to do, who you gonna call? - Neural Nets!! Genetic Algorithms would fit here as well. Clustering without Numbers Let the Data Group Itself http://www.thearling.com/

  10. Data Mining Exposed Basically Data Mining is the application of standard pattern classification techniques to the detection, extraction and interpretation of information from very large and diverse data sources. The level to which Data Mining is described as something extraordinary and revolutionary is the level to which the person describing it doesn't understand it. Data Analysis and Pattern Classification can be divided into three major levels of operations: Signal Level - At this level we are separating the signals (data elements) of interest from the background clutter (noise). What is signal and what is noise depends on the application. Syntactic Level - The inital task at this level (or the final task of the previous level) is the reduction in the amount of data needed to represent the information. The end-result is a relatively small list of features that describe the characteristics of the objects of interest. Semantic Level - The goal of semantic level processing is to extract knowledge (understanding) from the data collected. The relationships between the syntactical elements and the context in which they appear (situational awareness) permit us to generate an explanation of the observations.

  11. Rule Induction Rule induction is an area of machine learning in which formal rules are extracted from a set of observations. The rules extracted may represent a full scientific model of the data, or merely represent local patterns in the data. Some popular methods and tools related to Rule Induction: Association rule algorithms - {onions, potatoes} -> {beef} Where's the beef? Decision rule algorithms - Such as those based on Bayes' Rules Hypothesis testing algorithms - The difference between coincidence and causality. Inductive Logic Programming - Who's your Uncle? Version spaces - Separating the Wheat from the Chaff. http://en.wikipedia.org/wiki/Rule_induction

  12. E Bayes' Formula Suppose that E is an event from a sample space S and that F1, F2, . . . , Fn are mutually exclusive events such that the union of all Fi = S, and that p(E)>0 and p(Fi)>0 for all i. F3 F2 F1 Fj Fi p(Fj|E) Fn . . . Bayes's formula states: The probability that Fjis true assuming that E is has occurred is given by the ratio of the probability that E is true assumming that Fj has occurred times the probability that Fj is true divided by the total probability of E.

  13. The Willies A medical journal announces the availability of a new diagnostic test. The announce-ment states the following: "An incredibly accurate indicator for the presence of the Willies has recently been developed by Hokes Laboratories that will give a positive reading on an infected patient with probability 0.998 and has a false positive reading in only 2 out of 1000 patients. This modern miracle will revolutionize..." Even though you know that only 1 in 10,000 people in the world have the disease, you have long suspected that you have the willies. You rush to your doctor and demand to be tested using the new Hokes testing method. Confirming your suspicions, the test comes back positive. Based on this one test, what is the probability that you have the Willies? In this example, the probability space is partitioned into two regions: A1 = You have the willies; and A2 = You do not have the willies. You also have the information that 1/10,000 persons in the population actually have the willies which is equivalent to an a priori (prior) probability that you are infected of 0.0001. This means that the a priori probability that you do not have the willies is 0.9999.

  14. P(A1) - the probability that a person, chosen at random from the general population, has the Willies P(A2) - the probability that a person, chosen at random from the general population, does not have the Willies P(B|A1) - the probability that the Hokes test will return a positive result, given that a person has the Willies P(B|A2) - the probability that the Hokes test will return a positive result, given that a person does not have the Willies Since the test is positive, B is true in our example. In this case we can apply values to the following probabilities. P(A1) = 0.0001 P(A2) = 0.9999 P(B|A1) = 0.998 P(B|A2) = 0.002 Before you continue, take a moment to be sure you understand the meaning of each of these probabilities. Now we can determine the probability that you actually have the Willies by direct application of Bayes' Rule. In other words, based on the results of a single test, there is less than a 5% chance that you actually have the willies.

  15. First-Order Induction Logic Programming (ILP) The purpose of ILP is to infer rules such as, given lots of instance data such as, uncle(X,Y) :- brother(X,Z),parent(Z,Y). uncle(X,Y) :- husband(X,Z),sister(Z,W),parent(W,Y) uncle(tom,frank) uncle(bob,john) not uncle(tom,cindy) not uncle(bob,tom) parent(bob,frank) parent(cindy,frank) parent(alice,john) parent(tom,john) brother(tom,cindy) sister(cindy,tom) husband(tom,alice) husband(bob,cindy) Relational Data Mining with Inductive Logic Programming for Link Discovery, R. J. Mooney, Submitted to Data Mining: Next Generation Challenges and Future Directions, H. Kargupta and A. Joshi (eds.), by AAAI/MIT Press

  16. Version Spaces A version space in concept learning or induction is the subset of all hypotheses that are consistent with the observed training examples. This set contains all hypotheses that have not been eliminated as a result of being in conflict with observed data. GB represents the most general hypothesis that has not be contradicted. SB represent the most specific hypothesis that is consistent with all positive observations. The version space is represented by the region between SB and GB. http://en.wikipedia.org/wiki/Version_spaces

  17. Data Mining in the Weeds

  18. PDFs* tgt Group minval maxval non-tgt Group mean When the Target & non-Target Groups Have ~= Means In addition to load reduction, a non-parametric classifier is useful when the featuremeans of the non-target objects are co-located with the feature means of target objects.

  19. Perimeter/Bounding Box Orientation 1 160 0.9 140 0.8 120 0.7 0.6 100 Fraction Pixels 0.5 80 0.4 60 0.3 40 0.2 0.1 20 0 0 0 10 20 30 40 0 10 20 30 40 700 Area Perimeter 600 3500 500 3000 400 2500 300 2000 Pixels Square Pixels 200 1500 100 1000 0 500 0 10 20 30 40 0 0 10 20 30 40 Distribution in Feature Space

  20. Width Length 18 300 16 250 14 12 200 Pixels Length 10 150 8 6 100 4 50 2 0 0 0 10 20 30 40 0 10 20 30 40 Feature Set (concluded) Results small3 BG nonBG hue length width area perimeter orient per/box 23/23 0/23 BG nonBG 79.8889 57.28 6.19055 449.5 129.443 30.9829 0.201416 minval 1/492 491/492 98.5 243.988 15.5095 2781.5 563.21 148.958 0.803482 maxval

  21. Summary Data Mining Microsoft Synchronization Services for ADO.NET Tools of Data Mining Decision Trees Nearest Neighbor Classification Neural Net & Genetic Algorithm Rule Induction K-Means Clustering Rule Induction Association rule Decision rule Hypothesis testing Inductive Logic Programming Version spaces

More Related