Data Mining vs. Knowledge Mining What is an Inductive Database The INLEN M ethodology

INDUCTIVE DATA BASES andKNOWLEDGE SCOUTSRyszard S. MichalskiMachine Learning and Inference LaboratorySchool of Computational SciencesGeorge Mason University • Data Mining vs. Knowledge Mining • What is an Inductive Database • The INLEN Methodology • A Simple Medical Knowledge Scout • Conclusion • Demo of Natural Induction PLAN

MOTIVATIONData Mining vs. Knowledge Mining “All human beings desire to know” Aristotle, Metaphysics, I.1. • The widespread use of databases and the ubiquity of the internet have created extraordinaryopportunities for monitoring, analyzing, and predicting processes in many areas of human activity, such as economy, medicine, agriculture, sciences, defense, etc. • Current data mining technologies are, however, insufficient for such tasks; there is a need for new methods capable of “knowledge mining” that can be approximately described as a process of deriving goal-oriented knowledge from data by employing advanced methods of inference and prior knowledge. • The development of effective methods and systems for knowledge mining emerges as a central challenge on the research agenda for the 21st century.

Evolution of Database Technologies • Conventional databases: Provide answers if the answers are in the DB Research concerns problems ofefficiency of storage and retrieval, and of extending capabilities, such handling flexible queries (e.g., Bosc & Pivert, 92), vague queries (e.g., Motro, 88; D’Atri & Tarantino, 89), answer annotations (Motro, 96); developing mediation systems (Wiederhold, 96); quasi-cubes (Barbara & Sullivan, 98) • Deductive databases: They add to a databasethe capabilities for drawing deductive consequences from data using rule-bases (e.g., Minker, 87; Zaniolo, 92) • Data mining and knowledge discovery:Current methods can be viewed as initial steps inbuilding modern tools for deriving knowledge from data: they use very simple knowledge representations, are not integrated withDBMSs, and employ only very limited domain knowledge (Agraval et al, 98; Michalski & Kaufman, 98) • A natural next step: Inductive databases: They aim at answering queries that require drawing inductive inferences and deriving plausible conclusions; data bases are integrated with knowledge bases and inductive inference capabilities.

Steps in Knowledge Discovery: The Current Approach Data Selection Target data Preprocessing Pre-processed data Transformation Transformed data Data mining Patterns /Knowledge Evaluation & Interpretation Tested Knowledge

Popular Techniques • Decision trees • Rule induction • Neural networks • Nearest neighbors • Evolutionary computation • Clustering techniques • Bayesian belief networks

Strengths and Weaknesses of Current Methods Strengths: • Simple to implement • Easy to use • Efficient to apply • Can scale-up Weaknesses: • Are able to find only simple patterns • Not integrated with the database • Unable to employ sophisticated domain knowledge • Knowledge generated may be difficult to interpret

A Motivational Example The U.S. trade records show that in the early 1980s the import of trucks from Japan sharply declined while the import of auto parts significantly increased. The significance of these facts was only noticed several years later when it was determined that Japanese companies were assembling trucks in the U.S. to avoid a high U.S. tariff on imported trucks. This discovery led to a new trade agreement between the U.S. and Japan. How could such a discovery be made be a data mining system? A data mining system would have to be able to conduct inductive reasoning with abstract concepts (goals), to represent and reason with background domain knowledge, and have an access to relevant facts. Building such a system requires an integration of several technologies, such as databases, knowledge bases, and advanced methods of inductive reasoning.

Research Challenges • Need for effectively and efficiently usingboth data and background knowledge, and reasoning with more advanced knowledge representations than currently used • Since data may be insufficiently relevant to the task, there is a need for constructive induction • In many applications, knowledge generated needs to be easy to interpret and understand: this calls for natural induction • The process of knowledge discovery needs to be more automated: this leads to the concept of a knowledge scout.

Knowledge Mining To address such challenges we introduced the concept of knowledge mining defined as a process of deriving knowledge from data that satisfies three conditions: It is guided by goals (knowledge needs) of the user that can be defined abstractly It strives to generate knowledge in “natural forms,” that is, in forms similar to those used by human experts, such as NL descriptions and/or graphical representations; this facilitates its understandability and ease of interpretation It involves a significant amount of prior knowledge.

What is an Inductive Database? • An inductive database is an approach to knowledge mining • It closely integrates a DB with methods for inductive inference, hypothesis testing, and a KB • Requirements: • Integrates diverse learning and discovery methods into a unified knowledge mining and maintenance environment • The methods can be invoked automatically according to scripts, called knowledge scouts, defined in a high-level knowledge generation language • Employs general and domain-specific background knowledge in the process of knowledge mining • Generates knowledge in forms understandable and easy to interpret by employing a logic-style language and visualization • Can test the generated knowledge and hypotheses and use for supporting decision-making processes.

Another View of an Inductive Database “An inductive database contains data and inductive generalizations about the data” “The goal for inductive databases is that in addition to the facts, the database will contain a potentially infinite set of induced rules” Heikki Mannila “Inductive Databases and Condensed representations for Data Mining” Proceedings of the International Logic Programming Symposium, Jan Maluszynski (ed.), MIT Press, 1997

A General Schema of an Inductive Database Database Knowledge Scout 1 Knowledge Scout 2 . . . Knowledge Scout n Inductive and Conventional Database Operators Knowledge Base Domain knowledge and user models

Knowledge Mining A Goal-oriented Knowledge-intensive Derivation of Comprehensible Knowledge from Data Domain Knowledge Natural Induction Knowledge Generation Language Data Selection Target data Preprocessing Pre-processed data Transformation Transformed data Knowledge generation Patterns /Knowledge Evaluation & Intepretation Tested Knowledge

KNOWLEDGE MANAGEMENT OPERATORS KNOWLEDGE GENERATION OPERATORS DATA MANAGEMENT OPERATORS TRANSFORM GENRULE GENEQ SELECT SELECT GENTREE GENHIER CREATE CREATE PROJECT PROJECT INSERT INSERT DB KB JOIN JOIN CHANGE CHANGE COMBINE COMBINE DELETE DELETE GENEVE TEST/APPLY INTERSECT INTERSECT GENATR ANALYZE VISUALIZE KNOWLEDGE GENERATION OPERATORS INLEN: The First Step Toward an IDB

Basic Forms of Inference where P, BK and C can be a single sentence, or a set of sentences. Deduction Given P and BK derive C Induction Given C and BK hypothesize P Analogy If P’ is similar to P, hypothesize C’ similar to C

NATURAL INDUCTIONThe AQ Learning Approach • AQ (Algorithm Quasi-optimal) initiated the “separate-and-conquer” (a.k.a. progressive covering) method for rule induction • Recent version, AQ-20, is arguably the most advanced current rule learning system • Given a set of concept examples and counter-examples, it creates a task-optimized description of the examples in the formof rules in attributional calculus • Unlike conventional attribute-value or attribute-rel-value rules, attributional calculus uses more elaborate conditions, such as color = red or blue or green x = 2..8 x1 = x2 x1 & x3 ≥ x5 #VarsIn{x1,x2,x4,x6,x7 = 2} ≥ 3 • By using a more expressive description language, AQ learning is more complex but may be able to discover simple and easy to understand patterns that conventional rule or decision tree learning programs may not be able to discover. • AQ rules can be optimized according to different combinations of criteria • AQ rules can be evaluated using an exact match or flexible match

Example of an Attributional Rule • Consider a rule: If x1 ≤ x2 , x3 ≠ x4, and x3 is red or blue, then decision is A (1) • If variables xi, i=1,2,3,4 are five-valued, then representing (1) would require a decision tree with 810 leaves and 190 nodes, or 600 conventionalrules • A logically equivalent attributional calculus rule is: [Decision = A] <= [x1 ≤ x2] & [x3 ≠ x4] & [x3 = red v blue] (2) • To provide a user with more information about the rule, AQ adds annotations to the rule: [Decision = A] if [x1 ≤ x2: 3899, 266] & [x3 ≠ x4: 803, 19] & [x3= red or blue: 780, 40] t=750, u=700, n=14, f=4, q=.9 where t - the total number of examples covered by the rule (rule coverage) u - the number of examples covered only by this rule, and not by any other rule associated with Decision=A n - the number of negative examples covered by the rule (“negative coverage’) q - the rule quality combining the coverage and training accuracygain f - the number of examples in the training set matched flexibly

AQ Algorithm The simplest version Select seed e from Pos Generate star G(e/N) Input : Pos, Neg, LEF Select the LEF-best rule R from G(e/N) Pos = Pos - R-covered No Yes Output Cov(Pos/Neg) Pos = 0

Knowledge Scouts • Knowledge scouts utilize resources of an inductive database to create personal knowledge agents that synthesize knowledge of interest to a particular user • They “live” in an inductive database, in the sense that they may continuously and autonomously search for knowledge of interest in the database (which can be distributed). In the course of their existence, theylearn about the patron’s interests and habits, past experiences, and use that knowledge in knowledge synthesis and maintenance • They execute abstract scripts defined by a patron using a knowledge generation language (KGL) • For example, a knowledge scout may be designed to continuously monitor a changing database and hypothesize knowledge of interest.

Basic Functions • Knowledge generation – searches for desirable pieces of knowledge by executing scout’s cript • Knowledge testing -- tests the generated knowledge • Data or knowledge visualization -- creates visual structures(generalized logic diagrams--GLDs, or concept association graphs--CAGs) representing found patterns • Communicates with users in ways natural to them (NL and/or graphical forms).

Knowledge Generation Language:KGL-1 • Data mining programs typically require the user to guide the process and manually select a program to apply given the outcome of the previous operation • This process can be very laborious, time-consuming, and conducive to errors • To partially automate this process, we have developed a knowledge generation language, KGL-1, which allows a user to specifyscripts for automated knowledge mining and implementing knowledge scouts.

KGL Design Specifications • Individual operators of KGL can invoke programs for machine learning, inference, and related functions, set automatically their parameters, as well as call procedures for data processing, transformation, and statistical analysis • Looping and branching can be done on the basis of properties of the data and of the results of the previous steps of data exploration • Specifically, conditional statements can be based on the size and type of the data, the presence of noise or missing values, changes in the data, the domain size and the type of the attributes, kinds of attribute values in the rule conditions, the properties of the current rules or rulesets, e.g., their type, complexity, coverage, predictive accuracy, or some other characteristics.

An Application to MedicineDiscovering Relationships among Diseases and Lifestyles • The target dataset consists of nearly 75,000 survey reports selected from a database collected by the American Cancer Society (ACS) • Characteristics of the target dataset: • Describes white male non-smokers, aged 50-65, in terms of 35 attributes • Includes lifestyle and personal information (e.g., nightly sleep, mouthwash use, level of exercise, rotundity) • Represents 25 diseases, such as stroke, kidney stones, high blood pressure, gall stones, bladder disease, and others.

Defining the LDR Knowledge Scoutfor Determining Lifestyle-Disease Relationships begin open CANCER {Select target set} do SELEVE(criterion = random(.1), out = CANCER-STUDY){Randomly select about 1/10 of the examples and put in table CANCER-STUDY} for i = 0 to 32 {Repeat for each decision attribute in the target dataset} do DIFFSET(decision=i, Q-value=.25,.5,.75){Create a knowledge base by learning attributional rules for three different Q criteria} if (i > 7) {Check if the decision is a disease attribute} forall rules(decision-class=i, rulerank = 1) {For the strongest rule from each learning run that indicates the disease} do VISUALIZE(rule, CAG) {Visualize the rule as a concept intergrate(CAG, CAGSET)association graph} add-to(rule, RULESET) (add the found rule into the RULESET} end do COMBINEGRAPHS(CAGSET){Build a combined concept association graph)

An Example of a Learned Rule Arthritis is present if: 1 [HighBloodPressure=present] &(pos:432, neg:1765) [education<=college_grad] (pos:940, neg:4529) (t:355, u:352, n:1332, q:0.136516) 2 [Stomach ulcer] &(pos:104, neg:299) [education=8th_or_less..vocational,grad_school] &(pos:768, neg:3876) [exercise=none..medium] &(pos:1003, neg:5353) [y_i_n=1..67] &(pos:1114, neg:6049) [Diabetes=no]&(pos:1089, neg:5939) [HighBloodPressure=no] (pos:739, neg:4475) (t:51, u:51, n:92, q:0.101225) Conditions are annotated by their positive and negative coverage levels Rules are annotated by their total, unique and negative coverage levels, and by a description quality measure based on their completeness and consistency These annotations serve as measures of strength.

An Example of Data (selected columns) Selected records and attribute values (from the 1171 positive and 6240 negative cases of arthritis, originallyexpressed in terms of 35 attributes): # rotundity exercise sleep education y_i_n mouthwash HBP TB Arthritis 1 low medium 6 college_grad 15 yes yesno yes 2 very_low heavy 7 college_grad 37 yes nono yes 3 low medium 8 hs_grad 20 yes nono no 4 average medium 7 vocational 34 yes yesno no where: rotundity—a function of height and weight to characterize a body shape sleep is measured in hours y_i_n = years in neighborhood HBP = high blood pressure TB = tuberculosis

Arthritis Rule Representation at Two Levels of Abstraction Rule:Arthritisis associated with the following conditions: [HighBloodPressure=yes] &(pos:432,neg:1765) [Education<=college_grad] &(pos:940, neg:4529) [Rotundity =low..very_high] & (pos:1070, neg:5578) [y_i_n 47..56] (pos:1109, neg:5910) (t:325, u:257, n:1156, q:0.14) Association graph: - v + + Education Rotundity Years in Neighborhood High blood Pressure

A Hypothesized Concept Association Graph

Corresponding Attributional Rules Rule 1 Arthritisis associated with conditions: [High Blood Pressure: 432, 765] [Rotundity ≥ low: 1070, 5548] [Education ≤ college grad: 940, 4529] [Years in Neighborhood ≤ 47: 1109, 5910] t: 325; u: 257; n: 1156 P: 1171, N: 6240 Rule 2 Asthmais associated with condition: [Hay Fever present: 170, 787] t: 170; u: 161; n: 787 P: 331, N: 7047 Rule 3 Colon Polypsare associated with conditions: [Prostate problem: 34, 967] [Sleep = <5 or >9: 16, 515] [Years in Neighborhood = 8..12 or 45..67: 33, 1477] [Rotundity = average: 58, 2693] [Education ≤ some college: 83, 4146] t: 5; u: 5; n: 0 P: 147, N: 7383

Rule 4 Diverticulosis is associated with conditions: [Arthritis is yes: 70, 1033] [Rotundity ≥ average: 170, 4202] [Stroke = no: 257, 7037] [Sleep = 7..9: 205, 5743] [Years in Neighborhood ≥ 12: 1109, 5910] [Education ≥ some college: 176, 4412] t: 24; u: 21; n: 115 P: 262, N: 7117 Rule 5 Hay Fever is associated with conditions: [Education ≥ vocational: 772, 4231] [Years in Neighborhood ≥ 53: 939, 6073] t: 763; u: 721; n: 4141 P: 965, N: 6304 Rule 6 RectalPolypsis associated with conditions: [Prostate is yes: 73, 893] [Rotundity ≠ high:275, 5967] [Mouthwash = yes: 194, 3509] [Education = some hs..some college or grad school: 252, 5246] [Years in Neighborhood = 2..41 or 56..63: 296, 6173] t: 38; u: 30; n: 271 P: 334, N: 6951 Rule 7 Stomach Ulceris associated with conditions: [Arthritis: 107, 1041] [Education ≤ = college grad: 305, 5276] [Exercise ≥ medium: 298, 5606] • t: 79; u: 67; n: 668 P: 367, N: 7108

Final Remark:The Importance of an AppropriateSpecification of the Scout’s Mission • A scout scripted to discover patterns in Parent-Children survey data ran for four days and generated 72,203 rules for 415 decision classes. These rules occupied 30 MB disk space. • Clearly, such a volume of rules is “indigestible.” This was a signal that the task was defined too generally • After refining the task specification by adding more criteria, the scout generated only 217 rules.

Application Areas • Diagnostic problems (medical, agricultural, technical) • Biochemistry and computional biology • Hypothesis formation from scientific databases • Earth observation and global change prediction • Image and video interpretation • Document classification and retrieval • Web information access and filtering • Business(insurance, investment,advertizing,consumerretension) • Non-Darwinian evolutionary computation • Complex engineering design

Acknowledgements This research is supported in part by the National Science Foundation under Grants No. IIS-9906858 and IIS-0097476, and in part by the UMBC/LUCITE #32 grant. Previous support that enabled this research was provided in part by the National Science Foundation under Grants No. IRI-9510644, CDA-9309725, IRI-9020266 and DMI-9496192; in part by the Defense Advanced Research Projects Agency under Grants No. F49620-95-1-0462, F49620-92-J-0549 and No. N00014-91-J-1354; and in part by the Office of Naval Research under Grant No. N00014-91-J-1351.

For publications or additional information about the topics of the lecture see: www.mli.gmu.edu or contact: Ryszard Michalskimichalski@gmu.edu or Ken Kaufmankaufman@gmu.edu

DEMO Illustrating Natural Induction Emerald-AQ (Can be downloaded from www.mli.gmu.edu MLI software

MACHINE LEARNING AND INFERENCE LABORATORY Projects • Natural Induction Systems (AQ20) • Inductive Databases and Knowledge Scouts (IDB) • Non-Darwinian Evolutionary Computation (LEM) • Computer Intrusion Detection (LUS) • Knowledge Visualization (KV and CAG) • Intelligent Systems for Engineering Design (ISHED) • Plausible Inference and Intelligent Guessing (DIH) • Multistrategy Learning (MSL) • Image Interpretation (MIST)

END

Data Mining vs. Knowledge Mining What is an Inductive Database The INLEN M ethodology