1 / 31

CrossMine: Efficient Classification Across Multiple Database Relations

CrossMine: Efficient Classification Across Multiple Database Relations. Xiaoxin Yin, Jiawei Han, Jiong Yang University of Illinois at Urbana-Champaign Philip S. Yu IBM T. J. Watson Research Center. Roadmap. Introduction, definitions Problem definition - preliminaries Tuple ID Propagation

cleavant
Télécharger la présentation

CrossMine: Efficient Classification Across Multiple Database Relations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CrossMine: Efficient Classification Across Multiple Database Relations Xiaoxin Yin, Jiawei Han, Jiong Yang University of Illinois at Urbana-Champaign Philip S. Yu IBM T. J. Watson Research Center

  2. Roadmap • Introduction, definitions • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  3. Introduction, definitions • Most real-world data are stored in relational databases • Multirelational classification – procedure of building a classifier based on information stored in multiple relational databases • ILP most widely used, but are not scalable • Multi-relational classification: Automatically classifying objects using multiple relations

  4. An Example: Loan Applications Ask the backend database Approve or not? Apply for loan

  5. The Backend Database Account District Loan account-id district-id district-id dist-name Target relation: Each tuple has a class label, indicating whether a loan is paid on time. loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id How to make decisions to loan applications?

  6. Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  7. Preliminaries • Target relation • Class labels • Predicates • Rules • Decision Trees • Searching for Predicates by Joins

  8. The problem The joined realation Loan, Account, Order, Transaction (“x-y” represents attribute y in relation x)

  9. Rule Generation • Search for good predicates across multiple relations Applicant #1 Loan Applications Applicant #2 Orders Applicant #3 Accounts Applicant #4 Other relations Districts

  10. Previous Approaches • Inductive Logic Programming (ILP) • To build a rule • Repeatedly find the best predicate • To evaluate a predicate on relation R, first join target relation with R • Not scalable because • Huge search space (numerous candidate predicates) • Not efficient to evaluate each predicate • To evaluate a predicate Loan(L, +) :- Loan (L, A,?,?,?,?), Account(A,?, ‘monthly’,?) first join loan relation with account relation • CrossMine is more scalable and more than one hundred times faster on datasets with reasonable sizes

  11. CrossMine: An Efficient and Accurate Multi-relational Classifier • Tuple-ID propagation: an efficient and flexible method for virtually joining relations • Confine the rule search process in promising directions • Look-one-ahead: a more powerful search strategy • Negative tuple sampling: improve efficiency while maintaining accuracy

  12. Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  13. Tuple ID Propagation Instead of performing physical join, the IDs and class labels of target tuples can be propagated to Account relation

  14. Tuple ID Propagation Applicant #1 Account ID Frequency Open date Propagated ID Labels Applicant #2 124 monthly 02/27/93 1, 2 2+, 0– 108 weekly 09/23/97 3 0+, 1– 45 monthly 12/09/96 4 0+, 1– 67 weekly 01/01/97 Null 0+, 0– Applicant #3 • Possible predicates: • Frequency=‘monthly’: 2 +, 1 – • Open date < 01/01/95: 2 +, 0 – Applicant #4 • Propagate tuple IDs of target relation to non-target relations • Virtually join relations to avoid the high cost of physical joins

  15. Tuple ID Propagation (cont.) • Efficient • Only propagate the tuple IDs • Time and space usage is low • Flexible • Can propagate IDs among non-target relations • Many sets of IDs can be kept on one relation, which are propagated from different join paths

  16. Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  17. Overall Procedure • Sequential covering algorithm while(enough target tuples left) generate a rule remove positive target tuples satisfying this rule Examples covered by Rule 2 Examples covered by Rule 1 Examples covered by Rule 3 Positive examples

  18. Rule Generation • To generate a rule while(true) find the best predicate p if foil-gain(p)>threshold then add p to current rule else break A3=1 A3=1&&A1=2 A3=1&&A1=2 &&A8=5 Positive examples Negative examples

  19. Evaluating Predicates • All predicates in a relation can be evaluated based on propagated IDs • Use foil-gain to evaluate predicates • Suppose current rule is r. For a predicate p, foil-gain(p) = • Categorical Attributes • Compute foil-gain directly • Numerical Attributes • Discretize with every possible value

  20. Rule Generation • Start from the target relation • Only the target relation is active • Repeat • Search in all active relations • Search in all relations joinable to active relations • Add the best predicate to the current rule • Set the involved relation to active • Until • The best predicate does not have enough gain • Current rule is too long

  21. Account District account-id district-id Loan district-id dist-name loan-id frequency Card region account-id date card-id #people date disp-id #lt-500 amount type #lt-2000 duration Transaction issue-date #lt-10000 payment trans-id #gt-10000 account-id #city Disposition date ratio-urban Order disp-id type avg-salary order-id account-id operation unemploy95 account-id amount client-id unemploy96 bank-to balance den-enter account-to symbol Client #crime95 amount client-id #crime96 type birth-date gender district-id Rule Generation: Example Target relation First predicate Second predicate Range of Search Add best predicate to rule

  22. Look-one-ahead in Rule Generation • Two types of relations: Entity and Relationship • Often cannot find useful predicates on relations of relationship No good predicate Target Relation • Solution of CrossMine: • When propagating IDs to a relation of relationship, propagate one more step to next relation of entity.

  23. Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  24. Negative Tuple Sampling • A rule covers some positive examples • Positive examples are removed after covered • After generating many rules, there are much less positive examples than negative ones – + – + – – – + + + – – + + – + + – + – – – + – – + – + + + – – +

  25. Negative Tuple Sampling (cont.) • When there are much more negative examples than positive ones • Cannot build good rules (low support) • Still time consuming (large number of negative examples) • Make sampling on negative examples • Improve efficiency without affecting rule quality • T(-) < Neg_Pos_Ratio x T(+) and T(-) < Max_Num_Negtive – – – – – – + – – – – – – – – – + + + – – +

  26. Roadmap • Motivation • Problem definition - preliminaries • Tuple ID Propagation • Rule Generation • Negative Tuple Sampling • Performance Study

  27. Performance study • 1.7GHz P4 PC – Windows2000 • For CrossMine-Rule parameters: • Min_Foil_Gain = 2.5 • Max_Rule_Length = 6 • Neg_Pos_Ratio = 1 • Max_Num_Negative = 600

  28. Performance study • Synthetic relational databases are generated • Use different • Number of relations • Number of tuples in each relation • Number of foreign keys • The running time and accuracy are compared • CrossMine can be performed efficiently on data stored on disks (real applications) too.

  29. Synthetic datasets: Scalability w.r.t. number of relations Scalability w.r.t. number of tuples

  30. Real Dataset • PKDD Cup 99 dataset – Loan Application • Mutagenesis dataset (4 relations)

  31. References • H. Blockeel, L. De Raedt and J. Ramon. Top-down induction of logical decision trees. In Proc. of the Fifteenth Int. Conf. of Machine Learning, Madison, WI, 1998. • C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 1998. • L. Dehaspe and H. Toivonen. Discovery of Relational Association Rules. In Relational Data Mining, Springer-Verlag, 2000. • L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. In Proc. 18th International Conf. on Machine Learning, Williamtown, MA, 2001. • H. A. Leiva. MRDTL: a multi-relational decision tree learning algorithm. M.S. thesis, Iowa State Univ., 2002. • T. Mitchell. Machine Learning. McGraw Hill, 1996. • S. Muggleton. Inverse Entailment and Progol. New Generation Computing, Special issue on Inductive Logic Programming, 1995. • S. Muggleton and C. Feng. Efficient induction of logic programs. In Proc. of First Conf. on Algorithmic Learning Theory, Tokyo, Japan, 1990. • A. Popescul, L. Ungar, S. Lawrence, and M. Pennock. Towards Structural Logistic Regression: Combining Relational and Statistical Learning. In Proc. of Multi-Relational Data Mining Workshop, Alberta, Canada, 2002. • J. R. Quinlan. FOIL: A midterm report. In Proc. of the sixth European Conf. on Machine Learning, Springer-Verlag, 1993. • J. R. Quilan. C4.5: Programs for Machine Learning. In Morgan Kaufmann series in machine learning, Morgan Kaufmann, 1993. • B. Taskar, E. Segal, and D. Koller. Probabilistic Classification and Clustering in Relational Data. in Proc. of 17th Int. Joint Conf. on Artificial Intelligence, Seattle, WA, 2001.

More Related