1 / 47

Decision List

Decision List. LING 572 Fei Xia 1/18/06. Outline. Basic concepts and properties Case study. Definitions. A decision list (DL) is an ordered list of conjunctive rules. Rules can overlap, so the order is important.

brigid
Télécharger la présentation

Decision List

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decision List LING 572 Fei Xia 1/18/06

  2. Outline • Basic concepts and properties • Case study

  3. Definitions • A decision list (DL) is an ordered list of conjunctive rules. • Rules can overlap, so the order is important. • A decision tree determines an example’s class by using the first matched rule.

  4. An example A simple DL: x=(f1, f2, f3) • If f1=v11 && f2=v21 then c1 • If f2=v21 && f3=v34 then c2 Classify an example (v11,v21,v34)  c1 or c2 ?

  5. Decision list • A decision list is a list of pairs (t1, v1), …, (tr, vr), ti are terms, and tr=true. • A “term” in this context is a conjunction of literals: • f1=v11 is a literal. • “f1=v11 && f2=v21” is a term.

  6. How to build a decision list • Decision tree  Decision list • Greedy, iterative algorithm that builds DLs directly.

  7. Decision tree  Decision list Income low high Nothing Respond

  8. The greedy algorithm • RuleList=[ ], E=training_data • Repeat until E is empty or gain is small • t = Find_best_term(E) • Let E’ be the examples covered by t • Let c be the most common class in E’ • Add (t, c) to RuleList • E  E – E’

  9. Problem of greedy algorithm • The interpretation of rules depends on preceding rules. • Each iteration reduces the number of training examples. • Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned.  Several papers on alternative algorithms

  10. Algorithms for building DL • AQ algorithm (Michalski, 1969) • CN2 algorithm (Clark and Niblett, 1989) • Segal and Etzioni (1994) • Goodman (2002) • …

  11. Probabilistic DL • DL: a rule is (t, v) • Probabilistic DL: a rule is (t, c1/p1 c2/p2 … cn/pn)

  12. Case study(Yarowsky, 1994)

  13. Case study: accent restoration • Task: to restore accents in Spanish and French  A special case of WSD • Ex: ambiguous de-accented forms: • cesse  cesse, cessé • cote côté, côte, cote, coté • Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote • Attributes: words within a window

  14. The algorithm • Training: • Find the list of de-accent forms that are ambiguous. • For each ambiguous form, build a decision list. • Testing: check each word in a sentence • if it is ambiguous, then restore the accent form according to the DL

  15. Algorithm for building DLs • Select feature templates • Build attribute-value table • Find the feature ft that maximizes • Split the data and iterate.

  16. In this paper • Binary classification problem: each form has only two possible accent patterns. • Each rule tests only one feature • Very high baseline: 98.7% • Notation: • Accent pattern: label/target/y • Collocation: feature

  17. Step 1: Identify forms that are ambiguous

  18. Step 2: Collecting training context Context: the previous three and next three words. Strip the accents from the data. Why?

  19. Step 3: Measure collocational distributions Feature types are pre-defined.

  20. Collocations (a.k.a. features)

  21. Step 4: Rank decision rules by log-likelihood There are many alternatives. word class

  22. Step 5: Pruning DLs • Pruning: • Cross-validation • Remove redundant rules: “WEEKDAY” rule precedes “domingo” rule.

  23. Summary of the algorithm • For a de-accented form w, find all possible accented forms • Collect training contexts: • collect k words on each side of w • strip the accents from the data • Measure collocational distributions: • use pre-defined attribute combination: • Ex: “-1 w”, “+1w, +2w” • Rank decision rules by log-likelihood • Optional pruning and interpolation

  24. Experiments Prior (baseline): choose the most common form.

  25. Global probabilities vs. Residual probabilities • Two ways to calculate the log-likelihood, log (ci | ft): • Global probabilities: using the full data set • Residual probabilities: using the residual training data • More relevant, but less data and more expensive to compute. • Interpolation: use both • In practice, global probability works better.

  26. Combining vs. Not combining evidence • Each decision is based on a single piece of evidence (i.e., feature). • Run-time efficiency and easy modeling • It works well, at least for this task, but why? • Combining all available evidence rarely produces different results • “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided” (c.f. Naïve Bayes)

  27. Summary of case study • It allows a wider context (compared to n-gram methods) • It allows the use of multiple, highly non-independent evidence types (compared to Bayesian methods) • kitchen-sink approach of the best kind (at that time)

  28. Summary of decision list • Rules are easily understood by humans (but remember the order factor) • DL tends to be relatively small, and fast and easy to apply in practice. • Learning: greedy algorithm and other improved algorithms • Extension: probabilistic DL • Ex: if A & B then (c1, 0.8) (c2, 0.2) • DL is related to DT, CNF, DNF, and TBL (see “additional slides”).

  29. Additional slides

  30. Rivest’s paper • It assumes that all attributes (including goal attribute) are binary. • It shows DL is easily learnable from examples.

  31. Assignment and formula • Input attributes: x1, …, xn • An assignment gives each input attribute a value (1 or 0): e.g., 10001 • A boolean formula (function) maps each assignment to a value (1 or 0):

  32. Two formulae are equivalent if they give the same value for same input. • Total number of different formulae:  Classification problem: learn a formula given a partial table

  33. CNF an DNF • Literal: • Term: conjunction (“and”) of literals • Clause: disjunction (“or”) of literals • CNF (conjunctive normal form): the conjunction of clauses. • DNF (disjunctive normal form): the disjunction of terms. • k-CNF and k-DNF

  34. A slightly different definition of DT • A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1. • k-DT: the depth of a DT is at most k. • A DT defines a boolean formula: look at the paths whose leaf node is 1. • An example

  35. Decision list • A decision list is a list of pairs (f1, v1), …, (fr, vr), fi are terms, and fr=true. • A decision list defines a boolean function: given an assignment x, DL(x)=vj, where j is the least index s.t. fj(x)=1.

  36. Relations among different representations • CNF, DNF, DT, DL • k-CNF, k-DNF, k-DT, k-DL • For any k < n, k-DL is a proper superset of the other three. • Compared to DT, DL has a simple structure, but the complexity of the decisions allowed at each node is greater.

  37. k-CNF and k-DNF are proper subsets of k-DL • k-DNF is a subset of k-DL: • Each term t of a DNF is converted into a decision rule (t, 1). • Ex: • k-CNF is a subset of k-DL: • Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF are duals of each other. • The complement of a k-DL is also a k-DL. • Ex: • Neither k-CNF nor k-DNF is a subset of the other • Ex: 1-DNF:

  38. K-DT is a proper subset of k-DL • K-DT is a subset of k-DNF • Each leaf labeled with “1” maps to a term in k-DNF. • K-DT is a subset of k-CNF • Each leaf labeled with “0” maps to a clause in k-CNF  k-DT is a subset of

  39. K-DT, k-CNF, k-DNF and k-DT k-CNF k-DT k-DNF K-DL

  40. Learnability • Positive examples vs. negative examples of the concept being learned. • In some domains, positive examples are easier to collect. • A sample is a set of examples. • A boolean function is consistent with a sample if it does not contradict any example in the sample.

  41. Two properties of a learning algorithm • A learning algorithm is economical if it requires few examples to identify the correct concept. • A learning algorithm is efficient if it requires little computational effort to identify the correct concept.  We prefer algorithms that are both economical and efficient.

  42. Hypothesis space • Hypothesis space F: a set of concepts that are being considered. • Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm. • The goal of a learning algorithm is to select the right concept from F given the training data.

  43. Discrepancy between two functions f and g: • Ideally, we want to be as small as possible. • To deal with ‘bad luck’ in drawing example according to Pn, we define a confidence parameter:

  44. “Polynomially learnable” • A set of Boolean functions is polynomially learnable if there exists an algorithm A and a polynomial function when given a sample of f of size drawn according to Pn, A will with probability at least output a s.t. Furthermore, A’s running time is polynomially bounded in n and m. • K-DL is polynomially learnable.

  45. The algorithm in (Rivest, 1987) • If the example set S is empty, halt. • Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. • Add (t, v) to decision list and remove those examples from S. • Repeat 1-3.

  46. Summary of (Rivest, 1987) • Formal definition of DL • Show the relation between k-DL, k-CNF, k-DNF and k-DL. • Prove that k-DL is polynomially learnable. • Give a simple greedy algorithm to build k-DL.

  47. In practice • Input attributes and the goal are not necessarily binary. • Ex: the previous word • A term  a feature (it is not necessarily a conjunction of literals) • Ex: the word appears in a k-word window • Only some feature types are considered, instead of all possible features: • Ex: previous word and next word • Greedy algorithm: quality measure • Ex: a feature with minimum entropy

More Related