Decision List

Decision List LING 572 Fei Xia 1/18/06

Outline • Basic concepts and properties • Case study

Definitions • A decision list (DL) is an ordered list of conjunctive rules. • Rules can overlap, so the order is important. • A decision tree determines an example’s class by using the first matched rule.

An example A simple DL: x=(f1, f2, f3) • If f1=v11 && f2=v21 then c1 • If f2=v21 && f3=v34 then c2 Classify an example (v11,v21,v34)  c1 or c2 ?

Decision list • A decision list is a list of pairs (t1, v1), …, (tr, vr), ti are terms, and tr=true. • A “term” in this context is a conjunction of literals: • f1=v11 is a literal. • “f1=v11 && f2=v21” is a term.

How to build a decision list • Decision tree  Decision list • Greedy, iterative algorithm that builds DLs directly.

Decision tree  Decision list Income low high Nothing Respond

The greedy algorithm • RuleList=[ ], E=training_data • Repeat until E is empty or gain is small • t = Find_best_term(E) • Let E’ be the examples covered by t • Let c be the most common class in E’ • Add (t, c) to RuleList • E  E – E’

Problem of greedy algorithm • The interpretation of rules depends on preceding rules. • Each iteration reduces the number of training examples. • Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned.  Several papers on alternative algorithms

Algorithms for building DL • AQ algorithm (Michalski, 1969) • CN2 algorithm (Clark and Niblett, 1989) • Segal and Etzioni (1994) • Goodman (2002) • …

Probabilistic DL • DL: a rule is (t, v) • Probabilistic DL: a rule is (t, c1/p1 c2/p2 … cn/pn)

Case study(Yarowsky, 1994)

Case study: accent restoration • Task: to restore accents in Spanish and French  A special case of WSD • Ex: ambiguous de-accented forms: • cesse  cesse, cessé • cote côté, côte, cote, coté • Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote • Attributes: words within a window

The algorithm • Training: • Find the list of de-accent forms that are ambiguous. • For each ambiguous form, build a decision list. • Testing: check each word in a sentence • if it is ambiguous, then restore the accent form according to the DL

Algorithm for building DLs • Select feature templates • Build attribute-value table • Find the feature ft that maximizes • Split the data and iterate.

In this paper • Binary classification problem: each form has only two possible accent patterns. • Each rule tests only one feature • Very high baseline: 98.7% • Notation: • Accent pattern: label/target/y • Collocation: feature

Step 1: Identify forms that are ambiguous

Step 2: Collecting training context Context: the previous three and next three words. Strip the accents from the data. Why?

Step 3: Measure collocational distributions Feature types are pre-defined.

Collocations (a.k.a. features)

Step 4: Rank decision rules by log-likelihood There are many alternatives. word class

Step 5: Pruning DLs • Pruning: • Cross-validation • Remove redundant rules: “WEEKDAY” rule precedes “domingo” rule.

Summary of the algorithm • For a de-accented form w, find all possible accented forms • Collect training contexts: • collect k words on each side of w • strip the accents from the data • Measure collocational distributions: • use pre-defined attribute combination: • Ex: “-1 w”, “+1w, +2w” • Rank decision rules by log-likelihood • Optional pruning and interpolation

Experiments Prior (baseline): choose the most common form.

Global probabilities vs. Residual probabilities • Two ways to calculate the log-likelihood, log (ci | ft): • Global probabilities: using the full data set • Residual probabilities: using the residual training data • More relevant, but less data and more expensive to compute. • Interpolation: use both • In practice, global probability works better.

Combining vs. Not combining evidence • Each decision is based on a single piece of evidence (i.e., feature). • Run-time efficiency and easy modeling • It works well, at least for this task, but why? • Combining all available evidence rarely produces different results • “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided” (c.f. Naïve Bayes)

Summary of case study • It allows a wider context (compared to n-gram methods) • It allows the use of multiple, highly non-independent evidence types (compared to Bayesian methods) • kitchen-sink approach of the best kind (at that time)

Summary of decision list • Rules are easily understood by humans (but remember the order factor) • DL tends to be relatively small, and fast and easy to apply in practice. • Learning: greedy algorithm and other improved algorithms • Extension: probabilistic DL • Ex: if A & B then (c1, 0.8) (c2, 0.2) • DL is related to DT, CNF, DNF, and TBL (see “additional slides”).

Additional slides

Rivest’s paper • It assumes that all attributes (including goal attribute) are binary. • It shows DL is easily learnable from examples.

Assignment and formula • Input attributes: x1, …, xn • An assignment gives each input attribute a value (1 or 0): e.g., 10001 • A boolean formula (function) maps each assignment to a value (1 or 0):

Two formulae are equivalent if they give the same value for same input. • Total number of different formulae:  Classification problem: learn a formula given a partial table

CNF an DNF • Literal: • Term: conjunction (“and”) of literals • Clause: disjunction (“or”) of literals • CNF (conjunctive normal form): the conjunction of clauses. • DNF (disjunctive normal form): the disjunction of terms. • k-CNF and k-DNF

A slightly different definition of DT • A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1. • k-DT: the depth of a DT is at most k. • A DT defines a boolean formula: look at the paths whose leaf node is 1. • An example

Decision list • A decision list is a list of pairs (f1, v1), …, (fr, vr), fi are terms, and fr=true. • A decision list defines a boolean function: given an assignment x, DL(x)=vj, where j is the least index s.t. fj(x)=1.

Relations among different representations • CNF, DNF, DT, DL • k-CNF, k-DNF, k-DT, k-DL • For any k < n, k-DL is a proper superset of the other three. • Compared to DT, DL has a simple structure, but the complexity of the decisions allowed at each node is greater.

k-CNF and k-DNF are proper subsets of k-DL • k-DNF is a subset of k-DL: • Each term t of a DNF is converted into a decision rule (t, 1). • Ex: • k-CNF is a subset of k-DL: • Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF are duals of each other. • The complement of a k-DL is also a k-DL. • Ex: • Neither k-CNF nor k-DNF is a subset of the other • Ex: 1-DNF:

K-DT is a proper subset of k-DL • K-DT is a subset of k-DNF • Each leaf labeled with “1” maps to a term in k-DNF. • K-DT is a subset of k-CNF • Each leaf labeled with “0” maps to a clause in k-CNF  k-DT is a subset of

K-DT, k-CNF, k-DNF and k-DT k-CNF k-DT k-DNF K-DL

Learnability • Positive examples vs. negative examples of the concept being learned. • In some domains, positive examples are easier to collect. • A sample is a set of examples. • A boolean function is consistent with a sample if it does not contradict any example in the sample.

Two properties of a learning algorithm • A learning algorithm is economical if it requires few examples to identify the correct concept. • A learning algorithm is efficient if it requires little computational effort to identify the correct concept.  We prefer algorithms that are both economical and efficient.

Hypothesis space • Hypothesis space F: a set of concepts that are being considered. • Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm. • The goal of a learning algorithm is to select the right concept from F given the training data.

Discrepancy between two functions f and g: • Ideally, we want to be as small as possible. • To deal with ‘bad luck’ in drawing example according to Pn, we define a confidence parameter:

“Polynomially learnable” • A set of Boolean functions is polynomially learnable if there exists an algorithm A and a polynomial function when given a sample of f of size drawn according to Pn, A will with probability at least output a s.t. Furthermore, A’s running time is polynomially bounded in n and m. • K-DL is polynomially learnable.

The algorithm in (Rivest, 1987) • If the example set S is empty, halt. • Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v. • Add (t, v) to decision list and remove those examples from S. • Repeat 1-3.

Summary of (Rivest, 1987) • Formal definition of DL • Show the relation between k-DL, k-CNF, k-DNF and k-DL. • Prove that k-DL is polynomially learnable. • Give a simple greedy algorithm to build k-DL.

In practice • Input attributes and the goal are not necessarily binary. • Ex: the previous word • A term  a feature (it is not necessarily a conjunction of literals) • Ex: the word appears in a k-word window • Only some feature types are considered, instead of all possible features: • Ex: previous word and next word • Greedy algorithm: quality measure • Ex: a feature with minimum entropy

Decision List

Decision List

Presentation Transcript

Decision List

Module Decision - To Do List

Decision makers list

Hospital Decision Makers Email List Email List Hospital Decision Make

Hospitals Decision Makers Mailing List

Pharmacy Decision Makers Email List

Microsoft Decision Makers Email List | Microsoft customers list

Hospitals Decision Makers Email List

Decision Makers List

Decision Makers Email List

Hospital Decision Maker List- USA

DECISION MAKERS EMAIL LIST

Decision Makers Email List

Hospitals Decision Makers Email List

Hospitals Decision Makers Email List

Telecom Decision Makers Email List