400 likes | 827 Vues
On Agnostic Boosting and Parity Learning. Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua. Defs. Agnostic Learning = learning with adversarial noise Boosting = turn weak learner into strong learner Parities = parities of subsets of the bits
E N D
On Agnostic Boosting and Parity Learning Adam Tauman Kalai, Georgia Tech. Yishay Mansour, Google and Tel-Aviv Elad Verbin, Tsinghua
Defs • Agnostic Learning = learning with adversarial noise • Boosting = turn weak learner into strong learner • Parities = parities of subsets of the bits • f:{0,1}n→{0,1}. f(x)=x1x3x7 • Agnostic Boosting • Turning a weak agnostic learner to a strong agnostic learner • 2O(n/logn)-time algorithm for agnostically learning parities over any distribution Outline
Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis Runs weak learner as black box
Learning with Noise It’s, like, a really hard model!!! * up to well-studied open problems (i.e. we know where we’re stuck)
Agnostic Learning: some known results Due to hardness, or lack of tools??? Agnostic boosting: strong tool, makes it easier to design algorithms.
Why care about agnostic learning? • More relevant in practice • Impossibility results might be useful for building cryptosystems
Noisy learning f:{0,1}n→{0,1} from class F. alg gets samples <x,f(x)> where x is drawn from distribution D. • No noise • Random noise • Adversarial (≈agnostic) noise f Learning algorithm. Should approximate f up to error Learning algorithm. Should approximate f up to error f %noise g Learning algorithm. Should approximate g up to error + f allowed to corrupt -fraction
Agnostic learning (geometric view) F f opt opt + g PROPER LEARNING Parameters: F, metric Input: oracle for g Goal: return some element of blue ball
Agnostic boosting definition D weak learner w.h.p. h g errD(g,h)· ½ - 100 opt · ½ -
Agnostic boosting Agnostic Booster w.h.p. h’ Samples from g errD(g,h’) · opt + D weak learner w.h.p. h g errD(g,h)· ½ - 100 opt · ½ - Runs weak learner poly(1/100)times
Agnostic boosting Agnostic Booster w.h.p. h’ Samples from g errD(g,h’) ·opt + + D (,)-weak learner w.h.p. h g errD(g,h)·½ - opt ·½ - Runs weak learner poly(1/, 1/)times
Agnostic Booster Agnostic boosting Weak learner. For any noise rate < ½ produces a better-than-trivial hypothesis Strong Learner. Produces almost-optimal hypothesis
“Approximation Booster” Analogy poly-time MAX-3-SAT algorithm that when opt=7/8+ε produces solution with value 7/8+ε100 algorithm for MAX-3-SAT produces solution with value opt + running time poly(n,1/)
Gap 0 ½ 1 No hardness gap close to ½ booster no gap anywhere (additive PTAS)
Agnostic boosting • New Analysis for Mansour-McAllester booster. • uses branching programs; nodes are weak hypotheses • Previous Agnostic Boosting: • Ben-David+Long+Mansour, and Gavinsky, defined agnostic boosting differently. • Their result cannot be used for our application
Booster x h1 h1(x)=0 h1(x)=1 1 0
Booster: Split step x different distribution different distribution h1 h1 h1(x)=0 h1(x)=0 h1(x)=1 h1(x)=1 h2’ 0 h2 1 h2‘(x)=0 h2‘(x)=1 h2(x)=0 h2(x)=1 1 0 1 0 choose the “better” option
Booster: Split step x h1 h1(x)=0 h1(x)=1 h2 1 h2(x)=0 h2(x)=1 h3 0 H3(x)=0 h3(x)=1 1 0
Booster: Split step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h4(x)=1 h2(x)=0 h2(x)=1 1 0 h3 0 H3(x)=0 h3(x)=1 … 1 0
Booster: Merge step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h4(x)=1 h2(x)=0 h2(x)=1 1 0 h3 0 H3(x)=0 Merge if “similar” h3(x)=1 1 0
Booster: Merge step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h2(x)=0 h2(x)=1 h4(x)=1 0 h3 0 h3(x)=1 H3(x)=0 1 0
Booster: Another split step x h1 h1(x)=0 h1(x)=1 h4 h2 H4(x)=0 h2(x)=0 h2(x)=1 h4(x)=1 0 h3 0 h3(x)=1 H3(x)=0 h5 … 0 0 1
Booster: final result x h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 h1 0 1
Application: Parity with Noise * non-proper learner. hypothesis is circuit with 2O(n/logn) gates Feldman et al give black-box reduction to random-noise case. We give direct result • Theorem:ε, have weak learner that for noise ½-ε produces an hypothesis which is wrong on ½-(2ε)n0.001/2 fraction of space. Running time 2O(n/logn)
Corollary: Learners for many classes (without noise) • Can learn without noise any class with “guaranteed correlated parity”, in time 2O(n/logn) • e.g. DNF, any others? • A weak parity learner that runs in 2O(n0.32) time would beat the best algorithm known for learning DNF • Good evidence that parity with noise is hard efficient cryptosystems [Hopper-Blum, Blum-Furst-etal, and many others] ?
Main Idea: 1. Take Learner which resists random noise (BKW) 2. Add Randomness to its behavior, until you get a Weak Agnostic learner. Idea of weak agnostic parity learner “Between two evils, I pick the one I haven’t tried before”– Mae West “Between two evils, I pick uniformly at random” – CS folklore
Summary Problem: It is difficult but perhaps possible to design agnostic learning algorithms. Proposed Solution: Agnostic Boosting. Contributions: • Right(er) definition for weak agnostic learner • Agnostic boosting • Learning Parity with noise in hardest noise model • Entertaining STOC ’08 participants
Open Problems • Find other applications for Agnostic Boosting • Improve PwN algorithms. • Get proper learner for parity with noise • Reduce PwN with agnostic noise to PwN with random noise • Get evidence that PwN is hard • Prove that if parity with noise is easy then FACTORING is easy. 128$ reward!
May the parity be with you! The end.
Weak parity learner • Sample labeled points from distribution, sample unlabeled x, let’s guess f(x) Bucket according to last 2n/logn bits + + + to next round
Weak parity learner LAST ROUND: • √n vectors with sum=0. gives guess for f(x) + + + =0 =0 =0
Weak parity learner LAST ROUND: • √n vectors with sum=0. gives guess for f(x) • by symmetry, prob. of mistake = %mistakes • Claim: %mistakes (Cauchy-Schwartz) + + + =0 =0 =0
Intuition behind Boosting decrease weight increase weight
Intuition behind Boosting 1 • Run, reweight, run, reweight, … . Take majority of hypotheses. • Algorithmic & Efficient Yao-von Neumann Minimax Principle decrease weight 1 increase weight 2 0