1 / 37

Learning intersections and thresholds of halfspaces

Learning intersections and thresholds of halfspaces. Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard). Learning. We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means:

fsipp
Télécharger la présentation

Learning intersections and thresholds of halfspaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

  2. Learning We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - a function f in C is selected, and also a probability distribution D over {+1,−1}n - the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D - goal: efficiently output a hypothesis h such that w.h.p., Prx←D[f(x) ≠ h(x)] < ε.

  3. Learning example Example: C is the class of all conjunctions of variables. Perhaps the concept selected is: x1AND x2AND x4. One might see examples: < (+ + − + − +), + > < (− + − + − −), − > < (+ + + − − +), − > What is a learning algorithm for this class?

  4. Halfspaces Let h be a hyperplane in Rn: h = {x : ∑wixi = θ}. h naturally induces a boolean function: f :{+1,−1}n→{+1,−1}, f(x) = sgn(∑wixi − θ). We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi≡1, θ = 0). n i=1

  5. Learning halfspaces Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62]. The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89]. Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

  6. Learning halfspaces Basic idea: given a bunch of examples, find a halfspace which classifies them correctly. By some learning theory technology (“Occam’s Razor”), this is a good algorithm. Consider the coefficients of a hypothesis halfspace to be unknowns, a1, …, an,θ. Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

  7. Learning intersections of halfspaces The next logical extension of this, and a very important one, is learning intersections of halfspaces. Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas… Learning them is also an important problem for computer vision, study of perceptrons. But very little is known.

  8. Prior work - [Baum91]: poly time algorithm for intersection of two halfspaces through the origin under symmetric distributions (those satisfying D(x)= D(−x)). - [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere: - not relevant for boolean halfspaces • [KwekPitt98] gave a polynomial time alg., but requires membership queries • also not relevant for boolean halfspaces

  9. Our results Theorem 1: The concept class of arbitrary functions of k boolean halfspaces over {+1,−1}n is learnable under the uniform distribution to accuracy 1−ε in time: nO(k²/ε²). This is polynomial time if k = O(1), ε = Ω(1). (Prior to this, no algorithm could learn even an intersection of 2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

  10. Our results Theorem 2: The concept class of intersections of k boolean halfspaces with weight bound W is learnable under any probability distribution to accuracy 1−ε in time: nO(k log k log W)/ε. So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

  11. More results 4

  12. Sketch of techniques For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations. For uniform distribution results: show that functions of halfspaces have low noise sensitivity. Both conclusions imply learning results generically.

  13. Talk outline Plan for the rest of the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions. (Sketch other arbit. dist. results.) 2. Prove nO(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution. (Sketch other unif. dist. results.)

  14. Polynomial threshold functions A (multilinear) polynomial p :Rn→R is a PTF for fif it sign-represents f : f(x) = sgn(p(x)) for all x{+1,−1}n. - every boolean halfspace is a degree 1 PTF for itself - every boolean function has a degree n PTF By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time nO(d)/ε.

  15. PTFs for intersections of halfspaces Suppose f and g are hyperplanes, f(x) = ∑wi xi−θ, g(x) = ∑wi'xi−θ'. We would like a PTF for sgn(f) sgn(g). Failed attempt 1:- try f(x)g(x): is >0if f(x)>0and g(x)>0 is >0iff(x)<0and g(x)<0  Failed attempt 2:- try f(x)+g(x): is >0if f(x)>0and g(x)>0 is <0iff(x)<0and g(x)<0 is ??iff(x)>0and g(x)<0 

  16. PTFs for intersections of halfspaces The solution: apply a (polynomial?) function to f and g to make them look more like their sign. Assume ∑|wi|< W. Then for all x  {+1,−1}n, f(x), g(x)  [-W,-1] ∪ [1,W]. Beigel et al. [BRS95] showed how to construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

  17. p(-x)-p(x) p(-x)+p(x) BRS’s sgn-approximator p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2 q(x) = Q is a rational function of degree O(log k log W) such that: Q(x)  [1, 1+1/k] for x  [1,W],Q(x)  [-1-1/k, -1] for x  [-W,-1].

  18. PTFs for intersections of halfspaces Now given weight W halfspaces h1, …, hk, sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½)) is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W). Easy to get a polynomial: sgn(p/q)=sgn(pq). So we have a PTF for the intersection of k weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(klogk logW).

  19. Talk outline Plan for the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.  2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

  20. Noise sensitivity Let f :{+1,−1}n→{+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε. defn: The noise sensitivity of fis: NSε(f) = Pr[f(x) ≠ f(y)].

  21. Noise sensitivity examples • Let f be a projection to one bit, f(x1, …, xn) = x1. Then NSε(f) = ε. • Suppose f depends on only k bits.Then NSε(f) ≤ kε. • PARITY is the most noise-sensitive function: NSε(PARITYn) = ½ − ½(1−2ε)n.

  22. Noise sensitivity – study and apps. • [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity • [Kahn-Kalai-Linial-88] – random walks on the hypercube • [Håstad-97] – probabilistically checkable proofs • [Bshouty-Jackson-Tamon-99] – learning theory under noise • [O-02] – Yao’s XOR Lemma, average case hardness of NP • [Bourgain-02, Kindler-Safra-02, FKRSS-02] – study of juntas, Fourier analysis of boolean fcns.

  23. Low noise sens.  fast learning We show that if the noise sensitivity of all finC is uniformly bounded: NSε(f) ≤ α(ε), then C is learnable under the uniform distribution in time: nO(1)/α (ε/3). Intuition: if fis not too noise sensitive, nearby points are highly correlated, so a net of examples works. −1

  24. Proof of NS-learning connection Actually, the intuition is wrong. Here is the proper proof sketch: Low noise sensitivity  Fourier spectrum concentrated at low levels; this uses the formula: NSε(f) = ½−½Σ(1−2ε)|S| f(S)2 and a Markovish inequality. Low level Fourier concentration  efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93]. ˆ

  25. Noise sensitivity of halfspaces

  26. Consequences Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f  C have NSε(f) ≤ α(ε). α−1(ε/3) = O(ε2/k2). Hence we get Theorem 1: a uniform distribution learning algorithm running in time nO(k²/ε²).

  27. Noise sensitivity of a halfspace We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε). Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

  28. Noise sensitivity of a halfspace With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.) We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc. i=k+1…2k i=1…k

  29. Noise sensitivity of a halfspace Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.

  30. Noise sensitivity of a halfspace sgn(X1) and S' are independent, so: Pr[sgn(X1) ≠ sgn(S')] = ½. sgn(X1) and |X1| are independent, so: Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½  Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½  Pr[sgn(X1) ≠ sgn(S) &no flop] = ½(1−P)  Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)  P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].

  31. Noise sensitivity of a halfspace Of course, there was nothing special about block X1 as opposed to any other block. So in fact, P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]]. for all i = 1…n/k. Write τ=sgn(S), σi=sgn(Xi), and average: P = 2 E[½ – (k/n) ∑i I[τ≠σi]].

  32. Noise sensitivity of a halfspace P = 2 E[½ – (k/n) ∑i I[τ≠σi]] The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠σi] or ½ – (k/n) ∑i I[−1 ≠σi]. If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise: P≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].

  33. Noise sensitivity of a halfspace P≤ 2 E[ |½ – ε∑I[σi=1]| + |½ – ε ∑I[σi=−1]| ] But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/εsamples of an unbiased 0/1 random variable – i.e., O(√ε). i=1…1/ε i=1…1/ε

  34. Extensions This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows. To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h) ≤ min{2p, Cp(ε log(1/p))½}.

  35. Talk outline Plan for the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.  2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

  36. Open technical challenges • Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!) • Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

  37. The huge open problem It still remains open how to learn an intersection of twoarbitrary boolean halfspaces under an arbitrary distribution in subexponential time!

More Related