Learning intersections and thresholds of halfspaces

Learning intersections and thresholds of halfspaces Adam Klivans (MIT/Harvard) Ryan O’Donnell (MIT) Rocco Servedio (Harvard)

Learning We consider the PAC model of [Valiant-84], in which learning a “concept class” C of boolean functions means: - a function f in C is selected, and also a probability distribution D over {+1,−1}n - the learning algorithm gets access to random examples <x, f(x)>, where the x’s are drawn from D - goal: efficiently output a hypothesis h such that w.h.p., Prx←D[f(x) ≠ h(x)] < ε.

Learning example Example: C is the class of all conjunctions of variables. Perhaps the concept selected is: x1AND x2AND x4. One might see examples: < (+ + − + − +), + > < (− + − + − −), − > < (+ + + − − +), − > What is a learning algorithm for this class?

Halfspaces Let h be a hyperplane in Rn: h = {x : ∑wixi = θ}. h naturally induces a boolean function: f :{+1,−1}n→{+1,−1}, f(x) = sgn(∑wixi − θ). We call such a function a boolean halfspace, or a weighted majority. The majority function itself is an example (wi≡1, θ = 0). n i=1

Learning halfspaces Learning halfspaces is a very old problem; dates back to models for the brain from the ’50s: [Agmon-54, Rosenblatt-58, Block-62]. The concept class of halfspaces has long been known to be PAC learnable in polynomial time via Linear Programming [BEHW-89]. Indeed, this works over any distribution on Rn, including those singling out {+1,−1}n.

Learning halfspaces Basic idea: given a bunch of examples, find a halfspace which classifies them correctly. By some learning theory technology (“Occam’s Razor”), this is a good algorithm. Consider the coefficients of a hypothesis halfspace to be unknowns, a1, …, an,θ. Each example induces some linear constraints: e.g., < (+ + − + − −), + > induces a1+a2−a3+a4−a5−a6 > θ. Solve LP.

Learning intersections of halfspaces The next logical extension of this, and a very important one, is learning intersections of halfspaces. Intersections of halfspaces form a very rich concept class: all convex bodies, CNF formulas… Learning them is also an important problem for computer vision, study of perceptrons. But very little is known.

Prior work - [Baum91]: poly time algorithm for intersection of two halfspaces through the origin under symmetric distributions (those satisfying D(x)= D(−x)). - [BlumKannan,Vempala97] learn an intersection of O(1) halfspaces in poly time over near-uniform distributions on the Euclidean sphere: - not relevant for boolean halfspaces • [KwekPitt98] gave a polynomial time alg., but requires membership queries • also not relevant for boolean halfspaces

Our results Theorem 1: The concept class of arbitrary functions of k boolean halfspaces over {+1,−1}n is learnable under the uniform distribution to accuracy 1−ε in time: nO(k²/ε²). This is polynomial time if k = O(1), ε = Ω(1). (Prior to this, no algorithm could learn even an intersection of 2 arbitrary boolean halfspaces under the uniform distribution in subexponential time.)

Our results Theorem 2: The concept class of intersections of k boolean halfspaces with weight bound W is learnable under any probability distribution to accuracy 1−ε in time: nO(k log k log W)/ε. So if the weights are polynomially bounded, one can learn an intersection of log many halfspaces in quasipolynomial time.

More results 4

Sketch of techniques For arbitrary distribution results: show that functions of low weight halfspaces have low degree polynomial threshold representations. For uniform distribution results: show that functions of halfspaces have low noise sensitivity. Both conclusions imply learning results generically.

Talk outline Plan for the rest of the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions. (Sketch other arbit. dist. results.) 2. Prove nO(k²/ε²) bound for learning arbitrary functions of k halfspaces under the uniform distribution. (Sketch other unif. dist. results.)

Polynomial threshold functions A (multilinear) polynomial p :Rn→R is a PTF for fif it sign-represents f : f(x) = sgn(p(x)) for all x{+1,−1}n. - every boolean halfspace is a degree 1 PTF for itself - every boolean function has a degree n PTF By linear programming [KS01]: if every function in a class C has a PTF of degree d, then C is learnable in time nO(d)/ε.

PTFs for intersections of halfspaces Suppose f and g are hyperplanes, f(x) = ∑wi xi−θ, g(x) = ∑wi'xi−θ'. We would like a PTF for sgn(f) sgn(g). Failed attempt 1:- try f(x)g(x): is >0if f(x)>0and g(x)>0 is >0iff(x)<0and g(x)<0  Failed attempt 2:- try f(x)+g(x): is >0if f(x)>0and g(x)>0 is <0iff(x)<0and g(x)<0 is ??iff(x)>0and g(x)<0 

PTFs for intersections of halfspaces The solution: apply a (polynomial?) function to f and g to make them look more like their sign. Assume ∑|wi|< W. Then for all x  {+1,−1}n, f(x), g(x)  [-W,-1] ∪ [1,W]. Beigel et al. [BRS95] showed how to construct a univariate rational function which is an essentially optimal approximator of the sgn function on [-W,-1] ∪ [1,W].

p(-x)-p(x) p(-x)+p(x) BRS’s sgn-approximator p(x)=(x-1)(x-2)2(x-4)2(x-8)2(x-16)2(x-32)2 q(x) = Q is a rational function of degree O(log k log W) such that: Q(x)  [1, 1+1/k] for x  [1,W],Q(x)  [-1-1/k, -1] for x  [-W,-1].

PTFs for intersections of halfspaces Now given weight W halfspaces h1, …, hk, sgn(Q(h1(x)) + … + Q(hk(x)) − (k−½)) is a rational function which sign-represents the intersection. Once taken to a common denominator, it has degree O(k log k log W). Easy to get a polynomial: sgn(p/q)=sgn(pq). So we have a PTF for the intersection of k weight-W halfspaces of degree O(k log k log W). Hence a learning algorithm running in time nO(klogk logW).

Talk outline Plan for the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.  2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

Noise sensitivity Let f :{+1,−1}n→{+1,−1} be a boolean function. Pick x {+1,−1}n uniformly at random, and let y be an ε-corruption of x: flip each bit of x independently with probability ε. defn: The noise sensitivity of fis: NSε(f) = Pr[f(x) ≠ f(y)].

Noise sensitivity examples • Let f be a projection to one bit, f(x1, …, xn) = x1. Then NSε(f) = ε. • Suppose f depends on only k bits.Then NSε(f) ≤ kε. • PARITY is the most noise-sensitive function: NSε(PARITYn) = ½ − ½(1−2ε)n.

Noise sensitivity – study and apps. • [Benjamini-Kalai-Schramm-98] – percolation, low-level circuit complexity • [Kahn-Kalai-Linial-88] – random walks on the hypercube • [Håstad-97] – probabilistically checkable proofs • [Bshouty-Jackson-Tamon-99] – learning theory under noise • [O-02] – Yao’s XOR Lemma, average case hardness of NP • [Bourgain-02, Kindler-Safra-02, FKRSS-02] – study of juntas, Fourier analysis of boolean fcns.

Low noise sens.  fast learning We show that if the noise sensitivity of all finC is uniformly bounded: NSε(f) ≤ α(ε), then C is learnable under the uniform distribution in time: nO(1)/α (ε/3). Intuition: if fis not too noise sensitive, nearby points are highly correlated, so a net of examples works. −1

Proof of NS-learning connection Actually, the intuition is wrong. Here is the proper proof sketch: Low noise sensitivity  Fourier spectrum concentrated at low levels; this uses the formula: NSε(f) = ½−½Σ(1−2ε)|S| f(S)2 and a Markovish inequality. Low level Fourier concentration  efficient uniform distribution learning; this is by the “Low degree” Fourier sampling learning algorithm of [Linial-Mansour-Nisan-93]. ˆ

Noise sensitivity of halfspaces

Consequences Let C be the class of functions of k boolean halfspaces. Take α(ε) = O(k√ε), so all f  C have NSε(f) ≤ α(ε). α−1(ε/3) = O(ε2/k2). Hence we get Theorem 1: a uniform distribution learning algorithm running in time nO(k²/ε²).

Noise sensitivity of a halfspace We now sketch Peres’s beautiful proof that the noise sensitivity of a single halfspace is O(√ε). Suppose the halfspace is f = sgn(∑wi xi−θ). Without (much) loss of generality, one can assume θ = 0. Recall that xi’s are selected randomly from {+1,−1} and the sum is formed; then each xi is flipped indep. with prob. ε. We want to show that the prob. the sums land on opposite sides of 0 – call this a “flop”, prob. P – is O(√ε).

Noise sensitivity of a halfspace With high probability, the number of flipped bits is about k := εn. Let’s assume we always flip exactly k random bits, and that k divides n. (Both assumptions are easily removed.) We now model the problem thus: Pick signs xi at random. Randomly permute the weights. Divide the weights into n/k blocks of size k. Form the n/k block sums, X1= ∑wi xi, X2= ∑wi xi, etc. i=k+1…2k i=1…k

Noise sensitivity of a halfspace Write S = X1 + … + Xn/k for the initial sum. Because of the permutation, we may assume that the random signs in the first block are the “flips”. Put S' = S − X1, so the sum before flipping is S'+X1, and the sum after flipping is S'−X1. We are trying to bound the probability P that these two sums have opposite signs (a flop). Note that this happens iff |S'| < |X1|.

Noise sensitivity of a halfspace sgn(X1) and S' are independent, so: Pr[sgn(X1) ≠ sgn(S')] = ½. sgn(X1) and |X1| are independent, so: Pr[sgn(X1) ≠ sgn(S') | |S'| > |X1|] = ½  Pr[sgn(X1) ≠ sgn(S) | |S'| > |X1|] = ½  Pr[sgn(X1) ≠ sgn(S) &no flop] = ½(1−P)  Pr[sgn(X1) ≠ sgn(S)] = ½(1−P)  P = 2 E[½ – I[sgn(X1) ≠ sgn(S)]].

Noise sensitivity of a halfspace Of course, there was nothing special about block X1 as opposed to any other block. So in fact, P = 2 E[½ – I[sgn(Xi) ≠ sgn(S)]]. for all i = 1…n/k. Write τ=sgn(S), σi=sgn(Xi), and average: P = 2 E[½ – (k/n) ∑i I[τ≠σi]].

Noise sensitivity of a halfspace P = 2 E[½ – (k/n) ∑i I[τ≠σi]] The quantity inside the expectation is some random variable, a number which is either ½ – (k/n) ∑i I[1 ≠σi] or ½ – (k/n) ∑i I[−1 ≠σi]. If I tell you a number is either a or b, then assuredly it’s at most |a| + |b|. Applying this to the expectation, pointwise: P≤ 2 E[|½ – (k/n) ∑i I[σi=1]| + |½ – (k/n) ∑i I[σi=−1]|].

Noise sensitivity of a halfspace P≤ 2 E[ |½ – ε∑I[σi=1]| + |½ – ε ∑I[σi=−1]| ] But the σi’s are simply independent, uniformly random signs. Hence both quantities in the expectation are merely the expected absolute deviation from the mean in 1/εsamples of an unbiased 0/1 random variable – i.e., O(√ε). i=1…1/ε i=1…1/ε

Extensions This concludes the proof that a single halfspace has noise sensitivity O(√ε), from which the uniform distribution learning algorithm for functions of k halfspaces follows. To get the extended learning algorithms, must work harder at analyzing noise sensitivity. Key result: if a halfspace h is biased – say, the probability of + is p < ½, then: NSε(h) ≤ min{2p, Cp(ε log(1/p))½}.

Talk outline Plan for the talk: • Prove nO(klogk logW) bound for learning intersections of k weight-W halfspaces under arbitrary distributions.  2. Prove nO(k²/ε²) bound for learning functions of k halfspaces under the uniform distribution. 

Open technical challenges • Give an upper bound on the degree necessary for a PTF which represents the AND of two arbitrary halfspaces.(For a new lower bound, see my talk tomorrow!) • Give a better analysis of the noise sensitivity of the intersection of k halfspaces on n bits. Is it O((ε log k)½)?

The huge open problem It still remains open how to learn an intersection of twoarbitrary boolean halfspaces under an arbitrary distribution in subexponential time!

Learning intersections and thresholds of halfspaces