Learning and testing k-modal distributions

Learning and testing k-modal distributions Rocco A. Servedio Columbia University Joint work (in progress) with Costis Daskalakis MIT Ilias Diakonikolas UC Berkeley

What this talk is about Probability distributions over [N] = {1,2,…,N} N 1 2 Monotone increasing distribution: for all ` (Whole talk: “increasing” means “non-decreasing”)

k-modal distributions k-modal: k peaks and valleys Monotone distribution: 0-modal A unimodal distribution: Another one: A 3-modal distribution:

The learning problem Target distribution p is an unknown k-modal distribution over [N] Algorithm gets samples from p N 1 2 Goal: output a hypothesis h that’s -close to p in total variation distance Want algorithm that uses few samples & is computationally efficient.

The testing problem q is a known k-modal distribution over [N]. N 1 2 p is an unknown k-modal distribution over [N]. Algorithm gets samples from p. N 1 2 Goal: output “yes” w.h.p. if “no” w.h.p. if

Please note Testing problem is not: given samples from an unknown distribution p, determine if p is k-modal versus -far from every k-modal distribution. This problem requires samples, even for k=0. hard to distinguish vs 1 N 1 N uniform over random uniform over

Why study these questions? • k-modal distributions seem natural • would be nice if k-modal structure were exploitable by efficient learning / testing algorithms • post hoc justification: solutions exhibit interesting connections between testing and learning

The general case: learning If we drop k-modal assumption, learning problem becomes: Learn an arbitrary distribution over [N] to total variation distance N 1 samples are necessary and sufficient

The general case: testing If we drop k-modal assumption, testing problem becomes: q is a known, arbitrary distribution over [N]. p is an unknown, arbitrary distribution over [N]. Algorithm gets samples from p. Goal: output “yes” if “no” if samples are necessary and sufficient [GR00, BFFKRW02, P08]

This work: main learning result We give an algorithm that learns any k-modal distribution over [N] to accuracy . It uses samples and runs in time. Close to optimal: -sample lower bound for any algorithm.

Main testing result We give an algorithm that solves the k-modal testing problem over [N] to accuracy . It uses samples and runs in time. Any testing algorithm must use samples. Testing is easier than learning!

Prior work k=0,1: [BKR04] gave -sample efficient algorithm for testing problem (p,q both available via sample access) k=0,1: [Birge87, Birge87a] gave -sample efficient algorithm for learning, and matching lower bound We’ll use this algorithm as a black box in our results

Outline of rest of talk • Background: some tools • Learning k-modal distributions • Testing k-modal distributions

First tool: Learning monotone distributions Theorem [B87] There is an efficient algorithm that learns any monotone decreasing distribution over to accuracy . It uses samples and runs in time linear in its input size. [B87b] also gave lower bound for learning a monotone distribution.

Second tool: Learning a CDF – the Dvoretsky-Kiefer-Wolfowitz inequality Theorem: [DKW56] Let be any distribution over with CDF . Let be empirical estimate of obtained from samples. Then with probability . true CDF empirical CDF Note: samples suffice (by easyChernoff bound argument) Morally, means you can partition into intervals each of mass under , using samples.

Learning k-modaldistributions

The problem Learn an unknown k-modal distribution over [N]. N 1 2

What should we shoot for? Easy lower bound: need samples. (have to solve monotone-distribution-learning problems over to accuracy ) Want an algorithm that uses roughly this many samples and takes time

The problem, again Goal: learn an unknown k-modal distribution over [N]. We know how to efficiently learn an unknown monotone distribution… X X X Would be easy if we knew the k peaks/valleys… Guessing them exactly: infeasible Guessing them approximately: not too great either

A first approach Break up [N] into many intervals: … is not monotone for at most k of the intervals So running monotone distribution learner on each interval will usually give a good answer.

First approach in more detail • Use [DKW] to divide [N] into intervals & obtain estimates such that • (Assumes each point has mass at most or so; heavier points are easy to detect and deal with.) • Run monotone distribution learner on each to get • (Actually run it twice: once for increasing, once for decreasing. • Do hypothesis testing to pick one as .) • Combine hypotheses in obvious way: and

Sketch of analysis • Use [DKW] to divide [N] into intervals & obtain estimates such that • Takes samples • Run monotone distribution learner on each to get • Takes samples • Combine hypotheses in obvious way: • Total error from k non-monotone intervals • from scaling factors • from estimating ’s with ’s and

Improving the approach came from running monotone distribution learner times rather than just times If we could somehow check – more cheaply than learning – whether an interval is monotone before running the learner, could run the learner fewer times and save… …this is a property testing problem! More sophisticated algorithm: two new ingredients.

First ingredient: testing k-modal distributions for monotonicity Consider the following property testing problem: Algorithm gets samples from unknown k-modal distribution p over [N]. Goal: output “yes” w.h.p. if p is monotone increasing “no” w.h.p. if p is -far from monotone increasing Note: k-modal promise for p might save us from lower bound… hard to distinguish 1 n 1 n

Efficiently testing k-modal distributions for monotonicity Algorithm gets samples from unknown k-modal distribution p over [N]. Goal: output “yes” w.h.p. if p is monotone increasing “no” w.h.p. if p is -far from monotone increasing Theorem: There is a -sample tester for this problem. close to v We’ll use this to identify sub-intervals of [N] where p is monotone …can we efficiently learn close-to-monotone distributions?

Second ingredient: agnostically learning monotone distributions Consider the following “agnostic learning” problem: Algorithm gets samples from unknown distribution p over [N] that is -close to monotone. Goal: output hypothesis distribution h such that If opt=0, this is the original “learn a monotone distribution” problem Want to handle general case as efficiently as opt=0 case

agnostically learning monotone distributions Algorithm gets samples from unknown distribution p over [N] that isopt-close to monotone. Goal: output hypothesis distribution h such that Theorem: There is a computationally efficient learning algorithm for this problem that uses samples.

agnostically learning monotone distributions Semi- Algorithm gets samples from unknown distribution p over [N] that isopt-close to monotone. Goal: output hypothesis distribution h such that Theorem: There is a computationally efficient learning algorithm for this semi-agnostic problem that uses samples. The [Birge87] monotone distribution learner does the job. We will take , , so versus doesn’t matter.

The learning algorithm: first phase • Use [DKW] to divide [N] into intervals & obtain estimates such that • Run testers on then etc., until first time both say “no” at Mark and continue. invocations of tester in total(Alternative: use binary search: invocations of tester in total.) …

The algorithm • Run testers on then etc., until first time both say “no” at Mark and continue. … • Each time an interval is marked, • the block of unmarked intervals right before it is close-to-monotone; call this a superinterval • (at least) one of the k peaks/valleys of p is “used up”

The learning algorithm: second phase • After this step, [N] is partitioned into • superintervals each -close to monotone • “marked” intervals, each of weight • Rest of algorithm: • Run semi-agnostic monotone distribution learner on each superinterval to get -accurate hypothesis for • Output final hypothesis

Analysis of the algorithm • Sample complexity: • runs of tester: each uses samples • runs of semi-agnostic monotone learner: each uses • samples. • Error rate: • error from marked intervals • total error from estimating ’s with ’s • total error from scaling factors

I owe you a tester Algorithm gets samples from unknown k-modal distribution p over [N]. Goal: output “yes” w.h.p. if p is monotone increasing “no” w.h.p. if p is -far from monotone increasing Theorem: There is a -sample tester for this problem.

The testing algorithm • Algorithm: • Run [DKW] with accuracy Let be resulting empirical PDF. • If such that • then output “no”; otherwise output “yes” average value of over [a,b] • Completeness: p monotone increasing  test passes w.h.p.

Soundness • Algorithm: • Run [DKW] with accuracy Let be resulting empirical PDF. • If such that • then output “no”; otherwise output “yes” • Soundness lemma: If is k-modal and have • then is -close to monotone increasing. To prove soundness lemma: show that under lemma’s hypothesis, can “correct” each peak/valley of by “spending” at most in variation distance.

Correcting a peak of p • Lemma: If is k-modal and have • then is -close to monotone increasing. Consider a peak of p: Draw a line at height such that (mass of “hill” above line) = (missing mass of “valley” below line): Correct the peak by bulldozing the hill into the valley:

Why it works • Lemma: If is k-modal and have • then is -close to monotone increasing. n correction So and so so

Summary • Sample- and time- efficient algorithms for learning and testingk-modal distributions over [N]. • Upper bounds pretty close to lower bounds for these problems. • Testing is easier than learning • Learning algorithms have a testing component

Future work • More efficient algorithms for restricted classes of -modal distributions? • [DDS11]: any sum of Bernoulli random variables is learnable using samples independent of special type of unimodaldistribution: “Poisson Binomial Distribution”

Thank you

Key ingredient: oblivious decomposition Decompose into intervals whose widths increase as powers of . Call these the oblivious buckets. … …

Flattening a monotone distributionusing the oblivious decomposition Given a monotone decreasing distribution , the flattened version of , denoted , spreads ’s weight uniformly within each bucket of the oblivious decomposition: true pdf flattened version … … … … Lemma:[B87] For any monotone decreasing distribution , have

Learning monotone distributionsusing oblivious decomposition [B87] Reduce learning monotone distributions over to accuracy learning arbitrary distributions over to accuracy Algorithm: • Draw samples from • Output hypothesis is the flattened empirical distribution • - • View as arbitrary distribution over -element set: Analysis:

Testing monotone distributionsusing oblivious decomposition Can use learning algorithm to get -sample algorithm for testing problem. But, can do better by using oblivious decomposition directly: testing equality of monotone distributions over to accuracy testing equality of arbitrary distributions over to accuracy : known monotone distribution over : unknown monotone distribution over : known distribution over : unknown distribution over Using [BFFKRW02], get -sample testing algorithm Can show lower bound for any tester.

[BKR04] implicitly gave log^2(n)loglog(n)/eps^5-sample algorithm for learning monotone distribution

Learning and testing k-modal distributions