Sublinear Algorithms

Sublinear Algorithms Artur Czumaj DIMAPandDepartment of Computer Science University of Warwick TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAAAAA

Sublinear Algorithms What can we do in sublinear time? • Common knowledge Nothing … … how can we say anything about the input if we don’t have time to read it … • On the other hand: • statistical studies tell us - sometimes we can get some approximate claims Negative results: If I have 1012 numbers and I want to verify if one of them is even, I need to check each number (in the worst case) Positive experience: Election forecast: fairly well predictions of the election results without counting all the votes

Why do we want sublinear time? • Increasing role of computer/digital technologies in all aspects of our life  overwhelmed with information to be processed Massive data a decade ago BIG DATA now (structures with billions of nodes)

Why do we want sublinear time? • Modern data sets are frequently prohibitively huge • Examples of modern big data sets • Packet transactions in routers • Credit card transactions • Internet traffic logs, clickstreams, • Web data • … • Even linear-time algorithms are too slow • If data is of size 1015 – how to process it?

Why do we want sublinear time? • When dealing with such big data: of critical importance to not only being able just to analyze it, but to analyze it very efficiently • In many emerging application we have to cope with inputs of enormous sizes  managing and analyzing such data sets forces us to re-examine the traditional notions of efficiency • What is often needed: sublinear algorithms: algorithms that use resources (time and/or storage) significantly less than the input size

How can we achieve sublinear time? • We can‘t read the whole input • We can get approximate solutions only (in most of non-trivial cases) • We need randomized algorithms (in almost all non-trivial cases)

How can we achieve sublinear time? • We can do random sampling to achieve some partial information about the input • Approximate (typically) • Refinement: we can do adaptive sampling or even define some stochastic process to select part of the input

How can we achieve sublinear time? Classical (early/easy) results: • approximate counting elements from {1,…,W} • how many votes went to Obama / to Romney • approximate the median • approximate the average of elements from {1,...,W} All these are easy problems: can we deal with more complex ones?

Plan of the talk • A few examples of non-trivial sublinear-time algorithms

Searching • Input: key and numbers • Is key among the numbers? • Key factor: input representation • Numbers are in an unsorted array/list • Numbers are in a sorted array • Numbers are in a sorted list

Searching • Input: key and numbers • Is key among the numbers? • Key factor: input representation • Numbers are in an unsorted array/list • … • … Q(n)time necessary

Searching • Input: key and numbers • Is key among the numbers? • Key factor: input representation • … • Numbers are in a sorted array • … Q(log n)time necessary

Searching • Input: key and numbers • Is key among the numbers? • Key factor: input representation • … • … • Numbers are in a sorted list More tricky

Searching in a sorted list • If we don’t have access to intermediate elements in the list • What if we have “random” access to intermediate elements? Hopeless: time necessary

Searching in a sorted list • Access to intermediate elements • All elements are distinct 9 4 6 1 0 8 5 2 2 1 7 0 2 2 8 4 We can do better than in linear time!

Searching in a sorted list Traverse the (unique) sublist that can contain key Splits the list into sublists Check which sublist can contain key • Pick random elements • Wlog, • Check if there is such that • Else, find such that • Start traversing the sorted list from until either key is found or reach Correctness is trivial What’s the runtime?

Searching in a sorted list time time (since we don’t need to sort) • Pick random elements • Wlog, • Check if there is such that • Else, find such that • Start traversing the sorted list from until either key is found or reach time Expected time Expected running time is Cannot be improved

Searching in a sorted list • Access to intermediate elements • What if NOT all elements are distinct ? Let key = 2 Distinguish between two inputs: 1,1,1,…,1,3,3,3,…,3 1,1,1,…,1,2,3,3,…,3 (with #1s ~ #3s) time is necessary

Searching in a sorted list • Nontrivial application: Input: two convex polygons given as “chains” (sequences of consecutive points) Output:do these two polygons intersect ? Chazelle et al. used the searching algorithm to solve the problem in time

If points in each polygon are in an array in an arbitrary order and each polygon is represented by a list then we can detect if the two (convex) polygons intersect in time

Searching in a sorted list • Nontrivial application: Input: two convex polygons given as “chains” (sequences of consecutive points) Output:do these two polygons intersect Chazelle et al. used the searching algorithm to solve the problem in time Chazelle et al. used similar approach to get -time algorithms for a number of other geometric problems

Sublinear time graph algorithms

Average degree in a graph • Given a connected graph • We have access (oracle) to degree of each single vertex • What is the average degree of ? Estimate the number of edges • Can we do it in time?

Related problem • Given integers from interval • Estimate their average • Can we get 7-approximation in time? NO!!! Average is 1 time necessary Average is greater than 7 • How to distinguish between the following two inputs: • All numbers are 1 • 8 numbers are and numbers are 1

Related problem • Given n integers from interval • Estimate their average • Can we get 7-approximation in time? If this problem requires W(n) time then can we estimate the degree faster? Remember that input graph is connected Feigegave a approximation algorithm running in time

Graphs versus numbers • The reason of lower bound for the numbers: • Large numbers can hide • But in a graph: • Vertices with large degrees cannot hide

Feige’s algorithm* Repeat times sample a set of vertices i.u.r. For each sample set, compute the average degree of the sampled vertices Return the smallest average degree

Notation • = average degree • = sampled set of vertices • = average degree of vertices in S

Easy upper bound • Clearly, • Hence, Markov inequality yields: • Random sampling won’t overestimate • We’ll take samples • we expect the smallest to be smaller than Our next goal: show that it won’t underestimate

Why we won’t underestimate Goal: prove with prob. . vertices with highest degree . Claim 1: sum of degrees of vertices in Intuition: random sample won’t take any vertex from

Sum of degrees of vertices in •  there are edges between vertices in • Every other edge has ¸ 1 endpoint in Sum of the degrees of vertices in

Why we won’t underestimate • Bound for average degree in bound for expected average degree in • Chernoff-Hoeffding bound gives with prob. To use Chernoff-Hoeffding we needed a good upper bound for the maximum degree value of a node in (and that’s why we treated and separately)

Summarizing • Average degree in satisfies -approximation

Average degree in graphs Feige: a approximation algorithm running in time

Average degree in graphs • Feige proved also that • time is necessary • no -time algorithm can get -approx. Goldreich and Ron “improved” it!

Neighbourhood model • Goldreich & Ron - access a neighbor of a vertex (access to individual edges, their endpoints) • approximation in time

Ideas of improvement • Why did Feige get only approx.? • Random sample got only nodes from • Edge between nodes in and in contributed only to the degree • and should have contributed • Goldreich and Ron count twice each edge between a node in and in • each edge with both nodes in is seen twice; • each edge with one node in is seen once; • each edge with both nodes in is not seen.

Choose nodes at random • If is the max-deg in this set, then • = nodes of degree Idea of the algorithm … but we don’t know • Suppose that we know • Randomly sample nodes to • For each vertex • Count edges from to nodes in  • Count edges from to nodes in  “set” Isn’t this too expensive? time and are estimated by random sampling

Average degree in graphsNeighbouring model Goldreich and Ron (06) gave a approximation algorithm running in time

Next graph problem

Minimum Spanning Tree (MST) • G = (V,E)undirected connected weighted graph w: E !R • MST Problem: • Find a minimum spanning tree (MST) of G

Minimum Spanning Tree (MST) • Classical Problem: • Well-studied • -timerandomized algorithm • (Karger, Klein, Tarjan’94) • Unknown if we can solve it in deterministic time • Best known - runtime (Chazelle’97)

Estimating the weight of MST • If we don’t want to find MST - only its weight • we can do (sometimes) better • Chazelle, Rubinfeld, Trevisan ’01: • is represented by adjacency lists • Average degree is • All weights are known to be in interval • Randomized-approximation of weight of MST in time ) • Sublinear if and are small • even constant if and are constant • Doesn’t have to read the entire input • … but might be slow if either or is large

Idea behind the algorithm • Characterize MST weight in terms of number of connected components in certain auxiliary graphs • Show that the number of connected components can be approximated quickly

MST weight vs. #Connected Components • W=2 - the largest weight 1 2 2 1 2 2 2 2 1 1 2 2 2 1 1 2 2 1 1 2

MST weight vs. #Connected Components • W=2 - the largest weight • There are c=4 connected components induced by weight 1 edges 1 2 2 1 2 2 2 2 1 1 2 2 2 1 1 2 2 1 1 2

MST weight vs. #Connected Components • W=2 - the largest weight • There are c=4 connected components induced by weight 1 edges MST must have edges of weight 1 2 2 1 2 2 2 2 1 1 2 2 2 1 1 2 2 1 1 2

MST weight vs. #Connected Components • = number of connected components induced by edges of weight at most • Then we get for arbitrary : (assuming all weights are integers between and )

The Algorithm ApproxMST () for 1 to do = CountConnectedComponents() Output: How to compute/approximate the number of connected components ?

Sublinear Algorithms

Sublinear Algorithms

Presentation Transcript

Sketching, Sampling and other Sublinear Algorithms: Nearest Neighbor Search

Estimating the Unseen: Sublinear Statistics

Sketching, Sampling and other Sublinear Algorithms: Streaming

Sublinear Algorihms for Big Data

Sublinear Algorithms for Approximating Graph Parameters

Sublinear Algorihms for Big Data

Implicit regularization in sublinear approximation algorithms

multiplication by a constant is sublinear

Sublinear Algorithms via Precision Sampling

Sublinear Algorihms for Big Data

RA PRESENTATION Sublinear Geometric Algorithms

Something for almost nothing: Advances in sublinear time algorithms

Sublinear Algorithms

What can we do in sublinear time? 0368.4612 Seminar on Sublinear Time Algorithms Lecture 1

Sublinear

Sublinear Algorihms for Big Data

Sublinear Algorihms for Big Data

Sketching, Sampling and other Sublinear Algorithms: Algorithms for parallel models

Sublinear FPTASs for Stochastic Optimization Problems