Improve sketching of Hamming Distance with Error Correcting

Improve sketching of Hamming Distance with Error Correcting Ely Porat Bar-Ilan University Google Inc Ohad Lipsky Bar-Ilan University Check Point Inc December 2003

Problem Definition (1) Alice Bob TA TB n n hamm(TA,TB) Given k - bound on the number of mismatches December 2003

Problem Definition (2) TA TB n n S S SA SB Calculate hamm(TA,TB) given only SA,SB Finding the mistakes Given k - bound on the number of mismatches December 2003

Motivations • Data Bases • Internet • Error Correcting Router C Router B Router A Router D December 2003

Outline: • Simple Solution • Error Correcting • Improved Solution • Improve more • Recursion • File sharing December 2003

Simplest Solution - O(k2log1/) • Binary Alphabet • Allocate k2 cells. • Take the input array and hash each bit to one of the cells. • In each cell remember the xor of all the values hash to it. 0 1 1 0 December 2003

Simplest Solution - O(k2log1/) 0 1 0 0 1 1 0 0 December 2003

Simplest Solution - O(k2log1/) • Due to the birthday principal: The probability that 2 Error will fall to the same cell < 1/2 • log1/ - to get a probability to fail  0 1 1 0 December 2003

Alphabet • Denote with S the size of the alphabet. • We can encode each latter with it’s unary representation. • The only effect is that each mistake will be counted twice. 0 - 1000000….0 1 - 0100000….0 . S-1 - 0000000….1 0 - 1000000….0 5 - 0000010….0 December 2003

Error correcting - O(k2logNS) • Here we allocate two kind of k2cellsk2 of logS bits.k2 of logNS bits. C1[h(A[i])]+=A[i] 5 8 3 2 C2[h(A[i])]+=iA[i] 15 6 7 8 December 2003

Error correcting - O(k2logNS) • As before with probability > 1/2 there won’t fall 2 Errors in the same cell. C1[h(A[i])]+=A[i] 5 8 3 2 C1[h(A[i])]+=iA[i] 15 6 7 8 December 2003

Error correcting - O(k2logNS) • We get from the red cells: 5 5 8 3 2 C1[h(A[i])]+=A[i] 5 6 3 2 3 8 - 6 = 5 - 3 December 2003

Error correcting - O(k2logNS) • We get from the blue cells: 0 1 2 5 15 11 7 5 C2[h(A[i])]+=iA[i] 15 9 7 5 3 11 - 9 = 2*(5 - 3) => i=2 December 2003

Error correcting - O(k2logNS) • The probability to succeed is about 1/2. • To lower the failer probability we will run it 3 times. • We will get a list of possible mistakes each time. • Output all the mistakes that appear in at least 2 of the 3 runs. December 2003

O(klog2k) - Solution • The Idea is two stage hashes: k/logk w.h.p O(logk) Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(klog2k) - Solution keep accumulated XOR The Probability to fail is less then 1/2. Run it 2logk times And take the max. => failer probabilty less then 1/k2 O(logk) O(log2k) Space = O(log3k) Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(klog2k) - Solution k/logk O(log3k) O(log3k) O(log3k) O(log3k) O(klog2k) P(Failer)  k/logk * 1/k2 < 1/k Bar-Yossef, Jayram, Kumar, Sivakumar 03 December 2003

O(k2log*klogk) -Idea (recursion) k/logk Pr(F)<1/logck logk/loglogk logk/loglogk runs, take max December 2003

Error Correcting O(klogNS) Alice Bob TA TB n n r0r1r2… p=(N3S) Constant Probability December 2003

Error Correcting O(klogNS) Alice Bob TA TB n n If we wrong w.h.p j>n December 2003

Error Correcting O(klogNS) Alice Bob TA TB n n rj , aj - bj December 2003

Error Correcting O(klogNS) Alice Bob TA TB n n O(klnk) December 2003

Recursion Alice Bob TA TB n n ck TA TB n n December 2003

Recursion Alice Bob TA TB n n ck O(klogNS) December 2003

Complexity TA TB n n S S SA SB Size: O(klogNS) Computing sketch: O(nlogk) Comparing sketches: O(klogk) December 2003

O(klogk) -Solution • We can just encode in unary and hash the input to k3 cells and then run the O(klogNS)=O(klogk) algorithm. December 2003

Reed-Solomon Codes We manage to develop a deterministic algorithm based on that. But the encoding and the decoding is slower. Amir, Farach 95Feigenbaum, Ishai, Malkin, Nissim, Strauss, Wright 01Bar-Yossef, Jayram, Kumar, Sivakumar 03 Efremenko, Porat, Rothschild 06Efremenko, Porat 07

File Sharing Napster source n Source need to stay until someone will have the whole file. (and willing to stay) There is bottleneck at the end.

File Sharing emule/kazaa/torrent source n The source has to send nlnn blocks before disconnecting. Sometimes there are some bottlenecks

Improved File Sharing - Ver 1 a0a1a2…………….an-1 source n n6

Improved File Sharing - Ver 1 n6 Each client that got n points can recreate the file There is no more nlnn Almost no bottlenecks

Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Send linear equations on the file.

Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Problems: 1. Heavy to encode each packet we need to go over all the file. 2. Very heavy to decode O(n2) block operation + O(n3) fields operations. Facts: 1. If you get n(1/2-) random combination of two blocks you won’t have dependents w.h.p. 2. If you have d - pairs combinations you can easilly reduce your system to n-d variables. Solution: Use sparse functionals

Improved File Sharing - Ver 2 a0a1a2…………….an-1 source n Futures: Backward compatibility. Even if you don’t have the whole file you can mix functionals.

Improve sketching of Hamming Distance with Error Correcting

Improve sketching of Hamming Distance with Error Correcting

Presentation Transcript

Error Detecting and Error Correcting Codes

Error correcting codes

Section 3.5: Error-Correcting Codes

Error-Correcting Codes and Frames with Erasures

Error Correcting Codes

Hamming It Up with Hamming Codes

LAB2 Calculating Hamming Distance

Hamming Distance

Error Correcting Codes

Error correcting codes

Error Correcting Decoder

Error Correcting Codes

Error Correcting Code

Error correcting codes

Introduction to Error Correcting Codes

ERROR-DETECTING AND ERROR- CORRECTING CODES

Error Correcting Codes

Error-Detecting and Error-Correcting Codes

Error Explanation with Distance Metrics

Error correcting codes

Hamming Distance, minimum hamming Distance, Hamming code, error detection & correction