Data Compression with finite windows Fiala and Greene

Data Compression with finitewindowsFialaand Greene Speaker: Giora Alexandron

Overview:----------------------- • Our main purpose: • See how Suffix Tree supports a compression algorithm.

Overview:----------------------- • Our main purpose: • See how Suffix Tree supports a compression algorithm. • What we would see: • A data compression method, which works by substituting text. It uses a modification of the basic suffix tree, to support cyclic maintenance of the most recent strings seen in file.

Outlines------------------------ • 1. Compression: • - In General • - Our Algorithm • 2. Data Structure: • - Modification of the suffix tree. • 3. Theoretical Considerations: • - Prooves. • 4. Improvments.

Compression------------------------------- • What is Compression: • Compression is the coding of data to minimize its representation. We would focus on • lossless, adaptive, one-pass methods.

Compression------------------------------- • What is Compression: • Compression is the coding of data to minimize its representation. We would focus on • lossless, adaptive, one-pass methods. • Main approaches- • Statistical approach- try to predict the next symbol. • Substitutional approach- replace blocks of texts with references to earlier occurrences of identical text. • ** We would focus on a Substitutional method **

Compression-cont.------------------------------ • What characterize a good compressor: • - Good compressing ratio. • - Run fast in Compression. • - Use minimum of space. • - Run fast in Expansion.

Compression-cont.------------------------------ • What characterize a good compressor: • - Good compressing ratio. • - Run fast in Compression. • - Use minimum of space. • - Run fast in Expansion. • There are trade-offs between all of those. • Naturally, we want to achieve them all = • A good Algorithm + a matching Data Structure

Substitutional Compressing--------------------------------------- • Consider the following basic scheme: • The compressed files would contain two types of codewords: • literal xpass the next x characters directly to the output. • copy x, ygo back y characters and copy the next x • characters start at that position.

Example------------------------------------------------ • ..it was the best of times, • it was the worst of times.. • Would compress to-

Example------------------------------------------------ • ..it was the best of times, • it was the worst of times.. • Would compress to- • (literal 26) it was the best of times, +26

Example------------------------------------------------ • ..it was the best of times, • it was the worst of times.. • Would compress to- • (literal 26) it was the best of times, • (copy 11-26) -26 +11 +26

Example------------------------------------------------ • ..it was the best of times, • it was the worst of times.. • Would compress to- • (literal 26) it was the best of times, • (copy 11-26) wor (copy 11-27) -26 +11 -27 +11 +26

Example-cont.------------------------------------------------Example-cont.------------------------------------------------ • And we get a very simple lossless method: • The compression achieved depends on the size of the copy and literal codewords. ..it was the best of times, it was the worst of times. Compression (literal 26) it was the best of times, (copy 11-26) wor (copy 11-27). ..it was the best of times, it was the worst of times. Expansion

A1------------------------------------------------------ literal length [1..16] • The encoding of A1: • - 8 bits for a literal codeword • - 16 bit for a copy codeword • (can you figure what’s the logic behind?) 0000xxxx length [2..16] 0 7 displacement [1..4096] xxxxyy..yy 0 15

A1------------------------------------------------------ literal length [1..16] • The encoding of A1: • - 8 bits for a literal codeword • - 16 bit for a copy codeword • And we get (a compression of 51 to 36): • (literal 16) it was the best (literal 10)of times, • (copy 11-26) wor (copy 11-27) 0000xxxx length [2..16] 0 7 displacement [1..4096] xxxxyy..yy 0 15

A1’s policy---------------------------- • If the compressor is idle (just finish a word): • look for a copy >= 2 • otherwise, start a literal. • If the compressor is in the middle of a literal: • extend it until a copy >= 3 is found.

Where do we stand? 1. Compression: - In General - Our Algorithm 2. Data Structure: - Modification of the suffix tree. 3. Theoretical Considerations: - Prooves. Done (here)

The Data Structure----------------------------------------- • What do we need? • Find the current longest match (for copy).

The Data Structure----------------------------------------- • What do we need? • Find the current longest match (for copy). • What could we use? • Naive solution- • Suffix tree with all strings of length <= 16 in the previous 4096-bytes window.

Naive solution--------------------------------- • Suffix tree with all strings of length <= 16 in the previous 4096-bytes window: current 4096 16 16 16

The cost-------------------------------------------- • If we descended d levels to insert string starts at position j, • we will descend at least d-1 levels to insert string starts at j+1.

The cost-cont.------------------------------------------ • If we descended d levels to insert string starts at position j, • we would descend at least d-1 levels to insert string starts at j+1. • So the cost is O(nd) for insertion. • But we want to eliminate d. j j+1 4096 d d d-1 d

Modifications------------------------------------ • a.Suffix links: • Each node represents the string aX • has a pointer to the node represents • the string X. • Immediate advantage: • We don’t need to return to the root after each insertion. aX X g b Y Y d a k

Suffix Links------------------------------------ • How we use and create suffix links: • .. aXYb .. aX X g b Y Y d k

Suffix Links------------------------------------ • How we use and create suffix links: • .. aXYb .. aX X x g b Y Y d k

Suffix Links-cont.------------------------------------ • How we use and create suffix links: • .. aXYb .. • 1. Create a new node a, and insert b. aX X x g b Y Y d a b k

Suffix Links-cont.------------------------------------ • How we use and create suffix • links: • .. aXYb .. • 1. Create a new node a, and insert b. • 2. a. Use suffix link to insert XYb: • a.1 we go up to b, and cross to g, • using the suffix link. aX X x g b Y Y a d b k

Suffix Links-cont.------------------------------------ • How we use and create suffix • links: • .. aXYb .. • 1. Create a new node a, and insert b. • 2. a. Use suffix link to insert XYb: • a.1 we go up to b, and cross to g, • using the suffix link. • a.2 rescan to d (not necessarily exist). aX X x g b rescan Y Y a d b k If d doesn’t exist, create it! Rescan means we don’t need to check string again, but go stright to d

Suffix Links-cont.------------------------------------ • How we use and create suffix • links: • .. aXYb .. • 1. Create a new node a, and insert b. • 2. a. Use suffix link to insert XYb: • a.1 we go up to b, and cross to g, • using the suffix link. • a.2 rescan to d. • a.3 scan from d, to insert XYb. aX X x g b rescan Y Y a d scan b k

Suffix Links-cont.------------------------------------ • How we use and create suffix links: • .. aXYb .. • 1. Create a new node a, and insert b. • 2. Use suffix link to insert XYb. • 3. Add a’s suffix link (d). • And we finish with the insertion! aX X x g b rescan Y Y d a scan b k Invariant kept: every internal node has a suffix link (except one just created).

match insert • Demends from DS: • ……………………gffghk…… • We explained insertion. • What about deletion? delete 4096

Modifications- cont.------------------------------------ Son count=3 • Deletion: • b. Leaves in a circular buffer- • identify oldest and delete it. • c.’Son count’- • when it falls to one, delete node • and combine arcs. aX X g b Y Y d a b k 1 4096 Circular buffer

..fkjg… Is it enough?------------------------------------ • NO. • We still have a problem. • Higher pointers can become out-of-date. • But, climb up and update those pointers would take out the advantegaes of using the suffix links! aX X g b Y Y d a b k

Modifications- Last------------------------------------ True/false bit • d. Percolating updates: • Each internal node has an update bit. aX X g Y Y d a k

Percolating updates ------------------------------------ True/false bit • d. Percolating updates- • When updating a node: • bit = true • 1. set bit to false. • 2. propagate update to parent. • bit = false • 1. set bit to true. • 2. stop update. aX X g Y Y d a k

Percolating updates-cont.------------------------------------------- • Effect: • Keep all internal pointers on position • within the 4096-window in file.

Percolating updates-cont.------------------------------------------- • Effect: • Keep all internal pointers on position • within the 4096-window in file. • Cost: • worst case- • update propagates till root. • amortized- • summing over all new leaves, we get constant cost.

Summaryofthe inner loop--------------------------------------------------------- • The operations: • 1. Insert: • a. insert the previous string. • b. use suffix link to insert next string. • 2. Percolate update from leaf: • if bit is true • set position field of the node to current position. • set bit to false and propagate to parent. • if bit is false • set it true, and stop.

Summary- cont--------------------------------------------------------- • 3. Circular buffer: • a. replace oldest leaf with the new one. • b. if its parent has only one remaining son- • 1. delete parent, and attach remaining son • to grandparent. • 2. percolate the deleted node’s position- • (*special case- comparative percolation)

Where do we stand? 1. Compression: - In General - Our Algorithm 2. Data Structure: - Modification of the suffix tree. 3. Theoretical Considerations: - Prooves. Done 1 Done 2 (here)

Theoretical Considerations---------------------------------------------------- • Correctness and linearity of suffix tree construction- • we already saw that. • We need to be convinced about destruction: • Theorm 1: • Deleting leaves in FIFO order and deleting internal nodes • with single sons will never leave dangling suffix pointers..

Proof: l • Assume the contrary: • apoints to gthat was deleted. • The existence of a means: • two strings agree for l differ at l+1 • ……df..gb…df..gz.. a g b z

Proof-cont: l l-1 • Assume the contrary: • apoints to gthat was deleted. • The existence of a means: • two strings agree for l differ at l+1 • ……df..gb…df..gz.. • two strings agree for l-1differ at l • This contradicts that g has one son, and therefore deleted. a g b z

Theoretical Considerations----------------------------------------------------- • Theorm 2: • Each percolated update has constant amortized cost. • Proof: • Assumea ‘credit’ on each internal node • with ‘update’ flag true.

A new node is added with two ‘credits’- • One is spent to update parent. • Second - give to parent and terminate (parent is false). 0 false 1 true 1 2

A new node is added with two ‘credits’- • One is spent to update parent. • Second - give to parent and terminate (parent is false). • or - obtain two on parent and continue (true). • Result- • invariant is kept, and we get amortized cost of two • updates per new leaf. Apply recursively on parent 0 false 1 true 1 true 2 1 1 2 2

Theoretical Considerations----------------------------------------------------- • Theorm 3 (effectiveness): • Using the percolating update, every internal node will • be updated at least once in a period (4096). • Proof: • We would prove that every internal node will be • updated at least twice in a period, thus propagate • at least one update up.

Child that has remained for the entire period. • (in contradiction) Find b- the farthest node from the root that • doesn’t propagate an update to its parent. • 3 cases: • a. bhas two (or more) remained* children: • both are farther from root. Thus- updated it.

Child that has remained for the entire period. • (in contradiction) Find b- the farthest node from the root that • doesn’t propagate an update to its parent. • 3 cases: • a. b has two (or more) remained* children: • both are farther from root. Thus- updated it. • b. bhas only one remaining child: • one update from it. Second from new child when created. • (new arc causes son to update parent)

Data Compression with finite windows Fiala and Greene

Data Compression with finite windows Fiala and Greene

Presentation Transcript

Data Compression

Data Compression study with E2 data

Data Compression

Data Compression

Data Compression

Data Compression

Data Compression study with E2 data

Data Compression

Data Compression

Data Compression

Data compression

Data Compression

Data Compression

Data Compression

Data Compression

Petr Fiala

Data Compression

Data Compression

Data Compression