390 likes | 637 Vues
Lecture 29. Data Compression Algorithms. Recap. Commonly , algorithms are analyzed on the base probability factor such as average case in linear search. Amortized analysis not bases on probability, nor work on single operation.
E N D
Lecture 29. Data Compression Algorithms
Recap • Commonly , algorithms are analyzed on the base probability factor such as average case in linear search. • Amortized analysis not bases on probability, nor work on single operation. • There are three methods to find the amortized cost (cost of sequence of operations) such as aggregate, accounting and potential method. • In aggregate method , average cost will be O(n)/n • In accounting method, overcharge cost is assigned to sequence of operation know as credit for that sequence and it can be used later on when amortized cost of operation is less than actual cost. • In potential method, work is same as in accounting method except data structure is considered rather than operation cost.
What is Compression? Compression basically employs redundancy in the data: • Temporal - in 1D data, 1D signals, Audio etc. • Spatial - correlation between neighbouring pixels or data items • Spectral - correlation between colour or luminescence components. This uses the frequency domain to exploit relationships between frequency of change in data. • psycho-visual - exploit perceptual properties of the human visual system.
Compression can be categorised in two broad ways: • Lossless Compression where data is compressed and can be reconstituted (uncompressed) without loss of detail or information. These are referred to as bit-preserving or reversible compression systems also. • Lossy Compression where the aim is to obtain the best possible fidelity for a given bit-rate or minimizing the bit-rate to achieve a given fidelity measure. Video and audio compression techniques are most suited to this form of compression.
Cont !!! • If an image is compressed it clearly needs to be uncompressed (decoded) before it can viewed/listened to. Some processing of data may be possible in encoded form however. • Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques. • Lossy compression use source encoding techniques that may involve transform encoding, differential encoding or vector quantisation .
Lossless Compression Algorithms (Repetitive Sequence Suppression) Simple Repetition Suppression • If in a sequence a series on n successive tokens appears we can replace these with a token and a count number of occurrences. We usually need to have a special flag to denote when the repeated token appears For Example 89400000000000000000000000000000000 can be replaced with 894f32 where f is the flag for zero. • Compression savings depend on the content of the data.
Cont !!! Applications of this simple compression technique include: • Suppression of zero's in a file (Zero Length Suppression) • Silence in audio data, Pauses in conversation etc. • Bitmaps • Blanks in text or program source files • Backgrounds in images • other regular image or data tokens
Run-length Encoding • This encoding method is frequently applied to images (or pixels in a scan line). It is a small compression component used in JPEG compression. • In this instance, sequences of image elements X1, X2, …, Xn are mapped to pairs (c1, l1), (c1, L2), …, (cn, ln) where ci represent image intensity or colour and li the length of the ith run of pixels (Not dissimilar to zero length suppression above).
Cont !!! For example: Original Sequence: 111122233333311112222 can be encoded as: (1,4),(2,3),(3,6),(1,4),(2,4) The savings are dependent on the data. In the worst case (Random Noise) encoding is more heavy than original file: 2*integer rather 1* integer if data is represented as integers.
Lossless Compression Algorithms (Pattern Substitution) This is a simple form of statistical encoding. Here we substitute a frequently repeating pattern(s) with a code. The code is shorter than the pattern giving us compression. A simple Pattern Substitution scheme could employ predefined code (for example replace all occurrences of `The' with the code '&').
Cont !!! More typically tokens are assigned to according to frequency of occurrence of patterns: • Count occurrence of tokens • Sort in Descending order • Assign some symbols to highest count tokens A predefined symbol table may used i.e. assign code i to token i. However, it is more usual to dynamically assign codes to tokens. The entropy encoding schemes below basically attempt to decide the optimum assignment of codes to achieve the best compression.
Lossless Compression Algorithms (Entropy Encoding) Lossless compression frequently involves some form of entropy encoding and are based in information theoretic techniques, Shannon is father of information theory.
The Shannon-Fano Algorithm • This is a basic information theoretic algorithm. A simple example will be used to illustrate the algorithm: Symbol A B C D E Count 15 7 6 6 5
Cont !!! Encoding for the Shannon-Fano Algorithm: • A top-down approach 1. Sort symbols according to their frequencies/probabilities, e.g., ABCDE. • 2. Recursively divide into two parts, each with approx. same number of counts.
Introduction to LZW • As mentioned earlier, static coding schemes require some knowledge about the data before encoding takes place. • Universal coding schemes, like LZW, do not require advance knowledge and can build such knowledge on-the-fly. • LZW is the foremost technique for general purpose data compression due to its simplicity and versatility. • It is the basis of many PC utilities that claim to “double the capacity of your hard drive” • LZW compression uses a code table, with 4096 as a common choice for the number of table entries.
Introduction to LZW (Cont !!!) • Codes 0-255 in the code table are always assigned to represent single bytes from the input file. • When encoding begins the code table contains only the first 256 entries, with the remainder of the table being blanks. • Compression is achieved by using codes 256 through 4095 to represent sequences of bytes. • As the encoding continues, LZW identifies repeated sequences in the data, and adds them to the code table. • Decoding is achieved by taking each code from the compressed file, and translating it through the code table to find what character or characters it represents.
LZW Encoding Algorithm 1 Initialize table with single character strings 2 P = first input character 3 WHILE not end of input stream 4 C = next input character 5 IF P + C is in the string table 6 P = P + C 7 ELSE 8 output the code for P 9 add P + C to the string table 10 P = C 11 END WHILE 12 output code for P
Example 1: Compression using LZW Example 1: Use the LZW algorithm to compress the string BABAABAAA
Example 1: LZW Compression Step 1 BABAABAAA P=A C= B
Example 1: LZW Compression Step 2 BABAABAAA P=B C=A
Example 1: LZW Compression Step 3 BABAABAAA P=A C=A
Example 1: LZW Compression Step 4 BABAABAAA P=A C=B
Example 1: LZW Compression Step 5 BABAABAAA P=A C=A
Example 1: LZW Compression Step 6 BABAABAAA P=AA C=empty
LZW Decompression • The LZW decompressor creates the same string table during decompression. • It starts with the first 256 table entries initialized to single characters. • The string table is updated for each character in the input stream, except the first one. • Decoding achieved by reading codes and translating them through the code table being built.
LZW Decompression Algorithm 1 Initialize table with single character strings 2 OLD = first input code 3 output translation of OLD 4 WHILE not end of input stream 5 NEW = next input code 6 IF NEW is not in the string table 7 S = translation of OLD 8 S = S + C 9 ELSE 10 S = translation of NEW 11 output S 12 C = first character of S 13 OLD + C to the string table 14 OLD = NEW 15 END WHILE
Example 2: LZW Decompression 1 Example 2: Use LZW to decompress the output sequence of Example 1: <66><65><256><257><65><260>.
Example 2: LZW Decompression Step 1 <66><65><256><257><65><260> Old = 65 S = A New = 66 C = A
Example 2: LZW Decompression Step 2 <66><65><256><257><65><260> Old = 256 S = BA New = 256 C = B
Example 2: LZW Decompression Step 3 <66><65><256><257><65><260> Old = 257 S = AB New = 257 C = A
Example 2: LZW Decompression Step 4 <66><65><256><257><65><260> Old = 65 S = A New = 65 C = A
Example 2: LZW Decompression Step 5 <66><65><256><257><65><260> Old = 260 S = AA New = 260 C = A
LZW: Some Notes • This algorithm compresses repetitive sequences of data well. • Since the codewords are 12 bits, any single encoded character will expand the data size rather than reduce it. • In this example, 72 bits are represented with 72 bits of data. After a reasonable string table is built, compression improves dramatically. • Advantages of LZW over Huffman: • LZW requires no prior information about the input data stream. • LZW can compress the input stream in one single pass. • Another advantage of LZW its simplicity, allowing fast execution.
LZW: Limitations • What happens when the dictionary gets too large (i.e., when all the 4096 locations have been used)? • Here are some options usually implemented: • Simply forget about adding any more entries and use the table as is. • Throw the dictionary away when it reaches a certain size. • Throw the dictionary away when it is no longer effective at compression. • Clear entries 256-4095 and start building the dictionary again. • Some clever schemes rebuild a string table from the last N input characters.
Home Work • Use LZW to trace encoding the string ABRACADABRA. • Write a Java program that encodes a given string using LZW.
Summary • Data compression is a technique to compress the data represented either in text, audio or image form. • Two important compress techniques are lossy and lossless compression. • LZW is the foremost technique for general purpose data compression due to its simplicity and versatility. • LZW compression uses a code table, with 4096 as a common choice for the number of table entries.
In Next Lecture • In next lecturer , we will discuss the P, NP complete and NP Hard problems