110 likes | 188 Vues
Dive into Huffman compression, a technique to reduce unnecessary data and redundancy in computer files. Learn about lossless vs. lossy compression, explore how Huffman coding works through text files, and understand the intricacies of file compression algorithms.
E N D
Squishin’ Stuff Huffman Compression
Data Compression • Begin with a computer file (text, picture, movie, sound, executable, etc) • Most file contain extra information or redundancy • Goal: Reorganize the file to remove the excess information and redundancy • Lossless Compression: Compress the file in such a way that none of the information is lost (good for text files and executables) • Lossy Compression: Allow some information to be thrown away in order to get a better level of compression (good for pictures, movies, or sounds) • Many, many, many algorithms out there to compress files • Different types of files work best with different algorithms (need to consider the structure of the file and how things are connected). • We’re going to focus on Huffman compression which is used many compression programs, most notably winzip. • We’re just going to play with text files.
Text Files • Each character is represented by one byte. Each byte is a sequence of 8 bits (1’s and 0’s) (ASCII code). • International standard for how a character is represented. • A 01000001 • B 01000010 • ~ 01111110 • 3 00110011 • Most text files use less than 128 characters; this code has room for 256. Extra information!! • Goal: Use shorter codes to represent more frequent characters. • You have seen this before…
RAWA AWIS RINBABBE • That didn’t work. • If we do this, we need a way to know when a letter stops. • Huffman coding provides this, though we’ll lose some compression. • Huffman Coding • Named after some guy called Huffman (1952). • Use a tree to construct the code, and then use the tree to interpret the code.
What’s the best you can do? • Obviously, there is a limit to how far down you can compress a file. • Assume your file has n different characters in it, say a1…an, each with probability p1…pn (so p1+p2+…+pn = 1). • The entropy of the file is defined to be negative of the sum of pilog2(pi). • Measures the least number of bits, on average, needed to represent a character. • For my name, the entropy is 3.12 (takes at least 3.12 bits per character to represent my name). Huffman gave an average of 3.19 bits per character. • Huffman compression will always give an average that is within one bit of entropy.