Unpacking Huffman Compression: Data Minimization in Computer Files

Squishin’ Stuff Huffman Compression

Data Compression • Begin with a computer file (text, picture, movie, sound, executable, etc) • Most file contain extra information or redundancy • Goal: Reorganize the file to remove the excess information and redundancy • Lossless Compression: Compress the file in such a way that none of the information is lost (good for text files and executables) • Lossy Compression: Allow some information to be thrown away in order to get a better level of compression (good for pictures, movies, or sounds) • Many, many, many algorithms out there to compress files • Different types of files work best with different algorithms (need to consider the structure of the file and how things are connected). • We’re going to focus on Huffman compression which is used many compression programs, most notably winzip. • We’re just going to play with text files.

Text Files • Each character is represented by one byte. Each byte is a sequence of 8 bits (1’s and 0’s) (ASCII code). • International standard for how a character is represented. • A 01000001 • B 01000010 • ~ 01111110 • 3 00110011 • Most text files use less than 128 characters; this code has room for 256. Extra information!! • Goal: Use shorter codes to represent more frequent characters. • You have seen this before…

Morse Code

Example

RAWA AWIS RINBABBE • That didn’t work. • If we do this, we need a way to know when a letter stops. • Huffman coding provides this, though we’ll lose some compression. • Huffman Coding • Named after some guy called Huffman (1952). • Use a tree to construct the code, and then use the tree to interpret the code.

Huffman Chart

Issues and Problems

What’s the best you can do? • Obviously, there is a limit to how far down you can compress a file. • Assume your file has n different characters in it, say a1…an, each with probability p1…pn (so p1+p2+…+pn = 1). • The entropy of the file is defined to be negative of the sum of pilog2(pi). • Measures the least number of bits, on average, needed to represent a character. • For my name, the entropy is 3.12 (takes at least 3.12 bits per character to represent my name). Huffman gave an average of 3.19 bits per character. • Huffman compression will always give an average that is within one bit of entropy.

Unpacking Huffman Compression: Data Minimization in Computer Files

Unpacking Huffman Compression: Data Minimization in Computer Files

Presentation Transcript

DNA Structure and Function

HOT STUFF

You have a very naughty salad

Exam #3 Voting

Isn’t SEO all about On Page Stuff?

Absite : the nitty , gritty stuff part 1

http://www.sharepointmn.com

Intro

Chapter 15

WASTE

100

Who is haruki murakami ?

TQ ANALYST SOFTWARE

Intro to Information Systems I

Stuff you already know.

WELCOME.

tera monroe

Neurology Stuff

Lots of Stuff

Sega 500

REALITY BITES

Remarkable Matters: Remarkable Mobile Design