Bits and Huffman Encoding

Bits and Huffman Encoding Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing your notes.

How is data stored in a computer? • Several different ways are possible • In an ASCII-encoded text file (hithero known as a “text file”), each character consists of 8-bit chunks • What’s a bit?

Bits Mean What You Want Them To Mean 01001101 01001001 01001011 01000101 K M I E But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks? 0100 1101 0100 1001 0100 1011 0100 0101 4 13 4 9 4 11 4 5 With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are 24 (16) possible characters.

What if we want to save memory? • Could just use fewer bits per chunk, but that limits the number of characters we can represent • But my document contains a lot more ‘a’s than ‘%’s. What if the code for a was “01” and the code for % was “10000101”

A Problem Mike decides he wants to compress his vast collection of Harry Potter fanfic (all stored a textfiles). He notices that the character ‘H’ occurs much more frequently in the textfiles than does the character ‘%’, but they both use 8 bits. So he changes the encoding so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by ‘1010101010101010’ (16 bits). Assuming all other characters stay at 8 bits (a bit unlikely…but just pretend)…how much more frequent do ‘H’s need to be than ‘%’s for there to be a space savings? • There must be more than 3Hs for every %s • There must be more than 3Hs for every 2%s • There must be more than 4Hs for every 3%s • As long as there more Hs than %s you get a savings

We can get savings if we know what characters occur frequently • So imagine you look at some set of data with character frequencies. How much can you save?

Consider the string: • AACCCAAABAADAE What’s the best encoding?

Why will this encoding not work? What is the encoding for AAE? What is the encoding for ADC?

Consider the string: • AACCCAAABAADAE The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 Work with your row to find the most efficient possible encoding for each character. I was able to encode the entire string in 25 bits. See if you can do it that efficiently – but be careful with prefixes. Write your encoding, plus your computation of its total length, on your handin sheet.

My Encoding • AACCCAAABAADAE The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 There a several possible encodings that can give you 25. I’m pretty sure there’s no more efficient encoding, at least as far as compressing individual characters.

Given a set of frequencies, how should you select the character encoding for maximum efficiency?Huffman Encoding

The basic idea • We will make a “huffman tree”. By repeatedly combining nodes with small frequencies into “meta nodes” with larger frequency • A letter’s position in the tree will tell us what its encoded form is

Generate a huffman tree and encodings for this circumstance Follow the tutorial linked off the resources section in Sakai Write the huffman tree and the resultant encoding on your handin sheet.

The Header • The Magic Number • Counts • Or a tree for extra credit

The Psuedo-EOF character • A made up character to write to your file

Please turn your sheet in at the back of the room (try for a neat pile)

Bits and Huffman Encoding

Bits and Huffman Encoding

Presentation Transcript

Huffman code and ID3

Data Compressor---Huffman Encoding and Decoding

Huffman Codes

Huffman Encoding Example

Huffman Encoding

Judy Huffman

Huffman Encoding and Decoding

Huffman Encoding

Huffman Codes

HUFFMAN CODING

Huffman Trees and ID3

Bits and Bytes

Huffman Coding

Huffman Coding

Bits and Bytes

Huffman code and ID3

Huffman Codes

Bits and Bytes

Huffman Coding

Huffman Codes

Bits and Bytes

Huffman Encoding