160 likes | 285 Vues
This guide explains how data is stored in computers, focusing on bit representation and Huffman encoding. You'll learn about ASCII text files and the significance of using 8-bit versus fewer bit chunks for character representation. The text explores the advantages of more efficient encoding by examining character frequencies and demonstrates how to create a Huffman tree for optimal data compression. Follow along with exercises to implement Huffman encoding for a sample string and see how to maximize storage efficiency for frequently occurring characters.
E N D
Bits and Huffman Encoding Please get a piece of paper and a pen and put your name and netid on it. Make sure you can turn in it after class without losing your notes.
How is data stored in a computer? • Several different ways are possible • In an ASCII-encoded text file (hithero known as a “text file”), each character consists of 8-bit chunks • What’s a bit?
Bits Mean What You Want Them To Mean 01001101 01001001 01001011 01000101 K M I E But why 8-bit chunks? Couldn’t we interpret the same string as 4 bit chunks? 0100 1101 0100 1001 0100 1011 0100 0101 4 13 4 9 4 11 4 5 With 8 bit chucks there are 28 (256) possible characters. With 4 bit chunks there are 24 (16) possible characters.
What if we want to save memory? • Could just use fewer bits per chunk, but that limits the number of characters we can represent • But my document contains a lot more ‘a’s than ‘%’s. What if the code for a was “01” and the code for % was “10000101”
A Problem Mike decides he wants to compress his vast collection of Harry Potter fanfic (all stored a textfiles). He notices that the character ‘H’ occurs much more frequently in the textfiles than does the character ‘%’, but they both use 8 bits. So he changes the encoding so that ‘H’ is represented by ‘01’ (2 bits) and ‘%’ is represented by ‘1010101010101010’ (16 bits). Assuming all other characters stay at 8 bits (a bit unlikely…but just pretend)…how much more frequent do ‘H’s need to be than ‘%’s for there to be a space savings? • There must be more than 3Hs for every %s • There must be more than 3Hs for every 2%s • There must be more than 4Hs for every 3%s • As long as there more Hs than %s you get a savings
We can get savings if we know what characters occur frequently • So imagine you look at some set of data with character frequencies. How much can you save?
Consider the string: • AACCCAAABAADAE What’s the best encoding?
Why will this encoding not work? What is the encoding for AAE? What is the encoding for ADC?
Consider the string: • AACCCAAABAADAE The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 Work with your row to find the most efficient possible encoding for each character. I was able to encode the entire string in 25 bits. See if you can do it that efficiently – but be careful with prefixes. Write your encoding, plus your computation of its total length, on your handin sheet.
My Encoding • AACCCAAABAADAE The rule: one character’s encoding cannot be the prefix of another. So if A=01, no other encoding can begin with 01 There a several possible encodings that can give you 25. I’m pretty sure there’s no more efficient encoding, at least as far as compressing individual characters.
Given a set of frequencies, how should you select the character encoding for maximum efficiency?Huffman Encoding
The basic idea • We will make a “huffman tree”. By repeatedly combining nodes with small frequencies into “meta nodes” with larger frequency • A letter’s position in the tree will tell us what its encoded form is
Generate a huffman tree and encodings for this circumstance Follow the tutorial linked off the resources section in Sakai Write the huffman tree and the resultant encoding on your handin sheet.
The Header • The Magic Number • Counts • Or a tree for extra credit
The Psuedo-EOF character • A made up character to write to your file
Please turn your sheet in at the back of the room (try for a neat pile)