Information without the Theory: Sharpie Compression

cover

This is Part Three of an ongoing series on Information Theory and Compression. Part One dealt with a little background, while Part Two dealt with important terms and RLE. While it's not reeeeally important to read them, it'd help a lot with understanding whats happening here!

Remember how in part one I gave you that quick spiel on probabilities, and how important they are in Information Theory, then in the second part, showed you a method that doesn't deal with that at all? Well, don't worry, that's what this week is for!

In 1952, a guy called David Huffman developed an extremely useful way to encode a source symbol (say, a character in a file). Working with the frequencies at which the symbol appears, you can build an encoding for each letter, with the more frequently appearing symbols getting smaller codes.

You've seen this before, in the Chemist vs Menswear examples I used a series of 0's and 1's to simulate males and females in a row. But this is a text file we're talking about, surely each character shouldn't take up too much room...

Does it?

Funnily enough, each letter you write takes up 8 bits of information! This is all down to the UTF-8 encoding that allows for the letters shown on your screen. Without going too far into it (I recommend you watch this really good video on it! Don't worry, it doesn't bore your socks off) if you want to know more), encoding is essentially how a program transfers bits into the letters that the web page or program renders. UTF-8, then requires that each letter you see to take up 8 bits. A four letter word, like, JACK takes up 32 bits of information, so this using the Huffman Algorithm, we can easily compress the amount of data sent, yet retain the information that we want it to send.

The best way I thought to illustrate this method was on a sheet of paper, with a series of Sharpies. I'm no photographer, so some of them are out of focus, but hey, all in the name of compression.

Rolling down the River.

Lets pick a phrase:

This is a nice phrase - lots of s' and i's. If we count all the letters:

We get 17 characters. You see that I've highlighted the space character, the ˽. This is because in UTF-8, the space is not just an empty space, it is in fact 0010 0000.

What this means, is we now have 8 * 17, or 136 bits of information. As I said before, MISSISSIPPI RIVER contains a lot of repetitive letters, there must be a way to utilise that somehow...

Enter Huffman, stage left. To begin, Huffmans Coding works on the basis of probability, or the frequency of the occurance of the letters in the string.

So, lets count them!

The algorithm works on creating a tree from the information given. Here I put the like occurances together. Normally this would be some decimals, but hey ,who likes those?

Next we should start building the tree. Starting with the ones, we group them into pairs, and we see they have the same probability.

So we do the same again:

Oh, 'allo, what have we got here...

Now we can match PR up with MVE˽ , giving us a total of 8

Matching it up with the final two terms we add all them into our final tree:

Now we are on the home stretch. For each branch on the tree, if it is to the right, we attach a 1 to it, to the left, 0.

And beneath each letter, we should write a key. The key will be whatever branches it took on the way down, so the first two will be:

As you can see, these are small pairs for the most used characters. If we complete the tree, you can see to the right, the less frequent characters have up to four bits for each letter.

This means, MISSISSIPPI RIVER, when we substitute the letters for the new key, becomes this string of 1's and 0's

But, Jack, that looks longer! Thats 46 digits there. 46 * 8 is 336 bits!

Ahh, but that isn't UTF-8. These are raw, and as long as we store them correctly, they're just 46 bits.

As you can see by the magic of the hand computer, we have now have a string of bits 33% the size of our original string! Through the magic of probability, we have managed to achieve ~66% compression!

And the cool thing about it is that no two letters conflict whatsoever, so as long as you have the key, you can safely reverse the algorithm and get the original input back. There is no real way to test this, but have a go yourself, trying to revert all the bits into MISSISSIPPI RIVER.

Now, I know with the colours in the previous post I mentioned that they store integer numbers in an 8 bit length. There is a method called bitwise operations where we can push multiple numbers into a single block of integer memory. I might go over it in the future, but it is a bit of a heavy idea to go over now.

So, there you have it! Huffman Coding, laid out using nothing but paper and vivids. Next time I hope to go over how this method and another work together to create the well known .zip format. I'm not gonna lie, it's about to get kinda hairy, LZ77 is NOT a nice looking algorithm. I'm also planning on writing up some examples using code, for those of you who are curious about how to write these out/ more technical versions of it.

As always, you can follow me on twitter, @rack_jobinson. Perhaps you can tell me what I'm meant to say on there, or something.