27160 total geeks with 3531 solutions
Recent challengers:
 Welcome, you are an anonymous user! [register] [login] Get a yourname@osix.net email address 

Articles

GEEK

User's box
Username:
Password:

Forgot password?
New account

Shoutbox
omentomikel
omentomikel
[b][url=http ://www.outle tmonclerdco. com/]moncler kids outlet[/url] [/b] [b][url=http ://www.outle tmonclerdco. com/]moncler womens jackets[/url ][/b] notice I did not state favorite), and uncover a feel in your best solution there. Listen intently as the sal
omentomikel
[b][url=http ://www.links oflondonjewe llery.net/]l inks of london wholesale[/u rl][/b] [b][ url=http://w ww<strong><a href="http:/ /www.linksof londonjewell ery.net/">li nks of london wholesale</a ></strong> < strong><a href="http:/ /www.linksof londonjewell ery.net/">l
omentomikel
[b][url=http ://www.copyp anerai.com/] panerai watches[/url ][/b] [b][ur<stron g><a href="http:/ /www.copypan erai.com/">p anerai watches</a>< /strong> <br ><strong><a href="http:/ /www.copypan erai.com/">s wiss panerai watches</a>< /strong> <br ><strong><a href="http:
omentomikel
[b][url=http ://www.copyp anerai.com/] panerai watches[/url ][/b] [b][url=http ://www.copyp anerai.com/] swiss panerai watches[/url ][/b] [b][url=http ://www.copyp anerai.com/] panerai replica watches[/url ][/b] [b][url=http ://www.copyp anerai.com/] panerai re

Donate
Donate and help us fund new challenges
Donate!
Due Date: Nov 30
November Goal: $40.00
Gross: $0.00
Net Balance: $0.00
Left to go: $40.00
Contributors


News Feeds
The Register
Get a job in
Germany ? where
most activities are
precursors to
drinking
BIG FAT Lies: Porky
Pies about obesity
Culture CLASH:
Wuzhen Declaration
spurned at World
Internet Conference
in China
Pitchfork-wielding
villagers hunt
hairy shapeshifters
in One Night
Ultimate
Werewo
Post-pub nosh
neckfiller:
Bryndzové halu?ky
What We Do In
The Shadows
?
Laugh ourselves
silly, mostly
UK"s non-emergency
police and NHS
Vodafone systems go
titsup NATIONWIDE
Yahoo!
blames!
MONSTER!
email!
OUTAGE!
on!<
Criticism of Uber"s
journo-Data
Analytics plan is
an Attack on
DIGITAL FREEDOM
Ford"s B-Max:
Fiesta-based
runaround that goes
THUNK
Slashdot
Great Firewall of
China Blocks
Edgecast CDN,
Thousands of
Websites Affected
Judge Unseals 500+
Stingray Records
Samsung Seeking To
Block Nvidia Chips
From US Market
Doubling Saturated
Fat In Diet Does
Not Increase It In
Blood
Ask Slashdot: Best
Practices For
Starting and
Running a Software
Shop?
Eizo Debuts Monitor
With 1:1 Aspect
Ratio
Upgrading the
Turing Test:
Lovelace 2.0
Profanity-Laced
Academic Paper
Exposes Scam
Journal
Ukraine"s IT
Brigade Supports
the Troops
Extreme Shrimp May
Hold Clues To Alien
Life On Europa
Article viewer

Huffman Encoding Explained



Written by:jebus
Published by:bb
Published on:2004-07-27 14:11:02
Topic:Compression
Search OSI about Compression.More articles by jebus.
 viewed 19037 times send this article printer friendly

Digg this!
    Rate this article :
This article describes the technique known as Huffman Encoding, a method for compressing data.

Introduction
This article will outline the basic workings behind the Huffman Encoding technique. All materials were written by me but inspired from multiple web resources. Check the bibliography at the end of this article for some useful links.

Background
The Hufman encoding technique was named for it's inventor, David A. Huffman, a professor at the University of California, Santa Cruz. The technique involves creating a mapping of literals into key codes based on their frequency. Most archivers use some form of Huffman encoding for their compression.

Terminology
The following terms will be widely used throughout this article, so it's best if we define them in one spot.

code - a set of bits that will represent a literal
literal - a single unit of data, usually a byte from the source
source - the file or data stream we are compressing

Process
The process of compressing data using Huffman encoding is rather simple once it's broken down. First, you calculate the frequency of each literal in the source, listing them in increasing order. Then you construct a binary tree using the frequencies, with the less frequent literals being lower leaves in the tree. Using this tree, you can construct bit codes for the literals.

Since the masses generally accept learning by example, we will do just that. Here is our test string that we will compress.

I am what I am and that's all that I am

Hopefully you can see why I chose this phrase. There is a healthy mix of common characters as well as some who appear only once.

Counting the occurence of each character, and sorting them by increasing frequency, we get the following (the last node represents the ASCII space):



As the diagram shows, each node pair contains the character literal and the number of times it appears in the source stream. The pairs are ordered sequentially based on their frequency, and lexicographically in the case where the frequencies are equal. These pairs will become the leaves of the tree we will build.

To build the tree, we take the first two frequencies and make a new node that holds the sum of the frequencies. We remove the first two nodes and place the new node in the list based on it's value (2 in this case).



We continue this process again, taking the first two nodes, adding the frequencies, and making a new node with the sum.



Again, we repeat the process.



At this point we are finally combining an original node with one of our new ones. Luckily, the process remains the same.



For completeness, I will include each step hereafter without descriptions, but if you've followed thus far, you should be able to see how each step is done.



In the above figure, see that the root node of the tree, 40, is the total count of all the literals in the source stream. At this point we can generate the codes for our literals.

Code Generation
The codes for our literals will be sets of bits of varying length. The code lengths for the more frequent literals will be shorter than lengths for rare literals. To generate a code for a literal, you traverse the tree, appending a '0' for a left branch, and a '1' for a right branch. So the code for the literal 'a' would be '00' and the code for 't' will be '011'. Below I have made a table containing each literal and it's Huffman code.



As you can see, the length of a literals code is inversely proportional to it's frequency in the source stream.

Compression
At this point we can represent our source stream using our newly generated codes.

1100100011101001011110100011101100100011101000010010100010011110100011111110010
1010001111011110100111101000111011001000111011111

Counting these bits we get 128, or 16 bytes, as opposed to the original 40. We've compressed it by 60%! But, this is only half of the battle. Somewhere, sometime you'll want the original stream back, so we move onto decompression.

Decompression
If you've followed along this far, you should easily see that decompression is a snap. Given the alphabet of literal/code pairs in the above table, we can convert the stream of bits back into literals. Using the tree makes it even easier. Looking at the long line of bits above, the first bit is '1', so from the root of the tree, we go down the right branch. Since we're not in a "leaf" yet, we get the next bit from the stream, a '1'. We go down the right branch and we're still not in a leaf. The next bit '0' takes us down the left branch. Again, we must keep looking. The next bit '0' puts us in a leaf for the literal 'I'. Here we would write an I to the output. You continue this process until the entire stream is decompressed.

Implementation
One issue you may have noticed has not been mentioned. What happens when one process compresses and another process needs to decompress? How will the decompressor get the alphabet? This is how all the many Huffman derivatives differ, each packing their alphabet into some type of block header before the compressed data. This method is truly dependent on the application. The resources listed at the end of this article will lead you to some more verbose methods.

Conclusion
Hopefully you now have a general idea behind the process of how to generate Huffman codes and compress/decompress a stream of characters. Here I have listed some valuable links related to this topic.

An excellent Java animated huffman example, it shows how the tree is built and traversed
definition of Huffman encoding
another good resource

Did you like this article? There are hundreds more.

Comments:
paranoia
2004-07-27 19:03:59
very well written article jebus, and i'm glad you finally went with mspaint diagrams, as they illustrated the examples very effectively. i may now have to code my own huffman encoder/decoder (compressor/decompressor) for learning purposes.
TroPe
2004-10-23 21:35:30
Great work!
lonetech
2005-05-07 21:39:16
Good article. Not quite all compressors put the tree in a header block though; some rebalance the tree as they work.
Anonymous
2008-12-23 12:11:18
dfbdfbndfnfnd
Anonymous
2009-11-08 17:05:00
Quote Anon:
can u tell me in details about what is cookie?
A cookie is how a site recognizes your browser from other browsers; it's how when you check the remember me checkbox, you don't have to login again. Articulos Gratis
Anonymous
2011-05-03 09:11:07
Not quite all compressors put the tree in a header block though; some rebalance the tree as they work. new era hats
new era caps
DC Hats
Fox Hats
NFL Hats
MLB Hats
Red Bull Hats
New York Yankees Hats
Monster Energy Hats
Anonymously add a comment: (or register here)
(registration is really fast and we send you no spam)
BB Code is enabled.
Captcha Number:


Blogs: (People who have posted blogs on this subject..)
atreyu
JPEG compression on Fri 2nd Feb 1am
The JPEG committee was created in 1986 as a join of several groups that were working on presentation of photo quality graphics. Media storage and communication speed were big concerns at that time, and naturally JPEG put his focus on data compression. Six


     
Your Ad Here
 
Copyright Open Source Institute, 2006