Notes on gzip and DEFLATE format

June 4, 2009 at 5:59 am 3 comments

.gz file is just a wrapper arround deflate format. Nothing special. See rfc1952 if you want.

The way that gzip compress files is easy to understand. 1. If a string repeats, we can simply say: rewind x bytes and copy y bytes from there. 2. Use Haffman code to further reduce file size.

However, deflate format is complicated, largely due to the way it represents Huffman trees. With some restrictions (see RFC1951 section 3.2.1 and 3.2.2), you only need to write down how many bits each leaf node has. Thus, a Huffman tree with N leaf nodes can be represented as an N-integer sequence.

Besides, gzip code is in an old fashioned format and not very reader friendly. Variables are named as b, kb, bb. To undestand deflate format, I tried 3 files.

> echo hello > hello.txt
> gzip -c hello.txt > hello.gz

The content of hello.gz is

00000000  1F 8B 08 08 C0 C4 1D 4A 00 03 68 65 6C 6C 6F 2E .......J..hello.
00000010  74 78 74 00 CB 48 CD C9 C9 E7 02 00 20 30 3A 36 txt..H...... 0:6
00000020  06 00 00 00                                     ....

Compressed data is CB 48 CD C9 C9 E7 02 00. Deflate format is a bit stream. The RFC defined that when packing bits into bytes, store into LSB first. In our example, the first byte is CB (1100 1011). So the first bit is the right most 1. According to the RFC, this means we already reached the last data block. As we expected, we have only one data block here.

The next field is a 2-bit data field. According to the RFC, data fields are stored in their original bit order. So we got ’01’, which means this data block is compressed with fixed Huffman codes. The way to pack bits into bytes are confusing for human, but easy for computers to carry out. See section 3.2.6 for coding schema. For ASCII code, the result is simply ascii+0x30.

Now, we reached the real data. Since data are coded by Huffman codes, length of next field depends on its content. Lets cheat here. We know that we are going to get an ‘h’, which is 8-bit long. RFC says Huffman codes should be packed MSB first. That is to say, we need to reverse the bits to get the real Huffman code. Next 8-bit is 00011001 (left 3 bits from the 2nd byte). Revers it we get 10011000. Minus 0x30 we get the ascii code for ‘h’, 1101000. Repeat this step, we’ll get “ello\n” and a termination mark (7 zero bits). Gzip’s implementation does not realy reverse the Huffman code, it use pre-computed tables to accelerate this step.

The 2nd example is
> echo hello hello > hello2.txt
> gzip -c hello2.txt > hello2.gz

The content of hello2.gz is

00000000  1F 8B 08 08 22 BE 1E 4A 00 03 68 65 6C 6C 6F 32 ...."..J..hello2
00000010  2E 74 78 74 00 CB 48 CD C9 C9 57 C8 00 91 5C 00 .txt..H...W...\.
00000020  A5 6A 0A 44 0C 00 00 00                         .j.D....

Following the procedure, we can easily get “hello h”. Then we got a 7-bit Huffman code, 0x02, which represents 258. 258 means repeat a 4-byte string in the output buffer. The RFC call this kind of code (258-285) length code, but I’d rather call them repeat code. Next one or two fields specify offset. In our case, we get 4 and 1. Following the 2nd table in RFC section 3.2.5, we get offset=6. Rewind 6 bytes and read 4 bytes from output buffer, we get “ello”. In this case, gzip did some compression, but it didn’t do its best.

The 3rd file is a real case:
> wget http://www.ietf.org/rfc/rfc1951.txt
> gzip -c rfc1951.txt > rfc1951.gz

The content of rfc1951.gz is:
00000000 1F 8B 08 08 31 25 A2 31 00 03 72 66 63 31 39 35 ….1%.1..rfc195
00000010 31 2E 74 78 74 00 B5 7D 6B 77 DB 46 92 E8 77 9D 1.txt..}kw.F..w.
00000020 FB 23 7A 7D CF 9E 90 31 49 11 7C E8 69 E7 AC 2C .#z}…1I.|.i..,
00000030 CB 8E 27 B2 D7 37 72 26 99 CC 7A E7 80 64 93 C4 ..’..7r&..z..d..
……

This time, gzip use dynamic Huffman codes. To make it simpler, I would explan compress process before jumping into the file.
1) generate 2 Huffman trees( called dictionary, one for alphabet and repeat marks, one for distance) and encoding the text file with these 2 trees.
2) recall that a Huffman tree can be represented as a sequence of integers. The dictionary can be repsented as 2 sequences. Totally 256(alphabet) + 1 (termination code) + 29(repeat code) + 30(distance) = 316 elements. In this step, we encode the dictionary with another Huffman tree. Maximum value in dictionary is 15 (=maximum depth of alphabet and repeat code tree). We gave 16, 17, 18 special meanings (see RFC section 3.2.7). Thus, this new tree has 18 nodes.

Alright, let’s look at the file. Header saids we have only 1 block, use dynamic Huffman coding, has 22 + 257 = 279 nodes in alphabet and repeat code tree, 1 + 29 = 30 nodes in the distance tree and 11 + 4 = 15 nodes in the 2nd level Huffman tree. In the 2nd level tree, 8, 7, 9, 6, 10 are represented as 3-bit code. (Note: although 8 appears before 6 according to the RFC, 6 still has lower code then 8). Decode the tree, we get 6->000, 7->001, … Then, we can decode the alphabet and repeat code tree. 00-09 does not appear, 0A (\n) as 8-bit value, …, 32 (space) as 5-bit, …

Entry filed under: code reading. Tags: gzip format.

Openfire and Jwchat on Ubuntu Build Mame with Visual Studio

3 Comments Add your own

1. Daniel | March 28, 2011 at 10:08 am

Thank you very much for this explanation. This really helped me a lot to understand what deflate is doing.
2. Encoding Web Shells in PNG IDAT chunks | phpbug's life | January 30, 2015 at 1:47 am

[…] If only it were that simple Sadly, if you run DEFLATE over the above string you get a load of garbage out, the string hasn’t been compressed but the DEFLATE results don’t start on a byte boundary and are encoded using LSB rather than MSB. I won’t go into it in too much detail but you can read more on Pograph’s weblog […]
3. Crisp ie | January 23, 2021 at 9:40 pm

Whats a pograph?

Trackback this post | Subscribe to the comments via RSS Feed

Pograph's Weblog