Another way to store a canonical Huffman table
From
news@zzo38computer.org.invalid@21:1/5 to
All on Fri Feb 7 21:56:32 2020
I thought of the way to store a canonical Huffman table: First start one
bit telling if the largest code length is odd or even. And then store the number of codes of each length using truncated binary. And then store the values in truncated binary, alternating lowest and highest values, in order that the window is made narrow after each one.
For the storage of the values, consider that values already specified are
no longer possible for longer codes, so they can be skipped when making the window of possible values. Also consider that, because the number of codes
of a length is known, you might know that the next value (which is the
lowest, or maybe highest, used value with this code length) cannot be the
few highest (or maybe lowest) values in the window, because then it will
not leave enough room for any other values for codes of this length, so
this reduces the window further. Sometimes the window will be reduced to
a single value in this way, meaning it can be coded in zero bits.
For the list of how many codes per length, note that to determine the
maximum, start up to 1 for codes of zero length, and then double the number
of unused codes for the number of codes of the next length. The bit at the beginning specifies if the largest code length is odd or even, so if you
know this isn't the largest code length then the maximum can temporarily be decreased by one.
For example:
3 -- space a e
4 -- f h i m n s t
5 -- l o p r u x
Assume these are 8-bit ASCII codes, so the range of values is 0 to 255.
The maximum code length is five. Count codes per length:
0: 0/1 *
1: 0/2
2: 0/4 *
3: 3/8
4: 7/10 *
5: 6/6
The ones with asterisks are one less maximum. So the encoding will be:
1..0.00.011.1101.111 (fourteen bits)
Now we have to encode the values that the codes represent. The values to
be encoded, and the ranges that they are encoded in, are:
Length 3:
(0-255 / 0-253) 32
(33-255 / 34-255) 101
(33-100) 97
Length 4:
(0-255 except 32, 97, 101 / 0-249) 102
(103-255 / 108-255) 116
(103-115 / 103-111) 104
(105-115 / 108-115) 115
(105-114 / 105-112) 105
(106-114 / 107-114) 110
(106-109) 109
Length 5:
(0-255 except numbers used above / 0-250) 108
(111-255 except 115, 116 / 115-255) 120
(111-119 except 115, 116 / 111-114) 111
(112-119 except 115, 116 / 114-119) 117
(112-114 / 112-113) 112
(113-114) 114
Each range in parentheses first gives the range not considering the number
of codes of this length, and then the second range after the slash uses the consideration of how many codes of this length to shorten the window (with
the same exceptions listed before the slash).
The encoding into binary is left as an exercise for the reader.
Further improvements may also be possible, which I have not considered. (Possibly variable base coding, nCr coding, etc. Those might be more complicated to implement though; I am unsure.)
I have not compared this working with other schemes, such as the way used
in DEFLATE. Do you know how well it work compared with the other way? Are
there other schemes that I might not know of?
--
Note: I am not always able to read/post messages during Monday-Friday.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)
From
Harry Potter@21:1/5 to
All on Thu Mar 26 15:04:05 2020
I have my own technique to shorten Canonical Huffman codes. It follows:
* Start with the 16-code ranges to be included. Basically, 16 bits are used, where each bit contains whether a given 16-value range will be excluded from the header. I prepend a bit to determine whether any range will be excluded. This can save 15
bits.
* Write the least and most bits per entry. The most only needs to take into account the range from least to 16. Arithmetic coding can be used. I have a way of shortening values that are *not* in a range of a power of 2 but am not currently going to
reveal it.
* Write a bit to indicate whether a 0-value is used.
* Write the most-used value.
* Scan the Huffman table:
* Write a bit to determine whether the current value is the same as the previous.
* If not, write a bit to determine whether it is the most-often-used value.
* If not, write the value minus the least size plus whether the zero-length value is used, or, if a zero-value, 0, and skip the previous value and the most-occurring value.
Try this out, and tell me what you think.
BTW, I find that skipping blocks compressible with LZ helps with the size of the header.
BTW, I'd better study your idea nd see if I can add it to my technique.
--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)