• Another way to store a canonical Huffman table

    From news@zzo38computer.org.invalid@21:1/5 to All on Fri Feb 7 21:56:32 2020
    I thought of the way to store a canonical Huffman table: First start one
    bit telling if the largest code length is odd or even. And then store the number of codes of each length using truncated binary. And then store the values in truncated binary, alternating lowest and highest values, in order that the window is made narrow after each one.

    For the storage of the values, consider that values already specified are
    no longer possible for longer codes, so they can be skipped when making the window of possible values. Also consider that, because the number of codes
    of a length is known, you might know that the next value (which is the
    lowest, or maybe highest, used value with this code length) cannot be the
    few highest (or maybe lowest) values in the window, because then it will
    not leave enough room for any other values for codes of this length, so
    this reduces the window further. Sometimes the window will be reduced to
    a single value in this way, meaning it can be coded in zero bits.

    For the list of how many codes per length, note that to determine the
    maximum, start up to 1 for codes of zero length, and then double the number
    of unused codes for the number of codes of the next length. The bit at the beginning specifies if the largest code length is odd or even, so if you
    know this isn't the largest code length then the maximum can temporarily be decreased by one.

    For example:
    3 -- space a e
    4 -- f h i m n s t
    5 -- l o p r u x

    Assume these are 8-bit ASCII codes, so the range of values is 0 to 255.

    The maximum code length is five. Count codes per length:
    0: 0/1 *
    1: 0/2
    2: 0/4 *
    3: 3/8
    4: 7/10 *
    5: 6/6

    The ones with asterisks are one less maximum. So the encoding will be:
    1..0.00.011.1101.111 (fourteen bits)

    Now we have to encode the values that the codes represent. The values to
    be encoded, and the ranges that they are encoded in, are:
    Length 3:
    (0-255 / 0-253) 32
    (33-255 / 34-255) 101
    (33-100) 97
    Length 4:
    (0-255 except 32, 97, 101 / 0-249) 102
    (103-255 / 108-255) 116
    (103-115 / 103-111) 104
    (105-115 / 108-115) 115
    (105-114 / 105-112) 105
    (106-114 / 107-114) 110
    (106-109) 109
    Length 5:
    (0-255 except numbers used above / 0-250) 108
    (111-255 except 115, 116 / 115-255) 120
    (111-119 except 115, 116 / 111-114) 111
    (112-119 except 115, 116 / 114-119) 117
    (112-114 / 112-113) 112
    (113-114) 114

    Each range in parentheses first gives the range not considering the number
    of codes of this length, and then the second range after the slash uses the consideration of how many codes of this length to shorten the window (with
    the same exceptions listed before the slash).

    The encoding into binary is left as an exercise for the reader.

    Further improvements may also be possible, which I have not considered. (Possibly variable base coding, nCr coding, etc. Those might be more complicated to implement though; I am unsure.)

    I have not compared this working with other schemes, such as the way used
    in DEFLATE. Do you know how well it work compared with the other way? Are
    there other schemes that I might not know of?

    --
    Note: I am not always able to read/post messages during Monday-Friday.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Harry Potter@21:1/5 to All on Thu Mar 26 15:04:05 2020
    I have my own technique to shorten Canonical Huffman codes. It follows:

    * Start with the 16-code ranges to be included. Basically, 16 bits are used, where each bit contains whether a given 16-value range will be excluded from the header. I prepend a bit to determine whether any range will be excluded. This can save 15
    bits.

    * Write the least and most bits per entry. The most only needs to take into account the range from least to 16. Arithmetic coding can be used. I have a way of shortening values that are *not* in a range of a power of 2 but am not currently going to
    reveal it.

    * Write a bit to indicate whether a 0-value is used.

    * Write the most-used value.

    * Scan the Huffman table:
    * Write a bit to determine whether the current value is the same as the previous.
    * If not, write a bit to determine whether it is the most-often-used value.
    * If not, write the value minus the least size plus whether the zero-length value is used, or, if a zero-value, 0, and skip the previous value and the most-occurring value.

    Try this out, and tell me what you think.

    BTW, I find that skipping blocks compressible with LZ helps with the size of the header.

    BTW, I'd better study your idea nd see if I can add it to my technique.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)