• Oddities of popular archivers

    From Elhana@21:1/5 to All on Wed Jul 10 09:21:34 2019
    I used some popular archivers to compress a text file, and results surprised me quite much.

    The worst contender turned out to be gzip, with average 60% reduction. Interestingly, the UTF-8 version is compressed worse than ISO one, with about 15% overhead. I was under impression that both files contain the same amount of information, so they
    should compress to a comparable amount.

    The next result belongs to PKZIP. It managed to compress each file about 40 bytes better than gzip. (the gzip header was 25 bytes long).

    The next result belongs to xzip. It managed with the UTF-8 text much better, giving only 8% overhead (which is still too much in my opinion). Average compression was 70%.

    Next comes 7-zip, with default settings, which failed spectacularly on UTF-8 file, which turned out 8k more than xzip one. The other files compressed about 400 bytes better.

    The silver prize went to bzip2, with its impressive 72% compression. Surprisingly, it processed the UTF-8 file even better, with only 5% overhead.

    And the undisputed champion was WinRAR, with 81% compression.

    The following questions arose:

    * Why does xzip suck?
    * Why UTF-8 is not supported by mainstream compression software?
    * And why proprietary compression software so easily outperforms the 'free' one?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Elhana on Wed Jul 10 14:51:10 2019
    Elhana <tanarriscourge@yahoo.com> writes:
    I used some popular archivers to compress a text file, and results
    surprised me quite much.

    The worst contender turned out to be gzip, with average 60%
    reduction. Interestingly, the UTF-8 version is compressed worse
    than ISO one, with about 15% overhead. I was under impression
    that both files contain the same amount of information, so they
    should compress to a comparable amount.

    What do you mean by "ISO"? I'm going to guess that you're referring to something like UTF-16 or UCS-2, commonly used on Windows.

    The next result belongs to PKZIP. It managed to compress each
    file about 40 bytes better than gzip. (the gzip header was 25
    bytes long).

    The next result belongs to xzip. It managed with the UTF-8 text
    much better, giving only 8% overhead (which is still too much in
    my opinion). Average compression was 70%.

    Next comes 7-zip, with default settings, which failed spectacularly
    on UTF-8 file, which turned out 8k more than xzip one. The other
    files compressed about 400 bytes better.

    The silver prize went to bzip2, with its impressive 72%
    compression. Surprisingly, it processed the UTF-8 file even better,
    with only 5% overhead.

    And the undisputed champion was WinRAR, with 81% compression.

    The following questions arose:

    * Why does xzip suck?
    * Why UTF-8 is not supported by mainstream compression software?
    * And why proprietary compression software so easily outperforms the
    'free' one?

    A UTF-8 version of a given chunk of text is very likely to be smaller
    than a UTF-16 or UCS-2 version of the same text. UCS-2 represents each character as 16 bits. UTF-8 represents each character with an encoding
    in the range 0..127 in 8 bits (an in typical text, that's likely to be
    most characters).

    For a large chunk of ASCII text, it's likely that a UTF-8 representation
    will be about half the size of a UCS-2 or UTF-16 representation.
    Compressed versions of both are likely to be roughly the same size,
    since the files contain about the same amount of information. This
    could vary if the input contains a not of non-ASCII characters.

    (I don't have answers for your other questions.)

    --
    Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Elhana@21:1/5 to All on Wed Jul 10 19:59:34 2019
    Keith Thompson:

    What do you mean by "ISO"?

    ISO 8859-X family of encodings.

    Compressed versions of both are likely to be roughly the same size,
    since the files contain about the same amount of information.

    15% DEFLATE or 8% LZMA worse result for UTF-8 comparing to a ISO text is pretty much for "roughly the same" size in my opinion.

    Maybe old primitive algorithms were tuned to English ASCII text, with customary match lengths, or single-order arithmetic coders, and they get confused when every symbol in the text is suddenly 2-3 bytes long, but LZMA is pretty much a modern,
    sophisticated, state-of-art algorithm.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Elhana on Thu Jul 11 13:22:54 2019
    Elhana <tanarriscourge@yahoo.com> writes:
    Keith Thompson:

    What do you mean by "ISO"?

    ISO 8859-X family of encodings.

    OK.

    The term "ISO" for such encodings is probably incorrect, and certainly ambiguous. (Microsoft's 8-bit encodings, such as Windows-1252, are
    sometimes called "ANSI", which is also incorrect.)

    Compressed versions of both are likely to be roughly the same size,
    since the files contain about the same amount of information.

    15% DEFLATE or 8% LZMA worse result for UTF-8 comparing to a ISO text
    is pretty much for "roughly the same" size in my opinion.

    Maybe old primitive algorithms were tuned to English ASCII text, with customary match lengths, or single-order arithmetic coders, and they
    get confused when every symbol in the text is suddenly 2-3 bytes long,
    but LZMA is pretty much a modern, sophisticated, state-of-art
    algorithm.

    It's hard to tell just what you're comparing.

    Are you using multiple encodings of the same text? If not, you could be
    doing and apples-to-oranges comparison (yes, they're both fruits, but
    there's not much more you can usefully say about them).

    If you are, are most of the characters within the 7-bit ASCII set?
    UTF-8 encodes each character in 1 or more bytes; what is the
    distribution of byte counts for your input text? What are the numbers
    for (a) the number of charaters in your input, (b) the number of bytes
    in each encoding you're using, and (c) the compressed size of each
    encoding?

    (Don't assume that I'll be able to say anything useful even given that information, but others might.)

    --
    Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Elhana@21:1/5 to All on Sun Jul 14 08:35:42 2019
    Keith Thompson:

    Are you using multiple encodings of the same text?

    Yes.

    distribution of byte counts for your input text?

    A natural language one.

    What are the numbers...

    The input text (in UTF-8 form) had 4023k bytes in 2252k characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Keith Thompson@21:1/5 to Elhana on Sun Jul 14 18:38:25 2019
    Elhana <tanarriscourge@yahoo.com> writes:
    Keith Thompson:

    Are you using multiple encodings of the same text?

    Yes.

    distribution of byte counts for your input text?

    A natural language one.

    What are the numbers...

    The input text (in UTF-8 form) had 4023k bytes in 2252k
    characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.

    That's not enough information, or at least is unclear. And did you
    really mean 892 and 1061 bytes, or 892k and 1061k bytes? (I suggest
    quoting exact character/byte counts. "4023k" is both approximate and ambiguous; "k" could be either 1000 or 1024.)

    Here's my best guess at what you're saying:

    You have two files containing different encodings of the same text:
    - utf8.txt is 4023k bytes (averaging about 1.79 bytes per character).
    - latin1.txt is 2252k bytes.
    All characters have code points in the range 0..255 (otherwise a Latin-1 encoding would not be possible).

    Compressing utf8.txt with the DEFLATE algorithm (using what program?)
    yields 892k bytes of compressed output.

    Compressing latin1.txt with the DEFLATE algorithm yields 1061k bytes of compressed output.

    Since both utf8.txt and latin1.txt contain very nearly the same
    information, ideally a compression algorithm *should* yield outputs of
    similar size for both input files, but you're seeing a 19% difference,
    and you're wondering why.

    Is my description correct?

    (BTW, I got roughly similar results with a randomly generated chunk of
    text and the gzip command.)

    --
    Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
    void Void(void) { Void(); } /* The recursive call of the void */

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)