Forum: >>> Magnum BBS <<<

Oddities of popular archivers

From Elhana@21:1/5 to All on Wed Jul 10 09:21:34 2019

I used some popular archivers to compress a text file, and results surprised me quite much.

The worst contender turned out to be gzip, with average 60% reduction. Interestingly, the UTF-8 version is compressed worse than ISO one, with about 15% overhead. I was under impression that both files contain the same amount of information, so they
should compress to a comparable amount.

The next result belongs to PKZIP. It managed to compress each file about 40 bytes better than gzip. (the gzip header was 25 bytes long).

The next result belongs to xzip. It managed with the UTF-8 text much better, giving only 8% overhead (which is still too much in my opinion). Average compression was 70%.

Next comes 7-zip, with default settings, which failed spectacularly on UTF-8 file, which turned out 8k more than xzip one. The other files compressed about 400 bytes better.

The silver prize went to bzip2, with its impressive 72% compression. Surprisingly, it processed the UTF-8 file even better, with only 5% overhead.

And the undisputed champion was WinRAR, with 81% compression.

The following questions arose:

* Why does xzip suck?
* Why UTF-8 is not supported by mainstream compression software?
* And why proprietary compression software so easily outperforms the 'free' one?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Elhana on Wed Jul 10 14:51:10 2019

Elhana <tanarriscourge@yahoo.com> writes:

I used some popular archivers to compress a text file, and results
surprised me quite much.

The worst contender turned out to be gzip, with average 60%
reduction. Interestingly, the UTF-8 version is compressed worse
than ISO one, with about 15% overhead. I was under impression
that both files contain the same amount of information, so they
should compress to a comparable amount.

What do you mean by "ISO"? I'm going to guess that you're referring to something like UTF-16 or UCS-2, commonly used on Windows.

The next result belongs to PKZIP. It managed to compress each
file about 40 bytes better than gzip. (the gzip header was 25
bytes long).

The next result belongs to xzip. It managed with the UTF-8 text
much better, giving only 8% overhead (which is still too much in
my opinion). Average compression was 70%.

Next comes 7-zip, with default settings, which failed spectacularly
on UTF-8 file, which turned out 8k more than xzip one. The other
files compressed about 400 bytes better.

The silver prize went to bzip2, with its impressive 72%
compression. Surprisingly, it processed the UTF-8 file even better,
with only 5% overhead.

And the undisputed champion was WinRAR, with 81% compression.

The following questions arose:

* Why does xzip suck?
* Why UTF-8 is not supported by mainstream compression software?
* And why proprietary compression software so easily outperforms the
'free' one?

A UTF-8 version of a given chunk of text is very likely to be smaller
than a UTF-16 or UCS-2 version of the same text. UCS-2 represents each character as 16 bits. UTF-8 represents each character with an encoding
in the range 0..127 in 8 bits (an in typical text, that's likely to be
most characters).

For a large chunk of ASCII text, it's likely that a UTF-8 representation
will be about half the size of a UCS-2 or UTF-16 representation.
Compressed versions of both are likely to be roughly the same size,
since the files contain about the same amount of information. This
could vary if the input contains a not of non-ASCII characters.

(I don't have answers for your other questions.)

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Elhana@21:1/5 to All on Wed Jul 10 19:59:34 2019

Keith Thompson:

What do you mean by "ISO"?

ISO 8859-X family of encodings.

Compressed versions of both are likely to be roughly the same size,
since the files contain about the same amount of information.

15% DEFLATE or 8% LZMA worse result for UTF-8 comparing to a ISO text is pretty much for "roughly the same" size in my opinion.

Maybe old primitive algorithms were tuned to English ASCII text, with customary match lengths, or single-order arithmetic coders, and they get confused when every symbol in the text is suddenly 2-3 bytes long, but LZMA is pretty much a modern,
sophisticated, state-of-art algorithm.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Elhana on Thu Jul 11 13:22:54 2019

Elhana <tanarriscourge@yahoo.com> writes:

Keith Thompson:

What do you mean by "ISO"?

ISO 8859-X family of encodings.

OK.

The term "ISO" for such encodings is probably incorrect, and certainly ambiguous. (Microsoft's 8-bit encodings, such as Windows-1252, are
sometimes called "ANSI", which is also incorrect.)

Compressed versions of both are likely to be roughly the same size,
since the files contain about the same amount of information.

15% DEFLATE or 8% LZMA worse result for UTF-8 comparing to a ISO text
is pretty much for "roughly the same" size in my opinion.

Maybe old primitive algorithms were tuned to English ASCII text, with customary match lengths, or single-order arithmetic coders, and they
get confused when every symbol in the text is suddenly 2-3 bytes long,
but LZMA is pretty much a modern, sophisticated, state-of-art
algorithm.

It's hard to tell just what you're comparing.

Are you using multiple encodings of the same text? If not, you could be
doing and apples-to-oranges comparison (yes, they're both fruits, but
there's not much more you can usefully say about them).

If you are, are most of the characters within the 7-bit ASCII set?
UTF-8 encodes each character in 1 or more bytes; what is the
distribution of byte counts for your input text? What are the numbers
for (a) the number of charaters in your input, (b) the number of bytes
in each encoding you're using, and (c) the compressed size of each
encoding?

(Don't assume that I'll be able to say anything useful even given that information, but others might.)

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Elhana@21:1/5 to All on Sun Jul 14 08:35:42 2019

Keith Thompson:

Are you using multiple encodings of the same text?

Yes.

distribution of byte counts for your input text?

A natural language one.

What are the numbers...

The input text (in UTF-8 form) had 4023k bytes in 2252k characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Keith Thompson@21:1/5 to Elhana on Sun Jul 14 18:38:25 2019

Elhana <tanarriscourge@yahoo.com> writes:

Keith Thompson:

Are you using multiple encodings of the same text?

Yes.

distribution of byte counts for your input text?

A natural language one.

What are the numbers...

The input text (in UTF-8 form) had 4023k bytes in 2252k
characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.

That's not enough information, or at least is unclear. And did you
really mean 892 and 1061 bytes, or 892k and 1061k bytes? (I suggest
quoting exact character/byte counts. "4023k" is both approximate and ambiguous; "k" could be either 1000 or 1024.)

Here's my best guess at what you're saying:

You have two files containing different encodings of the same text:
- utf8.txt is 4023k bytes (averaging about 1.79 bytes per character).
- latin1.txt is 2252k bytes.
All characters have code points in the range 0..255 (otherwise a Latin-1 encoding would not be possible).

Compressing utf8.txt with the DEFLATE algorithm (using what program?)
yields 892k bytes of compressed output.

Compressing latin1.txt with the DEFLATE algorithm yields 1061k bytes of compressed output.

Since both utf8.txt and latin1.txt contain very nearly the same
information, ideally a compression algorithm *should* yield outputs of
similar size for both input files, but you're seeing a 19% difference,
and you're wondering why.

Is my description correct?

(BTW, I got roughly similar results with a randomly generated chunk of
text and the gzip command.)

--
Keith Thompson (The_Other_Keith) kst-u@mib.org <http://www.ghoti.net/~kst> Will write code for food.
void Void(void) { Void(); } /* The recursive call of the void */

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun Oct 6 02:19:48 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Fri Oct 4 21:17:30 2024
  from Wales, Uk via Telnet
- Chinaman
  Fri Oct 4 20:56:30 2024
  from Linkou,tw via Telnet
- Ordos
  Fri Oct 4 15:10:31 2024
  from Moscow via Telnet
- Keyop
  Thu Oct 3 22:55:32 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Thu Oct 3 19:56:46 2024
  from Wales, Uk via Telnet
- Bob Worm
  Wed Oct 2 21:00:35 2024
  from Wales, Uk via Telnet
- Chinaman
  Wed Oct 2 09:56:31 2024
  from Linkou,tw via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	349
Nodes:	16 (0 / 16)
Uptime:	142:50:43
Calls:	7,613
Calls today:	1
Files:	12,790
Messages:	5,684,507

Oddities of popular archivers

Who's Online

Recent Visitors

System Info