I used some popular archivers to compress a text file, and results
surprised me quite much.
The worst contender turned out to be gzip, with average 60%
reduction. Interestingly, the UTF-8 version is compressed worse
than ISO one, with about 15% overhead. I was under impression
that both files contain the same amount of information, so they
should compress to a comparable amount.
The next result belongs to PKZIP. It managed to compress each
file about 40 bytes better than gzip. (the gzip header was 25
bytes long).
The next result belongs to xzip. It managed with the UTF-8 text
much better, giving only 8% overhead (which is still too much in
my opinion). Average compression was 70%.
Next comes 7-zip, with default settings, which failed spectacularly
on UTF-8 file, which turned out 8k more than xzip one. The other
files compressed about 400 bytes better.
The silver prize went to bzip2, with its impressive 72%
compression. Surprisingly, it processed the UTF-8 file even better,
with only 5% overhead.
And the undisputed champion was WinRAR, with 81% compression.
The following questions arose:
* Why does xzip suck?
* Why UTF-8 is not supported by mainstream compression software?
* And why proprietary compression software so easily outperforms the
'free' one?
What do you mean by "ISO"?
Compressed versions of both are likely to be roughly the same size,
since the files contain about the same amount of information.
Keith Thompson:
What do you mean by "ISO"?
ISO 8859-X family of encodings.
Compressed versions of both are likely to be roughly the same size,
since the files contain about the same amount of information.
15% DEFLATE or 8% LZMA worse result for UTF-8 comparing to a ISO text
is pretty much for "roughly the same" size in my opinion.
Maybe old primitive algorithms were tuned to English ASCII text, with customary match lengths, or single-order arithmetic coders, and they
get confused when every symbol in the text is suddenly 2-3 bytes long,
but LZMA is pretty much a modern, sophisticated, state-of-art
algorithm.
Are you using multiple encodings of the same text?
distribution of byte counts for your input text?
What are the numbers...
Keith Thompson:
Are you using multiple encodings of the same text?
Yes.
distribution of byte counts for your input text?
A natural language one.
What are the numbers...
The input text (in UTF-8 form) had 4023k bytes in 2252k
characters. The DEFLATE algorithm reduced those to 892 or 1061 bytes correspondingly.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 251 |
Nodes: | 16 (2 / 14) |
Uptime: | 28:18:46 |
Calls: | 5,547 |
Calls today: | 6 |
Files: | 11,676 |
Messages: | 5,110,072 |