But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
this case).
On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@gmail.com> wrote:
But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the
compressed data is to parse it using the proper algorithm (deflate, in
this case).
I'm guessing that the reason it lacks such a header, is precisely so
that you can use it in a stream in just this manner. In order to
have a length in the header it would need to be able to seek back to
the start of the file to modify the header, which isn't always
possible.
I wouldn't be surprised if it stores some kind of metadata at the end
of the file, but of course you can only find that if the end of the
file is marked in some way.
If you google the details of the gzip file format
you might be able to figure out how to identify the end of the file,
scan the image to find this marker,
and then use dd to extract just the desired range. Unless the file
is VERY large I suspect that is going to take you longer than just recompressing it all.
I can't imagine that there is any way around sequentially reading
the entire file to find the end,
unless you have some mechanism that can read a random block and
determine if it is valid gzip data and if so you can do a binary
search assuming the data on the drive past the end of the file isn't
valid gzip.
I've got a "raw" USB flash drive containing a large chunk of gzipped
data. By "raw" I mean no partition table, now filesystem. Think of it
as a tape (if you're old enough).
gzip -tv is quite happy to validate the data and says it's OK, though
it says it ignored extra bytes after the end of the "file".
The flash drive size is 128GB, but the gzipped data is only maybe
20-30GB.
Question: is there a simple way to copy just the 'gzip' data from the
drive without copying the extra bytes after the end of the 'gzip'
data?
The only thing I can think of is:
$ zcat /dev/sdX | gzip -c > data.gz
But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
this case).
--
Grant
you could use gzip to tell you the compressed size of the file and then
use another method to copy just those bytes (dd for example):
gzip -clt </dev/sdX
Should print the compressed size in bytes, although by reading through
the entire stream once.
That doesn't work. It shows the size of the drive as the
"uncompressed" size and 0 as compressed:
# gzip -clt </dev/sdd
compressed uncompressed ratio uncompressed_name
31658606592 0 0.0% stdout
The actual size of the compressed data is about 1/3 the value shown
above.
It's not reading through the stream. It's seeking to the end and
looking at what it thinks is the trailer info. I thought that maybe
using a pipe instead of a file would make it read through the data,
but that doesn't work either:
$ ls > foo
$ ls -l foo
-rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo
$ gzip foo
$ ls -l foo.gz
-rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz
$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6083 12923 53.1% stdout
$ echo asdf >> foo.gz
$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6088 174482547 100.0% stdout
$ cat foo.gz | gzip -clt
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout
Here's relevent portion of the strace for the 'gzip -clt <foo.gz'
where it seeks to end-8 and reads what it thinks is the uncompressed
length and the CRC:
lseek(0, -8, SEEK_END) = 6080
read(0, "2\0\0asdf\n", 8) = 8
write(1, " 6088 17"..., 54) = 54
close(0) = 0
close(1) = 0
exit_group(0) = ?
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 302 |
Nodes: | 16 (0 / 16) |
Uptime: | 97:02:14 |
Calls: | 6,764 |
Calls today: | 2 |
Files: | 12,295 |
Messages: | 5,376,370 |
Posted today: | 1 |