Forum: >>> Magnum BBS <<<

[gentoo-user] How to copy gzip data from bytestream?

From Grant Edwards@21:1/5 to All on Tue Feb 22 02:30:02 2022

I've got a "raw" USB flash drive containing a large chunk of gzipped
data. By "raw" I mean no partition table, now filesystem. Think of it
as a tape (if you're old enough).

gzip -tv is quite happy to validate the data and says it's OK, though
it says it ignored extra bytes after the end of the "file".

The flash drive size is 128GB, but the gzipped data is only maybe
20-30GB.

Question: is there a simple way to copy just the 'gzip' data from the
drive without copying the extra bytes after the end of the 'gzip'
data?

The only thing I can think of is:

$ zcat /dev/sdX | gzip -c > data.gz

But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
this case).

--
Grant

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Rich Freeman@21:1/5 to grant.b.edwards@gmail.com on Tue Feb 22 02:50:01 2022

On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@gmail.com> wrote:

But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
this case).

I'm guessing that the reason it lacks such a header, is precisely so
that you can use it in a stream in just this manner. In order to have
a length in the header it would need to be able to seek back to the
start of the file to modify the header, which isn't always possible.

I wouldn't be surprised if it stores some kind of metadata at the end
of the file, but of course you can only find that if the end of the
file is marked in some way. Tapes sometimes have ways to seek to the
end of a recording - the drive can record a pattern that is detectable
while seeking at high speed. Obviously USB drives lack such a
mechanism unless provided by a filesystem or whatever application
wrote the data.

If you google the details of the gzip file format you might be able to
figure out how to identify the end of the file, scan the image to find
this marker, and then use dd to extract just the desired range.
Unless the file is VERY large I suspect that is going to take you
longer than just recompressing it all. I can't imagine that there is
any way around sequentially reading the entire file to find the end,
unless you have some mechanism that can read a random block and
determine if it is valid gzip data and if so you can do a binary
search assuming the data on the drive past the end of the file isn't
valid gzip.

--
Rich

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Edwards@21:1/5 to Rich Freeman on Tue Feb 22 04:10:01 2022

On 2022-02-22, Rich Freeman <rich0@gentoo.org> wrote:

On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@gmail.com> wrote:

But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the
compressed data is to parse it using the proper algorithm (deflate, in
this case).

I'm guessing that the reason it lacks such a header, is precisely so
that you can use it in a stream in just this manner. In order to
have a length in the header it would need to be able to seek back to
the start of the file to modify the header, which isn't always
possible.

Indeed. It's clearly designed to be used on non-seekable media/devices
like pipes and tapes. I should have realized that would be the case
and would preclude a length field in the header.

I wouldn't be surprised if it stores some kind of metadata at the end
of the file, but of course you can only find that if the end of the
file is marked in some way.

The gzip file format has a length and CRC field in a trailer at the
end (after the compressed data). But, the only way to locate the end
is to parse the data using the appropriate decompression algorithm.
The header allows for multiple algorithms, but only one (deflate) is
actually defined.

If you google the details of the gzip file format

I did -- link is below.

you might be able to figure out how to identify the end of the file,
scan the image to find this marker,

I'm pretty sure the only way to find the end of the file is to parse
the compressed data payload itself. There isn't a marker.

and then use dd to extract just the desired range. Unless the file
is VERY large I suspect that is going to take you longer than just recompressing it all.

Definitely. It's purely an academic question at this point.

I can't imagine that there is any way around sequentially reading
the entire file to find the end,

I believe you're right.

unless you have some mechanism that can read a random block and
determine if it is valid gzip data and if so you can do a binary
search assuming the data on the drive past the end of the file isn't
valid gzip.

I don't think that determining if something is valid deflate data is
easy (and may be impossible in the general case). I implemented the
deflate algorithm from scratch once a few years ago, and vaguely
recall that you can usually deflate almost anything. It turns out
that the flash drive I used was pretty new, and almost all 0x00
bytes. Once I knew where to look it was pretty obvious where the gzip
data ended.

I've copied it the easy way (zcat | gzip -c), and verified that the
copy matches byte-for-byte except for the MTIME field in the gzip
header. It appears that gzipping stdin produces an empty MTIME
field. No surprise there.

gzip file format:

https://datatracker.ietf.org/doc/html/rfc1952

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Felix Kuperjans@21:1/5 to grant.b.edwards@gmail.com on Tue Feb 22 13:00:02 2022

On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards<grant.b.edwards@gmail.com> wrote:

I've got a "raw" USB flash drive containing a large chunk of gzipped
data. By "raw" I mean no partition table, now filesystem. Think of it
as a tape (if you're old enough).

gzip -tv is quite happy to validate the data and says it's OK, though
it says it ignored extra bytes after the end of the "file".

The flash drive size is 128GB, but the gzipped data is only maybe
20-30GB.

Question: is there a simple way to copy just the 'gzip' data from the
drive without copying the extra bytes after the end of the 'gzip'
data?

The only thing I can think of is:

$ zcat /dev/sdX | gzip -c > data.gz

But I was trying to figure out a way to do it without uncompressing
and recompressing the data. I had hoped that the gzip header would
contain a "length" field (so I would know how many bytes to copy using
dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
this case).

--
Grant

Hi Grant,

you could use gzip to tell you the compressed size of the file and then
use another method to copy just those bytes (dd for example):

gzip -clt </dev/sdX

Should print the compressed size in bytes, although by reading through
the entire stream once.

--
Felix

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Grant Edwards@21:1/5 to Felix Kuperjans on Tue Feb 22 15:10:02 2022

On 2022-02-22, Felix Kuperjans <felix@desaster-games.com> wrote:

you could use gzip to tell you the compressed size of the file and then
use another method to copy just those bytes (dd for example):

gzip -clt </dev/sdX

Should print the compressed size in bytes, although by reading through
the entire stream once.

That doesn't work. It shows the size of the drive as the
"uncompressed" size and 0 as compressed:

# gzip -clt </dev/sdd
compressed uncompressed ratio uncompressed_name
31658606592 0 0.0% stdout

The actual size of the compressed data is about 1/3 the value shown
above.

It's not reading through the stream. It's seeking to the end and
looking at what it thinks is the trailer info. I thought that maybe
using a pipe instead of a file would make it read through the data,
but that doesn't work either:

$ ls > foo
$ ls -l foo
-rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo

$ gzip foo
$ ls -l foo.gz
-rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz

$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6083 12923 53.1% stdout

$ echo asdf >> foo.gz

$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6088 174482547 100.0% stdout

$ cat foo.gz | gzip -clt
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout

Here's relevent portion of the strace for the 'gzip -clt <foo.gz'
where it seeks to end-8 and reads what it thinks is the uncompressed
length and the CRC:

lseek(0, -8, SEEK_END) = 6080
read(0, "2\0\0asdf\n", 8) = 8
write(1, " 6088 17"..., 54) = 54
close(0) = 0
close(1) = 0
exit_group(0) = ?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Felix Kuperjans@21:1/5 to Grant Edwards on Wed Feb 23 01:20:01 2022

On 2022-02-22, Grant Edwards wrote:

That doesn't work. It shows the size of the drive as the
"uncompressed" size and 0 as compressed:

# gzip -clt </dev/sdd
compressed uncompressed ratio uncompressed_name
31658606592 0 0.0% stdout

The actual size of the compressed data is about 1/3 the value shown
above.

It's not reading through the stream. It's seeking to the end and
looking at what it thinks is the trailer info. I thought that maybe
using a pipe instead of a file would make it read through the data,
but that doesn't work either:

$ ls > foo
$ ls -l foo
-rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo

$ gzip foo
$ ls -l foo.gz
-rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz

$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6083 12923 53.1% stdout

$ echo asdf >> foo.gz

$ gzip -clt <foo.gz
compressed uncompressed ratio uncompressed_name
6088 174482547 100.0% stdout

$ cat foo.gz | gzip -clt
compressed uncompressed ratio uncompressed_name
-1 -1 0.0% stdout

Here's relevent portion of the strace for the 'gzip -clt <foo.gz'
where it seeks to end-8 and reads what it thinks is the uncompressed
length and the CRC:

lseek(0, -8, SEEK_END) = 6080
read(0, "2\0\0asdf\n", 8) = 8
write(1, " 6088 17"..., 54) = 54
close(0) = 0
close(1) = 0
exit_group(0) = ?

Hi Grant,

you're right it doesn't work with the trailing garbage. I wasn't aware
it actually seeks even on pipes.

By coincidence it seems the next release will even change this behavior:

https://git.savannah.gnu.org/cgit/gzip.git/commit/?id=cf26200380585019e927fe3cf5c0ecb7c8b3ef14

But this actually still doesn't solve your problem, since this only
adjust the calculation of the uncompressed size, but the compressed size
is still derived from stat.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Scott
  Sat May 18 21:45:06 2024
  from Uk via SSH
- Bob Worm
  Sat May 18 19:21:54 2024
  from Wales, Uk via Telnet
- Bob Worm
  Sun May 19 17:44:31 2024
  from Wales, Uk via Raw
- Bob Worm
  Sun May 19 07:32:22 2024
  from Wales, Uk via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	302
Nodes:	16 (0 / 16)
Uptime:	97:02:14
Calls:	6,764
Calls today:	2
Files:	12,295
Messages:	5,376,370
Posted today:	1

[gentoo-user] How to copy gzip data from bytestream?

Who's Online

Recent Visitors

System Info