• [gentoo-user] How to copy gzip data from bytestream?

    From Grant Edwards@21:1/5 to All on Tue Feb 22 02:30:02 2022
    I've got a "raw" USB flash drive containing a large chunk of gzipped
    data. By "raw" I mean no partition table, now filesystem. Think of it
    as a tape (if you're old enough).

    gzip -tv is quite happy to validate the data and says it's OK, though
    it says it ignored extra bytes after the end of the "file".

    The flash drive size is 128GB, but the gzipped data is only maybe
    20-30GB.

    Question: is there a simple way to copy just the 'gzip' data from the
    drive without copying the extra bytes after the end of the 'gzip'
    data?

    The only thing I can think of is:

    $ zcat /dev/sdX | gzip -c > data.gz

    But I was trying to figure out a way to do it without uncompressing
    and recompressing the data. I had hoped that the gzip header would
    contain a "length" field (so I would know how many bytes to copy using
    dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
    this case).

    --
    Grant

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Rich Freeman@21:1/5 to grant.b.edwards@gmail.com on Tue Feb 22 02:50:01 2022
    On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@gmail.com> wrote:

    But I was trying to figure out a way to do it without uncompressing
    and recompressing the data. I had hoped that the gzip header would
    contain a "length" field (so I would know how many bytes to copy using
    dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
    this case).

    I'm guessing that the reason it lacks such a header, is precisely so
    that you can use it in a stream in just this manner. In order to have
    a length in the header it would need to be able to seek back to the
    start of the file to modify the header, which isn't always possible.

    I wouldn't be surprised if it stores some kind of metadata at the end
    of the file, but of course you can only find that if the end of the
    file is marked in some way. Tapes sometimes have ways to seek to the
    end of a recording - the drive can record a pattern that is detectable
    while seeking at high speed. Obviously USB drives lack such a
    mechanism unless provided by a filesystem or whatever application
    wrote the data.

    If you google the details of the gzip file format you might be able to
    figure out how to identify the end of the file, scan the image to find
    this marker, and then use dd to extract just the desired range.
    Unless the file is VERY large I suspect that is going to take you
    longer than just recompressing it all. I can't imagine that there is
    any way around sequentially reading the entire file to find the end,
    unless you have some mechanism that can read a random block and
    determine if it is valid gzip data and if so you can do a binary
    search assuming the data on the drive past the end of the file isn't
    valid gzip.

    --
    Rich

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Edwards@21:1/5 to Rich Freeman on Tue Feb 22 04:10:01 2022
    On 2022-02-22, Rich Freeman <rich0@gentoo.org> wrote:
    On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards <grant.b.edwards@gmail.com> wrote:

    But I was trying to figure out a way to do it without uncompressing
    and recompressing the data. I had hoped that the gzip header would
    contain a "length" field (so I would know how many bytes to copy using
    dd), but it does not. Apparently, the only way to find the end of the
    compressed data is to parse it using the proper algorithm (deflate, in
    this case).

    I'm guessing that the reason it lacks such a header, is precisely so
    that you can use it in a stream in just this manner. In order to
    have a length in the header it would need to be able to seek back to
    the start of the file to modify the header, which isn't always
    possible.

    Indeed. It's clearly designed to be used on non-seekable media/devices
    like pipes and tapes. I should have realized that would be the case
    and would preclude a length field in the header.

    I wouldn't be surprised if it stores some kind of metadata at the end
    of the file, but of course you can only find that if the end of the
    file is marked in some way.

    The gzip file format has a length and CRC field in a trailer at the
    end (after the compressed data). But, the only way to locate the end
    is to parse the data using the appropriate decompression algorithm.
    The header allows for multiple algorithms, but only one (deflate) is
    actually defined.

    If you google the details of the gzip file format

    I did -- link is below.

    you might be able to figure out how to identify the end of the file,
    scan the image to find this marker,

    I'm pretty sure the only way to find the end of the file is to parse
    the compressed data payload itself. There isn't a marker.

    and then use dd to extract just the desired range. Unless the file
    is VERY large I suspect that is going to take you longer than just recompressing it all.

    Definitely. It's purely an academic question at this point.

    I can't imagine that there is any way around sequentially reading
    the entire file to find the end,

    I believe you're right.

    unless you have some mechanism that can read a random block and
    determine if it is valid gzip data and if so you can do a binary
    search assuming the data on the drive past the end of the file isn't
    valid gzip.

    I don't think that determining if something is valid deflate data is
    easy (and may be impossible in the general case). I implemented the
    deflate algorithm from scratch once a few years ago, and vaguely
    recall that you can usually deflate almost anything. It turns out
    that the flash drive I used was pretty new, and almost all 0x00
    bytes. Once I knew where to look it was pretty obvious where the gzip
    data ended.

    I've copied it the easy way (zcat | gzip -c), and verified that the
    copy matches byte-for-byte except for the MTIME field in the gzip
    header. It appears that gzipping stdin produces an empty MTIME
    field. No surprise there.

    gzip file format:

    https://datatracker.ietf.org/doc/html/rfc1952

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Kuperjans@21:1/5 to grant.b.edwards@gmail.com on Tue Feb 22 13:00:02 2022
    On Mon, Feb 21, 2022 at 8:29 PM Grant Edwards<grant.b.edwards@gmail.com> wrote:

    I've got a "raw" USB flash drive containing a large chunk of gzipped
    data. By "raw" I mean no partition table, now filesystem. Think of it
    as a tape (if you're old enough).

    gzip -tv is quite happy to validate the data and says it's OK, though
    it says it ignored extra bytes after the end of the "file".

    The flash drive size is 128GB, but the gzipped data is only maybe
    20-30GB.

    Question: is there a simple way to copy just the 'gzip' data from the
    drive without copying the extra bytes after the end of the 'gzip'
    data?

    The only thing I can think of is:

    $ zcat /dev/sdX | gzip -c > data.gz

    But I was trying to figure out a way to do it without uncompressing
    and recompressing the data. I had hoped that the gzip header would
    contain a "length" field (so I would know how many bytes to copy using
    dd), but it does not. Apparently, the only way to find the end of the compressed data is to parse it using the proper algorithm (deflate, in
    this case).

    --
    Grant

    Hi Grant,

    you could use gzip to tell you the compressed size of the file and then
    use another method to copy just those bytes (dd for example):

    gzip -clt </dev/sdX

    Should print the compressed size in bytes, although by reading through
    the entire stream once.

    --
    Felix

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Grant Edwards@21:1/5 to Felix Kuperjans on Tue Feb 22 15:10:02 2022
    On 2022-02-22, Felix Kuperjans <felix@desaster-games.com> wrote:

    you could use gzip to tell you the compressed size of the file and then
    use another method to copy just those bytes (dd for example):

    gzip -clt </dev/sdX

    Should print the compressed size in bytes, although by reading through
    the entire stream once.

    That doesn't work. It shows the size of the drive as the
    "uncompressed" size and 0 as compressed:

    # gzip -clt </dev/sdd
    compressed uncompressed ratio uncompressed_name
    31658606592 0 0.0% stdout

    The actual size of the compressed data is about 1/3 the value shown
    above.

    It's not reading through the stream. It's seeking to the end and
    looking at what it thinks is the trailer info. I thought that maybe
    using a pipe instead of a file would make it read through the data,
    but that doesn't work either:

    $ ls > foo
    $ ls -l foo
    -rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo

    $ gzip foo
    $ ls -l foo.gz
    -rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz

    $ gzip -clt <foo.gz
    compressed uncompressed ratio uncompressed_name
    6083 12923 53.1% stdout

    $ echo asdf >> foo.gz

    $ gzip -clt <foo.gz
    compressed uncompressed ratio uncompressed_name
    6088 174482547 100.0% stdout

    $ cat foo.gz | gzip -clt
    compressed uncompressed ratio uncompressed_name
    -1 -1 0.0% stdout



    Here's relevent portion of the strace for the 'gzip -clt <foo.gz'
    where it seeks to end-8 and reads what it thinks is the uncompressed
    length and the CRC:

    lseek(0, -8, SEEK_END) = 6080
    read(0, "2\0\0asdf\n", 8) = 8
    write(1, " 6088 17"..., 54) = 54
    close(0) = 0
    close(1) = 0
    exit_group(0) = ?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Felix Kuperjans@21:1/5 to Grant Edwards on Wed Feb 23 01:20:01 2022
    On 2022-02-22, Grant Edwards wrote:
    That doesn't work. It shows the size of the drive as the
    "uncompressed" size and 0 as compressed:

    # gzip -clt </dev/sdd
    compressed uncompressed ratio uncompressed_name
    31658606592 0 0.0% stdout

    The actual size of the compressed data is about 1/3 the value shown
    above.

    It's not reading through the stream. It's seeking to the end and
    looking at what it thinks is the trailer info. I thought that maybe
    using a pipe instead of a file would make it read through the data,
    but that doesn't work either:

    $ ls > foo
    $ ls -l foo
    -rw-r--r-- 1 grante users 12923 Feb 22 07:51 foo

    $ gzip foo
    $ ls -l foo.gz
    -rw-r--r-- 1 grante users 6083 Feb 22 07:51 foo.gz

    $ gzip -clt <foo.gz
    compressed uncompressed ratio uncompressed_name
    6083 12923 53.1% stdout

    $ echo asdf >> foo.gz

    $ gzip -clt <foo.gz
    compressed uncompressed ratio uncompressed_name
    6088 174482547 100.0% stdout

    $ cat foo.gz | gzip -clt
    compressed uncompressed ratio uncompressed_name
    -1 -1 0.0% stdout



    Here's relevent portion of the strace for the 'gzip -clt <foo.gz'
    where it seeks to end-8 and reads what it thinks is the uncompressed
    length and the CRC:

    lseek(0, -8, SEEK_END) = 6080
    read(0, "2\0\0asdf\n", 8) = 8
    write(1, " 6088 17"..., 54) = 54
    close(0) = 0
    close(1) = 0
    exit_group(0) = ?

    Hi Grant,

    you're right it doesn't work with the trailing garbage. I wasn't aware
    it actually seeks even on pipes.

    By coincidence it seems the next release will even change this behavior:

    https://git.savannah.gnu.org/cgit/gzip.git/commit/?id=cf26200380585019e927fe3cf5c0ecb7c8b3ef14

    But this actually still doesn't solve your problem, since this only
    adjust the calculation of the uncompressed size, but the compressed size
    is still derived from stat.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)