• format 0.939000 for breaking the 9.3GB barrier

    From Adam Borowski@21:1/5 to All on Sat Sep 16 17:00:01 2023
    Hi!
    The deb-2.0 format suffers from being unable to carry data larger than 9.3GB (10 gibiwhatever bytes -- this limitation uses a weird radix of 10, unusual
    in computing, might be related to twice the number of fingers some apes have
    on their appendage, or half of the total number of fingers that species of
    ape has in total).

    And we're closer and closer to get there: the last time we spoke, max
    package size was 1.7GB, it's 5.5GB today. In fact, judging by
    Installed-Size alone, some other packages would already breach this limit,
    had they shipped the data instead of fetching it from the Interwebs.

    The current max is kicad-packages3d_7.0.7-1_all.deb with data.tar
    5839452160 bytes in size.


    There were suggestions of other package format, but to my knowledge none
    have been implemented or even researched. Which leaves the deb-old format (version 0.939000; I'll round it to "1" hereafter).

    As far as I know, format 2.0 was devised with some undescribed extensions
    in mind; none of those extensions has appeared during 28 years since we
    made the switch -- any new stuff has gone into control.tar instead.

    Thus, I propose we revert to the old format.

    Benefits:
    * no 10¹⁰ data.tar limit
    * it unpacks 1% faster than 2.0
    Concerns:
    * no support for compressors other than cat/ncompress/gzip yet
    * external tools may not know it

    The speed-up benefit is a bit puzzling, but consistently shows up in my benchmarks, using any underlying compressor. It's even present, to a
    smaller degree, in zunpack (my reimplementation using libarchive, part of stalled zdebootstrap) -- which should have no overhead for ar pieces.

    As for other compressors:
    * format 2.0 explicitly names .deb components, currently supporting .tar
    .tar.{xz,gz,bz2,zst}
    * format 0.939 uses /bin/gzip to do format sniffing, currently supporting
    .tar .tar.Z .tar.gz (there's also an #ifdefed internal implementation,
    which afaik does the same but doesn't support .Z)

    It would be easy to extend that sniffing to newer compressors. Existing
    tools that already so so transparently include libarchive or my zst, but
    it would be no rocket surgery to do that in dpkg itself.

    (zst is my project to sanitize command-line tools like /bin/gzip
    /usr/bin/xz or /usr/bin/zstd to have consistent behaviour and supported options. It's a bit stalled, lacking eg. parallelization or --rsyncable,
    but available in a working state in Bookworm.)

    Packages using the old format but new compressors would obviously fail
    to install using Bookworm's dpkg, but that's not a problem:
    * the bulk of packages would remain on 2.0, at least for X releases
    * bulky packages are few, and they can Pre-Depend a version of dpkg
    that adds support for new compressors

    As for external tools, those that properly call dpkg will work out of the
    box, this is fortunately most of them. The rest would need to grow
    such support, I haven't done that research yet.


    So, before any of us commits more effort, please say if this is the way
    to go.


    Meow!
    --
    ⢀⣴⠾⠻⢶⣦⠀
    ⣾⠁⢠⠒⠀⣿⡁ Vat kind uf sufficiently advanced technology iz dis!? ⢿⡄⠘⠷⠚⠋⠀ -- Genghis Ht'rok'din ⠈⠳⣄⠀⠀⠀⠀

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Guillem Jover@21:1/5 to Adam Borowski on Tue Sep 26 12:00:01 2023
    Hi!

    On Sat, 2023-09-16 at 16:07:37 +0200, Adam Borowski wrote:
    And we're closer and closer to get there: the last time we spoke, max
    package size was 1.7GB, it's 5.5GB today. In fact, judging by
    Installed-Size alone, some other packages would already breach this limit, had they shipped the data instead of fetching it from the Interwebs.

    The current max is kicad-packages3d_7.0.7-1_all.deb with data.tar
    5839452160 bytes in size.

    While I think this should be solved, I don't think this is a pressing
    matter as it seems, because this really only affects binary packages
    that contain a single file that when compressed exceeds that limit,
    otherwise the packages can be split normally.

    There were suggestions of other package format, but to my knowledge none
    have been implemented or even researched. Which leaves the deb-old format (version 0.939000; I'll round it to "1" hereafter).

    This is really a non-starter.

    As far as I know, format 2.0 was devised with some undescribed extensions
    in mind; none of those extensions has appeared during 28 years since we
    made the switch -- any new stuff has gone into control.tar instead.

    Yes they have, deb signatures use that, the tdeb specification (that
    never got very far) also uses that. There could be custom extensions
    around too.

    Thus, I propose we revert to the old format.

    Benefits:
    * no 10¹⁰ data.tar limit
    * it unpacks 1% faster than 2.0

    Hmm, thanks, that's actually a bug in dpkg-deb, which I've now fixed
    locally, as the old format is supposed to only use gzip, and not the
    default xz.

    Concerns:
    * no support for compressors other than cat/ncompress/gzip yet
    * external tools may not know it

    AFAIK, no external tools except for dpkg-deb itself supports it, not
    even file(1). Thus its coverage is extremely poor.

    As for external tools, those that properly call dpkg will work out of the box, this is fortunately most of them. The rest would need to grow
    such support, I haven't done that research yet.

    So, before any of us commits more effort, please say if this is the way
    to go.

    While ar has its set of limitations:

    - Might diverge format depending on the system (AIX small and big
    formats).
    - File size limitation.
    - Filename length limitation (not relevant for .deb:s though),
    (which could be overcome with the BSD or GNU variants).

    the BSD and GNU variants have very wide support in many libraries and languages. It is also extensible and quite compact.

    While I've had this problem in mind and pondered over various ideas,
    I think the better option is to use sliced data parts within an ar
    container. Using tar-in-tar seems like a waste due to the 512-blocks
    padding, and using other custom formats means having to do special
    custom handling in other tools, and makes handling this with basic
    tools extremely cumbersome.

    Such "new format" could simply reuse the ar extensibility and it
    would actually be rather simple, and only require slicing the
    data.tar.COMP into pieces that then need to be reassembled. This means
    that «dpkg-deb --fsys-tarfile» would work transparently, and that
    handling such .deb files by hand would be trivial with cat and dd.

    I think this could be a new format similar to split packages but in a
    single .deb, with something perhaps like:

    ,---
    $ ar tv pkg-lfs_1.0_arch.deb
    debian-binslice
    control.tar.xz
    data-01.tar.xz
    data-02.tar.xz
    data-03.tar.xz
    $ ar p pkg-lfs_1.0_arch.deb debian-binslice
    1.0
    3
    `---

    (or perhaps just «debian-sliced».)

    I guess perhaps one problem is that it segregates the format, and it
    might mean its support might end up being poor as well, more so if
    there are no actual such binary packages in the wild. The other more
    intrusive option would be to make a 3.0 format that includes something
    like this by default, so that then there's a single thing to support
    for everything. Say:

    ,---
    $ ar tv pkg-small_3.0_arch.deb
    debian-binary
    meta.tar.xz
    fsys-01.tar.xz
    $ ar p pkg-small_3.0_arch.deb debian-binary
    3.0
    1
    `---

    (Even though I find the -01 there annoying, but that would make the
    format uniform regardless of the slices.)

    ,---
    $ ar tv pkg-lfs_3.0_arch.deb
    debian-binary
    meta.tar.xz
    fsys-01.tar.xz
    fsys-02.tar.xz
    fsys-03.tar.xz
    $ ar p pkg-lfs_3.0_arch.deb debian-binary
    3.0
    3
    `---

    But this seems too much disruption, for something I'd expect would not
    be used widely anyway.

    Thanks,
    Guillem

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)