• De-vendoring gnulib in Debian packages

    From Simon Josefsson@21:1/5 to All on Sat May 11 16:10:01 2024
    All, (going out to both debian-devel and bug-gnulib, please be
    respectful of each community's different perspectives and trim Cc
    when focus shifts to any Debian or gnulib specific topics)
    (please pardon the accidental duplicate post to bug-gnulib...)

    The content of upstream source code releases can largely be categorized
    into 1) the actual native source-code from the upstream supplier, 2) pre-generated artifacts from build tools (e.g., ./configure script) and
    3) third-party maintained source code (e.g., config.guess or getopt.c).
    The files in 3) may be referred to as "vendoring". The habit of
    including vendored and pre-generated artifacts is a powerful and
    successful way to make release tarballs usable for users, going back to
    the 1980's. This habit pose some challenges for packaging though:

    1) Pre-generated files (e.g., ./configure) should be re-generated to
    make sure the package is built from source code, and to allow patches
    on the toolchain used to generate the pre-generated files to have any
    effect. Otherwise we risk using pre-generated files created using
    non-free or non-public tools, which if I understand correctly against
    Debian main policy. Verifying that this happens for all
    pre-generated files in an upstream tarball is complicated, fragile
    and tedious work. I think it is simple to find examples of mistakes
    in this area even for important top-popcon Debian packages. The
    current approach of running autoreconf -fi is based on a
    misunderstanding: autoreconf -fi is documented to not replace certain
    files with newer versions:

    2) If a security problem in vendored code is discovered, the security
    team may have to patch 50+ packages if the vendor origin is popular.
    Maybe even different versions of the same vendored code has to be

    3) Auditing the difference between the tarball and what is stored in
    upstream version control system (VCS) is challenging. The xz
    incident exploited the fact that some pre-generated files aren't
    included in upstream VCS. Some upstream react to this by adding all
    pre-generated artifacts to VCS -- OpenSSH seems to take the route of
    adding the generated ./configure script to git, which moves that file
    from 3) to 1) but the problem is remaining.

    4) Auditing for license compliance is challenging, since not only do we
    have to audit all upstream's code but we also have to audit the
    license of pre-generated files and vendored source-code.

    There are probably more problems involved, and probably better ways to articulate the problems than what I managed to do above. The Go and
    Rust ecosystems solve some of these issues, which has other consequences
    for packaging. We have largely ignored that the same challenges apply
    to many C packages, and I'm focusing on those that uses gnulib -- https://www.gnu.org/software/gnulib/ -- gzip, tar, grep, m4, sed, bison,
    awk, coreutils, grub, libiconv, libtasn1, libidn2, inetutils, etc: https://git.savannah.gnu.org/gitweb/?p=gnulib.git;a=blob;f=users.txt

    Solving all of the problems for all packages will require some work and
    will take time. I've started to see if we can make progress on the gnulib-related packages. I'm speaking as contributor to gnulib and
    maintainer of a couple of Debian packages, but still learning to
    navigate -- the purpose of this post is to describe what I've done for
    libntlm and ask for feddback to hopefully make this into a re-usable
    pattern that can be applied to more packages. It would be great to
    improve collaboration on these topics between GNU and Debian.

    So let's turn this post into a recipe for Debian maintainers of packages
    that use gnulib to follow for their packages. I'm assuming git for now
    on, but feel free to mentally s/git/$VCS/.

    The first step is to establish an upstream tarball that you want to work
    with. There are too many opinions floating around on this to make any
    single solution a pre-requisite so here are the different approaches I
    can identify, ordered by my own preference, and the considerations with

    1) Use upstream's PGP signed git-archive tarball.

    See my recent blog posts for this new approach. The key property
    here is that there is no need to audit difference between upstream
    tarball and upstream git.


    2) Use upstream's PGP signed tarball.

    This is the current most common and recommended approach, as far as I

    3) Create a PGP signed git-archive tarball.

    If upstream doesn't publish PGP signed tarballs, or if there is a
    preference from upstream or from you as Debian package maintainer to
    not do 1) or 2), then create a minimal source-only copy of the git
    archive and sign it yourself. Could be done something like this:

    git clone https://git.savannah.gnu.org/git/inetutils.git
    cd inetutils/
    git archive --prefix=inetutils-v2.5/ -o inetutils-2.5-src.tar.gz v2.5
    # additional filtering of tarball may go here
    gpg -b inetutils-2.5-src.tar.gz

    This is your new upstream tarball. To build this particular one, use
    ./bootstrap --no-git --gnulib-srcdir=/usr/share/gnulib.

    4) Use upstream's git-archive tarball and PGP sign it.

    Download it using the GitHub or GitLab download link on the git tag
    like the cool kids. If you did this on a sunny day, the downloaded
    tarball should be identical to the git-archive tarball and you can
    sign it if you are comfortable with this.

    5) Use upstream's git-archive tarball.

    For those who want to join the really cool kids club.

    6) Use upstream's tarball without PGP signature.

    This is quite common today. It happens when upstream doesn't publish
    PGP signatures or the Debian maintainer doesn't care about them.

    Regardless of mechanism, you should end up with a tarball that we call
    the "upstream tarball". Which approach is chosen is subjective and up
    to the Debian package maintainer. people have different opinions.
    While I can't hide my own preferences I think we have to acknowledge
    that there is no single uniform answer here.

    To reach our goals in the beginning of this post, this upstream tarball
    has to be filtered to remove all pre-generated artifacts and vendored
    code. Use some mechanism, like the debian/copyright Files-Excluded
    mechanism to remove them. If you used a git-archive upstream tarball,
    chances are higher that you won't have to do a lot of work especially
    for pre-generated scripts.

    This filtered tarball will be the *.orig.tar.gz used to build the Debian package.

    Ideally you would like for the *.orig.tar.gz tarball to be as close as
    possible to upstream's git repository for the tag release, minus any pre-generated scripts or vendored gnulib files that upstream put into
    git. For collaborative upstreams, you could try to convince them to not
    put pre-generated scripts and vendored gnulib files into git.

    Auditing the upstream tarball to the *.orig.tar.gz should be simple, use sha256sum or diffoscope to compare content. In some ideal world this
    could be bit-by-bit identical. I'm hoping this can be the new best
    recommended approach going forward. This is only possible when upstream
    agree with these concerns, and make an effort to publish such minimized source-only tarballs. This may be a pipe dream, just like Debian's
    current best recommended approach for upstream PGP signed tarballs are sometimes ignored.

    You will now be faced with the challenge of building this tarball. Your existing debian/rules makefile will not work any more since it assumed
    the existance of the pre-generated scripts and vendored gnulib files.
    So you have to add the required tools as Build-Depends: and update the debian/rules to build everything from source code.

    For libntlm the essential diff between version 1.7-1, that used upstream tarball with pre-generated content and gnulib code, and latest version
    1.8-3 that builds from a minimal source-only tarball is small:

    --- a/debian/control
    +++ b/debian/control
    @@ -6,6 +6,8 @@ Uploaders:
    Simon Josefsson <simon@josefsson.org>,
    debhelper-compat (= 13),
    + git,
    + gnulib (>= 20240412~dfb7117+stable202401.20240408~aa0aa87-3~),
    Standards-Version: 4.6.2
    Section: libs
    Homepage: https://www.nongnu.org/libntlm/
    --- a/debian/rules
    +++ b/debian/rules
    @@ -1,6 +1,16 @@
    #! /usr/bin/make -f

    +include /usr/share/gnulib/debian/gnulib-dpkg.mk
    export DEB_BUILD_MAINT_OPTIONS = hardening=+all

    - dh $@ --builddirectory=build -X.la
    + dh $@ --without autoreconf --builddirectory=build
    + ./bootstrap --gnulib-srcdir=$(GNULIB_DEB_DEBIAN_GNULIB) --pull
    + ./bootstrap --gnulib-srcdir=$(GNULIB_DEB_DEBIAN_GNULIB) --gen
    +execute_before_dh_auto_configure: dh_gnulib_clone pull dh_gnulib_patch gen

    As you can see the essential part is to add a Build-Depends on the
    gnulib Debian package to get the necessary gnulib code for building. We
    also disable dh_aut
  • From Bruno Haible@21:1/5 to Simon Josefsson on Sat May 11 18:00:02 2024
    Simon Josefsson wrote:
    Finally, while this is somewhat gnulib specific, I think the practice
    goes beyond gnulib

    Yes, gnulib-tool for modules written in C is similar to

    * 'npm install' for JavaScript source code packages [1],
    * 'cargo fetch' for Rust source code packages [2],

    except that gnulib-tool is simpler: it fetches from a single source location only.

    How does Debian handle these kinds of source-code dependencies?


    [1] https://nodejs.org/en/learn/getting-started/an-introduction-to-the-npm-package-manager
    [2] https://doc.rust-lang.org/cargo/commands/cargo-fetch.html

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Josefsson@21:1/5 to Bruno Haible on Sat May 11 18:40:01 2024
    Bruno Haible <bruno@clisp.org> writes:

    Simon Josefsson wrote:
    Finally, while this is somewhat gnulib specific, I think the practice
    goes beyond gnulib

    Yes, gnulib-tool for modules written in C is similar to

    * 'npm install' for JavaScript source code packages [1],
    * 'cargo fetch' for Rust source code packages [2],

    except that gnulib-tool is simpler: it fetches from a single source location only.

    How does Debian handle these kinds of source-code dependencies?

    I don't know the details but I believe those commands are turned into
    local requests for source code, either vendored or previously packaged
    in Debian. No network access during builds. Same for Go packages,
    which I have some experience with, although for Go packages they lose
    the strict versioning so if Go package X declare a depedency on package
    Y version Z then on Debian it may build against version Z+1 or Z+2 which
    may in theory break and was not upstream's intended or supported
    configuration. We have a circular dependency situation for some core Go libraries in Debian right now due to this.

    I think fundamentally the shift that causes challenges for distributions
    may be dealing with packages dependencies that are version >= X to
    package dependencies that are version = X. If there is a desire to
    support that, some new patterns of the work flow is needed. Some
    package maintainers reject this approach and refuse to co-operate with
    those upstreams, but I'm not sure if this is a long-term winning
    strategy: it often just lead to useful projects not being available
    through distributions, and users suffers as a result.



    iIoEARYIADIWIQSjzJyHC50xCrrUzy9RcisI/kdFogUCZj+ebBQcc2ltb25Aam9z ZWZzc29uLm9yZwAKCRBRcisI/kdFolxdAP43NFM96RKAJ/iViHJmFtbxEg1181ty XCtKrJoV+xvaigD8Cf+D13WQrl/PkWn+M2NONWEiVgDZBnOqOzuLMYIvBwE=
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Paul Eggert@21:1/5 to Simon Josefsson via Gnulib discussi on Sat May 11 18:50:01 2024
    On 2024-05-11 07:09, Simon Josefsson via Gnulib discussion list wrote:
    I would assume that (some stripped down
    version of) git is a requirement to do any useful work on any platform
    these days, so maybe it isn't a problem

    Yes, my impression also is that Git has migrated into the realm of
    cc/gcc in that everybody has it, so it can depend indirectly on a
    possibly earlier version of itself.

    Although it is worrisome that our collective trusted computing base
    keeps growing, let's face it, if there's a security bug in Git we're all
    in big trouble anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theodore Ts'o@21:1/5 to Simon Josefsson on Sun May 12 14:10:01 2024
    On Sat, May 11, 2024 at 04:09:23PM +0200, Simon Josefsson wrote:
    The current approach of running autoreconf -fi is based on a
    misunderstanding: autoreconf -fi is documented to not replace certain
    files with newer versions:

    And the root cause of *this* is because historically, people put their
    own custom autoconf macros in aclocal.m4, so if autoreconf -fi
    overwrote aclocal.m4, things could break. This also means that
    programmtically always doing "rm -f aclocal.m4 ; aclocal --install"
    will break some packages.

    The best solution to this is to try to promote people to put those
    autoconf macros that they are manually maintaining that can't be
    supplied in acinclude.m4, which is now included by default by autoconf
    in addition to aclocal.m4. Personally, I think the two names are
    confusing and if it weren't for historical reasons, perhaps should
    have been swapped, but oh, well....

    (For example, I have some custom local autoconf macros needed to
    support MacOS in e2fsprogs's acinclude.m4.)

    1) Use upstream's PGP signed git-archive tarball.

    Here's how I do it in e2fsprogs which (a) makes the git-archive
    tarball be bit-for-bit reproducible given a particular git commit ID,
    and (b) minimizes the size of the tarball when stored using


    To reach our goals in the beginning of this post, this upstream tarball
    has to be filtered to remove all pre-generated artifacts and vendored
    code. Use some mechanism, like the debian/copyright Files-Excluded
    mechanism to remove them. If you used a git-archive upstream tarball, chances are higher that you won't have to do a lot of work especially
    for pre-generated scripts.

    Why does it *has* to be filtered? For the purposes of building, if
    you really want to nuke all of the pre-generated files, you can just
    move them out of the way at the beginning of the debian/rules run, and
    then move them back as part of "debian/rules clean". Then you can use autoreconf -fi to your heart's content in debian/rules (modulo
    possibly breaking things if you insist on nuking aclocal.m4 and
    regenerating it without taking proper care, as discussed above).

    This also allows the *.orig.tar.gz to be the same as the upstream
    signed PGP tarball, which you've said is the ideal, no?

    There is one design of gnulib that is important to understand: gnulib is
    a source-only library and is not versioned and has no release tarballs.
    Its release artifact is the git repository containing all the commits. Packages like coreutils, gzip, tar etc pin to one particular commit of gnulib.

    Note that how we treat gnulib is a bit differently from how we treat
    other C shared libraries, where we claim that *all* libraries must be dynamically linked, and that include source code by reference is
    against Debian Policy, precisely because of the toil needed to update
    all of the binary packages should some security vulnerability gets
    discovered in the library which is either linked statically or
    included by code duplication.

    And yet, we seem to have given a pass for gnulib, probably because it
    would be too awkward to enforce that rule *everywhere*, so apparently
    we've turned a blind eye.

    I personally think the "everything must be dynamically linked" to be
    not really workable in real life, and should be an aspirational goal
    --- and the fact that we treat gnulib differently is a great proof
    point about how the current debian policy is not really doable in real
    life if it were enforced strictly, everywhere, with no exceptions....

    Certainly for languages like Rust, it *can't* be enforced, so again,
    that's another place where that rule is not enforced consistently; if
    it were, we wouldn't be able to ship Rust programs.

    - Ted

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Josefsson@21:1/5 to Theodore Ts'o on Sun May 12 16:30:02 2024
    "Theodore Ts'o" <tytso@mit.edu> writes:

    1) Use upstream's PGP signed git-archive tarball.

    Here's how I do it in e2fsprogs which (a) makes the git-archive
    tarball be bit-for-bit reproducible given a particular git commit ID,
    and (b) minimizes the size of the tarball when stored using


    Wow, written five years ago and basically the same thing that I suggest (although you store pre-generated ./configure scripts in git).

    Going into detail, you use 'gzip -9n' but I use git-archive defaults
    which is the same as -n aka --no-name. I agree adding -9 aka --best is
    an improvement. Gnulib's maint.mk also add --rsyncable, would you agree
    that this is also an improvement? Thus what I'm arriving at is this:

    git archive --prefix=inetutils-$(git describe)/ HEAD |
    gzip --no-name --best --rsyncable > -o inetutils-$(git describe)-src.tar.gz

    To reach our goals in the beginning of this post, this upstream tarball
    has to be filtered to remove all pre-generated artifacts and vendored
    code. Use some mechanism, like the debian/copyright Files-Excluded
    mechanism to remove them. If you used a git-archive upstream tarball,
    chances are higher that you won't have to do a lot of work especially
    for pre-generated scripts.

    Why does it *has* to be filtered? For the purposes of building, if
    you really want to nuke all of the pre-generated files, you can just
    move them out of the way at the beginning of the debian/rules run, and
    then move them back as part of "debian/rules clean". Then you can use autoreconf -fi to your heart's content in debian/rules (modulo
    possibly breaking things if you insist on nuking aclocal.m4 and
    regenerating it without taking proper care, as discussed above).

    This also allows the *.orig.tar.gz to be the same as the upstream
    signed PGP tarball, which you've said is the ideal, no?

    Right, there is no requirement for orig.tar.gz to be filtered. But then
    the outcome depends on upstream, and I don't think we can convince all upstreams about these concerns. Most upstream prefer to ship
    pre-generated and vendored files in their tarballs, and will continue to
    do so. Let's assume upstream doesn't ship minimized tarballs that are
    free from vendored or pre-generated files. That's the case for most
    upstream tarballs in Debian today (including e2fsprogs, openssh,
    coreutils). Without filtering that tarball we won't fulfil the goals I mentioned in the beginning of my post. The downsides with not filtering include (somewhat repeating myself):

    - Opens up for bugs causing pre-generated files not being re-generated
    even when they are used to build the package. I think this is fairly
    common in Debian packages. Making sure all pre-generated files are
    re-generated during build -- or confirming that the file is not used
    at all -- is tedious and fragile work. Work that has to be done for
    every release. Are you certain that ./configure is re-generated? If
    it is not present you would notice.

    - Auditing the pre-generated and vendored files for malicious content
    takes more time than not having to audit those files. Especially if
    those files are not stored in upstream git.

    - Pre-generated and vendored files trigger licensing concerns and
    require tedious work that doesn't improve the binary package
    deliverable. Consider files like texinfo.tex for example, wouldn't it
    be better to strip that out of tarballs and not have to add it to
    debian/copyright? If some code in a package, let's say getopt.c, is
    not used during build of the package, the license of that file doesn't
    have to be mentioned in debian/copyright if I understand correctly:
    If in a few releases later, that file starts to get used during
    compilation, the package maintainer will likely not notice. If it was
    filtered, the maintainer would notice.

    The best is when upstream ship a tarball consistent with what I dream *.orig.tar.gz should be: free of vendored and pre-generated files.
    Debian package maintainers can take action before this happens, to reach
    nice properties within Debian. Maybe some upstream will adapt.

    There is one design of gnulib that is important to understand: gnulib is
    a source-only library and is not versioned and has no release tarballs.
    Its release artifact is the git repository containing all the commits.
    Packages like coreutils, gzip, tar etc pin to one particular commit of

    Note that how we treat gnulib is a bit differently from how we treat
    other C shared libraries, where we claim that *all* libraries must be dynamically linked, and that include source code by reference is
    against Debian Policy, precisely because of the toil needed to update
    all of the binary packages should some security vulnerability gets
    discovered in the library which is either linked statically or
    included by code duplication.

    And yet, we seem to have given a pass for gnulib, probably because it
    would be too awkward to enforce that rule *everywhere*, so apparently
    we've turned a blind eye.

    I personally think the "everything must be dynamically linked" to be
    not really workable in real life, and should be an aspirational goal
    --- and the fact that we treat gnulib differently is a great proof
    point about how the current debian policy is not really doable in real
    life if it were enforced strictly, everywhere, with no exceptions....

    Certainly for languages like Rust, it *can't* be enforced, so again,
    that's another place where that rule is not enforced consistently; if
    it were, we wouldn't be able to ship Rust programs.

    Agreed. I think the policy is mostly a good one, but when there are
    special situations like gnulib, Rust, Go etc we need some tools to
    handle them. Debian won't turn gnulib into a shared library. Debian
    won't turn Go into a shared library ecosystem (or maybe Go will actually
    go into that direction, but it is slow process..). I don't know Rust
    well but I suppose it is similar.



    iIoEARYIADIWIQSjzJyHC50xCrrUzy9RcisI/kdFogUCZkDRuhQcc2ltb25Aam9z ZWZzc29uLm9yZwAKCRBRcisI/kdFovTOAP9hHcVDco93V+hjTXpNAXl/bViGdj5j dRWYPROrd9C3ZgD/cwROC40/TxymXTdt0mGO0kAel2zfetEAJglvxEXBFw8=
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to Theodore Ts'o on Sun May 12 17:50:01 2024
    "Theodore Ts'o" <tytso@mit.edu> writes:

    The best solution to this is to try to promote people to put those
    autoconf macros that they are manually maintaining that can't be
    supplied in acinclude.m4, which is now included by default by autoconf
    in addition to aclocal.m4.

    Or use a subdirectory named something like m4, so that you can put each conceptually separate macro in a separate file and not mush everything together, and use:


    (and set ACLOCAL_AMFLAGS = -I m4 in Makefile.am if you're also using

    Note that how we treat gnulib is a bit differently from how we treat
    other C shared libraries, where we claim that *all* libraries must be dynamically linked, and that include source code by reference is against Debian Policy, precisely because of the toil needed to update all of the binary packages should some security vulnerability gets discovered in
    the library which is either linked statically or included by code duplication.

    And yet, we seem to have given a pass for gnulib, probably because it
    would be too awkward to enforce that rule *everywhere*, so apparently
    we've turned a blind eye.

    No, there's an explicit exception for cases like gnulib. Policy 4.13:

    Some software packages include in their distribution convenience
    copies of code from other software packages, generally so that users
    compiling from source don’t have to download multiple packages. Debian
    packages should not make use of these convenience copies unless the
    included package is explicitly intended to be used in this way.

    Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ansgar =?UTF-8?Q?=F0=9F=99=80?=@21:1/5 to Russ Allbery on Sun May 12 18:30:01 2024

    On Sun, 2024-05-12 at 08:41 -0700, Russ Allbery wrote:
    "Theodore Ts'o" <tytso@mit.edu> writes:
    And yet, we seem to have given a pass for gnulib, probably because it
    would be too awkward to enforce that rule *everywhere*, so apparently
    we've turned a blind eye.

    No, there's an explicit exception for cases like gnulib.  Policy 4.13:

        Some software packages include in their distribution convenience     copies of code from other software packages, generally so that users     compiling from source don’t have to download multiple packages. Debian
        packages should not make use of these convenience copies unless the     included package is explicitly intended to be used in this way.

    In ecosystems like NPM, Cargo, Golang, Python and so on pinning to
    specific versions is also "explicitly intended to be used"; they just
    sometimes don't include convenience copies directly as they have
    tooling to download these (which is not allowed in Debian).

    (Arguably Debian should use those more often as keeping all software at
    the same dependency version is a futile effort IMHO...)

    Gnulib is just older and targeted at the C ecosystem which still has
    worse tooling that pretty much everything else.


    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Russ Allbery@21:1/5 to ansgar@43-1.org on Sun May 12 19:50:01 2024
    Ansgar 🙀 <ansgar@43-1.org> writes:

    In ecosystems like NPM, Cargo, Golang, Python and so on pinning to
    specific versions is also "explicitly intended to be used"; they just sometimes don't include convenience copies directly as they have tooling
    to download these (which is not allowed in Debian).

    Yeah, this is a somewhat different case that isn't well-documented in
    Policy at the moment.

    (Arguably Debian should use those more often as keeping all software at
    the same dependency version is a futile effort IMHO...)

    There's a straight tradeoff with security effort: more security work is required for every additional copy of a library that exists in Debian
    stable. (And, of course, some languages have better support for having multiple simultaneously-installed versions of the same library than
    others. Python's support for this is not great; the ecosystem expectation
    is that one uses separate virtualenvs, which don't really solve the Debian build dependency problem.)

    Russ Allbery (rra@debian.org) <https://www.eyrie.org/~eagle/>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theodore Ts'o@21:1/5 to Simon Josefsson on Sun May 12 21:00:01 2024
    On Sun, May 12, 2024 at 04:27:06PM +0200, Simon Josefsson wrote:
    Going into detail, you use 'gzip -9n' but I use git-archive defaults
    which is the same as -n aka --no-name. I agree adding -9 aka --best is
    an improvement. Gnulib's maint.mk also add --rsyncable, would you agree
    that this is also an improvement?

    I'm not convinced --rsyncable is an improvement. It makes the
    compressed object slightly larger, and in exchange, if the compressed
    object changes slightly, it's possible that when you rsync the changed
    file, it might be more efficient. But in the case of PGP signed
    release tarballs, the file is constant; it's never going to change,
    and even if there are slight changes between say, e2fsprogs v1.47.0
    and e2fsprogs v1.47.1, in practice, this is not something --rsyncable
    can take advantage of, unless you manually copy
    e2fsprogs-v1.47.0.tar.gz to e2fsprogs-v1.47.1.tar.bz, and then rsync e2fsprogs-v1.471.tar.g.... and I don't think anyone is doing this,
    either automatically or manually.

    That being said, --rsyncable is mostly harmless, so I don't have
    strong feelings about changing it to add or remove in someone's
    release workflow.

    Right, there is no requirement for orig.tar.gz to be filtered. But then
    the outcome depends on upstream, and I don't think we can convince all upstreams about these concerns. Most upstream prefer to ship
    pre-generated and vendored files in their tarballs, and will continue to
    do so.

    Well, your blog entry does recognize some of the strong reasons why
    upstreams will probably want to continue shipping them. First of all,
    not all compilation targets are guaranteed to have autoconf, automake,
    et. al, installed. E2fsprogs is portable to Windows, MacOS, AIX,
    Solaris, HPUX, NetBSD, FreeBSD, and GNU/Hurd, in addition to Linux.
    If the package subscribes to the 'all the world's Linux, and nothing
    else exists/we have no interest in supporting anything elss', I'd ask
    the question, why are they using autoconf in the first place? :-)

    Secondly, i have gotten burned with older versions of either autoconf
    or the aclocal macros changing in incompatible ways between versions.
    So my practice is to check into git the configure script as generated
    by autoconf on Debian testing, which is my development system; and if
    it fails on anything else, or when a new version of autoconf or
    automake, etc. causes my configure script to break, I can curse, and
    fix it myself instead of inflicting the breakage on people who are
    downloading and trying to compile e2fsprogs.

    Let's assume upstream doesn't ship minimized tarballs that are
    free from vendored or pre-generated files. That's the case for most
    upstream tarballs in Debian today (including e2fsprogs, openssh,
    coreutils). Without filtering that tarball we won't fulfil the goals I mentioned in the beginning of my post. The downsides with not filtering include (somewhat repeating myself):


    Your arguments are made in a very general way --- there are potential
    problems for _all_ autogenerated or vendored files. However, I think
    it's possible to simply things by explicitly restricting the problem
    domain to those files auto-generated by autoconf, automake, libtool,
    etc. For example, the argument that this opens things up for bugs
    could be fixed by having common code in a debhelper script that
    re-generates all of the autoconf and related files. This address your "tedious" and "fragile" argument.

    And if you are always regenerating those files, you don't need to
    audit the code, since they are going to them, no? And the generated
    files from autoconf and friends have well understood licensing

    And by the way, all of your concerns about vendored files, and all of
    my arguments for why it's no big deal apply to gnulib source files,
    too, no? Why are you so insistent on saying that upstream must never,
    ever ship vendored files --- but I don't believe you are making this
    argument for gnulib?

    Yes, it's simpler if we have procrustean rules of the form "everything
    MUST be shared libraries", and "never, EVER have generated or vendored
    files". However, I think we're much better off if we have targetted
    solution which fix the 80 to 90% of the cases. We agree that gnulib
    isn't going to be a shared library; but the argument in favor of it
    means that there are exception, and I think we need to have similar accomodations files like configure, config.{guess,sub}.

    Upstream *is* going to be shipping those files, and I don't think it's
    worth it to deviate from upstream tarballs just to filter out those
    files, even if it makes somethings simpler from your perspective. So
    I do hear your arguments; it's just on balance, my opinion is that it's
    not worth it.


    - Ted

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Simon Josefsson@21:1/5 to Theodore Ts'o on Mon May 20 10:30:01 2024
    "Theodore Ts'o" <tytso@mit.edu> writes:

    On Sun, May 12, 2024 at 04:27:06PM +0200, Simon Josefsson wrote:
    Going into detail, you use 'gzip -9n' but I use git-archive defaults
    which is the same as -n aka --no-name. I agree adding -9 aka --best is
    an improvement. Gnulib's maint.mk also add --rsyncable, would you agree
    that this is also an improvement?

    I'm not convinced --rsyncable is an improvement. It makes the
    compressed object slightly larger, and in exchange, if the compressed
    object changes slightly, it's possible that when you rsync the changed
    file, it might be more efficient. But in the case of PGP signed
    release tarballs, the file is constant; it's never going to change,
    and even if there are slight changes between say, e2fsprogs v1.47.0
    and e2fsprogs v1.47.1, in practice, this is not something --rsyncable
    can take advantage of, unless you manually copy
    e2fsprogs-v1.47.0.tar.gz to e2fsprogs-v1.47.1.tar.bz, and then rsync e2fsprogs-v1.471.tar.g.... and I don't think anyone is doing this,
    either automatically or manually.

    That being said, --rsyncable is mostly harmless, so I don't have
    strong feelings about changing it to add or remove in someone's
    release workflow.

    Your example had me convinced, and I thought some more about why we
    really should keep using it as it consumes a small percentage more CPU
    and disk space. I have realized that another common operation is
    storing and transfering _several_ different releases of e2fsprogs. I
    would suspect that most releases of software is fairly similar to the
    previous release when uncompressed. With gzip --rsyncable, the tarballs
    should then be mostly similar. Without --rsyncable, they will largely
    be different if I understand correctly. This affects dedup-able storage
    and transfer methods, and some anecdotical evidence suggests this
    improvement is significant - going from 215GB to 176GB vs 13GB:


    Maybe someone could do some experiment to see if there is substance to
    this argument, its not clear to me that the example is comparable. Storing/transferring several releases for the same software could add significant savings for larger set of archives.

    As the downside seems fairly small, and the potential upside may be significant, I will use and recommend --rsyncable for git-archive
    release tarballs:

    git archive --format=tar --prefix=$PACKAGE-$VERSION/ HEAD | \
    env GZIP= gzip --no-name --best --rsyncable \
    > $PACKAGE-$VERSION-src.tar.gz



    iIoEARYIADIWIQSjzJyHC50xCrrUzy9RcisI/kdFogUCZksH+hQcc2ltb25Aam9z ZWZzc29uLm9yZwAKCRBRcisI/kdFosTuAP9JQSVC454XV81tOll6ZwvnV1pKxEML o2i+eTkRGZcuMAEAulNqFBnxNZu1y8e18mTNGMBGrs14nS0uYHx3l5VTKQo=
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)