Forum: >>> Magnum BBS <<<

Re: Upstream dist tarball transparency (was Re: Validating tarballs aga

From Guillem Jover@21:1/5 to James Addison on Sat Apr 6 09:51:05 2024

Hi!

On Wed, 2024-04-03 at 23:53:56 +0100, James Addison wrote:

On Wed, 3 Apr 2024 19:36:33 +0200, Guillem wrote:

On Fri, 2024-03-29 at 23:29:01 -0700, Russ Allbery wrote:

On 2024-03-29 22:41, Guillem Jover wrote:
I think with my upstream hat on I'd rather ship a clear manifest (checked into Git) that tells distributions which files in the distribution tarball
are build artifacts, and guarantee that if you delete all of those files, the remaining tree should be byte-for-byte identical with the corresponding signed Git tag. (In other words, Guillem's suggestion.) Then I can continue to ship only one release artifact.

I've been pondering about this and I think I might have come up with a protocol that to me (!) seems safe, even against a malicious upstream. And does not require two tarballs which as you say seems cumbersome, and makes it harder to explain to users. But I'd like to run this through the list
in case I've missed something obvious.

Does this cater for situations where part of the preparation of a source tarball involves populating a directory with a list of filenames that correspond to hostnames known to the source preparer?

If that set of hostnames changes, then regardless of the same source
VCS checkout being used, the resulting distribution source tarball could differ.

Yes, it's a hypothetical example; but given time and attacker patience, someone is motivated to attempt any workaround. In practice the
difference could be a directory of hostnames or it could be a bitflag
that is part of a macro that is only evaluated under various nested conditions.

I'm not sure whether I've perhaps misunderstood your scenario, but if
the distributed tarball contains things not present in the VCS, then
with this proposal those can then be easily removed, which means it
does not matter much if they differ between same tarball generation
(I mean it matters in the sense that it's an alarm sign, but it does
not matter in the sense that you can get at the same state as with a
clean VCS checkout).

The other part then is whether the remaining contents differ from what
is in the VCS.

If any of these trigger a difference, then that would require manual
review. That of course does not exempt one from reviewing the VCS, it
just potentially removes one avenue for smuggling artifacts.

To take a leaf from the Reproducible Builds[1] project: to achieve a one-to-one mapping between a set of inputs and an output, you need to
record all of the inputs; not only the source code, but also the build environment.

I'm not yet convinced that source-as-was-written to distributed-source-tarball
is a problem that is any different to that of distributed-source-tarball to built-package. Changes to tooling do, in reality, affect the output of
build processes -- and that's usually good, because it allows for
performance optimizations. But it also necessitates the inclusion of the toolchain and environment to produce repeatable results.

In this case, the property you'd gain is that you do not need to trust
the system of the person preparing the distribution tarball, and can
then regenerate those outputs from (supposedly) good inputs from both
the distribution tarball, and _your_ (or the distribution) system
toolchain.

The distinction I see from the reproducible build effort, is that in
this case we can just discard some of the inputs and outputs and go
from original sources.

(Not sure whether that clarifies or I've talked past you now. :)

Thanks,
Guillem

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Addison@21:1/5 to Guillem on Sat Apr 6 09:51:27 2024

Hi Guillem,

On Wed, 3 Apr 2024 19:36:33 +0200, Guillem wrote:

On Fri, 2024-03-29 at 23:29:01 -0700, Russ Allbery wrote:

On 2024-03-29 22:41, Guillem Jover wrote:
I think with my upstream hat on I'd rather ship a clear manifest (checked into Git) that tells distributions which files in the distribution tarball are build artifacts, and guarantee that if you delete all of those files, the remaining tree should be byte-for-byte identical with the
corresponding signed Git tag. (In other words, Guillem's suggestion.)
Then I can continue to ship only one release artifact.

I've been pondering about this and I think I might have come up with a protocol that to me (!) seems safe, even against a malicious upstream. And does not require two tarballs which as you say seems cumbersome, and makes
it harder to explain to users. But I'd like to run this through the list
in case I've missed something obvious.

Does this cater for situations where part of the preparation of a source tarball involves populating a directory with a list of filenames that correspond to hostnames known to the source preparer?

If that set of hostnames changes, then regardless of the same source
VCS checkout being used, the resulting distribution source tarball could differ.

Yes, it's a hypothetical example; but given time and attacker patience,
someone is motivated to attempt any workaround. In practice the
difference could be a directory of hostnames or it could be a bitflag
that is part of a macro that is only evaluated under various nested
conditions.

To take a leaf from the Reproducible Builds[1] project: to achieve a
one-to-one mapping between a set of inputs and an output, you need to
record all of the inputs; not only the source code, but also the build environment.

I'm not yet convinced that source-as-was-written to distributed-source-tarball is a problem that is any different to that of distributed-source-tarball to built-package. Changes to tooling do, in reality, affect the output of
build processes -- and that's usually good, because it allows for
performance optimizations. But it also necessitates the inclusion of the toolchain and environment to produce repeatable results.

Regards,
James

[1] - https://reproducible-builds.org/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From James Addison@21:1/5 to Guillem on Sat Apr 6 13:30:01 2024

Thanks for the response!

On Fri, 5 Apr 2024 11:12:33 +0200, Guillem wrote:

On Wed, 2024-04-03 at 23:53:56 +0100, James Addison wrote:

On Wed, 3 Apr 2024 19:36:33 +0200, Guillem wrote:

On Fri, 2024-03-29 at 23:29:01 -0700, Russ Allbery wrote:

On 2024-03-29 22:41, Guillem Jover wrote:
I think with my upstream hat on I'd rather ship a clear manifest (checked
into Git) that tells distributions which files in the distribution tarball
are build artifacts, and guarantee that if you delete all of those files,
the remaining tree should be byte-for-byte identical with the corresponding signed Git tag. (In other words, Guillem's suggestion.) Then I can continue to ship only one release artifact.

I've been pondering about this and I think I might have come up with a protocol that to me (!) seems safe, even against a malicious upstream. And
does not require two tarballs which as you say seems cumbersome, and makes
it harder to explain to users. But I'd like to run this through the list in case I've missed something obvious.

Ok, after a bit more time to process the details, this makes more sense to me now. It's a fairly strong assertion about the precise VCS origin and commit that a _subset_ of the files in a dist tarball originate from.

And the strength of the claim (I think) varies based on how feasible it would be for an attacker to take control of the origin and write a substitute commit with the same VCS commit ID and file list - so it's based on fairly well-understood principles about crytographic hash strength.

(this seems similar in some ways to the existing .dsc file format, although
in relation to a 'source package source' and not sources of binary packages)

In any case: I'm reasonably convinced that the provenance (claim-of-origin) that this would provide for a source tarball is fairly strong. That's not my only concern though (in particular, goto: regeneration).

Does this cater for situations where part of the preparation of a source tarball involves populating a directory with a list of filenames that

correspond to hostnames known to the source preparer?

If that set of hostnames changes, then regardless of the same source
VCS checkout being used, the resulting distribution source tarball could differ.

Yes, it's a hypothetical example; but given time and attacker patience, someone is motivated to attempt any workaround. In practice the
difference could be a directory of hostnames or it could be a bitflag
that is part of a macro that is only evaluated under various nested conditions.

I'm not sure whether I've perhaps misunderstood your scenario, but if
the distributed tarball contains things not present in the VCS, then
with this proposal those can then be easily removed, which means it
does not matter much if they differ between same tarball generation
(I mean it matters in the sense that it's an alarm sign, but it does
not matter in the sense that you can get at the same state as with a
clean VCS checkout).

Yep, you managed to translate my baffling scenario description into a clearer problem statement :)

The other part then is whether the remaining contents differ from what
is in the VCS.

If any of these trigger a difference, then that would require manual
review. That of course does not exempt one from reviewing the VCS, it
just potentially removes one avenue for smuggling artifacts.

Why not reject an upload automatically if a difference is detected between the source package source and the dist tarball?

To take a leaf from the Reproducible Builds[1] project: to achieve a one-to-one mapping between a set of inputs and an output, you need to record all of the inputs; not only the source code, but also the build environment.

I'm not yet convinced that source-as-was-written to distributed-source-tarball
is a problem that is any different to that of distributed-source-tarball to built-package. Changes to tooling do, in reality, affect the output of build processes -- and that's usually good, because it allows for performance optimizations. But it also necessitates the inclusion of the toolchain and environment to produce repeatable results.

In this case, the property you'd gain is that you do not need to trust
the system of the person preparing the distribution tarball, and can
then regenerate those outputs from (supposedly) good inputs from both
the distribution tarball, and _your_ (or the distribution) system
toolchain.

regeneration:

Here is the problem. Let's say that in future with this transparency code in place, a security bug is discovered in versions of autotools that were available in the testing or unstable distributions (and may have been used by some Debian maintainers/developers). It could be useful to determine a number of things:

* What dist tarballs were built using the affected versions?
* Do those dist tarballs differ when rebuilt with the fixed autotools?

It's similar to the discovery of a security-related problem in a compiler or other toolchain component; we want to identify and rebuild affected artifacts.

I think your proposal is good at providing provenance for the subset of files that originate from VCS; what I'm not sure it provides is confirmation about how to confirm regeneration of the remainder of the dist files.

Another way to think about it is: let's say you _do_ attempt to regenerate a dist tarball because you're curious -- and the results differ somehow. How would you find the cause of the difference? Without the source-build-deps,
I'd argue that's nearly intractable in some cases (although probably possible, to an unreliable degree of trust, by simply re-attempting the build with other different versions of the build tools). With the source-build-deps provided, then it should either become trivial (e.g.: oh, I used the wrong version, I'll rebuild again), or it becomes a question of bug-finding (ok, I used the same version, but the output differed: that means something here is
non-reproducible and can be considered a bug).

The distinction I see from the reproducible build effort, is that in
this case we can just discard some of the inputs and outputs and go
from original sources.

Ok, understood. I counter-claim that that is _almost_ enough, but that it's helpful when unexpected problems occur in future to be able to go back and recreate the preconditions accurately (and when that accuracy isn't enough, to determine why _that_ is, and resolve that separately).

Regards,
James

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Jun 11 21:37:59 2024
  from Wales, Uk via Telnet
- Keyop
  Tue Jun 11 20:48:11 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Tue Jun 11 19:19:03 2024
  from Wales, Uk via Telnet
- Bob Worm
  Tue Jun 11 07:19:27 2024
  from Wales, Uk via Raw
- Cronus
  Mon Jun 10 23:25:08 2024
  from Provo, Ut via SSH
- Bob Worm
  Mon Jun 10 22:16:20 2024
  from Wales, Uk via Raw
- Bob Worm
  Mon Jun 10 22:01:32 2024
  from Wales, Uk via Raw
- Keyop
  Mon Jun 10 11:17:06 2024
  from Huddersfield, West Yorkshire via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	307
Nodes:	16 (2 / 14)
Uptime:	36:31:10
Calls:	6,910
Calls today:	4
Files:	12,376
Messages:	5,428,699
Posted today:	2

Re: Upstream dist tarball transparency (was Re: Validating tarballs aga

Who's Online

Recent Visitors

System Info