[[PGP Signed Part:Undecided]]
[in reply to a gentoo-project@ post, but it was asked to continue this
on gentoo-dev@]
On 28/06/2023 16.46, Sam James wrote:
Florian Schmaus <flow@gentoo.org> writes:
On 17/06/2023 10.37, Arthur Zamarin wrote:That's fine and it's great to see more people running!
I also want to nominate people who I feel contribute a lot to Gentoo and >>>> I have a lot of interaction with (ordered by name, not priority):
[…]
flow
I apologize for the late reply, and thank you for the nomination. I am
honored and accept.
As many of you know, I am spending a lot of time on the EGO_SUM
situation, as it is one of the most critical issues to solve.
I have used the last few days to carefully consider whether a seat on
the council is more harmful or beneficial to my efforts regarding
EGO_SUM. On the one hand, council work means I have less time to
improve the EGO_SUM situation. On the other hand, a seat in the
council increases the probability of positively influencing Gentoo's
future, also regarding EGO_SUM.
Excellent that we share this view. :)
But with regard to EGO_SUM: you didn't appear at the meeting where we discussed
your previous EGO_SUM proposal,
Naively, as I am, I expected that the mailing list would be used for discussion and that the council meeting would be used chiefly for
voting and intra-council discussion. And since the request to the
council to vote on a concrete proposal was preceded by a
multiple-week, if not month-long, mailing list discussion, I assumed
that my presence in the council meeting was optional.
Had I known that my presence was required, or that the absence in the
meeting would be blamed on me afterward, I would have appeared if
possible.
and questions remain unanswered on the
ML (why not implement a check in pkgcheck similar to what is in Portage,
for example)?
On 2023-05-30 [1], I proposed a limit in the range of 2 to 1.5 MiB for
the total package-directory size. I only care a little about the tool
that checks this limit, but pkgcheck is an obvious choice. I also
suggested that we review this policy once the number of Go packages
has doubled or two years after this policy was established (whatever
comes first).
But I fear you may be referring to another kind of check. You may be
talking about a check that forbids EGO_SUM in ::gentoo but allows it overlays.
Intelligibly, EGO_SUM can be considered ugly. Compared to a
traditional Gentoo package, EGO_SUM-based ones are larger. The same is
true for Rust packages. However, looking at the bigger picture,
EGO_SUM's advantages outweigh its disadvantages.
My position on this has been consistent: a check is needed to statically determine when the environment size is too big. Copying the Portage
check into pkgcheck (in terms of the metrics) would satisfy this.
That is, regardless of raw size, I'm asking for a calculation based on
the contents of EGO_SUM where, if exceeded, the package will not be installable on some systems. You didn't have an issue implementing this
for Portage and I've mentioned this a bunch of times since, so I thought
it was clear what I was hoping to see.
I would also like (which is not what I was referring to here) some
limit on the size, given that we already have a limit on the size of ${FILESDIR}, but this is less of a concern for me given it's bounded
by the aforementioned environment size check.
Why do we have to keep exporting the related variables that generally
cause these size issues to the environment?
On 30/06/2023 13.33, Eray Aslan wrote:
On Fri, Jun 30, 2023 at 03:38:11AM -0600, Tim Harder wrote:
Why do we have to keep exporting the related variables that generally >>>cause these size issues to the environment?
I really do not want to make a +1 response but this is an excellent >>question that we need to answer before implementing EGO_SUM.
Could you please discuss why you make the reintroduction of EGO_SUM
dependent on this question?
On 2023-07-03 Mon 04:17, Florian Schmaus wrote:
On 30/06/2023 13.33, Eray Aslan wrote:
On Fri, Jun 30, 2023 at 03:38:11AM -0600, Tim Harder wrote:
Why do we have to keep exporting the related variables that generally >>>cause these size issues to the environment?
I really do not want to make a +1 response but this is an excellent >>question that we need to answer before implementing EGO_SUM.
Could you please discuss why you make the reintroduction of EGO_SUM >dependent on this question?
Just to be clear, I don't particularly care about EGO_SUM enough to gate
its reintroduction (and don't have any leverage to do so anyway). I'm
just tired of the circular discussions around env issues that all seem
to avoid actual fixes, catering instead to functionality used by a vanishingly small subset of ebuilds in the main repo that compels a
certain design mostly due to how portage functioned before EAPI 0.
Other than that, supporting EGO_SUM (or any other language ecosystem
trending towards distro-unfriendly releases) is fine as long as devs are cognizant how the related global-scope eclass design affects everyone
running or working on the raw repo. I hope devs continue leveraging the relatively recent benchmark tooling (and perhaps more future support) to improve their work. Along those lines, it could be nice to see sample benchmark data in commit messages for large, global-scope eclass work
just to reinforce that it was taken into account.
Tim
just to be curious about the whole discussion. I did not follow in the deepest detail but what I got is:This is out-of-tree/indirect Manifests, that I proposed here, more than
- EGO_SUM blows up the Manifest file, since every little Go module needs
to be respected. A lot of these Manifest files lead to a extremely
increased Portage tree size. EGO_SUM is just one example (though the
biggest one). Statically linked languages like Rust etc. have the same
problem.
- The current solution is to prepackage all modules, put it somewhere on
a webserver and just manifest that file. This make the Portage tree
small in size again, but requires a webserver/mirror and is thus
unfriendly for overlay devs.
I'm not sure if it was mentioned before but has anyone considered hash
trees / Merkle trees for the manifest file? The idea would be to hash
the standard manifest file a second time if it gets too big and write
down that hash as new manifest file and leave EGO_SUM as is.
On Tue, Jul 04, 2023 at 12:44:39PM +0200, Gerion Entrup wrote:
just to be curious about the whole discussion. I did not follow in the deepest detail but what I got is:
- EGO_SUM blows up the Manifest file, since every little Go module needs
to be respected. A lot of these Manifest files lead to a extremely
increased Portage tree size. EGO_SUM is just one example (though the
biggest one). Statically linked languages like Rust etc. have the same
problem.
- The current solution is to prepackage all modules, put it somewhere on
a webserver and just manifest that file. This make the Portage tree
small in size again, but requires a webserver/mirror and is thus
unfriendly for overlay devs.
I'm not sure if it was mentioned before but has anyone considered hash trees / Merkle trees for the manifest file? The idea would be to hashThis is out-of-tree/indirect Manifests, that I proposed here, more than
the standard manifest file a second time if it gets too big and write
down that hash as new manifest file and leave EGO_SUM as is.
a year ago:
https://marc.info/?l=gentoo-dev&m=168280762310716&w=2 https://marc.info/?l=gentoo-dev&m=165472088822215&w=2
Developing it requires PMS work in addition to package manager
development, because it introduces phases.
- primary fetch of $SRC_URI per ebuild, including indirect Manifest
- primary validation of distfiles
- secondary fetch of $SRC_URI per indirect Manifest
- secondary validation of additional distfiles
A significantly impacted use case is "emerge -f", it now needs to run downloads twice.
The rest of the posts also go into the matter of duplication within
EGO_SUM & the indirect Manifests: limiting the growth requires some form
of content-addressed layout.
It's absolutely something we should get developed, but it's a lot of
work.
The indirect Manifests still provide a hosting challenge for overlays.
--
Robin Hugh Johnson
Gentoo Linux: Dev, Infra Lead, Foundation Treasurer
E-Mail : robbat2@gentoo.org
GnuPG FP : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
GnuPG FP : 7D0B3CEB E9B85B1F 825BCECF EE05E6F6 A48F6136
On Tue, Jul 04, 2023 at 21:56:26 +0000, Robin H. Johnson wrote:
On Tue, Jul 04, 2023 at 12:44:39PM +0200, Gerion Entrup wrote:
just to be curious about the whole discussion. I did not follow in the deepest detail but what I got is:
- EGO_SUM blows up the Manifest file, since every little Go module needs
to be respected. A lot of these Manifest files lead to a extremely
increased Portage tree size. EGO_SUM is just one example (though the
biggest one). Statically linked languages like Rust etc. have the same
problem.
- The current solution is to prepackage all modules, put it somewhere on
a webserver and just manifest that file. This make the Portage tree
small in size again, but requires a webserver/mirror and is thus
unfriendly for overlay devs.
I'm not sure if it was mentioned before but has anyone considered hash trees / Merkle trees for the manifest file? The idea would be to hashThis is out-of-tree/indirect Manifests, that I proposed here, more than
the standard manifest file a second time if it gets too big and write down that hash as new manifest file and leave EGO_SUM as is.
a year ago:
https://marc.info/?l=gentoo-dev&m=168280762310716&w=2 https://marc.info/?l=gentoo-dev&m=165472088822215&w=2
Developing it requires PMS work in addition to package manager
development, because it introduces phases.
- primary fetch of $SRC_URI per ebuild, including indirect Manifest
- primary validation of distfiles
- secondary fetch of $SRC_URI per indirect Manifest
- secondary validation of additional distfiles
A significantly impacted use case is "emerge -f", it now needs to run downloads twice.
I'm not sure double downloading is required. Consider a flow similar to
this:
1. distfiles are fetched as per the ebuild
2. distfiles are hashed into a temporary Manifest
3. temporary Manifest is hashed and compared with the hashes stored in
the in-tree Manifest for the direct Manifest
A new Manifest format would be required in order to differentiate the
current ones from an indirect one. This may require PMS changes,
although I suspect ammending GLEP 74 may be enough since the PMS seems
to just refer to the GLEP for a description of Manifests.
This would also either rely on a stable ordering of Manifest contents
when generating it or having a separate file listing in the indirect
Manifest which corresponds to the order in the direct Manifest. For the latter, it should also have separate entries for different package
versions so that every single distfile for every single version of said package does not need to be fetched in order to build the direct
Manifest.
I'm imagining something along these lines:
INDIRECT true
PACKAGE category/package-version distfile1 distfile2 ... ALGO1 hash1 ALGO2 hash2 ...
PACKAGE ...
Here `ALGO1` and `hash1` correspond to the hash of the direct Manifest containing the distfiles (and potentially other files if a repo does not
have thin-manifests enabled) and their hashes in the order specified previously.
The indirect Manifest as described above would be large-ish for a
package that has lots of distfiles, but likely much smaller than if each distfile had its set of hashes stored directly.
Please correct me if there's some detail I've overlooked.
- Oskari
The rest of the posts also go into the matter of duplication within
EGO_SUM & the indirect Manifests: limiting the growth requires some form
of content-addressed layout.
It's absolutely something we should get developed, but it's a lot of
work.
The indirect Manifests still provide a hosting challenge for overlays.
Am Mittwoch, 5. Juli 2023, 01:09:30 CEST schrieb Oskari Pirhonen:
On Tue, Jul 04, 2023 at 21:56:26 +0000, Robin H. Johnson wrote:
Developing it requires PMS work in addition to package manager development, because it introduces phases.
- primary fetch of $SRC_URI per ebuild, including indirect Manifest
- primary validation of distfiles
- secondary fetch of $SRC_URI per indirect Manifest
- secondary validation of additional distfiles
A significantly impacted use case is "emerge -f", it now needs to run downloads twice.
I'm not sure double downloading is required. Consider a flow similar to this:
1. distfiles are fetched as per the ebuild
2. distfiles are hashed into a temporary Manifest
3. temporary Manifest is hashed and compared with the hashes stored in
the in-tree Manifest for the direct Manifest
This is exactly, what I meant. A webstorage is not needed. A second
download process is also not needed. Just an additional Manifest format
is needed for ebuilds with more than n distfiles.
Am Mittwoch, 5. Juli 2023, 01:09:30 CEST schrieb Oskari Pirhonen:
On Tue, Jul 04, 2023 at 21:56:26 +0000, Robin H. Johnson wrote:
On Tue, Jul 04, 2023 at 12:44:39PM +0200, Gerion Entrup wrote:
just to be curious about the whole discussion. I did not follow in the deepest detail but what I got is:
- EGO_SUM blows up the Manifest file, since every little Go module needs
to be respected. A lot of these Manifest files lead to a extremely
increased Portage tree size. EGO_SUM is just one example (though the
biggest one). Statically linked languages like Rust etc. have the same
problem.
- The current solution is to prepackage all modules, put it somewhere on
a webserver and just manifest that file. This make the Portage tree
small in size again, but requires a webserver/mirror and is thus
unfriendly for overlay devs.
I'm not sure if it was mentioned before but has anyone considered hash trees / Merkle trees for the manifest file? The idea would be to hash the standard manifest file a second time if it gets too big and write down that hash as new manifest file and leave EGO_SUM as is.This is out-of-tree/indirect Manifests, that I proposed here, more than
a year ago:
https://marc.info/?l=gentoo-dev&m=168280762310716&w=2 https://marc.info/?l=gentoo-dev&m=165472088822215&w=2
Developing it requires PMS work in addition to package manager development, because it introduces phases.
- primary fetch of $SRC_URI per ebuild, including indirect Manifest
- primary validation of distfiles
- secondary fetch of $SRC_URI per indirect Manifest
- secondary validation of additional distfiles
A significantly impacted use case is "emerge -f", it now needs to run downloads twice.
I'm not sure double downloading is required. Consider a flow similar to this:
1. distfiles are fetched as per the ebuild
2. distfiles are hashed into a temporary Manifest
3. temporary Manifest is hashed and compared with the hashes stored in
the in-tree Manifest for the direct Manifest
This is exactly, what I meant. A webstorage is not needed. A second
download process is also not needed. Just an additional Manifest format
is needed for ebuilds with more than n distfiles.
A new Manifest format would be required in order to differentiate the current ones from an indirect one. This may require PMS changes,
although I suspect ammending GLEP 74 may be enough since the PMS seems
to just refer to the GLEP for a description of Manifests.
This would also either rely on a stable ordering of Manifest contents
when generating it or having a separate file listing in the indirect Manifest which corresponds to the order in the direct Manifest. For the latter, it should also have separate entries for different package
versions so that every single distfile for every single version of said package does not need to be fetched in order to build the direct
Manifest.
I'm imagining something along these lines:
INDIRECT true
PACKAGE category/package-version distfile1 distfile2 ... ALGO1 hash1 ALGO2 hash2 ...
PACKAGE ...
Maybe it is reasonable to skip the distfile names at all (or just
provide a hash value of the concatenated file names). Then the manifest
would just contain two/three hashes (for as many distfiles as the ebuild needs). Since these kind of indirect Manifests should be more rare than
the normal ones, a slightly longer processing time does not have much
impact I would say.
Here `ALGO1` and `hash1` correspond to the hash of the direct Manifest containing the distfiles (and potentially other files if a repo does not have thin-manifests enabled) and their hashes in the order specified previously.
The indirect Manifest as described above would be large-ish for a
package that has lots of distfiles, but likely much smaller than if each distfile had its set of hashes stored directly.
Without storing the filenames, the Manifest file would have the same
small size for any amount of distfiles needed.
On 2023-07-03 Mon 04:17, Florian Schmaus wrote:
On 30/06/2023 13.33, Eray Aslan wrote:
On Fri, Jun 30, 2023 at 03:38:11AM -0600, Tim Harder wrote:
Why do we have to keep exporting the related variables that generally >>>cause these size issues to the environment?
I really do not want to make a +1 response but this is an excellent >>question that we need to answer before implementing EGO_SUM.
Could you please discuss why you make the reintroduction of EGO_SUM >dependent on this question?
Just to be clear, I don't particularly care about EGO_SUM enough to gate
its reintroduction (and don't have any leverage to do so anyway). I'm
just tired of the circular discussions around env issues that all seem
to avoid actual fixes, catering instead to functionality used by a vanishingly small subset of ebuilds in the main repo that compels a
certain design mostly due to how portage functioned before EAPI 0.
Other than that, supporting EGO_SUM (or any other language ecosystem
trending towards distro-unfriendly releases) is fine as long as devs are cognizant how the related global-scope eclass design affects everyone
running or working on the raw repo. I hope devs continue leveraging the relatively recent benchmark tooling (and perhaps more future support) to improve their work. Along those lines, it could be nice to see sample benchmark data in commit messages for large, global-scope eclass work
just to reinforce that it was taken into account.
Tim
I've been following the EGO_SUM thread for quite some time now. One
other thing I did not see mentioned in favour of EGO_SUM so far: reproducibility.
The problem with external tarballs is that they are gone once the
ebuild is dropped from the tree. Should a user ever want to roll back
to a previous version of an application, either by checking out on
older version of the portage tree or copying said ebuild into their
local overlay, they still cannot simply run an emerge on the it as
they have to somehow recreate the tarball itself too.
While upstream may not host everything forever, it's pretty much
guaranteed to be available for much longer than Gentoo's custom
tarball bundles of dependencies.
On Tue, Jul 04, 2023 at 01:13:30AM -0600, Tim Harder wrote:
On 2023-07-03 Mon 04:17, Florian Schmaus wrote:
On 30/06/2023 13.33, Eray Aslan wrote:
On Fri, Jun 30, 2023 at 03:38:11AM -0600, Tim Harder wrote:
Why do we have to keep exporting the related variables that generally
cause these size issues to the environment?
I really do not want to make a +1 response but this is an excellent
question that we need to answer before implementing EGO_SUM.
Could you please discuss why you make the reintroduction of EGO_SUM
dependent on this question?
Just to be clear, I don't particularly care about EGO_SUM enough to gate
its reintroduction (and don't have any leverage to do so anyway). I'm
just tired of the circular discussions around env issues that all seem
to avoid actual fixes, catering instead to functionality used by a
vanishingly small subset of ebuilds in the main repo that compels a
certain design mostly due to how portage functioned before EAPI 0.
Other than that, supporting EGO_SUM (or any other language ecosystem
trending towards distro-unfriendly releases) is fine as long as devs are
cognizant how the related global-scope eclass design affects everyone
running or working on the raw repo. I hope devs continue leveraging the
relatively recent benchmark tooling (and perhaps more future support) to
improve their work. Along those lines, it could be nice to see sample
benchmark data in commit messages for large, global-scope eclass work
just to reinforce that it was taken into account.
Tim
I've been following the EGO_SUM thread for quite some time now. One other thing
I did not see mentioned in favour of EGO_SUM so far: reproducibility.
The problem with external tarballs is that they are gone once the ebuild is dropped from the tree. Should a user ever want to roll back to a previous version of an application, either by checking out on older version of the portage tree or copying said ebuild into their local overlay, they still cannot
simply run an emerge on the it as they have to somehow recreate the tarball itself too.
[[PGP Signed Part:Undecided]]
On 30/06/2023 10.22, Sam James wrote:
Florian Schmaus <flow@gentoo.org> writes:
[[PGP Signed Part:Undecided]]My position on this has been consistent: > a check is needed to
[in reply to a gentoo-project@ post, but it was asked to continue this
on gentoo-dev@]
On 28/06/2023 16.46, Sam James wrote:
and questions remain unanswered on the
ML (why not implement a check in pkgcheck similar to what is in Portage, >>>> for example)?
On 2023-05-30 [1], I proposed a limit in the range of 2 to 1.5 MiB for
the total package-directory size. I only care a little about the tool
that checks this limit, but pkgcheck is an obvious choice. I also
suggested that we review this policy once the number of Go packages
has doubled or two years after this policy was established (whatever
comes first).
But I fear you may be referring to another kind of check. You may be
talking about a check that forbids EGO_SUM in ::gentoo but allows it
overlays.
statically
determine when the environment size is too big. Copying the Portage
check into pkgcheck (in terms of the metrics) would satisfy this.
It is not as easy as merely copying existing portage code into
pkgcheck (unless I am missing something).
I've talked to arthurzam, and there appears to be a .environment file
created by pkgcheck, which we could use to approximate the exported environment.
Another option would be to have pkgcheck count the EGO_SUM
entries. The tree-sitter API for Bash, which pkgcheck already uses,
seems to allow for that. But that would be different from the check in portage. Although, IMHO, counting EGO_SUM entries would be sufficient.
That is, regardless of raw size, I'm asking for a calculation based on
the contents of EGO_SUM where, if exceeded, the package will not be
installable on some systems. You didn't have an issue implementing this
for Portage and I've mentioned this a bunch of times since, so I thought
it was clear what I was hoping to see.
So pkgcheck counting EGO_SUM entries would be sufficient for the
purpose of having a static check that notices if the ebuild would
likely run into the environment limit?
To find a common compromise, I would possibly invest my time in
developing such a test. Even though I do not deem such a check a
strict prerequisite to reintroduce EGO_SUM.
Intelligibly, EGO_SUM can be considered ugly. Compared to aAgain, am on record as being fine with the general EGO_SUM approach,
traditional Gentoo package, EGO_SUM-based ones are larger. The same is
true for Rust packages. However, looking at the bigger picture,
EGO_SUM's advantages outweigh its disadvantages.
even if I wish we didn't need it, as I see it as inevitable for things
like yarn, .NET, and of course Rust as we already have it.
Just ideally not huge ones, and certainly not huge ones which then
aren't even reliably installable because of environment size.
Talking about "reliably installable" makes it sound to me like there
are cases where installing a EGO_SUM-based package sometimes works and sometimes not. But the kernel-limit is fixed and not even
configurable, besides, of course patching the source (and in the
absence of architectures with a page size below 4 KiB) [1].
Any developer testing whether or notan ebuild is installable would
become immediately aware if the ebuild runs into the environment
limit, or not.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 399 |
Nodes: | 16 (2 / 14) |
Uptime: | 101:44:07 |
Calls: | 8,363 |
Calls today: | 2 |
Files: | 13,165 |
Messages: | 5,898,006 |