• Bug#1067440: Compression makes searching packages very slow

    From =?UTF-8?Q?Lauren=C8=9Biu?= Nicola@21:1/5 to All on Thu Mar 21 17:10:01 2024
    Package: apt
    Version: 2.7.12

    I noticed that searching for packages is very slow if the package lists are compressed. To reproduce, remove `/var/lib/apt/lists`, enable

    Acquire::GzipIndexes "true"; Acquire::CompressionTypes::Order:: "gz";

    , run `apt update`. This enables LZ4 compression on my systems, but I don't think the exact method matters. You can then run `apt search librust`, which takes about 19 seconds in a Debian 12 container (docker.io/debian:12 has compression already set up),
    compared to 0.4 seconds without compression.

    Also tested on Ubuntu 22.04 and 24.04, so the exact APT version shouldn't matter too much.

    I tried to look into it, and `strace -e trace=openat apt-cache search librust` shows it reopen and re-read one of the package lists:

    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+default-dev - Cross-platform symbolication library - feature "default"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+object-dev - Cross-platform symbolication library - feature "object"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+rustc-demangle-dev - Cross-platform symbolication library - feature "rustc-demangle"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+std-dev - Cross-platform symbolication library - feature "std"

    (you can use -e trace=openat,read to confirm that it's actually reading the file)

    I believe it's quadratic in the number of search results, and this is related to the pseudo-indexing mechanism used by APT (see `pkgRecords::Lookup` in apt-pkg). Each lookup call will have to decompress the file in order to seek to the destination.

    Unfortunately, I suspect this isn't exactly an easy fix, given the current design.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Lauren=C8=9Biu?= Nicola@21:1/5 to All on Thu Mar 21 17:40:02 2024
    --e59e970ae1854a2fb173c1cd6c990194
    Content-Type: text/plain

    Correction: because of full-text search, it might actually be quadratic in the number of packages (I didn't check). And it might be possible to fix it, by going through the compressed stream just once, instead of restarting (assuming the results are
    returned in the file order, which seems reasonable). --e59e970ae1854a2fb173c1cd6c990194
    Content-Type: text/html

    <!DOCTYPE html><html><head><title></title><style type="text/css">p.MsoNormal,p.MsoNoSpacing{margin:0}
    p.MsoNormal,p.MsoNoSpacing{margin:0}</style></head><body><div style="font-family:Arial;">Correction: because of full-text search, it might actually be quadratic in the number of packages (I didn't check). And it might be possible to fix it, by going
    through the compressed stream just once, instead of restarting (assuming the results are returned in the file order, which seems reasonable).<br></div></body></html>
    --e59e970ae1854a2fb173c1cd6c990194--

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julian Andres Klode@21:1/5 to All on Thu Mar 21 21:30:01 2024
    On Thu, Mar 21, 2024 at 06:01:12PM +0200, Laurențiu Nicola wrote:
    Package: apt
    Version: 2.7.12

    I noticed that searching for packages is very slow if the package lists are compressed. To reproduce, remove `/var/lib/apt/lists`, enable

    Acquire::GzipIndexes "true"; Acquire::CompressionTypes::Order:: "gz";

    , run `apt update`. This enables LZ4 compression on my systems, but I don't think the exact method matters. You can then run `apt search librust`, which takes about 19 seconds in a Debian 12 container (docker.io/debian:12 has compression already set up)
    , compared to 0.4 seconds without compression.

    Also tested on Ubuntu 22.04 and 24.04, so the exact APT version shouldn't matter too much.

    I tried to look into it, and `strace -e trace=openat apt-cache search librust` shows it reopen and re-read one of the package lists:

    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+default-dev - Cross-platform symbolication library - feature "default"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+object-dev - Cross-platform symbolication library - feature "object"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+rustc-demangle-dev - Cross-platform symbolication library - feature "rustc-demangle"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+std-dev - Cross-platform symbolication library - feature "std"

    (you can use -e trace=openat,read to confirm that it's actually reading the file)

    I believe it's quadratic in the number of search results, and this is related to the pseudo-indexing mechanism used by APT (see `pkgRecords::Lookup` in apt-pkg). Each lookup call will have to decompress the file in order to seek to the destination.

    Unfortunately, I suspect this isn't exactly an easy fix, given the current design.


    Going to respond to this but also including responses to your followup email which has a broken Subject:


    Searching works by ordering the packages based on file, offset
    and then iterating over them and looking them up. Seeking forward
    to a higher offset does not involve a reopen, we just skip content
    in betwene.

    Full-text search is inside the description in the section parsed
    for each package.

    It's not clear why this fails on bookworm - I can reproduce that -
    t certainly is fine in git main on my Ubuntu 24.04 system.


    --
    debian developer - deb.li/jak | jak-linux.org - free software dev
    ubuntu core developer i speak de, en

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Julian Andres Klode@21:1/5 to Julian Andres Klode on Thu Mar 21 22:40:01 2024
    On Thu, Mar 21, 2024 at 09:25:47PM +0100, Julian Andres Klode wrote:
    On Thu, Mar 21, 2024 at 06:01:12PM +0200, Laurențiu Nicola wrote:
    Package: apt
    Version: 2.7.12

    I noticed that searching for packages is very slow if the package lists are compressed. To reproduce, remove `/var/lib/apt/lists`, enable

    Acquire::GzipIndexes "true"; Acquire::CompressionTypes::Order:: "gz";

    , run `apt update`. This enables LZ4 compression on my systems, but I don't think the exact method matters. You can then run `apt search librust`, which takes about 19 seconds in a Debian 12 container (docker.io/debian:12 has compression already set
    up), compared to 0.4 seconds without compression.

    Also tested on Ubuntu 22.04 and 24.04, so the exact APT version shouldn't matter too much.

    I tried to look into it, and `strace -e trace=openat apt-cache search librust` shows it reopen and re-read one of the package lists:

    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+default-dev - Cross-platform symbolication library - feature "default"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+object-dev - Cross-platform symbolication library - feature "object"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+rustc-demangle-dev - Cross-platform symbolication library - feature "rustc-demangle"
    openat(AT_FDCWD, "/var/lib/apt/lists/archive.ubuntu.com_ubuntu_dists_jammy_universe_binary-amd64_Packages.lz4", O_RDONLY) = 16
    librust-addr2line+std-dev - Cross-platform symbolication library - feature "std"

    (you can use -e trace=openat,read to confirm that it's actually reading the file)

    I believe it's quadratic in the number of search results, and this is related to the pseudo-indexing mechanism used by APT (see `pkgRecords::Lookup` in apt-pkg). Each lookup call will have to decompress the file in order to seek to the destination.

    Unfortunately, I suspect this isn't exactly an easy fix, given the current design.


    Going to respond to this but also including responses to your followup email which has a broken Subject:


    Searching works by ordering the packages based on file, offset
    and then iterating over them and looking them up. Seeking forward
    to a higher offset does not involve a reopen, we just skip content
    in betwene.

    Full-text search is inside the description in the section parsed
    for each package.

    It's not clear why this fails on bookworm - I can reproduce that -
    t certainly is fine in git main on my Ubuntu 24.04 system.

    This should be fixed in

    https://salsa.debian.org/apt-team/apt/-/merge_requests/336

    What happened was that we got called for the same offset twice, e.g.
    let's say 42.

    Now our section size is let's say 7.

    So what happened is that we did:

    - Jump to 42 => offset = 42
    - Parse => offset = 49
    - Jump to 42 => ugh gotta go back

    Luckily we still have the currently used section in the Buffer; i.e.
    we only looked at bytes 49...<something> but our buffer still starts
    at 42, so we can just look back to find our section again.

    Sadly we can't avoid the reparse of the section here, because the
    pkgTagSection passed to pkgTagFile::Jump() could be a different one
    than the last one.

    I have a patch to avoid the reparse by storing offset a level higher
    where the pkgTagSection is fixed but I'm in favour of not shipping that
    - those would only be marginal gains and we want to exercise this code
    path.
    --
    debian developer - deb.li/jak | jak-linux.org - free software dev
    ubuntu core developer i speak de, en

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From =?UTF-8?Q?Lauren=C8=9Biu?= Nicola@21:1/5 to All on Fri Mar 22 16:30:01 2024
    Thanks for the quick fix, I can confirm it's much faster now:

    # apt 2.7.13, trixie
    $ time apt search librust-
    real 0m30.185s
    user 0m28.286s
    sys 0m1.729s

    # apt 2.7.14, trixie
    $ time apt search librust-
    real 0m0.640s
    user 0m0.490s
    sys 0m0.035s

    And sorry for the empty subject, it was my first time using bugs.debian.org. It told me to add comments by sending an email to the bug address and I didn't know whether to copy-paste the original subject. It's not like I can manually set In-Reply-To and
    References from my email client, so it would have been broken anyway.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)