• kernel configs in Debian

    From Ryutaroh Matsumoto@21:1/5 to All on Mon Apr 26 09:50:01 2021
    Hi,

    For (ARM) SBCs with limited computational power, stripping out
    unused features from the kernel sometimes improves the performance,
    depending on usage.

    For my use case of packet filtering by RPi4B,

    CONFIG_PARAVIRT=n
    CONFIG_DEBUG_KERNEL=n

    each of the above increases the throughput of the packet filtering router
    by about 100Mbps, from the baseline 600Mbps by linux-image-rt-arm64 5.10.
    The above options cannot be disabled in Debian kernel package
    for its wider use cases. Rebuild of linux-image-rt-arm64 was done by https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh

    On the other hand, I am wondering why the following options are currently disabled by Debian arm64 kernel 5.10 package:

    CONFIG_CLEANCACHE:
    Cleancache can be thought of as a page-granularity victim cache for
    clean pages that the kernel's pageframe replacement algorithm (PFRA)
    would like to keep around, but can't since there isn't enough
    memory. So when the PFRA "evicts" a page, it first attempts to use
    cleancache code to put the data contained in that page into
    "transcendent memory", memory that is not directly accessible or
    addressable by the kernel and is of unknown and possibly time-varying
    size. And when a cleancache-enabled filesystem wishes to access a page
    in a file on disk, it first checks cleancache to see if it already
    contains it; if it does, the page is copied into the kernel and a disk
    access is avoided. When a transcendent memory driver is available
    (such as zcache or Xen transcendent memory), a significant I/O
    reduction may be achieved. When none is available, all cleancache
    calls are reduced to a single pointer-compare-against-NULL resulting
    in a negligible performance hit.

    If unsure, say Y to enable cleancache

    This is enabled by other distros.: https://hlandau.github.io/kconfigreport/option/CONFIG_CLEANCACHE.xhtml

    CONFIG_ZONE_DEVICE:
    Device memory hotplug support allows for establishing pmem, or other
    device driver discovered memory regions, in the memmap. This allows pfn_to_page() lookups of otherwise "device-physical" addresses which
    is needed for using a DAX mapping in an O_DIRECT operation, among
    other things.

    If FS_DAX is enabled, then say Y.

    (FS_DAX is enabled in Debian arm64 kernel 5.10 package)

    CONFIG_IRQ_TIME_ACCOUNTING:
    Select this option to enable fine granularity task irq time
    accounting. This is done by reading a timestamp on each transitions
    between softirq and hardirq state, so there can be a small performance
    impact.

    (My observation suggests CONFIG_PARAVIRT=y having much higher overhead.)

    If in doubt, say N here.

    The above CONFIG_IRQ_TIME_ACCOUNTING enables %hi in "top".
    See also "Is Your Linux Version Hiding Interrupt CPU Usage From You?" https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usage/


    CONFIG_PARAVIRT_TIME_ACCOUNTING has a similar role for
    linux-image-cloud-arm64:
    Select this option to enable fine granularity task steal time
    accounting. Time spent executing other tasks in parallel with the
    current vCPU is discounted from the vCPU power. To account for that,
    there can be a small performance impact.

    If in doubt, say N here.

    The above enables "%st" in "top". Some other distros seem enabling it: https://hlandau.github.io/kconfigreport/option/CONFIG_PARAVIRT_TIME_ACCOUNTING.xhtml


    Best regards, Ryutaroh Matsumoto

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arnd Bergmann@21:1/5 to ryutaroh@ict.e.titech.ac.jp on Mon Apr 26 11:20:01 2021
    On Mon, Apr 26, 2021 at 9:43 AM Ryutaroh Matsumoto <ryutaroh@ict.e.titech.ac.jp> wrote:

    For (ARM) SBCs with limited computational power, stripping out
    unused features from the kernel sometimes improves the performance,
    depending on usage.

    For my use case of packet filtering by RPi4B,

    CONFIG_PARAVIRT=n
    CONFIG_DEBUG_KERNEL=n

    each of the above increases the throughput of the packet filtering router
    by about 100Mbps, from the baseline 600Mbps by linux-image-rt-arm64 5.10.
    The above options cannot be disabled in Debian kernel package
    for its wider use cases. Rebuild of linux-image-rt-arm64 was done by https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh

    Interesting analysis. I would have expected neither of those two options to have a measurable effect on network throughput, so it is possible that
    these are hitting a bug somewhere that leads to bad performance.

    The only effect that CONFIG_PARAVIRT is supposed to have is the steal
    time accounting. Incidentally that has just changed to a static_call
    in linux-5.13
    with commit a0e2bf7cb700 ("x86/paravirt: Switch time pvops functions to
    use static_call()") on all architectures, so maybe that also addresses the problem.

    CONFIG_DEBUG_KERNEL by itself does not do anything, but instead it
    controls a number of other configuration options. You should be able to
    see which options changed by comparing the config file before and after
    turning this off.

    Generally I think at least CONFIG_DEBUG_INFO should be enabled in
    a distro kernel in order to analyse bug reports better, but this is not supposed to change executable code. What other options are disabled
    when you turn this off?

    Also, do you see the same performance difference with the non-rt kernel?
    Most people would not run the -rt kernel because of the inherent
    performance overhead, and it's not clear whether the slowdown you
    see is the result of a combination of CONFIG_PREEMPT_RT with some
    other option, or if this is something that hurts normal users as well.

    On the other hand, I am wondering why the following options are currently disabled by Debian arm64 kernel 5.10 package:

    CONFIG_CLEANCACHE:
    Cleancache can be thought of as a page-granularity victim cache for
    clean pages that the kernel's pageframe replacement algorithm (PFRA)
    would like to keep around, but can't since there isn't enough
    memory. So when the PFRA "evicts" a page, it first attempts to use
    cleancache code to put the data contained in that page into
    "transcendent memory", memory that is not directly accessible or
    addressable by the kernel and is of unknown and possibly time-varying
    size. And when a cleancache-enabled filesystem wishes to access a page
    in a file on disk, it first checks cleancache to see if it already
    contains it; if it does, the page is copied into the kernel and a disk
    access is avoided. When a transcendent memory driver is available
    (such as zcache or Xen transcendent memory), a significant I/O
    reduction may be achieved. When none is available, all cleancache
    calls are reduced to a single pointer-compare-against-NULL resulting
    in a negligible performance hit.

    If unsure, say Y to enable cleancache

    This is enabled by other distros.: https://hlandau.github.io/kconfigreport/option/CONFIG_CLEANCACHE.xhtml

    This seems like a useful thing to enable.

    CONFIG_ZONE_DEVICE:
    Device memory hotplug support allows for establishing pmem, or other
    device driver discovered memory regions, in the memmap. This allows pfn_to_page() lookups of otherwise "device-physical" addresses which
    is needed for using a DAX mapping in an O_DIRECT operation, among
    other things.

    If FS_DAX is enabled, then say Y.

    (FS_DAX is enabled in Debian arm64 kernel 5.10 package)

    This should probably be an architecture-independent setting.
    It does sound useful to only enable either both ZONE_DEVICE and
    FS_DAX or not at all. I'm not aware of any arm64 hardware supporting
    nvdimm or similar technology that needs these, but there is probably
    someone who has it, if only in a lab.

    CONFIG_IRQ_TIME_ACCOUNTING:
    Select this option to enable fine granularity task irq time
    accounting. This is done by reading a timestamp on each transitions
    between softirq and hardirq state, so there can be a small performance impact.

    (My observation suggests CONFIG_PARAVIRT=y having much higher overhead.)

    If in doubt, say N here.

    The above CONFIG_IRQ_TIME_ACCOUNTING enables %hi in "top".
    See also "Is Your Linux Version Hiding Interrupt CPU Usage From You?" https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usage/

    Indeed, reading the hardware clock on arm64 is usually cheap compared
    to other architectures, so enabling this seems reasonable.

    Arnd

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ryutaroh Matsumoto@21:1/5 to All on Tue Apr 27 02:20:01 2021
    Hi Arnd,

    Also, do you see the same performance difference with the non-rt kernel?
    Most people would not run the -rt kernel because of the inherent
    performance overhead, and it's not clear whether the slowdown you
    see is the result of a combination of CONFIG_PREEMPT_RT with some
    other option, or if this is something that hurts normal users as well.

    Thank you for your interest.
    I will check the differences of kernel compilation options for
    non-rt kernel (linux-image-arm64).
    Hopefully, I can return additional info. within one week.

    Best regards, Ryutaroh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Corey@21:1/5 to Ryutaroh Matsumoto on Tue Apr 27 13:10:02 2021
    Also look lor /proc/config.gz. If you have it it's a dump of the
    config options of the running kernel. Whether it gets generated or not
    is itself a config option.

    On 4/26/21, Ryutaroh Matsumoto <ryutaroh@ict.e.titech.ac.jp> wrote:
    Hi Arnd,

    Also, do you see the same performance difference with the non-rt kernel?
    Most people would not run the -rt kernel because of the inherent
    performance overhead, and it's not clear whether the slowdown you
    see is the result of a combination of CONFIG_PREEMPT_RT with some
    other option, or if this is something that hurts normal users as well.

    Thank you for your interest.
    I will check the differences of kernel compilation options for
    non-rt kernel (linux-image-arm64).
    Hopefully, I can return additional info. within one week.

    Best regards, Ryutaroh




    --
    -------------
    Education is contagious.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ryutaroh Matsumoto@21:1/5 to All on Wed Apr 28 02:00:01 2021
    Hi Alan, thank you for your interest.

    Also look lor /proc/config.gz. If you have it it's a dump of the
    config options of the running kernel. Whether it gets generated or not
    is itself a config option.

    I plan to make the minimal chanages to the config as rebuilding it by

    apt-get source linux/sid
    cd linux-5.10.28
    fakeroot make -f debian/rules.gen setup_arm64_none_arm64
    cat >>debian/build/build_arm64_none_arm64/.config <<'EOF'
    CONFIG_XEN=n
    CONFIG_PARAVIRT=n
    EOF
    fakeroot debian/rules source
    fakeroot make -j 3 -f debian/rules.gen binary-arch_arm64_none_arm64

    I expect not having /proc/config.gz as the CONFIG_IKCONFIG is disabled
    in the Debian kernel.
    I will include diff -u of .config in debian/build/build_arm64_none_arm64
    and /usr/src/linux-config-5.10/config.arm64_rt_arm64

    As CONFIG_XEN selects CONFIG_PARAVIRT, CONFIG_XEN=n is required
    to build a kernel with CONFIG_PARAVIRT=n.

    The last build of the above steps failed as ".btf.vmlinux.bin.o: file not recognized: file format not recognized". I am re-trying the build with adding
    CONFIG_DEBUG_INFO_BTF=n

    As single build takes 6 hours on RPi4B, it can take several days to find correct
    steps to build. The above steps seems completely obeying the instructions at

    https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.3
    and https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.5

    Best regards, Ryutaroh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Corey@21:1/5 to Ryutaroh Matsumoto on Wed Apr 28 03:30:02 2021
    I think you can probably enable CONFIG_IKCONFIG, I'm running a
    Bullseye kernel that has a /proc/config.gz. But the kernel did come
    from Manjaro I think, it's a little strange. It's on a Pinebook Pro
    and there's no official Debian release for it yet, this came from
    debootstrap. Getting the drivers and device tree right are a
    challenge, a few of the drivers are blobs. Made in China, engineered
    in Hong Kong.

    On 4/27/21, Ryutaroh Matsumoto <ryutaroh@ict.e.titech.ac.jp> wrote:
    Hi Alan, thank you for your interest.

    Also look lor /proc/config.gz. If you have it it's a dump of the
    config options of the running kernel. Whether it gets generated or not
    is itself a config option.

    I plan to make the minimal chanages to the config as rebuilding it by

    apt-get source linux/sid
    cd linux-5.10.28
    fakeroot make -f debian/rules.gen setup_arm64_none_arm64
    cat >>debian/build/build_arm64_none_arm64/.config <<'EOF'
    CONFIG_XEN=n
    CONFIG_PARAVIRT=n
    EOF
    fakeroot debian/rules source
    fakeroot make -j 3 -f debian/rules.gen binary-arch_arm64_none_arm64

    I expect not having /proc/config.gz as the CONFIG_IKCONFIG is disabled
    in the Debian kernel.
    I will include diff -u of .config in debian/build/build_arm64_none_arm64
    and /usr/src/linux-config-5.10/config.arm64_rt_arm64

    As CONFIG_XEN selects CONFIG_PARAVIRT, CONFIG_XEN=n is required
    to build a kernel with CONFIG_PARAVIRT=n.

    The last build of the above steps failed as ".btf.vmlinux.bin.o: file not recognized: file format not recognized". I am re-trying the build with
    adding
    CONFIG_DEBUG_INFO_BTF=n

    As single build takes 6 hours on RPi4B, it can take several days to find correct
    steps to build. The above steps seems completely obeying the instructions
    at

    https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.3
    and https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.5

    Best regards, Ryutaroh



    --
    -------------
    Education is contagious.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ryutaroh Matsumoto@21:1/5 to All on Wed Apr 28 03:40:01 2021
    Hi Alan,

    I think you can probably enable CONFIG_IKCONFIG, I'm running a

    I am pretty sure I can,
    as I am using my rebuilt Debian RT kernel with CONFIG_IKCONFIG=m.
    I guess that Arnd wants comparison between the original Debian kernel
    and a minimally changed kernel (I am not completely sure, of course).

    I wonder why the Debian kernel team keeps CONFIG_IKCONFIG
    and CONFIG_IKHEADERS disabled...
    which probably makes linux-headers-* and linux-config-* packages
    unnecessary.

    Best regards, Ryutaroh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Alan Corey@21:1/5 to Ryutaroh Matsumoto on Wed Apr 28 03:50:01 2021
    The headers wouldn't be unnecessary if you want to build modules for
    it I think. The linux-config may do the same thing as a config.gz.

    On 4/27/21, Ryutaroh Matsumoto <ryutaroh@ict.e.titech.ac.jp> wrote:
    Hi Alan,

    I think you can probably enable CONFIG_IKCONFIG, I'm running a

    I am pretty sure I can,
    as I am using my rebuilt Debian RT kernel with CONFIG_IKCONFIG=m.
    I guess that Arnd wants comparison between the original Debian kernel
    and a minimally changed kernel (I am not completely sure, of course).

    I wonder why the Debian kernel team keeps CONFIG_IKCONFIG
    and CONFIG_IKHEADERS disabled...
    which probably makes linux-headers-* and linux-config-* packages
    unnecessary.

    Best regards, Ryutaroh



    --
    -------------
    Education is contagious.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ryutaroh Matsumoto@21:1/5 to All on Fri Apr 30 04:20:01 2021
    Hi,

    This is a followup for my previous post of impact on kernel performance
    by kernel comile options:

    Summary:
    * CONFIG_PARAVIRT=n has probably no positive impact on either
    linux-image-arm64 or linux-image-rt-arm64.

    * CONFIG_DEBUG_PREEMPT=n much improves performance of linux-image-rt-arm64,
    while it is unselectabe with linux-image-arm64 as CONFIG_DEBUG_PREEMPT
    depends on CONFIG_PREEMPTION.

    * linux-image-rt-arm64 is much slower than the standard linux-image-arm64,
    but its performance probably becomes comparable by omitting unnecessary
    compile options for a given hardware.

    * All kernel versions are 5.10.28.

    Experiments:
    Compile options are adjusted as follows:

    apt-get source linux
    cd linux-5.10.28
    fakeroot make -f debian/rules.gen setup_arm64_none_arm64
    cat >>debian/build/build_arm64_none_arm64/.config <<'EOF'
    CONFIG_XEN=n
    CONFIG_PARAVIRT=n
    CONFIG_DEBUG_INFO_BTF=n
    EOF
    fakeroot debian/rules source
    fakeroot make -j 4 -f debian/rules.gen binary-arch_arm64_none_arm64

    CONFIG_XEN selects CONFIG_PARAVIRT, so it must be disabled when CONFIG_PARAVIRT=n.
    CONFIG_DEBUG_INFO_BTF=y causes build error with linux-image-arm64.

    The job of RPi4B is taking IPv4 packets, applying NAPT, encapslating them in IPv6,
    and vice versa. Almost no user process is involved. CPU is mainly in the kernel mode or interrupt. The cpu consumption of hard irq + softirq of single cpu core spikes to 85 to 100% during the speedtest.

    CPU frequency of RPi4 is set to the lowest (600 MHz) by
    cpupower frequency-set -g powersave

    IPv6 packets can travel at around 600-800Mbps. All IPv4 packets are
    converted to IPv6 by RPi4, and no IPv4 packets are exchanged with the ISP. ISP's network is essentially IPv6 single stack.
    All devices are wired to a single Ethernet switch.

    On another amd64 fast laptop, I do
    speedtest -v --selection-details -a -i 192.168.1.72 -s 28910

    The observed speeds are shown below:

    linux-image-arm64 with no change:
    Download: 577.23 Mbps (data used: 370.7 MB)
    Upload: 386.99 Mbps (data used: 353.0 MB)
    Download: 592.79 Mbps (data used: 1.1 GB)
    Upload: 380.41 Mbps (data used: 171.0 MB)


    linux-image-arm64 with CONFIG_PARAVIRT=n
    Download: 485.35 Mbps (data used: 406.0 MB)
    Upload: 380.57 Mbps (data used: 171.5 MB)
    Download: 514.57 Mbps (data used: 256.8 MB)
    Upload: 376.92 Mbps (data used: 169.2 MB)

    linux-image-rt-arm64 with no change:
    Download: 380.85 Mbps (data used: 422.2 MB)
    Upload: 283.87 Mbps (data used: 127.8 MB)

    linux-image-rt-arm64 with CONFIG_PARAVIRT=n
    Download: 332.95 Mbps (data used: 265.4 MB)
    Upload: 310.06 Mbps (data used: 273.7 MB)
    Download: 385.97 Mbps (data used: 400.1 MB)
    Upload: 295.57 Mbps (data used: 133.2 MB)
    Download: 379.69 Mbps (data used: 394.0 MB)
    Upload: 293.07 Mbps (data used: 139.4 MB)

    linux-image-rt-arm64 with CONFIG_PARAVIRT=n & CONFIG_DEBUG_PREEMPT=n
    Download: 425.95 Mbps (data used: 753.7 MB)
    Upload: 347.50 Mbps (data used: 382.8 MB)
    Download: 423.05 Mbps (data used: 499.4 MB)
    Upload: 332.48 Mbps (data used: 149.4 MB)

    RT kernel specialized for RPi: https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh

    Download: 488.33 Mbps (data used: 514.6 MB)
    Upload: 416.72 Mbps (data used: 330.8 MB)
    Download: 504.79 Mbps (data used: 633.5 MB)
    Upload: 404.07 Mbps (data used: 258.5 MB)

    Best regards, Ryutaroh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arnd Bergmann@21:1/5 to ryutaroh@ict.e.titech.ac.jp on Fri Apr 30 11:50:01 2021
    On Fri, Apr 30, 2021 at 4:10 AM Ryutaroh Matsumoto <ryutaroh@ict.e.titech.ac.jp> wrote:

    This is a followup for my previous post of impact on kernel performance
    by kernel comile options:

    Summary:
    * CONFIG_PARAVIRT=n has probably no positive impact on either
    linux-image-arm64 or linux-image-rt-arm64.

    Ok

    * CONFIG_DEBUG_PREEMPT=n much improves performance of linux-image-rt-arm64,
    while it is unselectabe with linux-image-arm64 as CONFIG_DEBUG_PREEMPT
    depends on CONFIG_PREEMPTION.

    * linux-image-rt-arm64 is much slower than the standard linux-image-arm64,
    but its performance probably becomes comparable by omitting unnecessary
    compile options for a given hardware.

    I would not expect any change in performance from omitting unused drivers.
    If turning off the other platforms has a performance impact, this could still mean that there is a serious performance regression where we do not
    expect it.

    CONFIG_DEBUG_PREEMPT is a tough choice here: in a distro kernel,
    this should probably be enabled since it may find RT specific bugs in
    arbitrary drivers. Generally speaking, PREEMPT_RT is less well tested
    than normal kernels, so having this enabled is particularly useful when
    running on hardware that nobody else has tried it on before.
    The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
    here, it may be worth asking on the linux-rt-users list about what the
    expected cost on arm64 hardware is.

    The job of RPi4B is taking IPv4 packets, applying NAPT, encapslating them in IPv6,
    and vice versa. Almost no user process is involved. CPU is mainly in the kernel
    mode or interrupt. The cpu consumption of hard irq + softirq of single cpu core
    spikes to 85 to 100% during the speedtest.

    This is likely all driver specific, and if you just need to improve network throughput, tuning or hacking the driver probably makes more difference
    than the kernel.

    If this is the internal network device in the Raspberry Pi 4, I can see
    that the platform is not particularly optimized for throughput, even
    though the driver doesn't contain any serious blunders.

    The first thing I see is that the driver can support 40 bit addressing,
    but the platform doesn't declare the bus to be wider than 32 bits,
    so it will always use bounce buffers for any address above the first
    four gigabytes. Interestingly, the DTB file that comes with raspbian
    does declare a /scb/dma-ranges property for the bus that ethernet
    and PCI are attached to, which would make their kernel much
    faster than a mainline kernel!

    Another thing I see is that the ethernet device is actually able to
    use four separate transmit queues, but it seems they are all
    wired up the same interrupt line. For rx queues, the hardware
    does seem to support it but the driver doesn't. I doubt that there
    is anything you can do about this to make it use multiple CPUs.

    Finally, I see that the TX queue is protected using a spinlock that
    prevents the bcmgenet_xmit() function from running concurrently
    with the __bcmgenet_tx_reclaim() function, so even when you
    call xmit on a different CPU cores, it still won't utilize multiple cores
    at any time, but rather lead to either spinning (with the normal
    kernel) or blocking the thread (on a rt kernel). If the transmit
    path can be changed to work without spinlocks, the differences
    between rt and and non-rt would get smaller for your workload,
    and probably faster in both cases.

    The observed speeds are shown below:

    linux-image-arm64 with no change:
    Download: 577.23 Mbps (data used: 370.7 MB)
    Upload: 386.99 Mbps (data used: 353.0 MB)
    Download: 592.79 Mbps (data used: 1.1 GB)
    Upload: 380.41 Mbps (data used: 171.0 MB)


    linux-image-arm64 with CONFIG_PARAVIRT=n
    Download: 485.35 Mbps (data used: 406.0 MB)
    Upload: 380.57 Mbps (data used: 171.5 MB)
    Download: 514.57 Mbps (data used: 256.8 MB)
    Upload: 376.92 Mbps (data used: 169.2 MB)

    Curiously, these numbers suggest that turning off CONFIG_PARAVIRT
    actually makes the kernel slower in the non-preempt version, while for
    the preempt-rt kernel it does not show that counterintuitive effect.
    Can you check whether there are any other differences in the .config
    file besides CONFIG_PARAVIRT that may cause the difference, and
    that you didn't mix up the results?

    linux-image-rt-arm64 with no change:
    Download: 380.85 Mbps (data used: 422.2 MB)
    Upload: 283.87 Mbps (data used: 127.8 MB)

    linux-image-rt-arm64 with CONFIG_PARAVIRT=n
    Download: 332.95 Mbps (data used: 265.4 MB)
    Upload: 310.06 Mbps (data used: 273.7 MB)
    Download: 385.97 Mbps (data used: 400.1 MB)
    Upload: 295.57 Mbps (data used: 133.2 MB)
    Download: 379.69 Mbps (data used: 394.0 MB)
    Upload: 293.07 Mbps (data used: 139.4 MB)

    linux-image-rt-arm64 with CONFIG_PARAVIRT=n & CONFIG_DEBUG_PREEMPT=n
    Download: 425.95 Mbps (data used: 753.7 MB)
    Upload: 347.50 Mbps (data used: 382.8 MB)
    Download: 423.05 Mbps (data used: 499.4 MB)
    Upload: 332.48 Mbps (data used: 149.4 MB)

    Nice!

    RT kernel specialized for RPi: https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh

    Download: 488.33 Mbps (data used: 514.6 MB)
    Upload: 416.72 Mbps (data used: 330.8 MB)
    Download: 504.79 Mbps (data used: 633.5 MB)
    Upload: 404.07 Mbps (data used: 258.5 MB)

    I see you do a couple of things in this fragment. One of them is the CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
    a significant difference if you actually use BPF (otherwise it makes
    no difference).

    Given that the numbers here are actually higher than the non-RT
    kernel numbers, you clearly hit something very interesting here.

    I also see that you enable a number debugging options, including CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
    the kernel significantly slower when turned on. Is this one enabled
    in the other kernels as well, or did you find that it has a positive
    effect here?

    As mentioned above, turning off the unused platforms /should/ not
    make a difference other than code size. Do you get different
    results if you drop all the CONFIG_ARCH_*=n lines from the
    fragment? If you do, I would consider that a problem in the
    upstream kernel that needs to be investigated further.

    Arnd

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Ryutaroh Matsumoto@21:1/5 to All on Sun May 2 08:30:02 2021
    Sorry for a bit late response.

    I would not expect any change in performance from omitting unused drivers.
    If turning off the other platforms has a performance impact, this could still mean that there is a serious performance regression where we do not
    expect it.

    I do not know if you meant CONFIG_ARCH_* by "drivers".
    Removal of all CONFIG_ARCH_* other than CONFIG_ARCH_BCM2835 disables CONFIG_GENERIC_IRQ_MIGRATION=y
    CONFIG_GENERIC_IRQ_CHIP=y
    CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS=y

    CONFIG_NUMA=n & CONFIG_HOTPLUG_CPU=n disable
    CONFIG_HAVE_SETUP_PER_CPU_AREA
    and CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK,
    and enable CONFIG_ARCH_FLATMEM_ENABLE.

    Those changes could have some impact...

    The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
    here, it may be worth asking on the linux-rt-users list about what the expected cost on arm64 hardware is.

    I believe they are very well aware of this, see https://wiki.linuxfoundation.org/realtime/documentation/howto/applications/preemptrt_setup

    There recommendation seems(?) CONFIG_DEBUG_PREEMPT=n
    for better performance.

    Can you check whether there are any other differences in the .config
    file besides CONFIG_PARAVIRT that may cause the difference, and
    that you didn't mix up the results?

    I believe no.
    The reason of the difference may come from:
    * The number of measurement is too few (2 times).
    * Measured speed depends on the IPv6 network of ISP, which I cannot make
    constant.
    The RPi4B is used for processing real network traffic and my family complains if it is down for too long...

    I see you do a couple of things in this fragment. One of them is the CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
    a significant difference if you actually use BPF (otherwise it makes
    no difference).

    I believe the measured speed depends on nftables, ipv4-ipv6 tunnel,
    macvlan driver, Ethernet driver and the general network stack, not
    including BPF.

    My net if config is:
    ip6tnl1 (tunnel) binds to myve1 (macvlan), and
    myve1 binds to eth0, and eth0 has absolutely no IPv4 or IPv6 address.
    The reason of using macvlan is to use multiple macvlan and macvtap
    interfaces binding to eth0.

    "ip l" shows as follows:
    1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    2: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether dc:a6:32:bb:99:d9 brd ff:ff:ff:ff:ff:ff
    3: myve1@eth0: <BROADCAST,MULTICAST,ALLMULTI,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 96:8a:a9:8d:f6:64 brd ff:ff:ff:ff:ff:ff
    4: myvtap1@eth0: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 500
    link/ether 8e:7e:4b:95:3b:59 brd ff:ff:ff:ff:ff:ff
    5: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/tunnel6 :: brd :: permaddr 616:be05:411::
    6: ip6tnl1@myve1: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1460 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/tunnel6 2400:4050:2ba1:ac00:99:f0ae:8600:2c00 peer 2001:380:a120::9 permaddr 9648:2668:3d4f::
    7: wlan0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc pfifo_fast state DOWN mode DEFAULT group default qlen 1000
    link/ether dc:a6:32:bb:99:da brd ff:ff:ff:ff:ff:ff

    I also see that you enable a number debugging options, including CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
    the kernel significantly slower when turned on. Is this one enabled
    in the other kernels as well, or did you find that it has a positive
    effect here?

    As far as I see, CONFIG_UBSAN=y and CONFIG_UBSAN_SANITIZE_ALL=y
    have not decreased the performance noticeablly (for my personal use cases).
    So I choose to turn on them when I have chance to build a kernel.
    As far as I can recall CONFIG_UBSAN related options did not
    decrease the YouTube playing by firefox-esr.
    For build of user-space applications, I have not seen " subjectively noticeable"
    performance difference by UBSAN. So I routinely use -fanitize=undefined.
    ASAN and MSAN are terribly slow, as we know well.

    As mentioned above, turning off the unused platforms /should/ not
    make a difference other than code size. Do you get different
    results if you drop all the CONFIG_ARCH_*=n lines from the
    fragment? If you do, I would consider that a problem in the
    upstream kernel that needs to be investigated further.

    Having look at arch/arm64/Kconfig.platforms, I see some options
    depending on CONFIG_ARCH_*. Besides the ones
    mentioned at the beginning, they include
    IRQ_DOMAIN_HIERARCHY
    ARM_GIC

    The *IRQ* and ARM_GIC config options can have some impact on the performance, if a use case includes lots of HW interrupts, as I am using it

    I am ready to re-build a Debian kernel with only CONFIG_ARCH_*
    (except CONFIG_ARCH_BCM2835) disabled.

    Best regards, Ryutaroh

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Arnd Bergmann@21:1/5 to ryutaroh@ict.e.titech.ac.jp on Mon May 3 14:20:01 2021
    On Sun, May 2, 2021 at 8:21 AM Ryutaroh Matsumoto
    <ryutaroh@ict.e.titech.ac.jp> wrote:

    Sorry for a bit late response.

    I would not expect any change in performance from omitting unused drivers. If turning off the other platforms has a performance impact, this could still
    mean that there is a serious performance regression where we do not
    expect it.

    I do not know if you meant CONFIG_ARCH_* by "drivers".
    Removal of all CONFIG_ARCH_* other than CONFIG_ARCH_BCM2835 disables CONFIG_GENERIC_IRQ_MIGRATION=y
    CONFIG_GENERIC_IRQ_CHIP=y
    CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS=y

    The way it generally works is that each platform option only allows you to enable additional platform specific drivers that don't make sense elsewhere. E.g. The generic irq chip infrastructure is library code that is used by certain
    drivers but not others, so the expectation is that they would not change
    the performance. If they do, that may be considered a bug.

    CONFIG_NUMA=n & CONFIG_HOTPLUG_CPU=n disable
    CONFIG_HAVE_SETUP_PER_CPU_AREA
    and CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK,
    and enable CONFIG_ARCH_FLATMEM_ENABLE.

    Those changes could have some impact...

    These can have a small performance impact, though it should
    mostly make NUMA machines worse, rather than making non-NUMA
    machines better.

    The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
    here, it may be worth asking on the linux-rt-users list about what the expected cost on arm64 hardware is.

    I believe they are very well aware of this, see https://wiki.linuxfoundation.org/realtime/documentation/howto/applications/preemptrt_setup

    There recommendation seems(?) CONFIG_DEBUG_PREEMPT=n
    for better performance.

    Ok, in that case it might help to change the Kconfig description that
    today recommends turning it on:

    |config DEBUG_PREEMPT
    | bool "Debug preemptible kernel"
    | depends on DEBUG_KERNEL && PREEMPTION && TRACE_IRQFLAGS_SUPPORT
    | default y
    | help
    | If you say Y here then the kernel will use a debug variant of the
    | commonly used smp_processor_id() function and will print warnings
    | if kernel code uses it in a preemption-unsafe way. Also, the kernel | will detect preemption count underflows.

    In particular the "default y" makes it sound like this has very little
    impact.

    Can you check whether there are any other differences in the .config
    file besides CONFIG_PARAVIRT that may cause the difference, and
    that you didn't mix up the results?

    I believe no.
    The reason of the difference may come from:
    * The number of measurement is too few (2 times).
    * Measured speed depends on the IPv6 network of ISP, which I cannot make
    constant.
    The RPi4B is used for processing real network traffic and my family complains if it is down for too long...

    Right. In that case, the other numbers are probably also less reliable
    than the variance between runs suggests.

    I see you do a couple of things in this fragment. One of them is the CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
    a significant difference if you actually use BPF (otherwise it makes
    no difference).

    I believe the measured speed depends on nftables, ipv4-ipv6 tunnel,
    macvlan driver, Ethernet driver and the general network stack, not
    including BPF.

    Ok.

    My net if config is:
    ip6tnl1 (tunnel) binds to myve1 (macvlan), and
    myve1 binds to eth0, and eth0 has absolutely no IPv4 or IPv6 address.
    The reason of using macvlan is to use multiple macvlan and macvtap
    interfaces binding to eth0.

    Ok. FWIW, this driver also lacks support for IFF_UNICAST_FLT,
    which means using macvlan/macvtap puts the device into
    promiscuous mode, and every frame on the wire will have to
    be processed coming into the device.

    I also see that you enable a number debugging options, including CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
    the kernel significantly slower when turned on. Is this one enabled
    in the other kernels as well, or did you find that it has a positive
    effect here?

    As far as I see, CONFIG_UBSAN=y and CONFIG_UBSAN_SANITIZE_ALL=y
    have not decreased the performance noticeablly (for my personal use cases). So I choose to turn on them when I have chance to build a kernel.
    As far as I can recall CONFIG_UBSAN related options did not
    decrease the YouTube playing by firefox-esr.
    For build of user-space applications, I have not seen " subjectively noticeable"
    performance difference by UBSAN. So I routinely use -fanitize=undefined.
    ASAN and MSAN are terribly slow, as we know well.

    The overhead of ubsan is very workload specific, I've seen other cases
    in which it matters a lot.

    As mentioned above, turning off the unused platforms /should/ not
    make a difference other than code size. Do you get different
    results if you drop all the CONFIG_ARCH_*=n lines from the
    fragment? If you do, I would consider that a problem in the
    upstream kernel that needs to be investigated further.

    Having look at arch/arm64/Kconfig.platforms, I see some options
    depending on CONFIG_ARCH_*. Besides the ones
    mentioned at the beginning, they include
    IRQ_DOMAIN_HIERARCHY
    ARM_GIC

    The *IRQ* and ARM_GIC config options can have some impact on the performance, if a use case includes lots of HW interrupts, as I am using it

    I am ready to re-build a Debian kernel with only CONFIG_ARCH_*
    (except CONFIG_ARCH_BCM2835) disabled.

    Ok. As I said, I don't think the IRQ options would matter here, but
    turning off PREEMPT_RT should help a lot if there are too many
    interrupts. More importantly, you can play around with changing
    the IRQ coalescing numbers for 'ethtool -C' (if the driver supports
    that). Setting the coalescing options higher generally improves
    throughput because more work can be done per interrupt, but
    setting it too high can add latency from buffer bloat.

    Arnd

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)