For (ARM) SBCs with limited computational power, stripping out
unused features from the kernel sometimes improves the performance,
depending on usage.
For my use case of packet filtering by RPi4B,
CONFIG_PARAVIRT=n
CONFIG_DEBUG_KERNEL=n
each of the above increases the throughput of the packet filtering router
by about 100Mbps, from the baseline 600Mbps by linux-image-rt-arm64 5.10.
The above options cannot be disabled in Debian kernel package
for its wider use cases. Rebuild of linux-image-rt-arm64 was done by https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh
On the other hand, I am wondering why the following options are currently disabled by Debian arm64 kernel 5.10 package:
CONFIG_CLEANCACHE:
Cleancache can be thought of as a page-granularity victim cache for
clean pages that the kernel's pageframe replacement algorithm (PFRA)
would like to keep around, but can't since there isn't enough
memory. So when the PFRA "evicts" a page, it first attempts to use
cleancache code to put the data contained in that page into
"transcendent memory", memory that is not directly accessible or
addressable by the kernel and is of unknown and possibly time-varying
size. And when a cleancache-enabled filesystem wishes to access a page
in a file on disk, it first checks cleancache to see if it already
contains it; if it does, the page is copied into the kernel and a disk
access is avoided. When a transcendent memory driver is available
(such as zcache or Xen transcendent memory), a significant I/O
reduction may be achieved. When none is available, all cleancache
calls are reduced to a single pointer-compare-against-NULL resulting
in a negligible performance hit.
If unsure, say Y to enable cleancache
This is enabled by other distros.: https://hlandau.github.io/kconfigreport/option/CONFIG_CLEANCACHE.xhtml
CONFIG_ZONE_DEVICE:
Device memory hotplug support allows for establishing pmem, or other
device driver discovered memory regions, in the memmap. This allows pfn_to_page() lookups of otherwise "device-physical" addresses which
is needed for using a DAX mapping in an O_DIRECT operation, among
other things.
If FS_DAX is enabled, then say Y.
(FS_DAX is enabled in Debian arm64 kernel 5.10 package)
CONFIG_IRQ_TIME_ACCOUNTING:
Select this option to enable fine granularity task irq time
accounting. This is done by reading a timestamp on each transitions
between softirq and hardirq state, so there can be a small performance impact.
(My observation suggests CONFIG_PARAVIRT=y having much higher overhead.)
If in doubt, say N here.
The above CONFIG_IRQ_TIME_ACCOUNTING enables %hi in "top".
See also "Is Your Linux Version Hiding Interrupt CPU Usage From You?" https://tanelpoder.com/posts/linux-hiding-interrupt-cpu-usage/
Also, do you see the same performance difference with the non-rt kernel?
Most people would not run the -rt kernel because of the inherent
performance overhead, and it's not clear whether the slowdown you
see is the result of a combination of CONFIG_PREEMPT_RT with some
other option, or if this is something that hurts normal users as well.
Hi Arnd,
Also, do you see the same performance difference with the non-rt kernel?
Most people would not run the -rt kernel because of the inherent
performance overhead, and it's not clear whether the slowdown you
see is the result of a combination of CONFIG_PREEMPT_RT with some
other option, or if this is something that hurts normal users as well.
Thank you for your interest.
I will check the differences of kernel compilation options for
non-rt kernel (linux-image-arm64).
Hopefully, I can return additional info. within one week.
Best regards, Ryutaroh
Also look lor /proc/config.gz. If you have it it's a dump of the
config options of the running kernel. Whether it gets generated or not
is itself a config option.
Hi Alan, thank you for your interest.
Also look lor /proc/config.gz. If you have it it's a dump of the
config options of the running kernel. Whether it gets generated or not
is itself a config option.
I plan to make the minimal chanages to the config as rebuilding it by
apt-get source linux/sid
cd linux-5.10.28
fakeroot make -f debian/rules.gen setup_arm64_none_arm64
cat >>debian/build/build_arm64_none_arm64/.config <<'EOF'
CONFIG_XEN=n
CONFIG_PARAVIRT=n
EOF
fakeroot debian/rules source
fakeroot make -j 3 -f debian/rules.gen binary-arch_arm64_none_arm64
I expect not having /proc/config.gz as the CONFIG_IKCONFIG is disabled
in the Debian kernel.
I will include diff -u of .config in debian/build/build_arm64_none_arm64
and /usr/src/linux-config-5.10/config.arm64_rt_arm64
As CONFIG_XEN selects CONFIG_PARAVIRT, CONFIG_XEN=n is required
to build a kernel with CONFIG_PARAVIRT=n.
The last build of the above steps failed as ".btf.vmlinux.bin.o: file not recognized: file format not recognized". I am re-trying the build with
adding
CONFIG_DEBUG_INFO_BTF=n
As single build takes 6 hours on RPi4B, it can take several days to find correct
steps to build. The above steps seems completely obeying the instructions
at
https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.3
and https://www.debian.org/doc/manuals/debian-kernel-handbook/ch-common-tasks.html#s4.2.5
Best regards, Ryutaroh
I think you can probably enable CONFIG_IKCONFIG, I'm running a
Hi Alan,
I think you can probably enable CONFIG_IKCONFIG, I'm running a
I am pretty sure I can,
as I am using my rebuilt Debian RT kernel with CONFIG_IKCONFIG=m.
I guess that Arnd wants comparison between the original Debian kernel
and a minimally changed kernel (I am not completely sure, of course).
I wonder why the Debian kernel team keeps CONFIG_IKCONFIG
and CONFIG_IKHEADERS disabled...
which probably makes linux-headers-* and linux-config-* packages
unnecessary.
Best regards, Ryutaroh
This is a followup for my previous post of impact on kernel performance
by kernel comile options:
Summary:
* CONFIG_PARAVIRT=n has probably no positive impact on either
linux-image-arm64 or linux-image-rt-arm64.
* CONFIG_DEBUG_PREEMPT=n much improves performance of linux-image-rt-arm64,
while it is unselectabe with linux-image-arm64 as CONFIG_DEBUG_PREEMPT
depends on CONFIG_PREEMPTION.
* linux-image-rt-arm64 is much slower than the standard linux-image-arm64,
but its performance probably becomes comparable by omitting unnecessary
compile options for a given hardware.
The job of RPi4B is taking IPv4 packets, applying NAPT, encapslating them in IPv6,
and vice versa. Almost no user process is involved. CPU is mainly in the kernel
mode or interrupt. The cpu consumption of hard irq + softirq of single cpu core
spikes to 85 to 100% during the speedtest.
The observed speeds are shown below:
linux-image-arm64 with no change:
Download: 577.23 Mbps (data used: 370.7 MB)
Upload: 386.99 Mbps (data used: 353.0 MB)
Download: 592.79 Mbps (data used: 1.1 GB)
Upload: 380.41 Mbps (data used: 171.0 MB)
linux-image-arm64 with CONFIG_PARAVIRT=n
Download: 485.35 Mbps (data used: 406.0 MB)
Upload: 380.57 Mbps (data used: 171.5 MB)
Download: 514.57 Mbps (data used: 256.8 MB)
Upload: 376.92 Mbps (data used: 169.2 MB)
linux-image-rt-arm64 with no change:
Download: 380.85 Mbps (data used: 422.2 MB)
Upload: 283.87 Mbps (data used: 127.8 MB)
linux-image-rt-arm64 with CONFIG_PARAVIRT=n
Download: 332.95 Mbps (data used: 265.4 MB)
Upload: 310.06 Mbps (data used: 273.7 MB)
Download: 385.97 Mbps (data used: 400.1 MB)
Upload: 295.57 Mbps (data used: 133.2 MB)
Download: 379.69 Mbps (data used: 394.0 MB)
Upload: 293.07 Mbps (data used: 139.4 MB)
linux-image-rt-arm64 with CONFIG_PARAVIRT=n & CONFIG_DEBUG_PREEMPT=n
Download: 425.95 Mbps (data used: 753.7 MB)
Upload: 347.50 Mbps (data used: 382.8 MB)
Download: 423.05 Mbps (data used: 499.4 MB)
Upload: 332.48 Mbps (data used: 149.4 MB)
RT kernel specialized for RPi: https://github.com/emojifreak/debian-rpi-image-script/blob/main/build-debian-raspi-kernel.sh
Download: 488.33 Mbps (data used: 514.6 MB)
Upload: 416.72 Mbps (data used: 330.8 MB)
Download: 504.79 Mbps (data used: 633.5 MB)
Upload: 404.07 Mbps (data used: 258.5 MB)
I would not expect any change in performance from omitting unused drivers.
If turning off the other platforms has a performance impact, this could still mean that there is a serious performance regression where we do not
expect it.
The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
here, it may be worth asking on the linux-rt-users list about what the expected cost on arm64 hardware is.
Can you check whether there are any other differences in the .config
file besides CONFIG_PARAVIRT that may cause the difference, and
that you didn't mix up the results?
I see you do a couple of things in this fragment. One of them is the CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
a significant difference if you actually use BPF (otherwise it makes
no difference).
I also see that you enable a number debugging options, including CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
the kernel significantly slower when turned on. Is this one enabled
in the other kernels as well, or did you find that it has a positive
effect here?
As mentioned above, turning off the unused platforms /should/ not
make a difference other than code size. Do you get different
results if you drop all the CONFIG_ARCH_*=n lines from the
fragment? If you do, I would consider that a problem in the
upstream kernel that needs to be investigated further.
Sorry for a bit late response.
I would not expect any change in performance from omitting unused drivers. If turning off the other platforms has a performance impact, this could still
mean that there is a serious performance regression where we do not
expect it.
I do not know if you meant CONFIG_ARCH_* by "drivers".
Removal of all CONFIG_ARCH_* other than CONFIG_ARCH_BCM2835 disables CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_GENERIC_IRQ_CHIP=y
CONFIG_IRQ_FASTEOI_HIERARCHY_HANDLERS=y
CONFIG_NUMA=n & CONFIG_HOTPLUG_CPU=n disable
CONFIG_HAVE_SETUP_PER_CPU_AREA
and CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK,
and enable CONFIG_ARCH_FLATMEM_ENABLE.
Those changes could have some impact...
The impact of CONFIG_DEBUG_PREEMPT is also higher than I expected
here, it may be worth asking on the linux-rt-users list about what the expected cost on arm64 hardware is.
I believe they are very well aware of this, see https://wiki.linuxfoundation.org/realtime/documentation/howto/applications/preemptrt_setup
There recommendation seems(?) CONFIG_DEBUG_PREEMPT=n
for better performance.
Can you check whether there are any other differences in the .config
file besides CONFIG_PARAVIRT that may cause the difference, and
that you didn't mix up the results?
I believe no.
The reason of the difference may come from:
* The number of measurement is too few (2 times).
* Measured speed depends on the IPv6 network of ISP, which I cannot make
constant.
The RPi4B is used for processing real network traffic and my family complains if it is down for too long...
I see you do a couple of things in this fragment. One of them is the CONFIG_BPF_JIT_ALWAYS_ON=y option that might result in
a significant difference if you actually use BPF (otherwise it makes
no difference).
I believe the measured speed depends on nftables, ipv4-ipv6 tunnel,
macvlan driver, Ethernet driver and the general network stack, not
including BPF.
My net if config is:
ip6tnl1 (tunnel) binds to myve1 (macvlan), and
myve1 binds to eth0, and eth0 has absolutely no IPv4 or IPv6 address.
The reason of using macvlan is to use multiple macvlan and macvtap
interfaces binding to eth0.
I also see that you enable a number debugging options, including CONFIG_UBSAN_SANITIZE_ALL=y, which I would expect to make
the kernel significantly slower when turned on. Is this one enabled
in the other kernels as well, or did you find that it has a positive
effect here?
As far as I see, CONFIG_UBSAN=y and CONFIG_UBSAN_SANITIZE_ALL=y
have not decreased the performance noticeablly (for my personal use cases). So I choose to turn on them when I have chance to build a kernel.
As far as I can recall CONFIG_UBSAN related options did not
decrease the YouTube playing by firefox-esr.
For build of user-space applications, I have not seen " subjectively noticeable"
performance difference by UBSAN. So I routinely use -fanitize=undefined.
ASAN and MSAN are terribly slow, as we know well.
As mentioned above, turning off the unused platforms /should/ not
make a difference other than code size. Do you get different
results if you drop all the CONFIG_ARCH_*=n lines from the
fragment? If you do, I would consider that a problem in the
upstream kernel that needs to be investigated further.
Having look at arch/arm64/Kconfig.platforms, I see some options
depending on CONFIG_ARCH_*. Besides the ones
mentioned at the beginning, they include
IRQ_DOMAIN_HIERARCHY
ARM_GIC
The *IRQ* and ARM_GIC config options can have some impact on the performance, if a use case includes lots of HW interrupts, as I am using it
I am ready to re-build a Debian kernel with only CONFIG_ARCH_*
(except CONFIG_ARCH_BCM2835) disabled.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 185 |
Nodes: | 16 (1 / 15) |
Uptime: | 86:54:07 |
Calls: | 3,750 |
Files: | 11,172 |
Messages: | 3,462,270 |