I'm trying to kickstart Debian 11 (port that you build) on my rx2660. However, no luck at all because of memcpy bug in 5.x kernel. Could you
please clarify what FW/BMC versions your server has?
Here are the firmware versions of my RX2660:
[mp0017a499dd1c] MP:CM> sr
SYSREV
Current firmware revisions
MP FW : F.02.17
BMC FW : 05.23
EFI FW : ROM A 05.63, ROM B 07.12
System FW : ROM A 01.00, ROM B 04.04, Boot ROM B
UCIO FW : 03.0b
PRS FW : 00.08 UpSeqRev: 02, DownSeqRev: 01
[mp0017a499dd1c] MP:CM>
Hello!
On 4/9/22 10:06, Frank Scheiner wrote:
Could you please also mention what processors you have installed,
Montecito (90xy) or Montvale (91xy[z])? I'm not sure if this makes a
difference, but suspect it could.
glaubitz@electron:~$ cat /proc/cpuinfo |head -n18
processor : 0
vendor : GenuineIntel
arch : IA-64
family : 32
model : 0
model name : Dual-Core Intel(R) Itanium(R) Processor 9050
revision : 5
archrev : 0
features : branchlong, 16-byte atomic ops
cpu number : 0
cpu regs : 4
cpu MHz : 1594.671
itc MHz : 399.166326
BogoMIPS : 3182.59
siblings : 4
physical id: 0
core id : 0
thread id : 0
glaubitz@electron:~$
Could you please also mention what processors you have installed,
Montecito (90xy) or Montvale (91xy[z])? I'm not sure if this makes a difference, but suspect it could.
Hi,
On 09.04.22 10:25, John Paul Adrian Glaubitz wrote:
Hello!
On 4/9/22 10:06, Frank Scheiner wrote:
Could you please also mention what processors you have installed,
Montecito (90xy) or Montvale (91xy[z])? I'm not sure if this makes a
difference, but suspect it could.
glaubitz@electron:~$ cat /proc/cpuinfo |head -n18
processor : 0
vendor : GenuineIntel
arch : IA-64
family : 32
model : 0
model name : Dual-Core Intel(R) Itanium(R) Processor 9050
revision : 5
archrev : 0
features : branchlong, 16-byte atomic ops
cpu number : 0
cpu regs : 4
cpu MHz : 1594.671
itc MHz : 399.166326
BogoMIPS : 3182.59
siblings : 4
physical id: 0
core id : 0
thread id : 0
glaubitz@electron:~$
Thanks! I have two of those in one of my rx2660s and the other one has a single 9140M IIRC. But both are in (cold) storage right now, so no way
to get the firmware levels in the near future - unless the spring
returns shortly. ;-)
But I will check my rx2620 for the firmware information and if it runs
with the latest kernel. Procesors are Montecitos there (9020s IIRC).
Cheers,
Frank
System FW : ROM A 01.05, ROM B 04.30, Boot ROM B</div><div dir="auto"> PDH FW : 50.07</div><div dir="auto"> UCIO FW : 03.0b</div><div dir="auto"> PRS FW : 00.08 UpSeqRev: 02, DownSeqRev: 01</div><div dir="auto"><br></div><div dir="auto"
P/n for this cpu: AH238-2100A (18mb L2 cache)</div></div></div><br><div class="gmail_quote"><div dir="ltr" class="gmail_attr">On Sat, 9 Apr 2022 11:48 Frank Scheiner, <<a href="mailto:frank.scheiner@web.de">frank.scheiner@web.de</a>> wrote:<br></
<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Hi,<br>
my info so far:
[root@rx2660 ~]# cat /proc/cpuinfo |less
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong, 16-byte atomic ops
cpu number : 0
cpu regs : 4
cpu MHz : 1594.000670
itc MHz : 399.165948
BogoMIPS : 3186.68
siblings : 4
physical id: 0
core id : 0
thread id : 0
P/n for this cpu: AH238-2100A (18mb L2 cache)
cpuinfo was from RHEL 5.4 environment. IIRC it had 2.6.18 kernel, that's
why such a tight model name. I'll give more details on my next boot with
a fresh snapshot...
None of the kernels from 5.x tree worked for me. Everytime I hit that
memcpy BUG. I've booted off from https://cdimage.debian.org/cdimage/ports/snapshots/2019-07-16/ <https://cdimage.debian.org/cdimage/ports/snapshots/2019-07-16/> - it
was perfect in terms of initial boot and launching the install shell.
But then it stuck at "detect and mount CD" (step 3 of launcher).
On 09.04.22 13:45, Anton Borisov wrote:
my info so far:
[root@rx2660 ~]# cat /proc/cpuinfo |less
vendor : GenuineIntel
arch : IA-64
family : Itanium 2
model : 0
revision : 7
archrev : 0
features : branchlong, 16-byte atomic ops
cpu number : 0
cpu regs : 4
cpu MHz : 1594.000670
itc MHz : 399.165948
BogoMIPS : 3186.68
siblings : 4
physical id: 0
core id : 0
thread id : 0
Hm, this doesn't contain the model name for some reason.
What kernel are you running there currently?
And what 5.x kernel specifically doesn't work correctly for you?
P/n for this cpu: AH238-2100A (18mb L2 cache)
According to its frequency readout it could be a 9040 (Montecito) or a
9140N (Montvale) (see [1] for details).
[1]: https://www.cpu-world.com/CPUs/Itanium_2/index.html
Cheers,
Frank
UPDATE: Just noticed, you seem to use an old snapshot. Maybe better use
a current one.
<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><br>UPDATE: Just noticed, you seem to use an old snapshot. Maybe better use<br>
On 09.04.22 20:43, Anton Borisov wrote:
None of the kernels from 5.x tree worked for me. Everytime I hit that
memcpy BUG. I've booted off from
https://cdimage.debian.org/cdimage/ports/snapshots/2019-07-16/
<https://cdimage.debian.org/cdimage/ports/snapshots/2019-07-16/> - it
was perfect in terms of initial boot and launching the install shell.
But then it stuck at "detect and mount CD" (step 3 of launcher).
IIRC the kernel on the installer ISOs is a generic one. Maybe that makes
a difference.
It is 9040 (from installer's shell):
arch : IA-64
family : 32
model : 0
model name : Dual-Core Intel(R) Itanium(R) Processor 9040
revision : 7
archrev : 0
features : branchlong, 16-byte atomic ops
cpu number : 0
cpu regs : 4
cpu MHz : 1594.666
itc MHz : 399.164976
BogoMIPS : 3182.59
siblings : 4
physical id: 0
core id : 0
thread id : 0
UPDATE: Just noticed, you seem to use an old snapshot. Maybebetter use
a current one.
Frank, I tried almost every snapshop from 2021. The only usable one is
from debian 10th branch (that's 2019-07-16 :).
The fresh one, i.e. current, dated as 2022-03-28, generates BUG and
kernel stack error.
cpuinfo was from RHEL 5.4 environment. IIRC it had 2.6.18 kernel, that's
why such a tight model name. I'll give more details on my next boot with
a fresh snapshot...
None of the kernels from 5.x tree worked for me. Everytime I hit that memcpy BUG. I've booted off from https://cdimage.debian.org/cdimage/ports/snapshots/2019-07-16/
- it was perfect in terms of initial boot and launching the install shell. But then it
stuck at "detect and mount CD" (step 3 of launcher).
One question (a rather basic one):
What is the right configuration for ‘atp’ on an machine with ia64 debian ports?
When I run ‘atp update’ I get the following error:
Get:1 http://ftp.ports.debian.org/debian-ports sid InRelease [24.2 kB]
Err:1 http://ftp.ports.debian.org/debian-ports sid InRelease
The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5A88D659DCB811BB
Reading package lists... Done
W: GPG error: http://ftp.ports.debian.org/debian-ports sid InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY 5A88D659DCB811BB
E: The repository 'http://ftp.ports.debian.org/debian-ports sid InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
Thanks Adrian.
That worked quite well. I was able to refresh the package list with no error after that.
However, when conducting the “upgrade” command, things quickly derailed. I am now in a state where not even “perl” runs.
# perl
perl: error while loading shared libraries: libcrypt.so.1: cannot open shared object file: No such file or directory
I guess I had just been too long since I had taken an upgrade and things are fragile as it is, without letting things
lapse this long. I guess the best next step for me is to install the OS clean from the most recent ISO.
[1] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=974552
So, I went ahead and tried the ISO from your last email: The 2022-03-18 (non-free).
Things didn’t go too well. I am back at having usercopy problems again. Did we have a workaround for that?
[ 1.478621] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 15466496, size 3)!
[ 1.480383] kernel BUG at mm/usercopy.c:99!
[ 1.480383] cryptomgr_test[76]: bugcheck! 0 [1]
[ 1.484383] Modules linked in:
[ 1.484383]
[ 1.484383] CPU: 3 PID: 76 Comm: cryptomgr_test Not tainted 5.16.0-5-itanium #1 Debian 5.16.14-1
[ 1.484383] Hardware name: hp server rx2660 , BIOS 04.30 03/05/2012
[ 1.484383] psr : 00001010084a6010 ifs : 8000000000000410 ip : [<a0000001013389b0>] Not tainted (5.16.0-5-itanium Debian 5.16.14-1)
[ 1.484383] ip is at usercopy_abort+0x120/0x130
[ 1.484383] unat: 0000000000000000 pfs : 0000000000000410 rsc : 0000000000000003
[ 1.484383] rnat: a000000101929380 bsps: 00000000000000ff pr : 00000005666a9655
[ 1.484383] ldrs: 0000000000000000 ccv : 00000000fffff13f fpsr: 0009804c8a70433f
[ 1.484383] csd : 0000000000000000 ssd : 0000000000000000
[ 1.484383] b0 : a0000001013389b0 b6 : a000000100cbd7c0 b7 : a000000100813460
[ 1.484383] f6 : 1003e00000000002c1e6e f7 : 1003e0044b82fa09b5a53
[ 1.484383] f8 : 1003e0000000000000bd7 f9 : 1003e000000000394424f
[ 1.484383] f10 : 1003e20c49ba5e353f7cf f11 : 1003e00000000007547f9
[ 1.484383] r1 : a000000101c1cd70 r2 : a0000001019aa680 r3 : a0000001019aa688
[ 1.484383] r8 : 000000000000001f r9 : a000000101992628 r10 : c0000000ffffefff
[ 1.484383] r11 : 0000000000000003 r12 : e000000101027c70 r13 : e000000101020000
[ 1.484383] r14 : ffffffffffd8d910 r15 : a0000001019aa688 r16 : 00000000ffffefff
[ 1.484383] r17 : 0000000000000001 r18 : e000000101027ba0 r19 : 0000000000000140
[ 1.484383] r20 : 000000000000000f r21 : 0000000000000003 r22 : 0000000000000000
[ 1.484383] r23 : 0000000000000003 r24 : 0000000000000000 r25 : ffffffffffd0c6d1
[ 1.484383] r26 : 000000000000000c r27 : a000000101992680 r28 : 0000000000001000
[ 1.484383] r29 : 0000000000000fff r30 : 0000000000000fff r31 : 0000000000001ffe
[ 1.484383]
[ 1.484383] Call Trace:
[ 1.484383] [<a000000100014c50>] show_stack+0x90/0xc0
[ 1.484383] sp=e0000001010278b0 bsp=e000000101021628
[ 1.484383] [<a000000100015360>] show_regs+0x6e0/0xa40
[ 1.484383] sp=e000000101027a80 bsp=e0000001010215b0
[ 1.484383] [<a000000100026bb0>] die+0x150/0x4c0
[ 1.484383] sp=e000000101027aa0 bsp=e000000101021568
[ 1.484383] [<a000000101366d40>] ia64_bad_break+0x740/0x760
[ 1.484383] sp=e000000101027aa0 bsp=e000000101021538
[ 1.484383] [<a00000010000ca80>] ia64_leave_kernel+0x0/0x270
[ 1.484383] sp=e000000101027aa0 bsp=e000000101021538
[ 1.484383] [<a0000001013389b0>] usercopy_abort+0x120/0x130
[ 1.484383] sp=e000000101027c70 bsp=e0000001010214b8
[ 1.484383] [<a0000001004b83f0>] __check_object_size+0x3f0/0x460
[ 1.484383] sp=e000000101027c80 bsp=e000000101021480
[ 1.484383] [<a00000010081f3e0>] build_test_sglist+0x540/0x8c0
[ 1.484383] sp=e000000101027c80 bsp=e0000001010213b8
[ 1.484383] [<a00000010081fac0>] test_shash_vec_cfg+0x1e0/0xc80
[ 1.484383] sp=e000000101027d00 bsp=e000000101021308
[ 1.484383] [<a000000100829810>] __alg_test_hash.constprop.0+0x2f0/0x760
[ 1.484383] sp=e000000101027da0 bsp=e000000101021260
[ 1.484383] [<a000000100829d90>] alg_test_hash+0x110/0x2e0
[ 1.484383] sp=e000000101027db0 bsp=e000000101021208
[ 1.484383] [<a000000100825a10>] alg_test+0xc50/0xec0
[ 1.484383] sp=e000000101027db0 bsp=e000000101021180
[ 1.484383] [<a00000010081d240>] cryptomgr_test+0x80/0xc0
[ 1.484383] sp=e000000101027e30 bsp=e000000101021160
[ 1.484383] [<a0000001000c08e0>] kthread+0x2e0/0x300
[ 1.484383] sp=e000000101027e30 bsp=e000000101021118
[ 1.484383] [<a00000010000c870>] call_payload+0x50/0x80
[ 1.484383] sp=e000000101027e30 bsp=e000000101021100
[ 1.484383] Disabling lock debugging due to kernel taint
[ 2.127275] Freeing initrd memory: 21920kB freed
[ 6.655281] random: crng init done
I also see there are a couple more recent ISOs. Should I try those first?
Same exact failure using the 2022-03-28 ISO. And it happens even with “hardened_usercopy=off”.
It might also make sense trying to update the system firmware to the latest version you can get
If I am not mistaken, last time we checked, my rx2660 FW version was actually more recent than yours…
From what I can understand by the information in the bugcheck, this is somewhat related to a violation
in parameter copy from user to kernel during some boot-time, crypto, self-test. Does that sound right?
If that is the case, how would this be related to FW?
From what I can understand by the information in the bugcheck, this is somewhat related to a violation
in parameter copy from user to kernel during some boot-time, crypto, self-test. Does that sound right?
If that is the case, how would this be related to FW?
I'm not claiming that it must be related to the firmware, I'm just saying that I don't see this problem
on my RX2660 at all and I have even reinstalled it recently with one of the latest firmware images
without having to pass any parameter to the command line.
On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pmsjt@texair.net> wrote:
altogether:
On 2022/Apr/25, at 01:14, Frank Scheiner <frank.scheiner@web.de> wrote:
Hi guys,
On 25.04.22 10:09, John Paul Adrian Glaubitz wrote:
From what I can understand by the information in the bugcheck, this is somewhat related to a violation
in parameter copy from user to kernel during some boot-time, crypto, self-test. Does that sound right?
If that is the case, how would this be related to FW?
I'm not claiming that it must be related to the firmware, I'm just saying that I don't see this problem
on my RX2660 at all and I have even reinstalled it recently with one of the latest firmware images
without having to pass any parameter to the command line.
A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito
left and Montvale right.
But could still be multiple other reasons we haven't looked at yet in
detail:
* amount of memory installed
* SMT enabled or not
* number of processor modules installed
It might be possible for me to check on my rx2660s (one with Montvale
and one with Montecito(s)) tomorrow. I will then also look at my other
Itanium gear and gather relevant information.
Yes, this sounds mode likely to me too.
The crypto self-tests seem to be an innocent bystander here. I tried booting the most recent kernel with the option “cryptomgr.notests” and it went much farther. Alas it still failed with another buffer copy validation for a different caller
bugcheck. This could be related or not. Coule very well be a different bug that happens to manifest itself round the same validation.[ 3.836466] [<a000000101353690>] usercopy_abort+0x120/0x130
[ 3.836466] sp=e0000001000cfdf0 bsp=e0000001000c9388
[ 3.836466] [<a0000001004c5660>] __check_object_size+0x3c0/0x420
[ 3.836466] sp=e0000001000cfe00 bsp=e0000001000c9350
[ 3.836466] [<a000000100570030>] sys_getcwd+0x250/0x420
[ 3.836466] sp=e0000001000cfe00 bsp=e0000001000c92c8
[ 3.836466] [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20
[ 3.836466] sp=e0000001000cfe30 bsp=e0000001000c92c8
[ 3.836466] [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400
[ 3.836466] sp=e0000001000d0000 bsp=e0000001000c92c8
This suggests the bug might be in the logic validating these buffers against the allocations (heap, span, etc).
I don’t know why hardened_usercopy=off is not being observed by the kernel. As a work-around I am copying myself a new kernel with CONFIG_HARDENED_USERCOPY disabled at the source.
Even with kernel "Linux debian 4.19.0-5-mckinley #1 SMP Debian 4.19.37-5 (2019-06-19) ia64 GNU/Linux"
Things are still not 100%. After a few hours into building the kernel it started crashing also with usercopy validations but, this time, the other way around. And because it was the other way around, it led to process termination instead of full-blown
CC [M] drivers/net/wireless/realtek/rtw88/rtw8822be.o
LD [M] drivers/net/wireless/realtek/rtw88/rtw88_8822be.o
CC [M] drivers/net/wireless/realtek/rtw88/rtw8822c.o
Segmentation fault
make[5]: *** [scripts/Makefile.build:293: drivers/net/wireless/realtek/rtw88/rtw8822c.o] Error 139
make[5]: *** Deleting file 'drivers/net/wireless/realtek/rtw88/rtw8822c.o' make[4]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek/rtw88] Error 2
make[3]: *** [scripts/Makefile.build:555: drivers/net/wireless/realtek] Error 2
make[2]: *** [scripts/Makefile.build:555: drivers/net/wireless] Error 2 make[1]: *** [scripts/Makefile.build:555: drivers/net] Error 2
make: *** [Makefile:1855: drivers] Error 2
pmsjt@debian:~/linux-source-5.17$ make
Message from syslogd@debian at Apr 25 07:58:08 ...
kernel:[23420.984012] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1916912, size 8)!
Message from syslogd@debian at Apr 25 07:58:08 ...
kernel:[23421.268009] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1818608, size 8)!
HOSTCC scripts/sign-file
CALL scripts/checksyscalls.sh
<stdin>:1517:2: warning: #warning syscall clone3 not implemented [-Wcpp]
CALL scripts/atomic/check-atomics.sh
CHK include/generated/compile.h
make[2]: *** [scripts/Makefile.build:294: arch/ia64/kernel/signal.o] Segmentation fault
Message from syslogd@debian at Apr 25 07:58:11 ...
kernel:[23423.626254] usercopy: Kernel memory overwrite attempt detected to linear kernel text (offset 1933296, size 8)!
make[1]: *** [scripts/Makefile.build:555: arch/ia64/kernel] Error 2
make: *** [Makefile:1855: arch/ia64] Error 2
So, I finished compiling my kernel with CONFIG_HARDENED_USERCOPY disabled. Guess what:but we’ll have to investigate to see what part of the validation is failing.
pmsjt@debian:~$ uname -a
Linux debian 5.17.3-rt17 #2 SMP Mon Apr 25 16:55:00 PDT 2022 ia64 GNU/Linux
Yup, the system starts just fine with the most recent kernel. So, two things we can infer from this:
- Yes, usercopy validation appears to be broken. The contours of how broken it is are yet unknown
- hardened_usercopy=off seems to be ignored by current kernels. When passing this option the systemwas still failing just the same.
On 2022/Apr/26, at 06:34, Frank Scheiner <frank.scheiner@web.de> wrote:
@Anton:
So maybe best to give `hardened_usercopy=off` a try on your rx2660, too.
From my testing on rx2660 and rx2620 this seems to unbreak the kernel
boot and maybe also makes it less likely to hit the problem post boot. I
don't know why Adrian's rx2660 seems to be unaffected by this, though.
I did. That is why I ended up compiling 5.17 with the entire thing turned off. With 5.17, on my rx2660 Montvale with 8 cores the machine can’t get past early boot even with hardened_usercopy=off.
Those ‘warnings' are actually processes being killed. And they depend on the direction the bad copy was happening.
If you look at my prior responses, with the 4.19 kernel I was also running along fine for hours and, after some time building the kernel (a benchmark in itself) it would start producing these warning and would not allow compilation to continue anyfurther. I would reboot the machine and that gave me a few more hours. When I tried 'hardened_usercopy=off’ on the 4.19 kernel that worked. I no longer got these process terminations after a few hours and the machine was able to build the entire kernel
So, 4.19 and 5.17 are different in many ways (symptom-wise):self test. Disabling the crypto boot self test made it go father but not much. If the error is detected on a system thread, there is no process to terminate: it is game over.
- I never got a bugckeck (panic) level failure on the 4.19. They were all process termination level.
- On the 4.19 these took quite some time to show up. Seemed to depend on the number of processes created in the past and was mitigated by a reboot. On the 5.17 it was very aggressive, showing up early in boot, even on system threads like the crypto bot
- hardened_usercopy=off was observed by 4.19 but ignored by 5.17
I don’t exclude the possibility of human error in conducting all these experiments (some of the process is error prone), but I did run these experiments more than just a few times, so it would have to be a heck of a coincidence to and up withconsistent results.
I'll now look at my other Itanium gear, rx2800 i2 first,
We can certainly send a pull request to the Debian kernel packaging repository to disable
CONFIG_HARDENED_USERCOPY although I'm not sure what ramifications that would have.
But since the feature is broken on Itanium anyway, I guess it won't hurt.
[1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/469
altogether:On 2022/Apr/26, at 01:01, Sergei Trofimovich <slyich@gmail.com> wrote:
On Tue, 26 Apr 2022 02:43:00 +0000
Pedro Miguel Justo <pmsjt@texair.net> wrote:
On 2022/Apr/25, at 14:27, Pedro Miguel Justo <pmsjt@texair.net> wrote: >>>
On 2022/Apr/25, at 14:09, Sergei Trofimovich <slyich@gmail.com> wrote: >>>>
On Mon, 25 Apr 2022 15:07:58 +0000
Pedro Miguel Justo <pmsjt@texair.net> wrote:
On 2022/Apr/25, at 01:22, Pedro Miguel Justo <pmsjt@texair.net> wrote: >>>>>>
On 2022/Apr/25, at 01:14, Frank Scheiner <frank.scheiner@web.de> wrote:
Hi guys,
On 25.04.22 10:09, John Paul Adrian Glaubitz wrote:
A difference between Adrian's rx2660 and Pedro's rx2660 is Montecito >>>>>>> left and Montvale right.From what I can understand by the information in the bugcheck, this is somewhat related to a violationI'm not claiming that it must be related to the firmware, I'm just saying that I don't see this problem
in parameter copy from user to kernel during some boot-time, crypto, self-test. Does that sound right?
If that is the case, how would this be related to FW? >>>>>>>>
on my RX2660 at all and I have even reinstalled it recently with one of the latest firmware images
without having to pass any parameter to the command line. >>>>>>>
But could still be multiple other reasons we haven't looked at yet in >>>>>>> detail:
* amount of memory installed
* SMT enabled or not
* number of processor modules installed
It might be possible for me to check on my rx2660s (one with Montvale >>>>>>> and one with Montecito(s)) tomorrow. I will then also look at my other
Itanium gear and gather relevant information.
Yes, this sounds mode likely to me too.
The crypto self-tests seem to be an innocent bystander here. I tried booting the most recent kernel with the option “cryptomgr.notests” and it went much farther. Alas it still failed with another buffer copy validation for a different caller
[ 3.836466] [<a000000101353690>] usercopy_abort+0x120/0x130
[ 3.836466] sp=e0000001000cfdf0 bsp=e0000001000c9388
[ 3.836466] [<a0000001004c5660>] __check_object_size+0x3c0/0x420 >>>>>> [ 3.836466] sp=e0000001000cfe00 bsp=e0000001000c9350
[ 3.836466] [<a000000100570030>] sys_getcwd+0x250/0x420
[ 3.836466] sp=e0000001000cfe00 bsp=e0000001000c92c8
[ 3.836466] [<a00000010000c860>] ia64_ret_from_syscall+0x0/0x20 >>>>>> [ 3.836466] sp=e0000001000cfe30 bsp=e0000001000c92c8
[ 3.836466] [<a000000000040720>] ia64_ivt+0xffffffff00040720/0x400 >>>>>> [ 3.836466] sp=e0000001000d0000 bsp=e0000001000c92c8
This suggests the bug might be in the logic validating these buffers against the allocations (heap, span, etc).
I don’t know why hardened_usercopy=off is not being observed by the kernel. As a work-around I am copying myself a new kernel with CONFIG_HARDENED_USERCOPY disabled at the source.
So, I finished compiling my kernel with CONFIG_HARDENED_USERCOPY disabled. Guess what:
pmsjt@debian:~$ uname -a
Linux debian 5.17.3-rt17 #2 SMP Mon Apr 25 16:55:00 PDT 2022 ia64 GNU/Linux
Yup, the system starts just fine with the most recent kernel. So, two things we can infer from this:
- Yes, usercopy validation appears to be broken. The contours of how broken it is are yet unknown but we’ll have to investigate to see what part of the validation is failing.
My vague memory tells me that the breakage mechanics is the following:
1. [ok] mm/usercopy.c:check_kernel_text_object() gets called on various kernel addresses
2. [ok] ia64 address space has at least two widely used kernel addresses:
- 0xe0000000_00000000-0xe000..ff-ffffffff - proper linear mapping, 0xe0000000_00000000 + phys
(so called, RGN_KERNEL = 7)
- 0xa0000001_00000000-0xa0000000_00ffffff - kernel ~linear mapping, 0xa0000001_00000000 + phys
(so called, RGN_GATE = 5)
Both are easy to translate to proper linear address, but the translations are not identical. Note that RGN_GATE has an offset.
ia64's stack for some reason lives in RGN_KERNEL space (probably due to historical reasons before kernel switched to vmapped stacks on other arches).
You can see these addresses in backtraces as
sp=e0000001000cfdf0 bsp=e0000001000c9388
ia64's kernel code on the other hand lives in RGN_GATE.
You can see these addresses in backtraces as
[<a000000101353690>] usercopy_abort+0x120/0x130
Now, check_kernel_text_object() compares passed kernel address for 2 ranges:
- RGN_GATE:
unsigned long textlow = (unsigned long)_stext;
unsigned long texthigh = (unsigned long)_etext;
if (overlaps(ptr, n, textlow, texthigh))
usercopy_abort("kernel text", NULL, to_user, ptr - textlow, n);
I think this check is correct. There is no address translation.
- RGN_KERNEL:
textlow_linear = (unsigned long)lm_alias(textlow);
texthigh_linear = (unsigned long)lm_alias(texthigh);
if (overlaps(ptr, n, textlow_linear, texthigh_linear))
usercopy_abort("linear kernel text", NULL, to_user,...
I think this one is wrong. Looking at the definition of lm_alias()
it can ge used only on RGN_KERNEL addresses (which looks useless,
they are already in correct form) and literal symbol names:
include/linux/mm.h:#define lm_alias(x) __va(__pa_symbol(x))
include/linux/mm.h:#define __pa_symbol(x) __pa(RELOC_HIDE((unsigned long)(x), 0))
arch/ia64/include/asm/page.h:#define __pa(x) ({ia64_va _v; _v.l = (long) (x); _v.f.reg = 0; _v.l;})
#define __va(x) ({ia64_va _v; _v.l = (long) (x); _v.f.reg = -1; _v.p;})
I think that means __pa_symbol(x) against local variables like textlow
is completely broken.
If lm_alias() works at all on ia64 maybe that is enough to unbreak usercoy:
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -133,13 +133,13 @@ static inline void check_kernel_text_object(const unsigned long ptr,
* __pa() is not just the reverse of __va(). This can be detected
* and checked:
*/
- textlow_linear = (unsigned long)lm_alias(textlow);
+ textlow_linear = (unsigned long)lm_alias(_stext);
/* No different mapping: we're done. */
if (textlow_linear == textlow)
return;
/* Check the secondary mapping... */
- texthigh_linear = (unsigned long)lm_alias(texthigh);
+ texthigh_linear = (unsigned long)lm_alias(_etext);
if (overlaps(ptr, n, textlow_linear, texthigh_linear))
usercopy_abort("linear kernel text", NULL, to_user,
ptr - textlow_linear, n);
If not here is an ia64-specific workaround:
--- a/mm/usercopy.c
+++ b/mm/usercopy.c
@@ -133,13 +133,13 @@ static inline void check_kernel_text_object(const unsigned long ptr,
* __pa() is not just the reverse of __va(). This can be detected
* and checked:
*/
- textlow_linear = (unsigned long)lm_alias(textlow);
+ textlow_linear = (unsigned long)ia64_imva(textlow);
/* No different mapping: we're done. */
if (textlow_linear == textlow)
return;
/* Check the secondary mapping... */
- texthigh_linear = (unsigned long)lm_alias(texthigh);
+ texthigh_linear = (unsigned long)ia64_imva(texthigh);
if (overlaps(ptr, n, textlow_linear, texthigh_linear))
usercopy_abort("linear kernel text", NULL, to_user,
ptr - textlow_linear, n);
CAVEAT: I did not compile- or run- test it. Might need a bit of pointer/long casting around.
I don’t mind at all giving this a try. Sounds quite plausible.
Do you think the stacks (and RSE) will eventually have to move from RGN_KERNEL to vmapped the same as other archs to avoid issues like this and other to come?
Booting `Debian GNU/Linux Sid (diskless)'
Loading Linux kernel ...
Loading initial ramdisk ...
[ 0.000000] Linux version 5.17.0-1-mckinley (debian-kernel@lists.debian.org) (gcc-11 (Debian 11.2.0-20) 11.2.0, GNU
ld (GNU Binutils for Debian) 2.38) #1 SMP Debian 5.17.3-1 (2022-04-18)
[ 0.000000] efi: EFI v2.00 by HP
[ 0.000000] efi: SALsystab=0x3ee7a000 ACPI 2.0=0x3fde6000
ESI=0x3ee7b000 SMBIOS=0x3ee7c000 HCDP=0x3fde4000
[ 0.000000] PCDP: v3 at 0x3fde4000
[...]
[ 1.199313] zbud: loaded
[ 1.199313] integrity: Platform Keyring initialized
[ 1.199313] Key type asymmetric registered
[ 1.199313] Asymmetric key parser 'x509' registered
[ 1.927433] Freeing initrd memory: 26688kB freed
[ 1.930079] usercopy: Kernel memory overwrite attempt detected to
linear kernel text (offset 450555, size 4)!
[ 1.930079] kernel BUG at mm/usercopy.c:100!
[ 1.930079] kworker/u16:1[71]: bugcheck! 0 [1]
[ 1.930079] Modules linked in:
[ 1.930079]
[ 1.930079] CPU: 3 PID: 71 Comm: kworker/u16:1 Not tainted 5.17.0-1-mckinley #1 Debian 5.17.3-1
[ 1.930079] Hardware name: hp server rx2660 , BIOS
04.04 07/15/2008
[ 1.930079] psr : 00001010084a6010 ifs : 8000000000000410 ip : [<a000000101353690>] Not tainted (5.17.0-1-mckinley Debian 5.17.3-1)
[ 1.930079] ip is at usercopy_abort+0x120/0x130
[...]
is it possible to fill a bug report at linux kernel bugzilla and post
full stack trace with the description
to: linux-ia64@vger.k.o
cc: linux-kernel@vger.k.o , linux-mm@kvack.org
and CC to the author of the patch as well? https://lwn.net/Articles/694470/
On 27.04.22 14:45, Anatoly Pugachev wrote:
is it possible to fill a bug report at linux kernel bugzilla and post
full stack trace with the description
to: linux-ia64@vger.k.o
cc: linux-kernel@vger.k.o , linux-mm@kvack.org
and CC to the author of the patch as well?
https://lwn.net/Articles/694470/
I've booted off with Gentoo (https://bouncer.gentoo.org/fetch/root/all/releases/ia64/autobuilds/20220601T030345Z/install-ia64-minimal-20220601T030345Z.iso).
Default parameters, i.e. without hardmemcpy.
No kernel faults at all.
Kernel: 5.18.1-gentoo-r1-ia64
Log is here: https://pastebin.com/gzLFyUKx
On 2022/Jun/06, at 03:44, Frank Scheiner <frank.scheiner@web.de> wrote:
On 05.06.22 20:00, Anton Borisov wrote:
I've booted off with Gentoo
(https://bouncer.gentoo.org/fetch/root/all/releases/ia64/autobuilds/20220601T030345Z/install-ia64-minimal-20220601T030345Z.iso).
Default parameters, i.e. without hardmemcpy.
Just for clarification:
Is hardened memcopy "just" deconfigured in the kernel config for the
used kernel...
Any chance this kernel was produced after Adrian’s PR [1] was merged upstream?
[1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/469
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 349 |
Nodes: | 16 (0 / 16) |
Uptime: | 144:46:26 |
Calls: | 7,614 |
Calls today: | 2 |
Files: | 12,790 |
Messages: | 5,684,750 |
Posted today: | 2 |