Hi Dennis,
Unless you already know that your system's memory is ok...
I am not so sure about this yet until I can rebuild the required grub binaries with full debug info. For at least a year ( or more ) I have
seen "really bad things"(tm) happen when I try to make a new initrd on sparc64. Generally the machine seems to pack up and go away with nary a single packet out to the world. To look into this problem I use a serial attached good old 9600 baud console and watch what happens when I try to
do a make install from within the Linux source tree :
(...)
So therefore I think that there is a bug in /usr/sbin/grub-probe and it really kills the whole "make install" process from within the Linux
kernel source tree or any other way you choose to run it.
Has anyone else seen this ?
Hello Dennis!
On 4/2/22 03:34, Dennis Clarke wrote:
I am not so sure about this yet until I can rebuild the required grub
binaries with full debug info. For at least a year ( or more ) I have
seen "really bad things"(tm) happen when I try to make a new initrd on
sparc64. Generally the machine seems to pack up and go away with nary a
single packet out to the world. To look into this problem I use a serial
attached good old 9600 baud console and watch what happens when I try to
do a make install from within the Linux source tree :
(...)
So therefore I think that there is a bug in /usr/sbin/grub-probe and it
really kills the whole "make install" process from within the Linux
kernel source tree or any other way you choose to run it.
Has anyone else seen this ?
This isn't a bug in GRUB but a kernel bug that affects older SPARC machines like your UltraSPARC IIIi. Unfortunately, no one has had the time yet to bisect
this issue.
But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.
But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.
That will be nearly impossible. I can not even recall when the bug first appeared or when was the last time that I could run update-grub without
the machine locking up. At least two years now. Maybe three.
Also this is an even older UltraSparc IIi type machine. Really I should
have tossed it out long ago but the next machine I have handy is a
Fujitsu M3000 unit and I thought I had heard it was impossible to get
Linux on such a beast for unknown reasons. Could be myth or rumour but I thought the M3000 was somehow "special". The larger M4000 seems to be
fine but those are just nasty large beasts to run in a home lab.
Dragging the deep waters looking for that kernel bug will take a lot of
time. Possibly even some luck.
Hello!
On 4/3/22 13:42, Dennis Clarke wrote:
But since you seem to have a reliable reproducer, you can start trying to bisect
the kernel to find the commit that introduced this regression.
That will be nearly impossible. I can not even recall when the bug first
appeared or when was the last time that I could run update-grub without
the machine locking up. At least two years now. Maybe three.
What do you mean is impossible? Bisecting the bug or the fact that it is
a kernel bug? I know very well it's a kernel bug because it does not occur when using the 4.19 kernel on any of the affected SPARCs and it does not occur on any of the newer SPARCs with a current kernel.
The SPARC T2 and T5 we are using don't have the problem at all, for example.
Also this is an even older UltraSparc IIi type machine. Really I should
have tossed it out long ago but the next machine I have handy is a
Fujitsu M3000 unit and I thought I had heard it was impossible to get
Linux on such a beast for unknown reasons. Could be myth or rumour but I
thought the M3000 was somehow "special". The larger M4000 seems to be
fine but those are just nasty large beasts to run in a home lab.
Dragging the deep waters looking for that kernel bug will take a lot of
time. Possibly even some luck.
Not really. You cross-build the kernel, transfer it to the machine and see if update-grub works.
But I can do it myself if I find the time, I have an Ultra 45 that can be used
for that. Thought it would just be nice if I can get a helping hand, especially
since cross-compiling and bisecting the kernel isn't really hard, it just takes
time.
I am curious if you can get the linux-4.19.114 kernel to compile. For
me it just blows up with :
.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more
bytes from a region of size 0 [-Werror=stringop-overread]
648 | if (!strcmp(names + ep[ret].name_offset, name))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property': arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more
bytes from a region of size 0 [-Werror=stringop-overread]
693 | if (!strcmp(names + ep->name_offset, name)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc': arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more
bytes from a region of size 0 [-Werror=stringop-overread]
720 | if (strcmp(names + ep->name_offset, arc_type))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o]
Error 1
make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2
Not sure what to make of that.
Hello Adrian and Dennis,
If this problem is expected to occur on an Ultra 5 or an Ultra 30,
please let me know and I'll be happy to help with a git bisect, using a
spare 9 GB disk for the installation.
If this problem is expected to occur on an Ultra 5 or an Ultra 30,
please let me know and I'll be happy to help with a git bisect, using a
spare 9 GB disk for the installation.
Are you sure of 4.19 ? I see that 4.19.237 exists but I will guess the
same bug exists there also. I was going to begin with 4.19.114 which was released 02-Apr-2020. A solid two years ago seems like as good a place
to start as any. However building the kernel will require that I create
an initrd and also update grub etc etc. I can do that manually and then bypass the "update-grub" process entirely.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/refs/tags
Not really. You cross-build the kernel, transfer it to the machine and see if
update-grub works.
Hold on. This sounds like a chicken and egg scenario. The update-grub
will fail every time. I will need to do the process by hand with an edit
to grub.cfg and with the files needed dropped into /boot with the few
kernel modules needed in /lib/modules/foo. That should be enough to at
least boot.
I have already started the process but I am starting with 4.19.114.
I am curious if you can get the linux-4.19.114 kernel to compile. For me it just blows up with :
.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name': arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
648 | if (!strcmp(names + ep[ret].name_offset, name))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property': arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
693 | if (!strcmp(names + ep->name_offset, name)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc': arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
720 | if (strcmp(names + ep->name_offset, arc_type))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] Error 1 make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2
Not sure what to make of that.
My intuition here tells me the bug is likely in arch/sparc/kernel/syscalls.S which changed slightly since the 4.19.114 days. Looking
previous I see no change in that source file. Regardless, this is just a hunch without a shred of proof. Yet.
No, I am not. I am going with whatever is in the Makefile.
https://github.com/torvalds/linux/commit/fc7c028dcdbfe981bca75d2a7b95f363eb691ef3
So this was seen before regardless.
On 4/3/22 17:19, Dennis Clarke wrote:
I am curious if you can get the linux-4.19.114 kernel to compile. For me it just blows up with :
.
.
.
arch/sparc/kernel/mdesc.c: In function 'mdesc_node_by_name':
arch/sparc/kernel/mdesc.c:648:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
648 | if (!strcmp(names + ep[ret].name_offset, name))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_get_property':
arch/sparc/kernel/mdesc.c:693:22: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
693 | if (!strcmp(names + ep->name_offset, name)) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
arch/sparc/kernel/mdesc.c: In function 'mdesc_next_arc':
arch/sparc/kernel/mdesc.c:720:21: error: 'strcmp' reading 1 or more bytes from a region of size 0 [-Werror=stringop-overread]
720 | if (strcmp(names + ep->name_offset, arc_type))
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
arch/sparc/kernel/mdesc.c:78:33: note: at offset 16 into source object 'mdesc' of size 16
78 | struct mdesc_hdr mdesc;
| ^~~~~
cc1: all warnings being treated as errors
make[2]: *** [scripts/Makefile.build:304: arch/sparc/kernel/mdesc.o] Error 1 >> make[1]: *** [scripts/Makefile.build:544: arch/sparc/kernel] Error 2
make: *** [Makefile:1053: arch/sparc] Error 2
Not sure what to make of that.
Well, it's up right there, you are building with -Werror enabled. You have to disable that.
On Apr 3, 2022, at 8:28 PM, Stan Johnson <userm57@yahoo.com> wrote:
On 4/3/22 11:04 AM, John Paul Adrian Glaubitz wrote:
Hi Stan!
On 4/3/22 16:39, Stan Johnson wrote:
If this problem is expected to occur on an Ultra 5 or an Ultra 30,
please let me know and I'll be happy to help with a git bisect, using a
spare 9 GB disk for the installation.
I think you should see the issue on both the Ultra 5 and Ultra 30.
...
I wasn't able to get my Ultra 5 working; the video signal kept cycling
on and off for some reason, and the CD drive wasn't seen, though it was
seen well enough to boot the installation and get to the point where it
said no CD drive was found.
But I was able to confirm that the "grub-probe" bug doesn't seem to
affect the Ultra 30.
-----
There were a few oddities, but only #6 is serious (apparently a libc
bug, not a kernel bug).
1) I see that /dev/sda1 is mounted as /boot, not /boot/grub. So all the kernels will end up in /dev/sda1. I haven't tested how (or whether) that
will affect kernels for other operating systems (e.g. Gentoo).
2) Please confirm that grub-install never needs to be run. It appears
not to be needed, since update-grub updates /boot/grub/grub.cfg directly.
3) At system boot, when GRUB runs, it complains that it is out of
memory, but it seems to work anyway.
4) During installation, the disk partitioner said "The disk has 562253 cylinders which is greater than the maximum of 65536.", but that error
didn't seem to affect anything.
6) In Xfce, a login at the console worked once, but it is now failing consistently (even after a reboot), with this error message in dmesg:
xfce4-session[3980]: segfault at 0 ip fffff8010263c9b4 (rpc
fffff801020efbb8) sp 000007feff8dc451 error 1 in libc-2.33.so[fffff801025b0000+164000]
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 297 |
Nodes: | 16 (2 / 14) |
Uptime: | 19:47:08 |
Calls: | 6,667 |
Calls today: | 1 |
Files: | 12,216 |
Messages: | 5,337,043 |