while I was able to "install" correctly using a slightly older ISO, I get not a bootable
system. The kernel appears to crash very early during boot.
John Paul Adrian Glaubitz wrote:
while I was able to "install" correctly using a slightly older ISO, I get not a bootableI think this is more likely a hardware issue. We haven't seen any machines crashing that
system. The kernel appears to crash very early during boot.
early. Please make sure the RAM modules in this machine are working properly.
I don't think so... I think it is a Kernel issue, since with kernel 5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux
the machine is performing fine with network, disk and compiler usage on all 32 CPUs.
while I was able to "install" correctly using a slightly older ISO, I get not a bootableI think this is more likely a hardware issue. We haven't seen any machines crashing that
system. The kernel appears to crash very early during boot.
early. Please make sure the RAM modules in this machine are working properly.
Hi!
On 3/9/21 6:26 PM, Riccardo Mottola wrote:
John Paul Adrian Glaubitz wrote:
while I was able to "install" correctly using a slightly older ISO, I get not a bootableI think this is more likely a hardware issue. We haven't seen any machines crashing that
system. The kernel appears to crash very early during boot.
early. Please make sure the RAM modules in this machine are working properly.
I don't think so... I think it is a Kernel issue, since with kernel
5.9.0-2-sparc64-smp #1 SMP Debian 5.9.6-1 (2020-11-08) sparc64 GNU/Linux
the machine is performing fine with network, disk and compiler usage on all 32 CPUs.
Then you need to bisect the kernel as I don't have any means to reproduce the issue.
I have a T1000 with which I could try to reproduce Riccardo's issues. Hardware wise they should be pretty similar. As the T1000 doesn't have a CDROM, I'll try to netboot a few newer kernels and report my findings.
Will take me until next week though, as the machine is in (cold) storage
now.
@Adrian:
Aren't there some build servers using UltraSPARC T2 or T2+? Do they run
with the latest kernels?
The oldest buildd we are running is a T5120 and that's a T2.
And these don't show the problems Riccardo's T1 powered T2000 has?
We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.
Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next week.
[1] https://salsa.debian.org/kernel-team/linux/-/merge_requests/339
Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.
I have not uploaded that kernel yet, I have it built locally, PR here [1].
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e5e8b80d352ec999d2bba3ea584f541c83f4ca3f
Frank Scheiner wrote:
We have an older UltraSPARC IIIi that has issues with newer kernels, but >>> usually only after longer operation and the issue might be related to
the
bug that was just fixed recently by Rob Gardner.
Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next
week.
as written in the title, I have issues with:
5.10.0-4-sparc64-smp #1 Debian 5.10.19-1
If I remember there was a repository with many snapshots of different versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.
Do you have a link?
If I remember there was a repository with many snapshots of different versions,
already as package, which one can test quickly. That way we can restrict breakage
range without git bisect.
We have an older UltraSPARC IIIi that has issues with newer kernels, but
usually only after longer operation and the issue might be related to the
bug that was just fixed recently by Rob Gardner.
Which kernel version will have this bug (which one?) fixed, 5.11.x? I
can also check with one of my UltraSPARC IIIi powered systems, too, next week.
If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.
Do you have a link?
I assume you mean "http://snapshot.debian.org" .
Well, that doesn't really help you though. You want to find the commit in question,
just the range isn't enough to solve the issue.
If you have a fast second machine available, bisecting the problem shouldn't take
too long.
Do you know if I can via serial-console reset the system?
I tried sending a break on the serial console, but the errors just keep running.
Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine? I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)
Hi Frank!
I suppose the Niagara CPU gives the kernel issue
Frank Scheiner wrote:
If I remember there was a repository with many snapshots of different
versions, already as package, which one can test quickly. That way we
can restrict breakage range without git bisect.
Do you have a link?
I assume you mean "http://snapshot.debian.org" .
Exactly. With this I did some more tests.
Still Works:
5.9.0-4-sparc64-smp #1 SMP Debian 5.9.11-1 (2020-11-27)
5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
Broken:
linux-image-5.10.0-trunk-sparc64-smp_5.10.2-1~exp1_sparc64.deb
So later series 5.9 series continue to work and even very early 5.10 do not
Do you know if I can via serial-console reset the system?
I tried sending a break on the serial console, but the errors just keep running.
Break is received, since I see it as SC Alert, but I am not put into the console, maybe there is some further trick on these newer machine?
I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)
How should I proceed? Which kernel sources?
https://kernel-team.pages.debian.net/kernel-handbook/ch-common-tasks.html#s-common-official
is 4.3 correct for me? 4.6 ?
Do you know if I can via serial-console reset the system?
Reset from the serial console might work via the kernel with the [magic system request] functionality.
[magic system request]: https://www.kernel.org/doc/html/v4.11/admin-guide/sysrq.html
But you can always reset the system using the SC. The T1000 (and the
T2000, too) has both serial (on T2000 right of the DB-9 ttya port,
should work with a blue Cisco serial cable) and network port (on T2000
above the two USB ports). The serial port of the SC automatically
switches to the system console after some (configurable) time
I tried sending a break on the serial console, but the errors just keep
running.
Break is received, since I see it as SC Alert, but I am not put into the
console, maybe there is some further trick on these newer machine?
So you already got access to the SC. Then you can reset the machine from there, too.
breakconfirm (y/n)y
consoleconfirm (y/n)y
I am
used to old SparcStations and UltraSparc Netras, where it was sufficient.
It is inconvenient at every hang to power-cycle, since at every turn on,
it runs a self-test which lasts minutes :)
I think depending on the SC configuration, these machines also run a self-test for every X resets, but this should be configurable.
On 11.03.21 23:03, Riccardo Mottola wrote:
I suppose the Niagara CPU gives the kernel issue
From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.
[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html
@Adrian:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit: 33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good" closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is good? Should I switch to the stable kernel sources from GKH?
Hello Frank!
On 3/16/21 2:07 PM, Frank Scheiner wrote:
After a first cross compile run, I can confirm that 5.10-rc1 is also
broken on my T1000. I'll take this version (parent commit:
33def8498fdde180023444b08e12b72a9efed41d) as "bad". But taking v5.9 as
good means more than 5000 commits in between. Linus's tree doesn't
contain v5.9.16 or at least I didn't find it there. How can I get "good"
closer to "bad"? I don't want to check too many good versions if I know
that v5.9.16 most likely will be good, as v5.9.15 (5.9.0-5 on Debian) is
good? Should I switch to the stable kernel sources from GKH?
I'm not sure I am understand your problem here. The bisecting algorithm
has a runtime O(ln(n)), so even with 5000 commits, it will converge quite quickly.
Just make sure you are using a fast machine when compiling the kernel
as otherwise it won't be fun.
Hi Adrian, Riccardo
so I'm finished with bisecting and it points to the following commit as
first bad commit:
```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
commit 028abd9222df0cf5855dab5014a5ebaf06f90565
Author: Christoph Hellwig <hch@lst.de>
Date: Thu Sep 17 10:22:34 2020 +0200
fs: remove compat_sys_mount
compat_sys_mount is identical to the regular sys_mount now, so
remove it
and use the native version everywhere.
On 3/17/21 1:22 PM, Frank Scheiner wrote:
```Did you verify that reverting this commit or - if reverting is not possible - testing
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect bad
028abd9222df0cf5855dab5014a5ebaf06f90565 is the first bad commit
[...]
out the revision just before the commit?
```
johndoe@x4270:~/git-projects/torvalds/linux$ git bisect log
[...]
# good: [67e306c6906137020267eb9bbdbc127034da3627] fs,nfs: lift compat
nfs4 mount data handling into the nfs code
git bisect good 67e306c6906137020267eb9bbdbc127034da3627
# bad: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs: remove
compat_sys_mount
git bisect bad 028abd9222df0cf5855dab5014a5ebaf06f90565
# first bad commit: [028abd9222df0cf5855dab5014a5ebaf06f90565] fs:
remove compat_sys_mount
```
Just to be safe you found the correct commit.
If that has been verified, please report the issue to the sparclinux LKML and CC Christoph.
while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during
boot.
From my current testing it looks like "UltraSPARC IIIi"s are also
affected by this problem with UltraSPARC T1s in some way:
With the latest Linux 5.10.x (from Debian) the root FS can't be
successfully mounted, with the latest Linux 5.9.x (also from Debian) it
just works fine. Unfortunately the V245 doesn't fail/work for the exact
same kernels that I tested during the bisecting for the T1000, e.g. the
first bad commit version that didn't work on the T1000 seems to work on
the V245 but some good versions don't with:
```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
Hi all,
while I was able to "install" correctly using a slightly older ISO, I
get not a bootable system. The kernel appears to crash very early during boot.
Anybody else has this issue?
 Booting `Debian GNU/Linux'
Loading Linux 5.10.0-4-sparc64-smp ...
Loading initial ramdisk ...
On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
I seem to recall that NFS is one of those filesystems that (a) makes use of filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mount helper,
/usr/sbin/mount.nfs.
Now, with the change in Linux kernel 028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsing this binary blob is no longer aware that it is being invoked in a compat32 context.
Since T2 systems were said to be fine and T1 not, perhaps the T1 systems in question were all on NFS mounts and the T2 one wasn't?
Hi Jan,
On 23.03.21 16:36, Jan Engelhardt wrote:
On Tuesday 2021-03-23 16:29, Frank Scheiner wrote:
```
[...]
Begin: Retrying nfs mount ... [ 41.753937] NFS: mount program didn't
pass remote address
mount: Invalid argument
I seem to recall that NFS is one of those filesystems that (a) makes useof
filesystem-specific data, i.e. mount(2)'s 5th argument, and (b) a mounthelper,
/usr/sbin/mount.nfs.
Now, with the change in Linux kernel028abd9222df0cf5855dab5014a5ebaf06f90565,
I am postulating the hypothesis that that the fs/nfs/ code for parsingthis
binary blob is no longer aware that it is being invoked in a compat32context.
That sounds interesting. Can you perhaps post your hypothesis also in
this thread:
https://marc.info/?t=161644900600003&r=1&w=2
Maybe this gives the kernel developers some ideas.
Since T2 systems were said to be fine and T1 not, perhaps the T1 systemsin
question were all on NFS mounts and the T2 one wasn't?
No, the T5220 was also running diskless, actually using the same root FS
as the T1000 (in form of a btrfs subvolume snapshot) plus identical
kernel and initramfs:
```
root@nfs:/srv/tftp# ls -la $( host2hex t5220 )*
lrwxrwxrwx 1 root root 35 Feb 28 2018 AC10026E -> boot/grub/sparc64-ieee1275/core.img
lrwxrwxrwx 1 root root 38 Mar 15 18:16 AC10026E.initrd.img -> initrd.img.5.10.0-4.debian.sid.sparc64
lrwxrwxrwx 1 root root 36 Mar 15 18:16 AC10026E.vmlinuz -> linux.mp.5.10.0-4.debian.sid.sparc64
```
Cheers,
Frank
<blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">Hi Jan,<br>
Hi,
can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?
Also this instability manifests such that the machine is crashing during
high workload? (halting? rebooting?)
I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg or somewhere else.
Then when i power off tand on again, the filesystem would be corrupt and sometimes impossible to repair without reinstallation.
This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.
You should clone the upstream Git repo, otherwise bisecting will be much
more difficult.
I think these instructions are still valid: https://wiki.debian.org/DebianKernel/GitBisect
You can also skip the Debian-specific stuff and simply do
make -j8 && make modules_install && make install
It's better to use at least a compatible kernel config, though.
I took the config out of /boot/config of a good kernel, updated it with "make oldconfig"
During compilation I see:
 CC     init/init_task.o
make[1]: *** No rule to make target
'debian/certs/debian-uefi-certs.pem', needed by 'certs/x509_certificate_list'. Stop.
make[1]: *** Waiting for unfinished jobs....
...
I think you need to remove all references to debian certs to compile a
custom kernel.
can anyone possible give a list of known stable kernel versions for
SPARC machines? (is there a difference necessary between
architectures/old vs. newer machines? sun4u/sun4v)?
Also this instability manifests such that the machine is crashing
during high workload? (halting? rebooting?)
I ask, because on three different SPARC machines i have been
experiencing a weird effect when using debian:
I would start a high compiling load for several days (7-10) where the machines are running fine without any apparent error visible in dmesg
or somewhere else.
Then when i power off tand on again, the filesystem would be corrupt
and sometimes impossible to repair without reinstallation.
This seems to only happen when the machines do a long run with high
workload and seemingly not when i just power them off again for night
with no high workload.
This seems to only happen when the machines do a long run with high workload and seemingly not when i just power them off again for night
with no high workload.
I have a limited experience and can only share that the kernel I
currently am running on this Fire T2000
Linux narya 5.9.0-5-sparc64-smp #1 SMP Debian 5.9.15-1 (2020-12-17)
sparc64 GNU/Linux
Is quite stable for me.
However, i did not try to run for several days compiling, so I don't
know if it is stable for a long time.
Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""
multix@narya:~/code/linux-stable$ time sudo make install
sh ./arch/sparc/boot/install.sh 5.12.0-rc5+ arch/sparc/boot/zImage \
System.map "/boot"
run-parts: executing /etc/kernel/postinst.d/apt-auto-removal 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/initramfs-tools 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
update-initramfs: Generating /boot/initrd.img-5.12.0-rc5+
run-parts: executing /etc/kernel/postinst.d/zz-update-grub 5.12.0-rc5+ /boot/vmlinuz-5.12.0-rc5+
Generating grub configuration file ...
Found linux image: /boot/vmlinuz-5.12.0-rc5+
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinuz-5.12.0-rc5+.old
Found initrd image: /boot/initrd.img-5.12.0-rc5+
Found linux image: /boot/vmlinux-5.10.0-4-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-4-sparc64-smp
Found linux image: /boot/vmlinux-5.10.0-trunk-sparc64-smp
Found initrd image: /boot/initrd.img-5.10.0-trunk-sparc64-smp
Found linux image: /boot/vmlinux-5.9.0-5-sparc64-smp
Found initrd image: /boot/initrd.img-5.9.0-5-sparc64-smp
done
At boot:
Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.
Loading initial ramdisk ...
error: you need to load the kernel first.
Yep, in your kernel config set:
CONFIG_SYSTEM_TRUSTED_KEYS=""
thanks, that was it! Now the kernel build
Do I need to do somethings special?
make install
make modules_install
Loading Linux 5.12.0-rc5+ ...
error: premature end of file /vmlinuz-5.12.0-rc5+.
current grub2 version does not support compressed image kernels, do
the following:
gzip -dc /boot/vmlinuz-5.12.0-rc5+ > /boot/vmlinux-5.12.0-rc5+
rm /boot/vmlinuz-5.12.0-rc5+
update-grub
and reboot
I remember you bisected about the breaking commits. Has there been any progress?
A better place where to report this issue other than this mailing list?
From [1] I assume T2 CPUs are not affected, but yeah, the issue could
be that selective that it only affects the very first generation.
[1]: https://lists.debian.org/debian-sparc/2021/03/msg00010.html
On 12/11/21 18:40, Riccardo Mottola wrote:
I remember you bisected about the breaking commits. Has there been any progress?
A better place where to report this issue other than this mailing list?
The proper place is to send an email to the author of the breaking commit and CC the sparclinux Linux kernel mailing list. Most kernel developers don't read
the debian-sparc mailing list.
as Frank asked, I compiled myself a kernel using his latest commit
identified as good:
67e306c6906137020267eb9bbdbc127034da3627
and this kernel works, but then fails to load initramfs.
The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.
Did you forget to create an initrd? After installing the kernel, run:
$ update-initramfs -k KERNEL_VERSION -c
The good news is that latest kernel installed seems to boot and takesPlease run some stress tests such as stress-ng and report back.
all CPUs online. How stable it is I don't know, it needs to be tested.
The good news is that latest kernel installed seems to boot and takes
all CPUs online. How stable it is I don't know, it needs to be tested.
Please run some stress tests such as stress-ng and report back.
Not nice. I started compiling some stuff and the box froze, I connected serial console and could not resume due to Fast Data Access MMU miss"
Not nice. I started compiling some stuff and the box froze, I connectedSo, this crash occurs with the latest 5.15 kernel on your T2000?
serial console and could not resume due to Fast Data Access MMU miss"
In my experience, the most stable kernels on the older SPARCs are still the 4.19 kernels. Thus, we should start bisecting to find out what commit actually
made the kernel unreliable on these older SPARCs.
John Paul Adrian Glaubitz wrote:
exactly latest kernel.Not nice. I started compiling some stuff and the box froze, I connectedSo, this crash occurs with the latest 5.15 kernel on your T2000?
serial console and could not resume due to Fast Data Access MMU miss"
I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.
Hi,
Riccardo Mottola wrote:
John Paul Adrian Glaubitz wrote:
exactly latest kernel.Not nice. I started compiling some stuff and the box froze, I connected >>>> serial console and could not resume due to Fast Data Access MMU miss"So, this crash occurs with the latest 5.15 kernel on your T2000?
I will retest it with stress-ng as soon as I finish this email and copy
the dmesg errors.
wow, running the test suite once or twice, I am able to have the system power-cycle... wow
Frank test latest kernel on yours :)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 293 |
Nodes: | 16 (2 / 14) |
Uptime: | 242:21:54 |
Calls: | 6,624 |
Files: | 12,175 |
Messages: | 5,320,202 |