• [gentoo-user] time to build a new machine ?

    From Philip Webb@21:1/5 to All on Fri Sep 24 12:00:02 2021
    While I was asleep yesterday, my machine reported on all 3 Konsoles :

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822

    -- end of report --

    I don't remember seeing this before : how concerned should I be ?

    The present machine is 6 years old & has always worked very well ;
    its CPU is an AMD. I plan to build a new machine in the next few months : should I accelerate my plans ?

    --
    ========================,,============================================
    SUPPORT ___________//___, Philip Webb
    ELECTRIC /] [] [] [] [] []| Cities Centre, University of Toronto
    TRANSIT `-O----------O---' purslowatchassdotutorontodotca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrew Udvare@21:1/5 to All on Fri Sep 24 12:20:02 2021
    On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:

    While I was asleep yesterday, my machine reported on all 3 Konsoles :

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000

    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822

    -- end of report --

    I don't remember seeing this before : how concerned should I be ?

    From the manpage:

    Most errors can be corrected by the CPU by internal error correction mechanisms. Uncorrected
    errors cause machine check exceptions which may kill processes or panic the machine. A small
    number of corrected errors is usually not a cause for worry, but a large number can indicate
    future failure.

    When an uncorrected machine check error happens that the kernel cannot recover from then it
    will usually panic the system. In this case when there was a warm reset after the panic
    mcelog should pick up the machine check errors after reboot. This is not possible after a
    cold reset.

    If you are overclocking, try disabling it.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philip Webb@21:1/5 to All on Fri Sep 24 12:50:01 2021
    210924 Andrew Udvare wrote:
    On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
    While I was asleep yesterday, my machine reported on all 3 Konsoles :
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
    -- end of report --
    From the manpage:

    Which man page is that ?

    Most errors can be corrected by the CPU
    by internal error correction mechanisms. Uncorrected errors cause
    machine check exceptions which may kill processes or panic the machine.
    A small number of corrected errors is usually not a cause for worry,
    but a large number can indicate future failure.

    So it looks as if the above was a correctable error.

    When an uncorrected machine check error happens
    that the kernel cannot recover from, then it will usually panic the system. In this case when there was a warm reset after the panic,
    mcelog should pick up the machine check errors after reboot.
    This is not possible after a cold reset.

    No sign of any other effects : everything went on running.

    If you are overclocking, try disabling it.

    No, I never overclock anything (smile).

    --
    ========================,,============================================
    SUPPORT ___________//___, Philip Webb
    ELECTRIC /] [] [] [] [] []| Cities Centre, University of Toronto
    TRANSIT `-O----------O---' purslowatchassdotutorontodotca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andrew Udvare@21:1/5 to Philip Webb on Fri Sep 24 17:30:02 2021
    This is an OpenPGP/MIME signed message (RFC 4880 and 3156) --9xlKJwxDScZByy5ql8SD2h9KqwzF8mq2O
    Content-Type: text/plain; charset=utf-8; format=flowed
    Content-Language: en-GB
    Content-Transfer-Encoding: quoted-printable

    On 24/09/2021 06:48, Philip Webb wrote:
    210924 Andrew Udvare wrote:
    On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
    While I was asleep yesterday, my machine reported on all 3 Konsoles :
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b >>> Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
    -- end of report --
    From the manpage:

    Which man page is that ?

    man mcelog


    --9xlKJwxDScZByy5ql8SD2h9KqwzF8mq2O--

    -----BEGIN PGP SIGNATURE-----

    wsF5BAABCAAjFiEEYK9084jvT0kxwI44Gv2a/BIMJt0FAmFN7UMFAwAAAAAACgkQGv2a/BIMJt0v 8A//bpDiB4G5Oiyzo6my3PTBf87Ned2c1WGGE8Kc3S4z9GST1LD7k/rSWRj1ypQdPOQ2oILtDPeW fvoqqpiwk4oUztSNDi6id5XogFY/XW7lm8hl5bN5/yb27Z56G1vh0yOAi98L9Uiai0AADP0DrOdV flV10TibXkFeHahz9zKMvh3PvHHWi/mZB7g86L3yp1O6Ap4PMwAt3deX+lSMPxjmbWaLtZLLdsnL SW6+IHnHfcvZIesXwzbsnx1eYqqUw/fqxo0y2cUpy3QiJqAG4I2b0TbXYhI/UYgCQhKrSIMA1++q 8hh4G+nIz8SZkIcpKRD6Gm54qCGK05wdo7mV+n5E/ODLui6QvETwwDPacZ8N3JirjeiUTobspIry NwJ3mhJcqmRLsIoAxf0PUhkt8l3tgKM2Z1PPEOpBMgW7xUjh8qFCTqX0VQfu0Gnhm30XmDw8u7Uj aGs6G87WnLFp8YGRtWPwxP0+KAvhVaKb9eskI0+m56I+ps2Lfm81fwa1khUboz8mN2T2lCDeNHb2 dmBAmKluODndYw+4L2aosi1bTltALLw8oS5qkCi+Q2cTW3Lrk48Klj1TBL9O8uxVte/gLP9IrpN2 tiOH/89nj0FtBNkp8haCC+5jUkBPSNfgWOAFtUDdi9RQ9LqndqH8i6Qr7+D3ZhvmnE6QoNv2YR8U XTc=
    =hDqx
    -----END PGP SIGNATURE-----

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Mark Knecht@21:1/5 to audvare@gmail.com on Fri Sep 24 18:40:02 2021
    On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare <audvare@gmail.com> wrote:

    On 24/09/2021 06:48, Philip Webb wrote:
    210924 Andrew Udvare wrote:
    On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
    While I was asleep yesterday, my machine reported on all 3 Konsoles
    :
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4:
    9d0b4c16001d011b
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0
    APIC 0 microcode 6000822
    -- end of report --
    From the manpage:

    Which man page is that ?

    man mcelog

    I have no direct experience with this error however I'd suggest it was most likely an error reading
    a block of DRAM and not likely the CPU itself failing. I periodically get
    mce errors on my i980
    machine when running big PixInsight jobs and I hit thermal limits.

    I'd suggest you run extensive memory tests and if you don't see any problems don't worry too much. Of course, it's always wise to do good backups in case the problem gets worse.

    Good luck,
    Mark

    <div dir="ltr"><br><br>On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare &lt;<a href="mailto:audvare@gmail.com">audvare@gmail.com</a>&gt; wrote:<br>&gt;<br>&gt; On 24/09/2021 06:48, Philip Webb wrote:<br>&gt; &gt; 210924 Andrew Udvare wrote:<br>&gt; &gt;&gt;
    On 2021-09-24, at 05:58, Philip Webb &lt;<a href="mailto:purslow@ca.inter.net">purslow@ca.inter.net</a>&gt; wrote:<br>&gt; &gt;&gt;&gt; While I was asleep yesterday, my machine reported on all  3  Konsoles :<br>&gt; &gt;&gt;&gt; Message from syslogd@  
    at Thu Sep 23 19:38:11 2021 ...<br>&gt; &gt;&gt;&gt; : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b<br>&gt; &gt;&gt;&gt; Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...<br>&gt; &gt;&gt;&gt; : mce: [Hardware Error]: TSC
    0 ADDR 19e617980 MISC c01a000001000000<br>&gt; &gt;&gt;&gt; Message from syslogd@  at Thu Sep 23 19:38:11 2021 ...<br>&gt; &gt;&gt;&gt; : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822<br>&gt; &gt;&gt;&gt; --
    end of report --<br>&gt; &gt;&gt;  From the manpage:<br>&gt; &gt;<br>&gt; &gt; Which man page is that ?<br>&gt;<br>&gt; man mcelog<div><br></div><div>I have no direct experience with this error however I&#39;d suggest it was most likely an error readingÂ
     </div><div>a block of DRAM and not likely the CPU itself failing. I periodically get mce errors on my i980</div><div>machine when running big PixInsight jobs and I hit thermal limits.</div><div><br></div><div>I&#39;d suggest you run extensive memory
    tests and if you don&#39;t see any problems</div><div>don&#39;t worry too much. Of course, it&#39;s always wise to do good backups in case</div><div>the problem gets worse.</div><div><br></div><div>Good luck,</div><div>Mark</div><div><br></div></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Philip Webb@21:1/5 to All on Fri Sep 24 22:00:02 2021
    210924 Mark Knecht wrote:
    On 2021-09-24, at 05:58, Philip Webb <purslow@ca.inter.net> wrote:
    While I was asleep yesterday, my machine reported on all 3 Konsoles
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 4: 9d0b4c16001d011b
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: TSC 0 ADDR 19e617980 MISC c01a000001000000
    Message from syslogd@ at Thu Sep 23 19:38:11 2021 ...
    : mce: [Hardware Error]: PROCESSOR 2:600f20 TIME 1632440315 SOCKET 0 APIC 0 microcode 6000822
    I have no direct experience with this error,
    however I'd suggest it was most likely an error
    reading a block of DRAM and not likely the CPU itself failing.
    I periodically get mce errors on my i980 machine
    when running big PixInsight jobs and I hit thermal limits.

    I thought you had written "1980 machine" (grin).

    I'd suggest you run extensive memory tests
    and if you don't see any problems don't worry too much.
    It's always wise to do good backups in case the problem gets worse.

    Everything is backed up, incl off-site.

    On Fri, Sep 24, 2021 at 8:23 AM Andrew Udvare <audvare@gmail.com> wrote:
    man mcelog

    'man mcelog' + 'man mce' find nothing. does it need to be installed ?

    Thanks for the advice so far.

    --
    ========================,,============================================
    SUPPORT ___________//___, Philip Webb
    ELECTRIC /] [] [] [] [] []| Cities Centre, University of Toronto
    TRANSIT `-O----------O---' purslowatchassdotutorontodotca

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Adam Carter@21:1/5 to All on Sat Sep 25 02:10:01 2021

    man mcelog

    'man mcelog' + 'man mce' find nothing. does it need to be installed ?


    Yep and the package is called mcelog.

    Did you check for any other messages before/after the mce errors?

    Do you also have lm-sensors installed? Running sensord?

    Genuine CPU issues seem pretty rare, so I would check for overheating or
    power issues, and lm-sensors will help with that.

    <div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex">
    &gt;&gt; man mcelog<br>

    &#39;man mcelog&#39; + &#39;man mce&#39; find nothing.  does it need to be installed ?<br></blockquote><div><br></div><div>Yep and the package is called mcelog.</div><div><br></div><div>Did you check for any other messages before/after the mce errors?<
    </div><div><br></div><div>Do you also have lm-sensors installed? Running sensord?</div><div><br></div><div>Genuine CPU issues seem pretty rare, so I would check for overheating or power issues, and lm-sensors will help with that.<br></div></div></div>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)