• Re: warning: "This [Crucial M.2] storage device is likely to fail soon"

    From Theo@21:1/5 to jkn on Thu Apr 28 10:12:09 2022
    jkn <jkn_gg@nicorp.f9.co.uk> wrote:
    So a few thoughts:

    - any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
    - is this likely to be a real issue, or an over-zealous warning?

    I checked my NVMe and I don't have an 'error log' section. I don't know
    what yours means. You could try nvme-cli (that's the package in Ubuntu) eg:

    $ sudo nvme error-log /dev/nvme0n1
    $ sudo nvme smart-log /dev/nvme0n1
    (other *-log commands available)

    and see if it reports anything interesting. eg for me smart-log says:

    $ sudo nvme smart-log /dev/nvme1n1
    Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
    critical_warning : 0
    temperature : 25 C
    available_spare : 100%
    available_spare_threshold : 5%
    percentage_used : 10%
    endurance group critical warning summary: 0

    so I seem to have used 10% of my write endurance (I think).

    It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.

    I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2
    drive

    - Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

    In theory it shouldn't be a risk to update the firmware (it happens in production all the time), but if the drive is exhibiting failure signs I'd
    want to make a backup first just in case.

    Theo

    (who hadn't come across nvme-cli before and thinks it could be a useful way
    of using cheaper NVMe in servers and replacing drives when they start
    running out of writes)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jkn@21:1/5 to All on Thu Apr 28 01:46:09 2022
    Hi all
    Just yesterday I upgraded my main desktop from Kubuntu 20.04 to 22.04 LTS. All went pretty well ... except this morning when I powered it up I got a worrying desktop warning message along the lines of:

    The Storage Device /dev/nvme0n1 is likely to fail soon

    erk!

    I have not seen this before, but TBH yesterday was the first time I have powered this machine off in a long time; I'm checking 'quiescent' household power consumption.

    I did a quick "sudo smartctl /dev/nvme0n1 -a" which doesn't look too bad, although I haven't delved deep. See below for the output.

    So a few thoughts:

    - any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
    - is this likely to be a real issue, or an over-zealous warning?

    I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2 drive

    - Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

    Thanks for any thoughts
    Jon N

    {{{ output from: sudo smartctl /dev/nvme0n1 -a
    smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-27-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

    === START OF INFORMATION SECTION ===
    Model Number: CT500P2SSD8
    Serial Number: 2043E4BD82DC
    Firmware Version: P2CR010
    PCI Vendor/Subsystem ID: 0xc0a9
    IEEE OUI Identifier: 0x6479a7
    Total NVM Capacity: 500,107,862,016 [500 GB]
    Unallocated NVM Capacity: 0
    Controller ID: 1
    NVMe Version: 1.3
    Number of Namespaces: 1
    Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
    Namespace 1 Formatted LBA Size: 512
    Namespace 1 IEEE EUI-64: 6479a7 fff0000000
    Local Time is: Thu Apr 28 09:39:36 2022 BST
    Firmware Updates (0x12): 1 Slot, no Reset required
    Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
    Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
    Maximum Data Transfer Size: 64 Pages
    Warning Comp. Temp. Threshold: 70 Celsius
    Critical Comp. Temp. Threshold: 85 Celsius

    Supported Power States
    St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
    0 + 4.50W - - 0 0 0 0 0 0
    1 + 2.70W - - 1 1 1 1 0 0
    2 + 2.16W - - 2 2 2 2 0 0
    3 - 0.0700W - - 3 3 3 3 1000 1000
    4 - 0.0020W - - 4 4 4 4 5000 55000

    Supported LBA Sizes (NSID 0x1)
    Id Fmt Data Metadt Rel_Perf
    0 + 512 0 1
    1 - 4096 0 0

    === START OF SMART DATA SECTION ===
    SMART overall-health self-assessment test result: PASSED

    SMART/Health Information (NVMe Log 0x02)
    Critical Warning: 0x00
    Temperature: 39 Celsius
    Available Spare: 100%
    Available Spare Threshold: 5%
    Percentage Used: 0%
    Data Units Read: 1,140,109 [583 GB]
    Data Units Written: 2,357,594 [1.20 TB]
    Host Read Commands: 14,832,947
    Host Write Commands: 25,228,486
    Controller Busy Time: 11,685
    Power Cycles: 236
    Power On Hours: 8,833
    Unsafe Shutdowns: 38
    Media and Data Integrity Errors: 0
    Error Information Log Entries: 275
    Warning Comp. Temperature Time: 0
    Critical Comp. Temperature Time: 0

    Error Information (NVMe Log 0x01, 16 of 16 entries)
    Num ErrCount SQId CmdId Status PELoc LBA NSID VS
    0 275 0 0x1008 0x4005 0x028 0 0 -
    }}}

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to jkn on Thu Apr 28 11:14:04 2022
    jkn <jkn_gg@nicorp.f9.co.uk> wrote:
    I think I will press on with buying a new 1TB M.2 drive (I was thinking of doing that anyway, as it happens), and updating the firmware on this one
    only after I have dd'd everything over and swapped to the new one.

    Sounds reasonable.

    Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off on Amazon on the Crucial P2 I have here...

    Samsung Evo are my standard fit. I've also been using Sabrent as they have been better at producing PCIe Gen4 drives at a decent price, although I'm a
    bit more uncertain about reliability. (I have 8 of them in a server, everything is fine thus far...)

    I would avoid QLC drives (cheap but slow, sometimes HDD-slow). TLC isn't a great deal more expensive.

    Previously I would have aimed for an SSD with DRAM rather than a DRAM-less
    one but they seem to be harder to find these days. DRAMless is probably
    fine unless you're serving databases or similar.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jkn@21:1/5 to Theo on Thu Apr 28 02:24:04 2022
    On Thursday, April 28, 2022 at 10:12:12 AM UTC+1, Theo wrote:
    jkn <jkn...@nicorp.f9.co.uk> wrote:
    So a few thoughts:

    - any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
    - is this likely to be a real issue, or an over-zealous warning?
    I checked my NVMe and I don't have an 'error log' section. I don't know
    what yours means. You could try nvme-cli (that's the package in Ubuntu) eg:

    $ sudo nvme error-log /dev/nvme0n1
    $ sudo nvme smart-log /dev/nvme0n1
    (other *-log commands available)

    and see if it reports anything interesting. eg for me smart-log says:

    $ sudo nvme smart-log /dev/nvme1n1
    Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
    critical_warning : 0
    temperature : 25 C
    available_spare : 100%
    available_spare_threshold : 5%
    percentage_used : 10%
    endurance group critical warning summary: 0

    so I seem to have used 10% of my write endurance (I think).

    It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.
    I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2 drive

    - Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?
    In theory it shouldn't be a risk to update the firmware (it happens in production all the time), but if the drive is exhibiting failure signs I'd want to make a backup first just in case.

    Theo

    (who hadn't come across nvme-cli before and thinks it could be a useful way of using cheaper NVMe in servers and replacing drives when they start
    running out of writes)

    Thanks a lot Theo, very useful.
    I installed nvme-cli and get this:

    {{{ $ sudo nvme error-log /dev/nvme0n1
    Error Log Entries for device:nvme0n1 entries:16
    .................
    Entry[ 0]
    .................
    error_count : 275
    sqid : 0
    cmdid : 0x1008
    status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
    phase_tag : 0x1
    parm_err_loc : 0x28
    lba : 0
    nsid : 0
    vs : 0
    trtype : The transport type is not indicated or the error is not transport related.
    cs : 0
    trtype_spec_info: 0

    # (all other log entries seem 'empty')
    }}}

    {{{ $ sudo nvme smart-log /dev/nvme0n1
    Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
    critical_warning : 0
    temperature : 31 C (304 Kelvin)
    available_spare : 100%
    available_spare_threshold : 5%
    percentage_used : 0%
    endurance group critical warning summary: 0
    data_units_read : 1,140,151
    data_units_written : 2,357,758
    host_read_commands : 14,833,879
    host_write_commands : 25,231,374
    controller_busy_time : 11,687
    power_cycles : 236
    power_on_hours : 8,834
    unsafe_shutdowns : 38
    media_errors : 0
    num_err_log_entries : 275
    Warning Temperature Time : 0
    Critical Composite Temperature Time : 0
    Thermal Management T1 Trans Count : 0
    Thermal Management T2 Trans Count : 0
    Thermal Management T1 Total Time : 0
    Thermal Management T2 Total Time : 0
    }}}

    It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.

    That is a good thought...

    I think I will press on with buying a new 1TB M.2 drive (I was thinking of doing
    that anyway, as it happens), and updating the firmware on this one
    only after I have dd'd everything over and swapped to the new one.

    Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off
    on Amazon on the Crucial P2 I have here...

    Thanks, J^n

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Theo@21:1/5 to Theo on Thu Apr 28 11:33:03 2022
    Theo <theom+news@chiark.greenend.org.uk> wrote:
    Previously I would have aimed for an SSD with DRAM rather than a DRAM-less one but they seem to be harder to find these days. DRAMless is probably
    fine unless you're serving databases or similar.

    One other thing... if it's going to be under any kind of intense workload
    (eg compiling) what I tend to do is look for 'performance consistency'
    graphs, eg this is a cheap and old drive:

    https://www.anandtech.com/show/9258/crucial-mx200-250gb-500gb-1tb-ssd-review/2

    You can see the IOPS fall off a cliff once the buffer cache is exhausted.

    One way to improve this is to leave some portion of the drive unwritten - eg partition it to 900GB not 1TB and leave the last 100GB as unwritten blocks. This gives the drive a bit more breathing space as it can have more spare blocks to play with. Anandtech's benchmarks sometimes incorporate such overprovisioning, eg: https://www.anandtech.com/show/9451/the-2tb-samsung-850-pro-evo-ssd-review/2

    You probably don't care about performance to that level but it's something I look at when selecting drives.

    Theo

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From jkn@21:1/5 to Theo on Thu Apr 28 04:07:34 2022
    On Thursday, April 28, 2022 at 11:33:07 AM UTC+1, Theo wrote:
    Theo <theom...@chiark.greenend.org.uk> wrote:
    Previously I would have aimed for an SSD with DRAM rather than a DRAM-less one but they seem to be harder to find these days. DRAMless is probably fine unless you're serving databases or similar.
    One other thing... if it's going to be under any kind of intense workload
    (eg compiling) what I tend to do is look for 'performance consistency' graphs, eg this is a cheap and old drive:

    https://www.anandtech.com/show/9258/crucial-mx200-250gb-500gb-1tb-ssd-review/2

    You can see the IOPS fall off a cliff once the buffer cache is exhausted.

    One way to improve this is to leave some portion of the drive unwritten - eg partition it to 900GB not 1TB and leave the last 100GB as unwritten blocks. This gives the drive a bit more breathing space as it can have more spare blocks to play with. Anandtech's benchmarks sometimes incorporate such overprovisioning, eg: https://www.anandtech.com/show/9451/the-2tb-samsung-850-pro-evo-ssd-review/2

    You probably don't care about performance to that level but it's something I look at when selecting drives.

    Theo
    Thanks Theo, that's all very useful. I do do some semi-intensive compiling so the point is well taken.

    As it happens I've just placed an order for a Samsung 970 EVO Plus (1TB),
    I will take your advice and leave some part unpartitioned.

    Anyone suggest any semi-decent NVME USB-C enclosures to move my
    Crucial P2 drive to once I have updated the firmware etc?

    Thanks again, Jon N

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)