Forum: >>> Magnum BBS <<<

Re: warning: "This [Crucial M.2] storage device is likely to fail soon"

From Theo@21:1/5 to jkn on Thu Apr 28 10:12:09 2022

jkn <jkn_gg@nicorp.f9.co.uk> wrote:

So a few thoughts:

- any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
- is this likely to be a real issue, or an over-zealous warning?

I checked my NVMe and I don't have an 'error log' section. I don't know
what yours means. You could try nvme-cli (that's the package in Ubuntu) eg:

$ sudo nvme error-log /dev/nvme0n1
$ sudo nvme smart-log /dev/nvme0n1
(other *-log commands available)

and see if it reports anything interesting. eg for me smart-log says:

$ sudo nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 25 C
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 10%
endurance group critical warning summary: 0

so I seem to have used 10% of my write endurance (I think).

It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.

I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2
drive

- Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

In theory it shouldn't be a risk to update the firmware (it happens in production all the time), but if the drive is exhibiting failure signs I'd
want to make a backup first just in case.

Theo

(who hadn't come across nvme-cli before and thinks it could be a useful way
of using cheaper NVMe in servers and replacing drives when they start
running out of writes)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jkn@21:1/5 to All on Thu Apr 28 01:46:09 2022

Hi all
Just yesterday I upgraded my main desktop from Kubuntu 20.04 to 22.04 LTS. All went pretty well ... except this morning when I powered it up I got a worrying desktop warning message along the lines of:

The Storage Device /dev/nvme0n1 is likely to fail soon

erk!

I have not seen this before, but TBH yesterday was the first time I have powered this machine off in a long time; I'm checking 'quiescent' household power consumption.

I did a quick "sudo smartctl /dev/nvme0n1 -a" which doesn't look too bad, although I haven't delved deep. See below for the output.

So a few thoughts:

- any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
- is this likely to be a real issue, or an over-zealous warning?

I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2 drive

- Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

Thanks for any thoughts
Jon N

{{{ output from: sudo smartctl /dev/nvme0n1 -a
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.0-27-generic] (local build) Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number: CT500P2SSD8
Serial Number: 2043E4BD82DC
Firmware Version: P2CR010
PCI Vendor/Subsystem ID: 0xc0a9
IEEE OUI Identifier: 0x6479a7
Total NVM Capacity: 500,107,862,016 [500 GB]
Unallocated NVM Capacity: 0
Controller ID: 1
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 500,107,862,016 [500 GB]
Namespace 1 Formatted LBA Size: 512
Namespace 1 IEEE EUI-64: 6479a7 fff0000000
Local Time is: Thu Apr 28 09:39:36 2022 BST
Firmware Updates (0x12): 1 Slot, no Reset required
Optional Admin Commands (0x001f): Security Format Frmw_DL NS_Mngmt Self_Test Optional NVM Commands (0x005e): Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x0e): Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg
Maximum Data Transfer Size: 64 Pages
Warning Comp. Temp. Threshold: 70 Celsius
Critical Comp. Temp. Threshold: 85 Celsius

Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.50W - - 0 0 0 0 0 0
1 + 2.70W - - 1 1 1 1 0 0
2 + 2.16W - - 2 2 2 2 0 0
3 - 0.0700W - - 3 3 3 3 1000 1000
4 - 0.0020W - - 4 4 4 4 5000 55000

Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 1
1 - 4096 0 0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 39 Celsius
Available Spare: 100%
Available Spare Threshold: 5%
Percentage Used: 0%
Data Units Read: 1,140,109 [583 GB]
Data Units Written: 2,357,594 [1.20 TB]
Host Read Commands: 14,832,947
Host Write Commands: 25,228,486
Controller Busy Time: 11,685
Power Cycles: 236
Power On Hours: 8,833
Unsafe Shutdowns: 38
Media and Data Integrity Errors: 0
Error Information Log Entries: 275
Warning Comp. Temperature Time: 0
Critical Comp. Temperature Time: 0

Error Information (NVMe Log 0x01, 16 of 16 entries)
Num ErrCount SQId CmdId Status PELoc LBA NSID VS
0 275 0 0x1008 0x4005 0x028 0 0 -
}}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Theo@21:1/5 to jkn on Thu Apr 28 11:14:04 2022

jkn <jkn_gg@nicorp.f9.co.uk> wrote:

I think I will press on with buying a new 1TB M.2 drive (I was thinking of doing that anyway, as it happens), and updating the firmware on this one
only after I have dd'd everything over and swapped to the new one.

Sounds reasonable.

Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off on Amazon on the Crucial P2 I have here...

Samsung Evo are my standard fit. I've also been using Sabrent as they have been better at producing PCIe Gen4 drives at a decent price, although I'm a
bit more uncertain about reliability. (I have 8 of them in a server, everything is fine thus far...)

I would avoid QLC drives (cheap but slow, sometimes HDD-slow). TLC isn't a great deal more expensive.

Previously I would have aimed for an SSD with DRAM rather than a DRAM-less
one but they seem to be harder to find these days. DRAMless is probably
fine unless you're serving databases or similar.

Theo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jkn@21:1/5 to Theo on Thu Apr 28 02:24:04 2022

On Thursday, April 28, 2022 at 10:12:12 AM UTC+1, Theo wrote:

jkn <jkn...@nicorp.f9.co.uk> wrote:

So a few thoughts:

- any idea if this is related to my recent upgrade? ie. new feature in [K]Ubunto 22.04?
- is this likely to be a real issue, or an over-zealous warning?

I checked my NVMe and I don't have an 'error log' section. I don't know
what yours means. You could try nvme-cli (that's the package in Ubuntu) eg:

$ sudo nvme error-log /dev/nvme0n1
$ sudo nvme smart-log /dev/nvme0n1
(other *-log commands available)

and see if it reports anything interesting. eg for me smart-log says:

$ sudo nvme smart-log /dev/nvme1n1
Smart Log for NVME device:nvme1n1 namespace-id:ffffffff
critical_warning : 0
temperature : 25 C
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 10%
endurance group critical warning summary: 0

so I seem to have used 10% of my write endurance (I think).

It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.

I am thinking of doing two things: buying a new/larger(1TB) M.2 drive, and dd'ing everything over; and upgrading the firmware on this Crucial M.2 drive

- Is it a particularly risky operation up update the M.2 firmware without backing up the drive first?

In theory it shouldn't be a risk to update the firmware (it happens in production all the time), but if the drive is exhibiting failure signs I'd want to make a backup first just in case.

Theo

(who hadn't come across nvme-cli before and thinks it could be a useful way of using cheaper NVMe in servers and replacing drives when they start
running out of writes)

Thanks a lot Theo, very useful.
I installed nvme-cli and get this:

{{{ $ sudo nvme error-log /dev/nvme0n1
Error Log Entries for device:nvme0n1 entries:16
.................
Entry[ 0]
.................
error_count : 275
sqid : 0
cmdid : 0x1008
status_field : 0x2002(INVALID_FIELD: A reserved coded value or an unsupported value in a defined field)
phase_tag : 0x1
parm_err_loc : 0x28
lba : 0
nsid : 0
vs : 0
trtype : The transport type is not indicated or the error is not transport related.
cs : 0
trtype_spec_info: 0

# (all other log entries seem 'empty')
}}}

{{{ $ sudo nvme smart-log /dev/nvme0n1
Smart Log for NVME device:nvme0n1 namespace-id:ffffffff
critical_warning : 0
temperature : 31 C (304 Kelvin)
available_spare : 100%
available_spare_threshold : 5%
percentage_used : 0%
endurance group critical warning summary: 0
data_units_read : 1,140,151
data_units_written : 2,357,758
host_read_commands : 14,833,879
host_write_commands : 25,231,374
controller_busy_time : 11,687
power_cycles : 236
power_on_hours : 8,834
unsafe_shutdowns : 38
media_errors : 0
num_err_log_entries : 275
Warning Temperature Time : 0
Critical Composite Temperature Time : 0
Thermal Management T1 Trans Count : 0
Thermal Management T2 Trans Count : 0
Thermal Management T1 Total Time : 0
Thermal Management T2 Total Time : 0
}}}

It is possible doing an upgrade has eaten some of your available writes and pushed it over some threshold.

That is a good thought...

I think I will press on with buying a new 1TB M.2 drive (I was thinking of doing
that anyway, as it happens), and updating the firmware on this one
only after I have dd'd everything over and swapped to the new one.

Any recommendations for a decent M.2 1TB drive? I see a lot of slagging off
on Amazon on the Crucial P2 I have here...

Thanks, J^n

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Theo@21:1/5 to Theo on Thu Apr 28 11:33:03 2022

Theo <theom+news@chiark.greenend.org.uk> wrote:

Previously I would have aimed for an SSD with DRAM rather than a DRAM-less one but they seem to be harder to find these days. DRAMless is probably
fine unless you're serving databases or similar.

One other thing... if it's going to be under any kind of intense workload
(eg compiling) what I tend to do is look for 'performance consistency'
graphs, eg this is a cheap and old drive:

https://www.anandtech.com/show/9258/crucial-mx200-250gb-500gb-1tb-ssd-review/2

You can see the IOPS fall off a cliff once the buffer cache is exhausted.

One way to improve this is to leave some portion of the drive unwritten - eg partition it to 900GB not 1TB and leave the last 100GB as unwritten blocks. This gives the drive a bit more breathing space as it can have more spare blocks to play with. Anandtech's benchmarks sometimes incorporate such overprovisioning, eg: https://www.anandtech.com/show/9451/the-2tb-samsung-850-pro-evo-ssd-review/2

You probably don't care about performance to that level but it's something I look at when selecting drives.

Theo

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From jkn@21:1/5 to Theo on Thu Apr 28 04:07:34 2022

On Thursday, April 28, 2022 at 11:33:07 AM UTC+1, Theo wrote:

Theo <theom...@chiark.greenend.org.uk> wrote:

Previously I would have aimed for an SSD with DRAM rather than a DRAM-less one but they seem to be harder to find these days. DRAMless is probably fine unless you're serving databases or similar.

One other thing... if it's going to be under any kind of intense workload
(eg compiling) what I tend to do is look for 'performance consistency' graphs, eg this is a cheap and old drive:

https://www.anandtech.com/show/9258/crucial-mx200-250gb-500gb-1tb-ssd-review/2

You can see the IOPS fall off a cliff once the buffer cache is exhausted.

One way to improve this is to leave some portion of the drive unwritten - eg partition it to 900GB not 1TB and leave the last 100GB as unwritten blocks. This gives the drive a bit more breathing space as it can have more spare blocks to play with. Anandtech's benchmarks sometimes incorporate such overprovisioning, eg: https://www.anandtech.com/show/9451/the-2tb-samsung-850-pro-evo-ssd-review/2

You probably don't care about performance to that level but it's something I look at when selecting drives.

Theo

Thanks Theo, that's all very useful. I do do some semi-intensive compiling so the point is well taken.

As it happens I've just placed an order for a Samsung 970 EVO Plus (1TB),
I will take your advice and leave some part unpartitioned.

Anyone suggest any semi-decent NVME USB-C enclosures to move my
Crucial P2 drive to once I have updated the firmware etc?

Thanks again, Jon N

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Bob Worm
  Tue Apr 16 21:34:28 2024
  from Wales, Uk via Telnet
- Jasonrx8
  Wed Apr 17 13:08:27 2024
  from Sydney Nsw via SSH
- Bob Worm
  Wed Apr 17 10:35:26 2024
  from Wales, Uk via Telnet
- Bob Worm
  Wed Apr 17 08:38:18 2024
  from Wales, Uk via Raw

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	292
Nodes:	16 (2 / 14)
Uptime:	178:06:56
Calls:	6,616
Calls today:	3
Files:	12,165
Messages:	5,313,909

Re: warning: "This [Crucial M.2] storage device is likely to fail soon"

Who's Online

Recent Visitors

System Info