Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...
On the server that has no issues:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 512 bytes
These are then gpt partitioned, a small BIOS boot and EFI partition and
then a big "Linux filesystem" partition that is part of a mdadm raid
md0 : active raid1 sda3[3] sdb3[2]
On the server that has performance issues and I get occasional data corruption (both reading and writing) under heavy (disk) load:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 4096 bytes
All the
partitions start on a 4k boundary but the big partition is not an exact multiple of 4k.
... the "heavy load" filesystem that triggered the issue ...
There are a LOT of
partitions and filesystems in a complicated layered LVM setup ...
Booted on the problem machine but physical disk still on the OK machine: real 0m35.731s
user 0m5.291s
sys 0m4.677s
Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s
The SMART attributes from the problem machine:099 000 Old_age
sda:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0> 12 Power_Cycle_Count 0x0032 099
Always - 54> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100100 010 Pre-fail
Always - 0> 181 Program_Fail_Cnt_Total 0x0032 100100 010 Old_age
Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age
Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail
Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age
Always - 33
195 ECC_Error_Rate 0x001a 200 200 000 Old_age
Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age
Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age087 000 Pre-fail
Always - 18280> 177 Wear_Leveling_Count 0x0013 087
Always - 129> 241 Total_LBAs_Written 0x0032 099099 000 Old_age
Always - 62154466086
235 POR_Recovery_Count 0x0012 099 099 000 Old_age
Always - 39
sdb:100 000 Old_age
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail
Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age
Always - 0> 12 Power_Cycle_Count 0x0032 100
Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age
Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age
Always - 0
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age
Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
194 Temperature_Celsius 0x0022 074 052 000 Old_age
Always - 26 (Min/Max 0/48)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age
Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age
Always - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age
Always - 1
9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 18697
173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age
Always - 433
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail
Always - 45
246 Total_LBAs_Written 0x0032 100 100 000 Old_age
Always - 63148678276
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age
Always - 1879223820
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age
Always - 1922002147
202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age
Offline - 33
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age
Always - 12
On 1/20/24 08:25, Tim Woodall wrote:
Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...
Please post a console session that demonstrates, or at least documents, the data corruption.
Please post console sessions that document the make and model of your disks, their partition tables, your md RAID configurations, and your LVM configurations.
Putting a sector size 512/512 disk and a sector size 512/4096 disk into the same mirror is unconventional. I suppose there are kernel developers who could definitively explain the consequences, but I am not one of them. The KISS solution is to use matching disks in RAID.
All the
partitions start on a 4k boundary but the big partition is not an exact
multiple of 4k.
I align my partitions to 1 MiB boundaries and suggest that you do the same.
... the "heavy load" filesystem that triggered the issue ...
Please post a console session that demonstrates how data corruption is related to I/O throughput.
There are a LOT of
partitions and filesystems in a complicated layered LVM setup ...
Complexity is the enemy of data integrity and system reliability. I suggest simplifying where it makes sense; but do not over-simplify.
The fast one above is running apt-get remove --purgeBooted on the problem machine but physical disk still on the OK machine:
real 0m35.731s
user 0m5.291s
sys 0m4.677s
Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s
Please provide host names.
The disk that I'm using when I saw the above error is a straight LVM ->
iscsi -> ext3 mounted like this:
/dev/xvdb on /mnt/mirror/ftp/mirror type ext3 (rw,noatime)
Tim.
On Sat, 20 Jan 2024, David Christensen wrote:
On 1/20/24 08:25, Tim Woodall wrote:
Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...
Please post a console session that demonstrates, or at least
documents, the data corruption.
Console session is difficult - this is a script that takes around 6
hours to run - but a typical example of corruption is something like
this:
Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack):
cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end
of file or stream
The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct checksum)
Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack): cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream
The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct checksum)
TLDR; there was a firmware bug in a disk in the raid array resulting in
data corruption. A subsequent kernel workaround resulted in
dramatically reducing the disk performance. (probably just writes but I didn't confirm)
Initially, under heavy disk load I got errors like:
Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive
'/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb'
(size=4015516) member 'data.tar': lzma error: compressed data is corrupt
dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive
/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb
(--unpack):
cannot copy extracted data for
'./usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to
'/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected
end of file or stream
The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct
checksum)
A second problem I got was 503 errors from apt-cacher-ng (which ran on
the same machine as the above error)
I initially assumed this was due to faulty memory, or possibly a faulty
CPU. But I assumed memory because the disk errors were happening in a VM
and no other VMs were affected. Because I always start the same VMs in
the same order I assumed they'd be using the same physical memory each
time.
However, nothing I could do would help track down where the memory
problem was. Everything worked perfectly except when using the disk
under load.
At this time I spent a significant amount of time migrating everything important, including the big job that triggered this problem, off this machine onto the pair. After that the corruption problems went away but
I continued to get periodic 503 errors from apt-cacher-ng.
I continued to worry at this on and off but failed to make any progress
in finding what was wrong. The version of the motherboard is no longer available otherwise I'd probably have bought another one. During this
time I also spent quite a lot of time ensurning that it was much easier
to move VMs between my two machines. I'd underestimated how tricky this
would be if the dodgy machine failed totally which I became aware of
when I did migrate the VM having problems.
Late last year or early this year someone (possibly Andy Smith?) posted
a question about logical/physical sector sizes on SSDs. That set me off investigating again as that's not something I'd thought of. That didn't
prove fruitful either but I did notice this in the kernel logs:
Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support
from libata-core.c
{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },
This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)
At this point I'd discovered that the big job that had been regularly
hitting corruption issues now completed. However, it was taking 19 hours instead of 11 hours.
I ordered some new disks - I'd assumed both disks were affected but
while writing this I notice that that "disabling queued TRIM support"
prints twice for the same disk, not once per disk.
I thought one of these was my disk but looking again now I see I had 1000MX500 which doesn't actually match.
{ "Crucial_CT*M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM },
{ "Crucial_CT*MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM },
While waiting for my disks I started looking at the apt-cacher-ng
503 problem - which has continued to bug me. I got lucky and discovered
a way I could almost always trigger it.
I managed to track that down to a race condition when updating the
Release files if multiple machines request the same file at the same
moment.
After finding a fix I found this bug reporting the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043
There is now a patch attached to that bug that I've been running for a
few weeks without a single 503 error.
And Sunday I replaced the two disks with new ones. Today that big job completed in 10h15m.
Another thing I notice, although I'm not sure I understand what is going
on, is that my iscsi disks all have
Thin-provisioning: No
This means tha fstrim on the vm doesn't work. Switching them to Yes and
it does. So I'm not exactly sure where the queued trim was coming from
in the first place.
Are you using systemd ?No, I'm not
You should not be running trim in a container/virtual machine
Here is some info: https://wiki.archlinux.org/title/Solid_state_drive
On Mon, 26 Feb 2024, Gremlin wrote:
Are you using systemd ?No, I'm not
You should not be running trim in a container/virtual machine
Why not? That's, in my case, basically saying "you should not be running
trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
to understand why. Enabling thin_provisioning and fstrim works and gets mapped to the underlying layers all the way down to the SSD.
My underlying VG is less than 50% occupied, so I can trim the free space
by creating a LV and then removing it again (I have issue_discards set)
FWIW, I did issue fstrim in the VMs with no visible issues at all.
Perhaps I got lucky?
Here is some info: https://wiki.archlinux.org/title/Solid_state_drive
I don't see VM or virtual machine anywhere on that page.
On Mon, 26 Feb 2024, Gremlin wrote:
re running fstrim in a vm.
The Host system takes care of itI guess you've no idea what iscsi is. Because this makes no sense at
all. systemd or no systemd. The physical disk doesn't have to be
something the host system knows anything about.
Here's a thread of someone wanting to do fstrim from a vm with iscsi
mounted disks.
https://serverfault.com/questions/1031580/trim-unmap-zvol-over-iscsi
And another page suggesting you should.
https://gist.github.com/hostberg/86bfaa81e50cc0666f1745e1897c0a56
8.10.2. Trim/Discard It is good practice to run fstrim (discard)
regularly on VMs and containers. This releases data blocks that the filesystem isn't using anymore. It reduces data usage and resource load.
Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the
disk discard option.
I would guess that if you use sparse file backed storage to a vm you'd
want the vm to run fstrim too but this isn't a setup I've ever used so perhaps it's nonsense.
You should not be running trim in a container/virtual machineWhy not? That's, in my case, basically saying "you should not be running
trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
to understand why. Enabling thin_provisioning and fstrim works and gets
mapped to the underlying layers all the way down to the SSD.
I guest you didn't understand the systemd timer that runs fstrim on
the host.
The Host system takes care of it
Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support
from libata-core.c
{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },
This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)
Hi,
On Mon, Feb 26, 2024 at 06:25:53PM +0000, Tim Woodall wrote:
Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support
from libata-core.c
{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },
This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)
That's interesting. I have quite a few of these drives and haven't
noticed any problems. What kernel version introduced the above
workarounds?
$ sudo lsblk -do NAME,MODEL
NAME MODEL
sda SAMSUNG_MZ7KM1T9HAJM-00005
sdb SAMSUNG_MZ7KM1T9HAJM-00005
sdc Samsung_SSD_870_EVO_4TB
sdd Samsung_SSD_870_EVO_4TB
sde ST4000LM016-1N2170
sdf ST4000LM016-1N2170
sdg SuperMicro_SSD
sdh SuperMicro_SSD
Thanks,
Andy
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 297 |
Nodes: | 16 (2 / 14) |
Uptime: | 34:42:28 |
Calls: | 6,669 |
Calls today: | 1 |
Files: | 12,216 |
Messages: | 5,338,426 |