Forum: >>> Magnum BBS <<<

Disk corruption and performance issue.

From Tim Woodall@21:1/5 to All on Sat Jan 20 17:30:01 2024

This is rather long - so if you're replying to just one bit, please
consider trimming the parts that you're not responding to to make
everybody's life a little bit better!

Some time ago I wrote about a data corruption issue. I've still not
managed to track it down but I have two new datapoints one (inspired but
a recent thread) and I'm hoping someone will have ideas how I should
move forward. By avoiding heavy disk load (and important tasks/jobs!) on
the problem machine I've had no more data corruption. There are no errors/warnings anywhere. A part of me is suspecting a faulty SSD!

I have new disks on order so I can replace the existing disks soon if
that's what it will need to fix this.

Inspired from the recent thread:
On the server that has no issues:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 512 bytes

These are then gpt partitioned, a small BIOS boot and EFI partition and
then a big "Linux filesystem" partition that is part of a mdadm raid

md0 : active raid1 sda3[3] sdb3[2]

On the server that has performance issues and I get occasional data
corruption (both reading and writing) under heavy (disk) load:

sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 4096 bytes

I'm wondering if that physical sector size is the issue. All the
partitions start on a 4k boundary but the big partition is not an exact multiple of 4k. Inside the raid is a LVM PV so I think everything is 4K
aligned anyway except the filesystems themselves and the "heavy load" filesystem that triggered the issue uses 4k blocks. But I don't know if something somewhere has "padding" so that the actual data doesn't
actually start on a 4k boundary on the disk. There are a LOT of
partitions and filesystems in a complicated layered LVM setup so it will
be easier for me to check with instructions than to try to provide the
data for someone else to check - if someone can give me instructions to
work out exactly where the data ends up on the disk. (all partitions are formatted with ext3)

The remaining setup is identical

The new disks are the same make and model as sdb in this server - I hope
that's not a problem!

The second datapoint. My VMs all use iscsi to provide their disk.
Normally the vm runs on the same server as the iscsi target but today I
did a kernel upgrade on a pair of vms (the one on the "problem" machine
took about twice as long) and then "cross booted" them and purged the
old kernel. I actually took timings here:

Booted on the problem machine but physical disk still on the OK machine:
real 0m35.731s
user 0m5.291s
sys 0m4.677s

Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s

I was running these at the same time - which I think rules out cpu
issues. (I've done other tests that also suggest that cpu/memory isn't the issue, it seems to be disk, cabling etc).

The SMART attributes from the problem machine:
sda:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18280
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 54
177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail Always - 129
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age Always - 33
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 39
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 62154466086

sdb:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18697
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age Always - 433
174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 12
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 45
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
194 Temperature_Celsius 0x0022 074 052 000 Old_age Always - 26 (Min/Max 0/48)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1
202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age Offline - 33
206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 63148678276
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1879223820
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1922002147

Does anything leap out at anyone? Anything I should try next? Normally I
try and avoid having disks bought at the same time from the same brand
paired together but I'll give that a try if it will fix this.

Tim.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Christensen@21:1/5 to Tim Woodall on Sat Jan 20 22:20:01 2024

On 1/20/24 08:25, Tim Woodall wrote:

Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...

Please post a console session that demonstrates, or at least documents,
the data corruption.

Please cut and paste complete console sessions into your posts --
prompt, command entered, output displayed. Redact sensitive information.

It helps if your prompt contains useful information. I set PS1 in $HOME/.profile as follows:

2024-01-20 11:31:58 dpchrist@laalaa ~
$ grep PS1 .profile | grep -v '#'
export PS1='\n\D{%Y-%m-%d %H:%M:%S} ${USER}@\h \w\n\$ '

On the server that has no issues:
sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 512 bytes

Attempting to diagnose issues without all the facts is an exercise in
futility.

Please post console sessions that document the make and model of your
disks, their partition tables, your md RAID configurations, and your LVM configurations.

These are then gpt partitioned, a small BIOS boot and EFI partition and
then a big "Linux filesystem" partition that is part of a mdadm raid

md0 : active raid1 sda3[3] sdb3[2]

On the server that has performance issues and I get occasional data corruption (both reading and writing) under heavy (disk) load:

sda: Sector size (logical/physical): 512 bytes / 512 bytes
sdb: Sector size (logical/physical): 512 bytes / 4096 bytes

Putting a sector size 512/512 disk and a sector size 512/4096 disk into
the same mirror is unconventional. I suppose there are kernel
developers who could definitively explain the consequences, but I am not
one of them. The KISS solution is to use matching disks in RAID.

All the
partitions start on a 4k boundary but the big partition is not an exact multiple of 4k.

I align my partitions to 1 MiB boundaries and suggest that you do the same.

... the "heavy load" filesystem that triggered the issue ...

Please post a console session that demonstrates how data corruption is
related to I/O throughput.

There are a LOT of
partitions and filesystems in a complicated layered LVM setup ...

Complexity is the enemy of data integrity and system reliability. I
suggest simplifying where it makes sense; but do not over-simplify.

Booted on the problem machine but physical disk still on the OK machine: real 0m35.731s
user 0m5.291s
sys 0m4.677s

Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s

Please provide host names.

Please post a console session that demonstrates how data corruption
affects VM boot time.

The SMART attributes from the problem machine:
sda:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
Always - 0> 12 Power_Cycle_Count 0x0032 099

099 000 Old_age

Always - 54> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100

100 010 Pre-fail

Always - 0> 181 Program_Fail_Cnt_Total 0x0032 100

100 010 Old_age

Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age
Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail
Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age
Always - 0
190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age
Always - 33
195 ECC_Error_Rate 0x001a 200 200 000 Old_age
Always - 0
199 CRC_Error_Count 0x003e 100 100 000 Old_age
Always - 0

Those look good.

9 Power_On_Hours 0x0032 096 096 000 Old_age
Always - 18280> 177 Wear_Leveling_Count 0x0013 087

087 000 Pre-fail

Always - 129> 241 Total_LBAs_Written 0x0032 099

099 000 Old_age

Always - 62154466086

Please compare those to the SSD specifications.

235 POR_Recovery_Count 0x0012 099 099 000 Old_age
Always - 39

https://www.overclock.net/threads/what-does-por-recovery-count-mean-in-samsung-magician.1491466/

I see a similar statistic on my Intel SSD 520 Series drives:

12 Power_Cycle_Count -O--CK 099 099 000 - 1996
174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 1994

Linux does not seem to shut down the drives the way they want to be shut
down.

sdb:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail
Always - 0
5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age
Always - 0> 12 Power_Cycle_Count 0x0032 100

100 000 Old_age

Always - 50
171 Program_Fail_Count 0x0032 100 100 000 Old_age
Always - 0
172 Erase_Fail_Count 0x0032 100 100 000 Old_age
Always - 0
183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age
Always - 0
184 Error_Correction_Count 0x0032 100 100 000 Old_age
Always - 0
187 Reported_Uncorrect 0x0032 100 100 000 Old_age
Always - 0
194 Temperature_Celsius 0x0022 074 052 000 Old_age
Always - 26 (Min/Max 0/48)
196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
Always - 0
197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age
Always - 0
198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
Offline - 0
206 Write_Error_Rate 0x000e 100 100 000 Old_age
Always - 0
210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age
Always - 0

Those look good.

199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age
Always - 1

I believe that indicates a SATA communications problem. I suggest using
SATA cables that are rated for SATA III 6 Gbps with locking connectors.
If you are in doubt, buy new cables that are properly identified.

9 Power_On_Hours 0x0032 100 100 000 Old_age
Always - 18697
173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age
Always - 433
180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail
Always - 45
246 Total_LBAs_Written 0x0032 100 100 000 Old_age
Always - 63148678276
247 Host_Program_Page_Count 0x0032 100 100 000 Old_age
Always - 1879223820
248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age
Always - 1922002147

Please compare those to SSD specifications.

202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age
Offline - 33

That value is not encouraging, but it is an estimate; not a hard error
count. I would monitor it over time.

174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age
Always - 12

Same comments as above.

An underlying theme is "data integrity". AIUI only btrfs and ZFS have integrity checking built-in; AIUI md, LVM, and ext[234] do not. Linux dm-integrity has not reached Debian stable yet. I suggest that you
implemented periodic runs of BSD mtree(8) to monitor your file systems
for corruption:

https://manpages.debian.org/bullseye/mtree-netbsd/mtree.8.en.html

Another underlying theme is system monitoring and failure prediction.
It is good to run SMART tests and get SMART tests on a regular basis. I
do this manually, have too many disks, and am doing a lousy job. I need
to learn smartd(8).

There have been a few posts recently by people who are running consumer
SSD's in RAID 24x7. After 2+ years, the SSD's start having problems and produce scary SMART reports. AIUI consumer drives are rated for 40
hours/week. Running them 24x7 is like "dog years" -- multiply wall
clock time by 24 * 7 / 40 to get equivalent usage time. In this case, 2
years at 24x7 is equivalent to 8.4 years of 40 hours/week usage. If you
want to run disks 24x7 and have them last 5 years with a certain I/O
load, get disks rated for that.

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to David Christensen on Sun Jan 21 11:50:01 2024

On Sat, 20 Jan 2024, David Christensen wrote:

On 1/20/24 08:25, Tim Woodall wrote:

Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...

Please post a console session that demonstrates, or at least documents, the data corruption.

Console session is difficult - this is a script that takes around 6
hours to run - but a typical example of corruption is something like
this:

Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt
dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack):
cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream

The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct checksum)

Please post console sessions that document the make and model of your disks, their partition tables, your md RAID configurations, and your LVM configurations.

Can you please give a clue as to what you're looking for? This is a
machine exposing dozens of LVM volumes via iscsi targets that are then
exported into VMs that then may be used as PVs in the VM.

The disk that I'm using when I saw the above error is a straight LVM ->
iscsi -> ext3 mounted like this:

/dev/xvdb on /mnt/mirror/ftp/mirror type ext3 (rw,noatime)

That is this iscsi target:
[fd01:8b0:bfcd:100:230:18ff:fe08:5ad6]:3260,1 iqn.xen17:aptmirror-archive

configured like this:
root@xen17:~# cat /etc/tgt/conf.d/aptmirror17.conf
<target iqn.xen17:aptmirror17>
backing-store /dev/vg-xen17/aptmirror17
</target>
<target iqn.xen17:aptmirror-archive>
backing-store /dev/vg-xen17/aptmirror-archive
</target>

and configured in the vm config like this:
disk=[ 'script=block-iscsi,vdev=xvda,target=portal=xen17:3260,iqn=iqn.xen17:aptmirror17,w',
'script=block-iscsi,vdev=xvdb,target=portal=xen17:3260,iqn=iqn.xen17:aptmirror-archive,w',
]

Putting a sector size 512/512 disk and a sector size 512/4096 disk into the same mirror is unconventional. I suppose there are kernel developers who could definitively explain the consequences, but I am not one of them. The KISS solution is to use matching disks in RAID.

The problem with matching disks in the raid, which has bitten me before,
is that they can both be subject of a recall. I make a deliberate effort
to avoid matching disks for exactly that reason.

I'm happy to accept that this is "unconventional" - however, I didn't
even know it had happened. It was Andy's thread that gave me the clue to
look. I'm surprised that mdadm didn't say something - and I thought
LVM/mdadm all did everything at the 4k level anyway so I don't really
see why it should matter.

All the
partitions start on a 4k boundary but the big partition is not an exact
multiple of 4k.

I align my partitions to 1 MiB boundaries and suggest that you do the same.

They are aligned at 1M boundaries but while I could see that sub-4k
alignment could be triggering some (expected) problem, I can't really
see why 4k or 1M alignment would be different:

Device Start End Sectors Size Type
/dev/sda1 2048 4095 2048 1M BIOS boot
/dev/sda2 4096 264191 260096 127M EFI System
/dev/sda3 264192 1953525134 1953260943 931.4G Linux filesystem

... the "heavy load" filesystem that triggered the issue ...

Please post a console session that demonstrates how data corruption is related to I/O throughput.

I don't know how to do that except that I run a script every Sunday that rebuilds my entire set of packages that I have locally in a sandbox. For
each package that builds a clean sandbox, installs all of the
build-depends and then builds it. It also generates some multi-hundred
MB compressed tar archives of "clean" systems that I use to bootstrap installing new VMs. I have had the following commands report:

build-tarfiles.sh: tar -C ${BUILDCHROOT} --one-file-system -Jcf ${PDIR}/${tgt} .
build-tarfiles.sh: tar tvf ${PDIR}/${tgt} >/dev/null
build-tarfiles.sh: tar tvf ${PDIR}/${tgt} >/dev/null

Where the first tar tvf reports that the archive is corrupted while the
second works (and the archive is uncorrupted)

There are a LOT of
partitions and filesystems in a complicated layered LVM setup ...

Complexity is the enemy of data integrity and system reliability. I suggest simplifying where it makes sense; but do not over-simplify.

I don't see any opportunity to simplify. It is complicated but
conceptually easy.

For example xen17 has >30 LVs, each exported via iscsi, they
are then mounted inside various VMs (currently 14 running) and then the
virtual disk inside that VM may, or may not be a LVM PV itself.

"Just supply everything" is going to be a multi-hundred-thousand line
email though.

lvm.conf from 14 VMs is going to be 30k lines on its own. I think that
the lvm.conf in the VMs is "unchanged" but without work I don't know
that they're unchanged from a default install. The one on xen17
definitely is changed:
filter = [ "r|/dev/vg-xen17/.*|", "r|/dev/disk/by-path/ip-.*|", "r|/dev/disk/by-id/usb-.*|", "r|/dev/disk/by-id/usb-Kingston_DataTraveler_3.0_6CF049E16B59B03169C6D9ED-0:0|" ]

because I don't want the kernel looking into the various images that are intended to be used in a VM. Whether I've made other changes I don't
recall without spending time going through logs or installing a mirror
system and diffing the files.

(And yes, I know that that last exclusion is redundant but I want that
one documented explicitly in case I need/want to remove the general
one)

Booted on the problem machine but physical disk still on the OK machine:
real 0m35.731s
user 0m5.291s
sys 0m4.677s

Booted on the good machine but physical disk still on the problem
machine:
real 0m57.721s
user 0m5.446s
sys 0m4.783s

Please provide host names.

The fast one above is running apt-get remove --purge
linux-image-5<something> on debootstrap19 - which is a VM running on xen17
but with physical backing disks exported from xen19

The slow one above is running the same command on debootstrap17 - which
is a VM running on xen19 but with physical backing disks exported from
xen17

Note that these systems are optimized for power consumption, not speed,
so "slow" is relative. I don't expect anything to be fast!

When I did the kernel upgrade debootstrap17 was running on xen17 and debootstrap19 was running on xen19 - and the slowness stayed with
debootstrap17 (but I didn't take timings)

And note that the slowness is not linked to debootstrap17 - all VMs with
a backing disk on xen17 are slow relative to VMs with a backing disk on
xen19 which indicates that the problem is xen17 or the disks on xen17.

I have been assuming the problem was with xen17 itself - and I've made
sure everything important is on xen19 - but I'm starting to suspect
that there's a disk problem on xen17 (which exhibits only as corrupted
reads and writes but no errors in any logs or SMART)

Next Sunday the big rebuild job will kick off on aptmirror19 but that has
been moved to being hosted on xen17 (with backing disks still on xen19)

I've never had a data-corruption failure since I moved the entire job
(vm and backing disk) to xen19. It happened every Sunday when it was on
xen17.

Tim.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From gene heskett@21:1/5 to Tim Woodall on Sun Jan 21 19:00:02 2024

On 1/21/24 05:46, Tim Woodall wrote:

The disk that I'm using when I saw the above error is a straight LVM ->
iscsi -> ext3 mounted like this:

/dev/xvdb on /mnt/mirror/ftp/mirror type ext3 (rw,noatime)

I should stay out of this, but feel compelled to ask why ext3?

It had a very short run, a long time ago now, quickly replaced by ext4,
I presume for a good reason but do not now recall the details.

Tim.

Cheers, Gene Heskett.
--
"There are four boxes to be used in defense of liberty:
soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author, 1940)
If we desire respect for the law, we must first make the law respectable.
- Louis D. Brandeis

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From David Christensen@21:1/5 to Tim Woodall on Mon Jan 22 00:10:02 2024

On 1/21/24 02:45, Tim Woodall wrote:

On Sat, 20 Jan 2024, David Christensen wrote:

On 1/20/24 08:25, Tim Woodall wrote:

Some time ago I wrote about a data corruption issue. I've still not
managed to track it down ...

Please post a console session that demonstrates, or at least
documents, the data corruption.

Console session is difficult - this is a script that takes around 6
hours to run - but a typical example of corruption is something like
this:

Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack):
cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end
of file or stream

The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct checksum)

Suggestions:

1. Use the -e (errexit), -u (nounset), and/or -x (xtrace) options; or
their equivalents, if you are not using Bourne shell.

2. Add printf's to dump progress and debugging information to a file
while the script runs.

3. Add assertions to check for disk corruption, performance problems,
and any other else that concerns you; now or in the past. If any
assertion fails, the assertion should identify itself, halt the script,
and dump the relevant debugging information.

4. Refactor your code into a hierarchy (directed acyclic graph). Start
your debugging/ validation at the bottom (leaf nodes; functions,
commands) and work your way up (root node; the 6 hour script).

5. Make the script idempotent, so that when it fails and you run it
again the script will skip over previously completed steps.

David

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to All on Mon Feb 26 19:30:01 2024

TLDR; there was a firmware bug in a disk in the raid array resulting in
data corruption. A subsequent kernel workaround resulted in
dramatically reducing the disk performance. (probably just writes but I
didn't confirm)

Initially, under heavy disk load I got errors like:

Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack): cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream

The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct checksum)

A second problem I got was 503 errors from apt-cacher-ng (which ran on
the same machine as the above error)

I initially assumed this was due to faulty memory, or possibly a faulty
CPU. But I assumed memory because the disk errors were happening in a VM
and no other VMs were affected. Because I always start the same VMs in
the same order I assumed they'd be using the same physical memory each
time.

However, nothing I could do would help track down where the memory
problem was. Everything worked perfectly except when using the disk
under load.

At this time I spent a significant amount of time migrating everything important, including the big job that triggered this problem, off this
machine onto the pair. After that the corruption problems went away but
I continued to get periodic 503 errors from apt-cacher-ng.

I continued to worry at this on and off but failed to make any progress
in finding what was wrong. The version of the motherboard is no longer available otherwise I'd probably have bought another one. During this
time I also spent quite a lot of time ensurning that it was much easier
to move VMs between my two machines. I'd underestimated how tricky this
would be if the dodgy machine failed totally which I became aware of
when I did migrate the VM having problems.

Late last year or early this year someone (possibly Andy Smith?) posted
a question about logical/physical sector sizes on SSDs. That set me off investigating again as that's not something I'd thought of. That didn't
prove fruitful either but I did notice this in the kernel logs:

Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support

from libata-core.c

{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },

This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)

At this point I'd discovered that the big job that had been regularly
hitting corruption issues now completed. However, it was taking 19 hours instead of 11 hours.

I ordered some new disks - I'd assumed both disks were affected but
while writing this I notice that that "disabling queued TRIM support"
prints twice for the same disk, not once per disk.

I thought one of these was my disk but looking again now I see I had
1000MX500 which doesn't actually match.

{ "Crucial_CT*M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM },
{ "Crucial_CT*MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM },

While waiting for my disks I started looking at the apt-cacher-ng
503 problem - which has continued to bug me. I got lucky and discovered
a way I could almost always trigger it.

I managed to track that down to a race condition when updating the
Release files if multiple machines request the same file at the same
moment.

After finding a fix I found this bug reporting the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043

There is now a patch attached to that bug that I've been running for a
few weeks without a single 503 error.

And Sunday I replaced the two disks with new ones. Today that big job
completed in 10h15m.

Another thing I notice, although I'm not sure I understand what is going
on, is that my iscsi disks all have
Thin-provisioning: No

This means tha fstrim on the vm doesn't work. Switching them to Yes and
it does. So I'm not exactly sure where the queued trim was coming from
in the first place.

I also need to check the version of tgt in sid because there doesn't
seem to be an option to switch this in the config although my googling suggested there should be an option
thin_provisioning=1

At some point I'll patch things to switch this automatically and/or
install a newer version of tgt-admin that does read this setting from
the config.

So I am now confident that my "pending hardware failure" that had been
haunting me is resolved and I can now start planning the final VM move
from buster to bullseye and then the upgrades to bookworm :-)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 20:10:01 2024

On 2/26/24 13:25, Tim Woodall wrote:

TLDR; there was a firmware bug in a disk in the raid array resulting in
data corruption. A subsequent kernel workaround resulted in
dramatically reducing the disk performance. (probably just writes but I didn't confirm)

Initially, under heavy disk load I got errors like:

Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
Unpacking libperl5.34:arm64 (5.34.0-5) ...
dpkg-deb (subprocess): decompressing archive
'/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb'
(size=4015516) member 'data.tar': lzma error: compressed data is corrupt
dpkg-deb: error: <decompress> subprocess returned error exit status 2
dpkg: error processing archive
/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb
(--unpack):
cannot copy extracted data for
'./usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to
'/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected
end of file or stream

The checksum will have been verified by apt during the download but when
it comes to read the downloaded deb to unpack and install it doesn't get
the same data. The corruption can happen at both the writing (the file
on disk is corrupted) and the reading (the file on disk has the correct
checksum)

A second problem I got was 503 errors from apt-cacher-ng (which ran on
the same machine as the above error)

I initially assumed this was due to faulty memory, or possibly a faulty
CPU. But I assumed memory because the disk errors were happening in a VM
and no other VMs were affected. Because I always start the same VMs in
the same order I assumed they'd be using the same physical memory each
time.

However, nothing I could do would help track down where the memory
problem was. Everything worked perfectly except when using the disk
under load.

At this time I spent a significant amount of time migrating everything important, including the big job that triggered this problem, off this machine onto the pair. After that the corruption problems went away but
I continued to get periodic 503 errors from apt-cacher-ng.

I continued to worry at this on and off but failed to make any progress
in finding what was wrong. The version of the motherboard is no longer available otherwise I'd probably have bought another one. During this
time I also spent quite a lot of time ensurning that it was much easier
to move VMs between my two machines. I'd underestimated how tricky this
would be if the dodgy machine failed totally which I became aware of
when I did migrate the VM having problems.

Late last year or early this year someone (possibly Andy Smith?) posted
a question about logical/physical sector sizes on SSDs. That set me off investigating again as that's not something I'd thought of. That didn't
prove fruitful either but I did notice this in the kernel logs:

Feb 17 17:01:49 xen17 vmunix: [    3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [    3.805074] ata1.00: disabling queued TRIM support

from libata-core.c

{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
      ATA_HORKAGE_ZERO_AFTER_TRIM |
      ATA_HORKAGE_NO_NCQ_ON_ATI },

This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)

At this point I'd discovered that the big job that had been regularly
hitting corruption issues now completed. However, it was taking 19 hours instead of 11 hours.

I ordered some new disks - I'd assumed both disks were affected but
while writing this I notice that that "disabling queued TRIM support"
prints twice for the same disk, not once per disk.

I thought one of these was my disk but looking again now I see I had 1000MX500 which doesn't actually match.

{ "Crucial_CT*M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
      ATA_HORKAGE_ZERO_AFTER_TRIM },
{ "Crucial_CT*MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
      ATA_HORKAGE_ZERO_AFTER_TRIM },

While waiting for my disks I started looking at the apt-cacher-ng
503 problem - which has continued to bug me. I got lucky and discovered
a way I could almost always trigger it.

I managed to track that down to a race condition when updating the
Release files if multiple machines request the same file at the same
moment.

After finding a fix I found this bug reporting the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043

There is now a patch attached to that bug that I've been running for a
few weeks without a single 503 error.

And Sunday I replaced the two disks with new ones. Today that big job completed in 10h15m.

Another thing I notice, although I'm not sure I understand what is going
on, is that my iscsi disks all have
           Thin-provisioning: No

This means tha fstrim on the vm doesn't work. Switching them to Yes and
it does. So I'm not exactly sure where the queued trim was coming from
in the first place.

Are you using systemd ?

/etc/systemd/system/timers.target.wants/fstrim.timer

[Unit]
Description=Discard unused blocks once a week
Documentation=man:fstrim
ConditionVirtualization=!container
ConditionPathExists=!/etc/initrd-release

[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true
RandomizedDelaySec=6000

[Install]
WantedBy=timers.target

You should not be running trim in a container/virtual machine

Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to Gremlin on Mon Feb 26 20:50:01 2024

On Mon, 26 Feb 2024, Gremlin wrote:

Are you using systemd ?

No, I'm not

You should not be running trim in a container/virtual machine

Why not? That's, in my case, basically saying "you should not be running
trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
to understand why. Enabling thin_provisioning and fstrim works and gets
mapped to the underlying layers all the way down to the SSD.

My underlying VG is less than 50% occupied, so I can trim the free space
by creating a LV and then removing it again (I have issue_discards set)

FWIW, I did issue fstrim in the VMs with no visible issues at all.
Perhaps I got lucky?

Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

I don't see VM or virtual machine anywhere on that page.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 22:00:02 2024

On 2/26/24 14:40, Tim Woodall wrote:

On Mon, 26 Feb 2024, Gremlin wrote:

Are you using systemd ?

No, I'm not

You should not be running trim in a container/virtual machine

Why not? That's, in my case, basically saying "you should not be running
trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
to understand why. Enabling thin_provisioning and fstrim works and gets mapped to the underlying layers all the way down to the SSD.

I guest you didn't understand the systemd timer that runs fstrim on the
host.

My underlying VG is less than 50% occupied, so I can trim the free space
by creating a LV and then removing it again (I have issue_discards set)

FWIW, I did issue fstrim in the VMs with no visible issues at all.
Perhaps I got lucky?

Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

I don't see VM or virtual machine anywhere on that page.

Exactly, and you should not be running it in a VM/container. Which BTW
systemd will not run fstrim in a container.

The Host system takes care of it

Well you can keep shooting yourself in the butt as long as you wish, I
on the other hand tend not to do that as much I possibly can as I need
to be able to set down at times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 22:50:01 2024

On 2/26/24 16:31, Tim Woodall wrote:

On Mon, 26 Feb 2024, Gremlin wrote:

re running fstrim in a vm.

The Host system takes care of it

I guess you've no idea what iscsi is. Because this makes no sense at
all. systemd or no systemd. The physical disk doesn't have to be
something the host system knows anything about.

Here's a thread of someone wanting to do fstrim from a vm with iscsi
mounted disks.

https://serverfault.com/questions/1031580/trim-unmap-zvol-over-iscsi

And another page suggesting you should.

https://gist.github.com/hostberg/86bfaa81e50cc0666f1745e1897c0a56

8.10.2. Trim/Discard It is good practice to run fstrim (discard)
regularly on VMs and containers. This releases data blocks that the filesystem isn't using anymore. It reduces data usage and resource load.
Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the
disk discard option.

I would guess that if you use sparse file backed storage to a vm you'd
want the vm to run fstrim too but this isn't a setup I've ever used so perhaps it's nonsense.

Never mind, arguing with me will not solve your issue.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Mon Feb 26 22:50:01 2024

You should not be running trim in a container/virtual machine

Why not? That's, in my case, basically saying "you should not be running
trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
to understand why. Enabling thin_provisioning and fstrim works and gets
mapped to the underlying layers all the way down to the SSD.

I guest you didn't understand the systemd timer that runs fstrim on
the host.

How can the host properly run `fstrim` if it only sees a disk image and
may not know how that image is divided into partitions/filesystems?

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to Gremlin on Mon Feb 26 22:40:02 2024

On Mon, 26 Feb 2024, Gremlin wrote:

re running fstrim in a vm.

The Host system takes care of it

I guess you've no idea what iscsi is. Because this makes no sense at
all. systemd or no systemd. The physical disk doesn't have to be
something the host system knows anything about.

Here's a thread of someone wanting to do fstrim from a vm with iscsi
mounted disks.

https://serverfault.com/questions/1031580/trim-unmap-zvol-over-iscsi

And another page suggesting you should.

https://gist.github.com/hostberg/86bfaa81e50cc0666f1745e1897c0a56

8.10.2. Trim/Discard It is good practice to run fstrim (discard)
regularly on VMs and containers. This releases data blocks that the
filesystem isn't using anymore. It reduces data usage and resource load.
Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the
disk discard option.

I would guess that if you use sparse file backed storage to a vm you'd
want the vm to run fstrim too but this isn't a setup I've ever used so
perhaps it's nonsense.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Andy Smith@21:1/5 to Tim Woodall on Tue Feb 27 00:50:02 2024

Hi,

On Mon, Feb 26, 2024 at 06:25:53PM +0000, Tim Woodall wrote:

Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support

from libata-core.c

{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },

This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)

That's interesting. I have quite a few of these drives and haven't
noticed any problems. What kernel version introduced the above
workarounds?

$ sudo lsblk -do NAME,MODEL
NAME MODEL
sda SAMSUNG_MZ7KM1T9HAJM-00005
sdb SAMSUNG_MZ7KM1T9HAJM-00005
sdc Samsung_SSD_870_EVO_4TB
sdd Samsung_SSD_870_EVO_4TB
sde ST4000LM016-1N2170
sdf ST4000LM016-1N2170
sdg SuperMicro_SSD
sdh SuperMicro_SSD

Thanks,
Andy

--
https://bitfolk.com/ -- No-nonsense VPS hosting

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Woodall@21:1/5 to Andy Smith on Tue Feb 27 05:40:01 2024

On Mon, 26 Feb 2024, Andy Smith wrote:

Hi,

On Mon, Feb 26, 2024 at 06:25:53PM +0000, Tim Woodall wrote:

Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support

from libata-core.c

{ "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
ATA_HORKAGE_ZERO_AFTER_TRIM |
ATA_HORKAGE_NO_NCQ_ON_ATI },

This fixed the disk corruption errors at the cost of dramatically
reducing performance. (I'm not sure why because manual fstrim didn't
improve things)

That's interesting. I have quite a few of these drives and haven't
noticed any problems. What kernel version introduced the above
workarounds?

$ sudo lsblk -do NAME,MODEL
NAME MODEL
sda SAMSUNG_MZ7KM1T9HAJM-00005
sdb SAMSUNG_MZ7KM1T9HAJM-00005
sdc Samsung_SSD_870_EVO_4TB
sdd Samsung_SSD_870_EVO_4TB
sde ST4000LM016-1N2170
sdf ST4000LM016-1N2170
sdg SuperMicro_SSD
sdh SuperMicro_SSD

Thanks,
Andy

Looks like the fix was brand new around sept 2021 https://www.neowin.net/news/linux-patch-disables-trim-and-ncq-on-samsung-860870-ssds-in-intel-and-amd-systems/

I was still seeing corruption in August 2022 but it's possible the fix
wasn't backported to whatever release I was running.

Tim.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Keyop
  Sun Apr 28 20:37:53 2024
  from Huddersfield, West Yorkshire via SSH
- Keyop
  Mon Apr 29 19:16:32 2024
  from Huddersfield, West Yorkshire via SSH
- Bob Worm
  Mon Apr 29 09:04:47 2024
  from Wales, Uk via Telnet
- Smithy
  Tue Apr 30 00:43:37 2024
  from Plymouth via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	297
Nodes:	16 (2 / 14)
Uptime:	34:42:28
Calls:	6,669
Calls today:	1
Files:	12,216
Messages:	5,338,426

Disk corruption and performance issue.

Who's Online

Recent Visitors

System Info