• Disk corruption and performance issue.

    From Tim Woodall@21:1/5 to All on Sat Jan 20 17:30:01 2024
    This is rather long - so if you're replying to just one bit, please
    consider trimming the parts that you're not responding to to make
    everybody's life a little bit better!


    Some time ago I wrote about a data corruption issue. I've still not
    managed to track it down but I have two new datapoints one (inspired but
    a recent thread) and I'm hoping someone will have ideas how I should
    move forward. By avoiding heavy disk load (and important tasks/jobs!) on
    the problem machine I've had no more data corruption. There are no errors/warnings anywhere. A part of me is suspecting a faulty SSD!

    I have new disks on order so I can replace the existing disks soon if
    that's what it will need to fix this.

    Inspired from the recent thread:
    On the server that has no issues:
    sda: Sector size (logical/physical): 512 bytes / 512 bytes
    sdb: Sector size (logical/physical): 512 bytes / 512 bytes

    These are then gpt partitioned, a small BIOS boot and EFI partition and
    then a big "Linux filesystem" partition that is part of a mdadm raid

    md0 : active raid1 sda3[3] sdb3[2]


    On the server that has performance issues and I get occasional data
    corruption (both reading and writing) under heavy (disk) load:

    sda: Sector size (logical/physical): 512 bytes / 512 bytes
    sdb: Sector size (logical/physical): 512 bytes / 4096 bytes

    I'm wondering if that physical sector size is the issue. All the
    partitions start on a 4k boundary but the big partition is not an exact multiple of 4k. Inside the raid is a LVM PV so I think everything is 4K
    aligned anyway except the filesystems themselves and the "heavy load" filesystem that triggered the issue uses 4k blocks. But I don't know if something somewhere has "padding" so that the actual data doesn't
    actually start on a 4k boundary on the disk. There are a LOT of
    partitions and filesystems in a complicated layered LVM setup so it will
    be easier for me to check with instructions than to try to provide the
    data for someone else to check - if someone can give me instructions to
    work out exactly where the data ends up on the disk. (all partitions are formatted with ext3)

    The remaining setup is identical

    The new disks are the same make and model as sdb in this server - I hope
    that's not a problem!


    The second datapoint. My VMs all use iscsi to provide their disk.
    Normally the vm runs on the same server as the iscsi target but today I
    did a kernel upgrade on a pair of vms (the one on the "problem" machine
    took about twice as long) and then "cross booted" them and purged the
    old kernel. I actually took timings here:

    Booted on the problem machine but physical disk still on the OK machine:
    real 0m35.731s
    user 0m5.291s
    sys 0m4.677s

    Booted on the good machine but physical disk still on the problem
    machine:
    real 0m57.721s
    user 0m5.446s
    sys 0m4.783s

    I was running these at the same time - which I think rules out cpu
    issues. (I've done other tests that also suggest that cpu/memory isn't the issue, it seems to be disk, cabling etc).

    The SMART attributes from the problem machine:
    sda:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
    9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 18280
    12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 54
    177 Wear_Leveling_Count 0x0013 087 087 000 Pre-fail Always - 129
    179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
    181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
    182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
    183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
    187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
    190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age Always - 33
    195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
    199 CRC_Error_Count 0x003e 100 100 000 Old_age Always - 0
    235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 39
    241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 62154466086


    sdb:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail Always - 0
    5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age Always - 0
    9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 18697
    12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 50
    171 Program_Fail_Count 0x0032 100 100 000 Old_age Always - 0
    172 Erase_Fail_Count 0x0032 100 100 000 Old_age Always - 0
    173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age Always - 433
    174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age Always - 12
    180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail Always - 45
    183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age Always - 0
    184 Error_Correction_Count 0x0032 100 100 000 Old_age Always - 0
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always - 0
    194 Temperature_Celsius 0x0022 074 052 000 Old_age Always - 26 (Min/Max 0/48)
    196 Reallocated_Event_Count 0x0032 100 100 000 Old_age Always - 0
    197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age Always - 0
    198 Offline_Uncorrectable 0x0030 100 100 000 Old_age Offline - 0
    199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age Always - 1
    202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age Offline - 33
    206 Write_Error_Rate 0x000e 100 100 000 Old_age Always - 0
    210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age Always - 0
    246 Total_LBAs_Written 0x0032 100 100 000 Old_age Always - 63148678276
    247 Host_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1879223820
    248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age Always - 1922002147

    Does anything leap out at anyone? Anything I should try next? Normally I
    try and avoid having disks bought at the same time from the same brand
    paired together but I'll give that a try if it will fix this.

    Tim.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Christensen@21:1/5 to Tim Woodall on Sat Jan 20 22:20:01 2024
    On 1/20/24 08:25, Tim Woodall wrote:
    Some time ago I wrote about a data corruption issue. I've still not
    managed to track it down ...

    Please post a console session that demonstrates, or at least documents,
    the data corruption.


    Please cut and paste complete console sessions into your posts --
    prompt, command entered, output displayed. Redact sensitive information.


    It helps if your prompt contains useful information. I set PS1 in $HOME/.profile as follows:

    2024-01-20 11:31:58 dpchrist@laalaa ~
    $ grep PS1 .profile | grep -v '#'
    export PS1='\n\D{%Y-%m-%d %H:%M:%S} ${USER}@\h \w\n\$ '


    On the server that has no issues:
    sda: Sector size (logical/physical): 512 bytes / 512 bytes
    sdb: Sector size (logical/physical): 512 bytes / 512 bytes

    Attempting to diagnose issues without all the facts is an exercise in
    futility.


    Please post console sessions that document the make and model of your
    disks, their partition tables, your md RAID configurations, and your LVM configurations.


    These are then gpt partitioned, a small BIOS boot and EFI partition and
    then a big "Linux filesystem" partition that is part of a mdadm raid

    md0 : active raid1 sda3[3] sdb3[2]

    On the server that has performance issues and I get occasional data corruption (both reading and writing) under heavy (disk) load:

    sda: Sector size (logical/physical): 512 bytes / 512 bytes
    sdb: Sector size (logical/physical): 512 bytes / 4096 bytes

    Putting a sector size 512/512 disk and a sector size 512/4096 disk into
    the same mirror is unconventional. I suppose there are kernel
    developers who could definitively explain the consequences, but I am not
    one of them. The KISS solution is to use matching disks in RAID.


    All the
    partitions start on a 4k boundary but the big partition is not an exact multiple of 4k.

    I align my partitions to 1 MiB boundaries and suggest that you do the same.


    ... the "heavy load" filesystem that triggered the issue ...

    Please post a console session that demonstrates how data corruption is
    related to I/O throughput.


    There are a LOT of
    partitions and filesystems in a complicated layered LVM setup ...

    Complexity is the enemy of data integrity and system reliability. I
    suggest simplifying where it makes sense; but do not over-simplify.


    Booted on the problem machine but physical disk still on the OK machine: real 0m35.731s
    user 0m5.291s
    sys 0m4.677s

    Booted on the good machine but physical disk still on the problem
    machine:
    real 0m57.721s
    user 0m5.446s
    sys 0m4.783s

    Please provide host names.


    Please post a console session that demonstrates how data corruption
    affects VM boot time.


    The SMART attributes from the problem machine:
    sda:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
    UPDATED WHEN_FAILED RAW_VALUE
    5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail
    Always - 0> 12 Power_Cycle_Count 0x0032 099
    099 000 Old_age
    Always - 54> 179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100
    100 010 Pre-fail
    Always - 0> 181 Program_Fail_Cnt_Total 0x0032 100
    100 010 Old_age
    Always - 0
    182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age
    Always - 0
    183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail
    Always - 0
    187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age
    Always - 0
    190 Airflow_Temperature_Cel 0x0032 067 049 000 Old_age
    Always - 33
    195 ECC_Error_Rate 0x001a 200 200 000 Old_age
    Always - 0
    199 CRC_Error_Count 0x003e 100 100 000 Old_age
    Always - 0

    Those look good.


    9 Power_On_Hours 0x0032 096 096 000 Old_age
    Always - 18280> 177 Wear_Leveling_Count 0x0013 087
    087 000 Pre-fail
    Always - 129> 241 Total_LBAs_Written 0x0032 099
    099 000 Old_age
    Always - 62154466086

    Please compare those to the SSD specifications.


    235 POR_Recovery_Count 0x0012 099 099 000 Old_age
    Always - 39

    https://www.overclock.net/threads/what-does-por-recovery-count-mean-in-samsung-magician.1491466/


    I see a similar statistic on my Intel SSD 520 Series drives:

    12 Power_Cycle_Count -O--CK 099 099 000 - 1996
    174 Unexpect_Power_Loss_Ct -O--CK 100 100 000 - 1994


    Linux does not seem to shut down the drives the way they want to be shut
    down.


    sdb:
    ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE
    UPDATED WHEN_FAILED RAW_VALUE
    1 Raw_Read_Error_Rate 0x002f 100 100 000 Pre-fail
    Always - 0
    5 Reallocate_NAND_Blk_Cnt 0x0032 100 100 010 Old_age
    Always - 0> 12 Power_Cycle_Count 0x0032 100
    100 000 Old_age
    Always - 50
    171 Program_Fail_Count 0x0032 100 100 000 Old_age
    Always - 0
    172 Erase_Fail_Count 0x0032 100 100 000 Old_age
    Always - 0
    183 SATA_Interfac_Downshift 0x0032 100 100 000 Old_age
    Always - 0
    184 Error_Correction_Count 0x0032 100 100 000 Old_age
    Always - 0
    187 Reported_Uncorrect 0x0032 100 100 000 Old_age
    Always - 0
    194 Temperature_Celsius 0x0022 074 052 000 Old_age
    Always - 26 (Min/Max 0/48)
    196 Reallocated_Event_Count 0x0032 100 100 000 Old_age
    Always - 0
    197 Current_Pending_ECC_Cnt 0x0032 100 100 000 Old_age
    Always - 0
    198 Offline_Uncorrectable 0x0030 100 100 000 Old_age
    Offline - 0
    206 Write_Error_Rate 0x000e 100 100 000 Old_age
    Always - 0
    210 Success_RAIN_Recov_Cnt 0x0032 100 100 000 Old_age
    Always - 0

    Those look good.


    199 UDMA_CRC_Error_Count 0x0032 100 100 000 Old_age
    Always - 1

    I believe that indicates a SATA communications problem. I suggest using
    SATA cables that are rated for SATA III 6 Gbps with locking connectors.
    If you are in doubt, buy new cables that are properly identified.


    9 Power_On_Hours 0x0032 100 100 000 Old_age
    Always - 18697
    173 Ave_Block-Erase_Count 0x0032 067 067 000 Old_age
    Always - 433
    180 Unused_Reserve_NAND_Blk 0x0033 000 000 000 Pre-fail
    Always - 45
    246 Total_LBAs_Written 0x0032 100 100 000 Old_age
    Always - 63148678276
    247 Host_Program_Page_Count 0x0032 100 100 000 Old_age
    Always - 1879223820
    248 FTL_Program_Page_Count 0x0032 100 100 000 Old_age
    Always - 1922002147

    Please compare those to SSD specifications.


    202 Percent_Lifetime_Remain 0x0030 067 067 001 Old_age
    Offline - 33

    That value is not encouraging, but it is an estimate; not a hard error
    count. I would monitor it over time.


    174 Unexpect_Power_Loss_Ct 0x0032 100 100 000 Old_age
    Always - 12

    Same comments as above.


    An underlying theme is "data integrity". AIUI only btrfs and ZFS have integrity checking built-in; AIUI md, LVM, and ext[234] do not. Linux dm-integrity has not reached Debian stable yet. I suggest that you
    implemented periodic runs of BSD mtree(8) to monitor your file systems
    for corruption:

    https://manpages.debian.org/bullseye/mtree-netbsd/mtree.8.en.html


    Another underlying theme is system monitoring and failure prediction.
    It is good to run SMART tests and get SMART tests on a regular basis. I
    do this manually, have too many disks, and am doing a lousy job. I need
    to learn smartd(8).


    There have been a few posts recently by people who are running consumer
    SSD's in RAID 24x7. After 2+ years, the SSD's start having problems and produce scary SMART reports. AIUI consumer drives are rated for 40
    hours/week. Running them 24x7 is like "dog years" -- multiply wall
    clock time by 24 * 7 / 40 to get equivalent usage time. In this case, 2
    years at 24x7 is equivalent to 8.4 years of 40 hours/week usage. If you
    want to run disks 24x7 and have them last 5 years with a certain I/O
    load, get disks rated for that.


    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to David Christensen on Sun Jan 21 11:50:01 2024
    On Sat, 20 Jan 2024, David Christensen wrote:

    On 1/20/24 08:25, Tim Woodall wrote:
    Some time ago I wrote about a data corruption issue. I've still not
    managed to track it down ...

    Please post a console session that demonstrates, or at least documents, the data corruption.


    Console session is difficult - this is a script that takes around 6
    hours to run - but a typical example of corruption is something like
    this:

    Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
    Unpacking libperl5.34:arm64 (5.34.0-5) ...
    dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt
    dpkg-deb: error: <decompress> subprocess returned error exit status 2
    dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack):
    cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream

    The checksum will have been verified by apt during the download but when
    it comes to read the downloaded deb to unpack and install it doesn't get
    the same data. The corruption can happen at both the writing (the file
    on disk is corrupted) and the reading (the file on disk has the correct checksum)


    Please post console sessions that document the make and model of your disks, their partition tables, your md RAID configurations, and your LVM configurations.


    Can you please give a clue as to what you're looking for? This is a
    machine exposing dozens of LVM volumes via iscsi targets that are then
    exported into VMs that then may be used as PVs in the VM.

    The disk that I'm using when I saw the above error is a straight LVM ->
    iscsi -> ext3 mounted like this:

    /dev/xvdb on /mnt/mirror/ftp/mirror type ext3 (rw,noatime)

    That is this iscsi target:
    [fd01:8b0:bfcd:100:230:18ff:fe08:5ad6]:3260,1 iqn.xen17:aptmirror-archive

    configured like this:
    root@xen17:~# cat /etc/tgt/conf.d/aptmirror17.conf
    <target iqn.xen17:aptmirror17>
    backing-store /dev/vg-xen17/aptmirror17
    </target>
    <target iqn.xen17:aptmirror-archive>
    backing-store /dev/vg-xen17/aptmirror-archive
    </target>

    and configured in the vm config like this:
    disk=[ 'script=block-iscsi,vdev=xvda,target=portal=xen17:3260,iqn=iqn.xen17:aptmirror17,w',
    'script=block-iscsi,vdev=xvdb,target=portal=xen17:3260,iqn=iqn.xen17:aptmirror-archive,w',
    ]



    Putting a sector size 512/512 disk and a sector size 512/4096 disk into the same mirror is unconventional. I suppose there are kernel developers who could definitively explain the consequences, but I am not one of them. The KISS solution is to use matching disks in RAID.


    The problem with matching disks in the raid, which has bitten me before,
    is that they can both be subject of a recall. I make a deliberate effort
    to avoid matching disks for exactly that reason.

    I'm happy to accept that this is "unconventional" - however, I didn't
    even know it had happened. It was Andy's thread that gave me the clue to
    look. I'm surprised that mdadm didn't say something - and I thought
    LVM/mdadm all did everything at the 4k level anyway so I don't really
    see why it should matter.

    All the
    partitions start on a 4k boundary but the big partition is not an exact
    multiple of 4k.

    I align my partitions to 1 MiB boundaries and suggest that you do the same.

    They are aligned at 1M boundaries but while I could see that sub-4k
    alignment could be triggering some (expected) problem, I can't really
    see why 4k or 1M alignment would be different:

    Device Start End Sectors Size Type
    /dev/sda1 2048 4095 2048 1M BIOS boot
    /dev/sda2 4096 264191 260096 127M EFI System
    /dev/sda3 264192 1953525134 1953260943 931.4G Linux filesystem



    ... the "heavy load" filesystem that triggered the issue ...

    Please post a console session that demonstrates how data corruption is related to I/O throughput.


    I don't know how to do that except that I run a script every Sunday that rebuilds my entire set of packages that I have locally in a sandbox. For
    each package that builds a clean sandbox, installs all of the
    build-depends and then builds it. It also generates some multi-hundred
    MB compressed tar archives of "clean" systems that I use to bootstrap installing new VMs. I have had the following commands report:

    build-tarfiles.sh: tar -C ${BUILDCHROOT} --one-file-system -Jcf ${PDIR}/${tgt} .
    build-tarfiles.sh: tar tvf ${PDIR}/${tgt} >/dev/null
    build-tarfiles.sh: tar tvf ${PDIR}/${tgt} >/dev/null

    Where the first tar tvf reports that the archive is corrupted while the
    second works (and the archive is uncorrupted)

    There are a LOT of
    partitions and filesystems in a complicated layered LVM setup ...

    Complexity is the enemy of data integrity and system reliability. I suggest simplifying where it makes sense; but do not over-simplify.

    I don't see any opportunity to simplify. It is complicated but
    conceptually easy.

    For example xen17 has >30 LVs, each exported via iscsi, they
    are then mounted inside various VMs (currently 14 running) and then the
    virtual disk inside that VM may, or may not be a LVM PV itself.

    "Just supply everything" is going to be a multi-hundred-thousand line
    email though.

    lvm.conf from 14 VMs is going to be 30k lines on its own. I think that
    the lvm.conf in the VMs is "unchanged" but without work I don't know
    that they're unchanged from a default install. The one on xen17
    definitely is changed:
    filter = [ "r|/dev/vg-xen17/.*|", "r|/dev/disk/by-path/ip-.*|", "r|/dev/disk/by-id/usb-.*|", "r|/dev/disk/by-id/usb-Kingston_DataTraveler_3.0_6CF049E16B59B03169C6D9ED-0:0|" ]

    because I don't want the kernel looking into the various images that are intended to be used in a VM. Whether I've made other changes I don't
    recall without spending time going through logs or installing a mirror
    system and diffing the files.

    (And yes, I know that that last exclusion is redundant but I want that
    one documented explicitly in case I need/want to remove the general
    one)


    Booted on the problem machine but physical disk still on the OK machine:
    real 0m35.731s
    user 0m5.291s
    sys 0m4.677s

    Booted on the good machine but physical disk still on the problem
    machine:
    real 0m57.721s
    user 0m5.446s
    sys 0m4.783s

    Please provide host names.
    The fast one above is running apt-get remove --purge
    linux-image-5<something> on debootstrap19 - which is a VM running on xen17
    but with physical backing disks exported from xen19

    The slow one above is running the same command on debootstrap17 - which
    is a VM running on xen19 but with physical backing disks exported from
    xen17

    Note that these systems are optimized for power consumption, not speed,
    so "slow" is relative. I don't expect anything to be fast!

    When I did the kernel upgrade debootstrap17 was running on xen17 and debootstrap19 was running on xen19 - and the slowness stayed with
    debootstrap17 (but I didn't take timings)

    And note that the slowness is not linked to debootstrap17 - all VMs with
    a backing disk on xen17 are slow relative to VMs with a backing disk on
    xen19 which indicates that the problem is xen17 or the disks on xen17.

    I have been assuming the problem was with xen17 itself - and I've made
    sure everything important is on xen19 - but I'm starting to suspect
    that there's a disk problem on xen17 (which exhibits only as corrupted
    reads and writes but no errors in any logs or SMART)

    Next Sunday the big rebuild job will kick off on aptmirror19 but that has
    been moved to being hosted on xen17 (with backing disks still on xen19)

    I've never had a data-corruption failure since I moved the entire job
    (vm and backing disk) to xen19. It happened every Sunday when it was on
    xen17.

    Tim.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From gene heskett@21:1/5 to Tim Woodall on Sun Jan 21 19:00:02 2024
    On 1/21/24 05:46, Tim Woodall wrote:

    The disk that I'm using when I saw the above error is a straight LVM ->
    iscsi -> ext3 mounted like this:

    /dev/xvdb on /mnt/mirror/ftp/mirror type ext3 (rw,noatime)

    I should stay out of this, but feel compelled to ask why ext3?

    It had a very short run, a long time ago now, quickly replaced by ext4,
    I presume for a good reason but do not now recall the details.

    Tim.

    Cheers, Gene Heskett.
    --
    "There are four boxes to be used in defense of liberty:
    soap, ballot, jury, and ammo. Please use in that order."
    -Ed Howdershelt (Author, 1940)
    If we desire respect for the law, we must first make the law respectable.
    - Louis D. Brandeis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From David Christensen@21:1/5 to Tim Woodall on Mon Jan 22 00:10:02 2024
    On 1/21/24 02:45, Tim Woodall wrote:
    On Sat, 20 Jan 2024, David Christensen wrote:
    On 1/20/24 08:25, Tim Woodall wrote:
    Some time ago I wrote about a data corruption issue. I've still not
    managed to track it down ...

    Please post a console session that demonstrates, or at least
    documents, the data corruption.

    Console session is difficult - this is a script that takes around 6
    hours to run - but a typical example of corruption is something like
    this:

    Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
    Unpacking libperl5.34:arm64 (5.34.0-5) ...
    dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
    dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack):
     cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end
    of file or stream

    The checksum will have been verified by apt during the download but when
    it comes to read the downloaded deb to unpack and install it doesn't get
    the same data. The corruption can happen at both the writing (the file
    on disk is corrupted) and the reading (the file on disk has the correct checksum)


    Suggestions:

    1. Use the -e (errexit), -u (nounset), and/or -x (xtrace) options; or
    their equivalents, if you are not using Bourne shell.

    2. Add printf's to dump progress and debugging information to a file
    while the script runs.

    3. Add assertions to check for disk corruption, performance problems,
    and any other else that concerns you; now or in the past. If any
    assertion fails, the assertion should identify itself, halt the script,
    and dump the relevant debugging information.

    4. Refactor your code into a hierarchy (directed acyclic graph). Start
    your debugging/ validation at the bottom (leaf nodes; functions,
    commands) and work your way up (root node; the 6 hour script).

    5. Make the script idempotent, so that when it fails and you run it
    again the script will skip over previously completed steps.


    David

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to All on Mon Feb 26 19:30:01 2024
    TLDR; there was a firmware bug in a disk in the raid array resulting in
    data corruption. A subsequent kernel workaround resulted in
    dramatically reducing the disk performance. (probably just writes but I
    didn't confirm)


    Initially, under heavy disk load I got errors like:

    Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
    Unpacking libperl5.34:arm64 (5.34.0-5) ...
    dpkg-deb (subprocess): decompressing archive '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb' (size=4015516) member 'data.tar': lzma error: compressed data is corrupt dpkg-deb: error: <decompress> subprocess returned error exit status 2
    dpkg: error processing archive /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb (--unpack): cannot copy extracted data for './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected end of file or stream

    The checksum will have been verified by apt during the download but when
    it comes to read the downloaded deb to unpack and install it doesn't get
    the same data. The corruption can happen at both the writing (the file
    on disk is corrupted) and the reading (the file on disk has the correct checksum)


    A second problem I got was 503 errors from apt-cacher-ng (which ran on
    the same machine as the above error)



    I initially assumed this was due to faulty memory, or possibly a faulty
    CPU. But I assumed memory because the disk errors were happening in a VM
    and no other VMs were affected. Because I always start the same VMs in
    the same order I assumed they'd be using the same physical memory each
    time.

    However, nothing I could do would help track down where the memory
    problem was. Everything worked perfectly except when using the disk
    under load.

    At this time I spent a significant amount of time migrating everything important, including the big job that triggered this problem, off this
    machine onto the pair. After that the corruption problems went away but
    I continued to get periodic 503 errors from apt-cacher-ng.


    I continued to worry at this on and off but failed to make any progress
    in finding what was wrong. The version of the motherboard is no longer available otherwise I'd probably have bought another one. During this
    time I also spent quite a lot of time ensurning that it was much easier
    to move VMs between my two machines. I'd underestimated how tricky this
    would be if the dodgy machine failed totally which I became aware of
    when I did migrate the VM having problems.


    Late last year or early this year someone (possibly Andy Smith?) posted
    a question about logical/physical sector sizes on SSDs. That set me off investigating again as that's not something I'd thought of. That didn't
    prove fruitful either but I did notice this in the kernel logs:

    Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
    Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support


    from libata-core.c

    { "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
    ATA_HORKAGE_ZERO_AFTER_TRIM |
    ATA_HORKAGE_NO_NCQ_ON_ATI },

    This fixed the disk corruption errors at the cost of dramatically
    reducing performance. (I'm not sure why because manual fstrim didn't
    improve things)


    At this point I'd discovered that the big job that had been regularly
    hitting corruption issues now completed. However, it was taking 19 hours instead of 11 hours.

    I ordered some new disks - I'd assumed both disks were affected but
    while writing this I notice that that "disabling queued TRIM support"
    prints twice for the same disk, not once per disk.

    I thought one of these was my disk but looking again now I see I had
    1000MX500 which doesn't actually match.

    { "Crucial_CT*M500*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
    ATA_HORKAGE_ZERO_AFTER_TRIM },
    { "Crucial_CT*MX100*", "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
    ATA_HORKAGE_ZERO_AFTER_TRIM },

    While waiting for my disks I started looking at the apt-cacher-ng
    503 problem - which has continued to bug me. I got lucky and discovered
    a way I could almost always trigger it.

    I managed to track that down to a race condition when updating the
    Release files if multiple machines request the same file at the same
    moment.

    After finding a fix I found this bug reporting the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043

    There is now a patch attached to that bug that I've been running for a
    few weeks without a single 503 error.

    And Sunday I replaced the two disks with new ones. Today that big job
    completed in 10h15m.

    Another thing I notice, although I'm not sure I understand what is going
    on, is that my iscsi disks all have
    Thin-provisioning: No

    This means tha fstrim on the vm doesn't work. Switching them to Yes and
    it does. So I'm not exactly sure where the queued trim was coming from
    in the first place.

    I also need to check the version of tgt in sid because there doesn't
    seem to be an option to switch this in the config although my googling suggested there should be an option
    thin_provisioning=1

    At some point I'll patch things to switch this automatically and/or
    install a newer version of tgt-admin that does read this setting from
    the config.



    So I am now confident that my "pending hardware failure" that had been
    haunting me is resolved and I can now start planning the final VM move
    from buster to bullseye and then the upgrades to bookworm :-)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 20:10:01 2024
    On 2/26/24 13:25, Tim Woodall wrote:
    TLDR; there was a firmware bug in a disk in the raid array resulting in
    data corruption. A subsequent kernel workaround resulted in
    dramatically reducing the disk performance. (probably just writes but I didn't confirm)


    Initially, under heavy disk load I got errors like:

    Preparing to unpack .../03-libperl5.34_5.34.0-5_arm64.deb ...
    Unpacking libperl5.34:arm64 (5.34.0-5) ...
    dpkg-deb (subprocess): decompressing archive
    '/tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb'
    (size=4015516) member 'data.tar': lzma error: compressed data is corrupt
    dpkg-deb: error: <decompress> subprocess returned error exit status 2
    dpkg: error processing archive
    /tmp/apt-dpkg-install-zqY3js/03-libperl5.34_5.34.0-5_arm64.deb
    (--unpack):
    cannot copy extracted data for
    './usr/lib/aarch64-linux-gnu/libperl.so.5.34.0' to
    '/usr/lib/aarch64-linux-gnu/libperl.so.5.34.0.dpkg-new': unexpected
    end of file or stream

    The checksum will have been verified by apt during the download but when
    it comes to read the downloaded deb to unpack and install it doesn't get
    the same data. The corruption can happen at both the writing (the file
    on disk is corrupted) and the reading (the file on disk has the correct
    checksum)


    A second problem I got was 503 errors from apt-cacher-ng (which ran on
    the same machine as the above error)



    I initially assumed this was due to faulty memory, or possibly a faulty
    CPU. But I assumed memory because the disk errors were happening in a VM
    and no other VMs were affected. Because I always start the same VMs in
    the same order I assumed they'd be using the same physical memory each
    time.

    However, nothing I could do would help track down where the memory
    problem was. Everything worked perfectly except when using the disk
    under load.

    At this time I spent a significant amount of time migrating everything important, including the big job that triggered this problem, off this machine onto the pair. After that the corruption problems went away but
    I continued to get periodic 503 errors from apt-cacher-ng.


    I continued to worry at this on and off but failed to make any progress
    in finding what was wrong. The version of the motherboard is no longer available otherwise I'd probably have bought another one. During this
    time I also spent quite a lot of time ensurning that it was much easier
    to move VMs between my two machines. I'd underestimated how tricky this
    would be if the dodgy machine failed totally which I became aware of
    when I did migrate the VM having problems.


    Late last year or early this year someone (possibly Andy Smith?) posted
    a question about logical/physical sector sizes on SSDs. That set me off investigating again as that's not something I'd thought of. That didn't
    prove fruitful either but I did notice this in the kernel logs:

    Feb 17 17:01:49 xen17 vmunix: [    3.802581] ata1.00: disabling queued TRIM support
    Feb 17 17:01:49 xen17 vmunix: [    3.805074] ata1.00: disabling queued TRIM support


    from libata-core.c

     { "Samsung SSD 870*",  NULL, ATA_HORKAGE_NO_NCQ_TRIM |
          ATA_HORKAGE_ZERO_AFTER_TRIM |
          ATA_HORKAGE_NO_NCQ_ON_ATI },

    This fixed the disk corruption errors at the cost of dramatically
    reducing performance. (I'm not sure why because manual fstrim didn't
    improve things)


    At this point I'd discovered that the big job that had been regularly
    hitting corruption issues now completed. However, it was taking 19 hours instead of 11 hours.

    I ordered some new disks - I'd assumed both disks were affected but
    while writing this I notice that that "disabling queued TRIM support"
    prints twice for the same disk, not once per disk.

    I thought one of these was my disk but looking again now I see I had 1000MX500 which doesn't actually match.

     { "Crucial_CT*M500*",  NULL, ATA_HORKAGE_NO_NCQ_TRIM |
          ATA_HORKAGE_ZERO_AFTER_TRIM },
     { "Crucial_CT*MX100*",  "MU01", ATA_HORKAGE_NO_NCQ_TRIM |
          ATA_HORKAGE_ZERO_AFTER_TRIM },

    While waiting for my disks I started looking at the apt-cacher-ng
    503 problem - which has continued to bug me. I got lucky and discovered
    a way I could almost always trigger it.

    I managed to track that down to a race condition when updating the
    Release files if multiple machines request the same file at the same
    moment.

    After finding a fix I found this bug reporting the same problem: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1022043

    There is now a patch attached to that bug that I've been running for a
    few weeks without a single 503 error.

    And Sunday I replaced the two disks with new ones. Today that big job completed in 10h15m.

    Another thing I notice, although I'm not sure I understand what is going
    on, is that my iscsi disks all have
               Thin-provisioning: No

    This means tha fstrim on the vm doesn't work. Switching them to Yes and
    it does. So I'm not exactly sure where the queued trim was coming from
    in the first place.


    Are you using systemd ?

    /etc/systemd/system/timers.target.wants/fstrim.timer

    [Unit]
    Description=Discard unused blocks once a week
    Documentation=man:fstrim
    ConditionVirtualization=!container
    ConditionPathExists=!/etc/initrd-release

    [Timer]
    OnCalendar=weekly
    AccuracySec=1h
    Persistent=true
    RandomizedDelaySec=6000

    [Install]
    WantedBy=timers.target

    You should not be running trim in a container/virtual machine

    Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Gremlin on Mon Feb 26 20:50:01 2024
    On Mon, 26 Feb 2024, Gremlin wrote:

    Are you using systemd ?
    No, I'm not

    You should not be running trim in a container/virtual machine

    Why not? That's, in my case, basically saying "you should not be running
    trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
    to understand why. Enabling thin_provisioning and fstrim works and gets
    mapped to the underlying layers all the way down to the SSD.

    My underlying VG is less than 50% occupied, so I can trim the free space
    by creating a LV and then removing it again (I have issue_discards set)

    FWIW, I did issue fstrim in the VMs with no visible issues at all.
    Perhaps I got lucky?

    Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

    I don't see VM or virtual machine anywhere on that page.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 22:00:02 2024
    On 2/26/24 14:40, Tim Woodall wrote:
    On Mon, 26 Feb 2024, Gremlin wrote:

    Are you using systemd ?
    No, I'm not

    You should not be running trim in a container/virtual machine

    Why not? That's, in my case, basically saying "you should not be running
    trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
    to understand why. Enabling thin_provisioning and fstrim works and gets mapped to the underlying layers all the way down to the SSD.

    I guest you didn't understand the systemd timer that runs fstrim on the
    host.


    My underlying VG is less than 50% occupied, so I can trim the free space
    by creating a LV and then removing it again (I have issue_discards set)

    FWIW, I did issue fstrim in the VMs with no visible issues at all.
    Perhaps I got lucky?

    Here is some info: https://wiki.archlinux.org/title/Solid_state_drive

    I don't see VM or virtual machine anywhere on that page.



    Exactly, and you should not be running it in a VM/container. Which BTW
    systemd will not run fstrim in a container.

    The Host system takes care of it

    Well you can keep shooting yourself in the butt as long as you wish, I
    on the other hand tend not to do that as much I possibly can as I need
    to be able to set down at times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Gremlin@21:1/5 to Tim Woodall on Mon Feb 26 22:50:01 2024
    On 2/26/24 16:31, Tim Woodall wrote:
    On Mon, 26 Feb 2024, Gremlin wrote:

    re running fstrim in a vm.

    The Host system takes care of it

    I guess you've no idea what iscsi is. Because this makes no sense at
    all. systemd or no systemd. The physical disk doesn't have to be
    something the host system knows anything about.

    Here's a thread of someone wanting to do fstrim from a vm with iscsi
    mounted disks.

    https://serverfault.com/questions/1031580/trim-unmap-zvol-over-iscsi


    And another page suggesting you should.

    https://gist.github.com/hostberg/86bfaa81e50cc0666f1745e1897c0a56

    8.10.2. Trim/Discard It is good practice to run fstrim (discard)
    regularly on VMs and containers. This releases data blocks that the filesystem isn't using anymore. It reduces data usage and resource load.
    Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the
    disk discard option.


    I would guess that if you use sparse file backed storage to a vm you'd
    want the vm to run fstrim too but this isn't a setup I've ever used so perhaps it's nonsense.




    Never mind, arguing with me will not solve your issue.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stefan Monnier@21:1/5 to All on Mon Feb 26 22:50:01 2024
    You should not be running trim in a container/virtual machine
    Why not? That's, in my case, basically saying "you should not be running
    trim on a drive exported via iscsi" Perhaps I shouldn't be but I'd like
    to understand why. Enabling thin_provisioning and fstrim works and gets
    mapped to the underlying layers all the way down to the SSD.

    I guest you didn't understand the systemd timer that runs fstrim on
    the host.

    How can the host properly run `fstrim` if it only sees a disk image and
    may not know how that image is divided into partitions/filesystems?


    Stefan

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Gremlin on Mon Feb 26 22:40:02 2024
    On Mon, 26 Feb 2024, Gremlin wrote:

    re running fstrim in a vm.

    The Host system takes care of it

    I guess you've no idea what iscsi is. Because this makes no sense at
    all. systemd or no systemd. The physical disk doesn't have to be
    something the host system knows anything about.

    Here's a thread of someone wanting to do fstrim from a vm with iscsi
    mounted disks.

    https://serverfault.com/questions/1031580/trim-unmap-zvol-over-iscsi


    And another page suggesting you should.

    https://gist.github.com/hostberg/86bfaa81e50cc0666f1745e1897c0a56

    8.10.2. Trim/Discard It is good practice to run fstrim (discard)
    regularly on VMs and containers. This releases data blocks that the
    filesystem isn't using anymore. It reduces data usage and resource load.
    Most modern operating systems issue such discard commands to their disks regularly. You only need to ensure that the Virtual Machines enable the
    disk discard option.


    I would guess that if you use sparse file backed storage to a vm you'd
    want the vm to run fstrim too but this isn't a setup I've ever used so
    perhaps it's nonsense.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Andy Smith@21:1/5 to Tim Woodall on Tue Feb 27 00:50:02 2024
    Hi,

    On Mon, Feb 26, 2024 at 06:25:53PM +0000, Tim Woodall wrote:
    Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
    Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support


    from libata-core.c

    { "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
    ATA_HORKAGE_ZERO_AFTER_TRIM |
    ATA_HORKAGE_NO_NCQ_ON_ATI },

    This fixed the disk corruption errors at the cost of dramatically
    reducing performance. (I'm not sure why because manual fstrim didn't
    improve things)

    That's interesting. I have quite a few of these drives and haven't
    noticed any problems. What kernel version introduced the above
    workarounds?

    $ sudo lsblk -do NAME,MODEL
    NAME MODEL
    sda SAMSUNG_MZ7KM1T9HAJM-00005
    sdb SAMSUNG_MZ7KM1T9HAJM-00005
    sdc Samsung_SSD_870_EVO_4TB
    sdd Samsung_SSD_870_EVO_4TB
    sde ST4000LM016-1N2170
    sdf ST4000LM016-1N2170
    sdg SuperMicro_SSD
    sdh SuperMicro_SSD

    Thanks,
    Andy

    --
    https://bitfolk.com/ -- No-nonsense VPS hosting

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Tim Woodall@21:1/5 to Andy Smith on Tue Feb 27 05:40:01 2024
    On Mon, 26 Feb 2024, Andy Smith wrote:

    Hi,

    On Mon, Feb 26, 2024 at 06:25:53PM +0000, Tim Woodall wrote:
    Feb 17 17:01:49 xen17 vmunix: [ 3.802581] ata1.00: disabling queued TRIM support
    Feb 17 17:01:49 xen17 vmunix: [ 3.805074] ata1.00: disabling queued TRIM support


    from libata-core.c

    { "Samsung SSD 870*", NULL, ATA_HORKAGE_NO_NCQ_TRIM |
    ATA_HORKAGE_ZERO_AFTER_TRIM |
    ATA_HORKAGE_NO_NCQ_ON_ATI },

    This fixed the disk corruption errors at the cost of dramatically
    reducing performance. (I'm not sure why because manual fstrim didn't
    improve things)

    That's interesting. I have quite a few of these drives and haven't
    noticed any problems. What kernel version introduced the above
    workarounds?

    $ sudo lsblk -do NAME,MODEL
    NAME MODEL
    sda SAMSUNG_MZ7KM1T9HAJM-00005
    sdb SAMSUNG_MZ7KM1T9HAJM-00005
    sdc Samsung_SSD_870_EVO_4TB
    sdd Samsung_SSD_870_EVO_4TB
    sde ST4000LM016-1N2170
    sdf ST4000LM016-1N2170
    sdg SuperMicro_SSD
    sdh SuperMicro_SSD

    Thanks,
    Andy


    Looks like the fix was brand new around sept 2021 https://www.neowin.net/news/linux-patch-disables-trim-and-ncq-on-samsung-860870-ssds-in-intel-and-amd-systems/

    I was still seeing corruption in August 2022 but it's possible the fix
    wasn't backported to whatever release I was running.

    Tim.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)