• Vanishing hard disk device

    From Janis Papanagnou@21:1/5 to All on Sun Feb 28 06:55:43 2016
    My first posting in this newsgroup; in case this is not the appropriate newsgroup for this question I'd welcome pointers.

    The problem: In the past I had problems with the consistency of my hard
    disk file system (from ext2/ext3, reiser, with soft RAID, until ZFS); I've often corrupted data with the older file systems. Now, with ZFS running,
    I notice the probable source of the faulty effect; ZFS reports one of the
    hard disks as 'removed' and the whole disks state as 'degraded'. Running
    the smartctl tool seems to indicate that one of the three RAID-Z disks
    is unavailable, as if it's just switched off. Removing and re-inserting
    the disk to its slot activates it again; the smartctl tool shows all the expected disk information and after a ZFS 'online' of that disk everyting
    is fine (i.e. no data loss).

    I ruled out that it's a hard disk problem, since I bought many different
    disks (different vendors, or same disk types), and the problem is always
    only with the disks that are connected to /dev/sdd.

    My suspicion is that the controller hardware in the motherboard might be faulty.

    Any advice about the source of this sort of problem? Or suggestions how
    to avoid that the disk at /dev/sdd will occasionally get unavailable and
    (sort of) vanishes?

    Thanks.

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aragorn@21:1/5 to All on Sun Feb 28 09:57:54 2016
    On Sunday 28 Feb 2016 06:55, Janis Papanagnou conveyed the following to comp.unix.admin...

    My first posting in this newsgroup; in case this is not the
    appropriate newsgroup for this question I'd welcome pointers.

    The problem: In the past I had problems with the consistency of my
    hard disk file system (from ext2/ext3, reiser, with soft RAID, until
    ZFS); I've often corrupted data with the older file systems. Now, with
    ZFS running, I notice the probable source of the faulty effect; ZFS
    reports one of the hard disks as 'removed' and the whole disks state
    as 'degraded'. Running the smartctl tool seems to indicate that one of
    the three RAID-Z disks is unavailable, as if it's just switched off.
    Removing and re-inserting the disk to its slot activates it again; the smartctl tool shows all the expected disk information and after a ZFS 'online' of that disk everyting is fine (i.e. no data loss).

    I ruled out that it's a hard disk problem, since I bought many
    different disks (different vendors, or same disk types), and the
    problem is always only with the disks that are connected to /dev/sdd.

    My suspicion is that the controller hardware in the motherboard might
    be faulty.

    Any advice about the source of this sort of problem? Or suggestions
    how to avoid that the disk at /dev/sdd will occasionally get
    unavailable and (sort of) vanishes?

    If it is indeed the controller on the motherboard, then the only thing I
    can think of, given that you're on a software RAID, would be to get a
    PCI, PCI-X or PCIe SATA adapter card. And then for good measure, you
    should connect _all_ of your hard disks to that one.

    On the other hand, there's also a chance ─ given that you're alluding to hot-swap drive bays ─ that it could be the backplane itself which is
    faulty, or the cable for that one particular drive bay. And in that
    case, the only thing you can do is replace the cable (which is cheapest)
    or the backplane (which will cost you more).

    So I would advise first checking the cable, see whether it's well-
    seated, try with another cable for a while, and then see whether the
    problem persists. With a bit of luck, it's only the cable. which is
    faulty. ;)

    --
    = Aragorn =

    http://www.linuxcounter.net - registrant #223157

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Aragorn on Mon Feb 29 10:28:26 2016
    On 28.02.2016 09:57, Aragorn wrote:
    On Sunday 28 Feb 2016 06:55, Janis Papanagnou conveyed the following to comp.unix.admin...

    My first posting in this newsgroup; in case this is not the
    appropriate newsgroup for this question I'd welcome pointers.

    The problem: In the past I had problems with the consistency of my
    hard disk file system (from ext2/ext3, reiser, with soft RAID, until
    ZFS); I've often corrupted data with the older file systems. Now, with
    ZFS running, I notice the probable source of the faulty effect; ZFS
    reports one of the hard disks as 'removed' and the whole disks state
    as 'degraded'. Running the smartctl tool seems to indicate that one of
    the three RAID-Z disks is unavailable, as if it's just switched off.
    Removing and re-inserting the disk to its slot activates it again; the
    smartctl tool shows all the expected disk information and after a ZFS
    'online' of that disk everyting is fine (i.e. no data loss).

    I ruled out that it's a hard disk problem, since I bought many
    different disks (different vendors, or same disk types), and the
    problem is always only with the disks that are connected to /dev/sdd.

    My suspicion is that the controller hardware in the motherboard might
    be faulty.

    Any advice about the source of this sort of problem? Or suggestions
    how to avoid that the disk at /dev/sdd will occasionally get
    unavailable and (sort of) vanishes?

    If it is indeed the controller on the motherboard, then the only thing I
    can think of, given that you're on a software RAID, would be to get a
    PCI, PCI-X or PCIe SATA adapter card. And then for good measure, you
    should connect _all_ of your hard disks to that one.

    On the other hand, there's also a chance ─ given that you're alluding to hot-swap drive bays ─ that it could be the backplane itself which is faulty, or the cable for that one particular drive bay. And in that
    case, the only thing you can do is replace the cable (which is cheapest)
    or the backplane (which will cost you more).

    So I would advise first checking the cable, see whether it's well-
    seated, try with another cable for a while, and then see whether the
    problem persists. With a bit of luck, it's only the cable. which is
    faulty. ;)

    Thanks for your suggestions, Aragorn!

    Sadly, making a plan to follow your suggestions localizing the problem,
    my system seems to have decided to fool me. Without changing anything
    the ZFS file system again became 'degraded'; but this time (and for the
    first time) it is another device, /dev/sdc, that became inaccessible.

    Does that now, in your experience, change the diagnosis of the problem?

    Frankly, I'm totally confused. (Previously I had at least some ideas of potential sources of the issue, but now...)

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Aragorn@21:1/5 to All on Mon Feb 29 11:05:08 2016
    On Monday 29 Feb 2016 10:28, Janis Papanagnou conveyed the following to comp.unix.admin...

    On 28.02.2016 09:57, Aragorn wrote:

    On the other hand, there's also a chance ─ given that you're alluding
    to hot-swap drive bays ─ that it could be the backplane itself which
    is faulty, or the cable for that one particular drive bay. And in
    that case, the only thing you can do is replace the cable (which is
    cheapest) or the backplane (which will cost you more).

    So I would advise first checking the cable, see whether it's well-
    seated, try with another cable for a while, and then see whether the
    problem persists. With a bit of luck, it's only the cable. which is
    faulty. ;)

    Thanks for your suggestions, Aragorn!

    Sadly, making a plan to follow your suggestions localizing the
    problem, my system seems to have decided to fool me. Without changing anything the ZFS file system again became 'degraded'; but this time
    (and for the first time) it is another device, /dev/sdc, that became inaccessible.

    Does that now, in your experience, change the diagnosis of the
    problem?

    Well, there is now something else that pops into my mind, and from
    reading your contributions to comp.unix.shell, I'd imagine you to be a professional and thus not to make the mistake I'm about to expound on,
    but there _is_ always the chance that the hard disks in your array are
    actually not RAID-certified.

    It is not uncommon for consumer-grade SATA disks ─ and most notably
    those made by Western Digital ─ to be a little slow in handling certain status polls from the RAID controller ─ whether hardware or software ─
    with as a result that the controller may falsely detect a degraded
    state. For this purpose, Western Digital has released "RAID-certified"
    disks, which have a different timing setup and report faster to status
    polls, so that they wouldn't be marked as defective by software or
    hardware RAID setups when they are in fact still functioning normally
    but busy executing other instructions.

    A second possibility is the following... Since you enumerate the
    devices as /dev/sdc and /dev/sdd, that tells me that you're running a
    GNU/Linux system. And then there are a few questions that pop up,
    because then more information is needed...

    1. Do you ever power the machine down, and if so, did you power down
    between your previous report on the issue and the report that I'm
    now replying to?

    2. What distribution are you running on your system?

    3. Are the devices mounted by UUID or LABEL, or do you mount them
    by way of their Linux-specific /dev/sd? designations?

    The thing is that the /dev/sd? designations are not guaranteed to be
    persistent across reboots. The udev device manager was supposed to
    provide for some consistency in that regard, but it doesn't do that
    particular job very well either. Therefore, when using multiple disks
    in the same machine, it is best to give the individual partitions a
    unique LABEL and mount them while using that, or to mount them by way of
    their unique UUID. (This does require booting with an initrd/initramfs,
    as the kernel itself does not recognize LABELs and UUIDs at boot time.)

    If you have indeed rebooted the machine, then it is possible that the
    faulty drive /dev/sdd of the last time has now become /dev/sdc. If you
    have not rebooted your machine, then you may safely discard this section
    of my reply. ;)

    If it's neither of the above, then I suspect there to be a problem with
    your hot-swap backplane, as I wrote in my previous reply. It could just
    be an intermittent problem ─ e.g. a contact issue ─ or it could be permanent, but I have insufficient experience with such backplanes, so I
    don't really know which ones are high quality and which ones are prone
    to failure.

    Hope this helps. ;)

    --
    = Aragorn =

    http://www.linuxcounter.net - registrant #223157

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Janis Papanagnou@21:1/5 to Aragorn on Tue Mar 1 02:59:14 2016
    On 29.02.2016 11:05, Aragorn wrote:
    [...]
    Sadly, making a plan to follow your suggestions localizing the
    problem, my system seems to have decided to fool me. Without changing
    anything the ZFS file system again became 'degraded'; but this time
    (and for the first time) it is another device, /dev/sdc, that became
    inaccessible.

    Well, there is now something else that pops into my mind, and from
    reading your contributions to comp.unix.shell, I'd imagine you to be a professional and thus not to make the mistake I'm about to expound on,
    but there _is_ always the chance that the hard disks in your array are actually not RAID-certified.

    Well, you can safely assume that I don't know much WRT hardware or system administration, so every hint is welcome. :-)

    WRT "RAID-certified"; I don't really know what that means. My assumption
    would have been that such a classification would refer to hardware RAID systems, not software-RAID, but I don't know.

    What I can tell for my case is that I originally had two "server-disks" (Seagate) that were said to be designed for continuous operation, which
    I prefered at that time because I rarely reboot. But since those disks
    failed despite "24/7 feature" and journalling file-system and software
    RAID, I replaced them later with "desktop hard-disks" (Toshiba and WD);
    the failure was all the same with every hard-disk configuration, though.


    It is not uncommon for consumer-grade SATA disks ─ and most notably
    those made by Western Digital ─ to be a little slow in handling certain status polls from the RAID controller ─ whether hardware or software ─ with as a result that the controller may falsely detect a degraded
    state. For this purpose, Western Digital has released "RAID-certified" disks, which have a different timing setup and report faster to status
    polls, so that they wouldn't be marked as defective by software or
    hardware RAID setups when they are in fact still functioning normally
    but busy executing other instructions.

    A second possibility is the following... Since you enumerate the
    devices as /dev/sdc and /dev/sdd, that tells me that you're running a GNU/Linux system.

    Right.

    And then there are a few questions that pop up,
    because then more information is needed...

    1. Do you ever power the machine down, and if so, did you power down
    between your previous report on the issue and the report that I'm
    now replying to?

    I rarely reboot, and I haven't rebooted since 20+ days.


    2. What distribution are you running on your system?

    Xubuntu.


    3. Are the devices mounted by UUID or LABEL, or do you mount them
    by way of their Linux-specific /dev/sd? designations?

    I'm unsure about that. I have a couple (other) ext4 disks that I mounted
    by UUID. But I was thinking that ZFS has it's own way of accessing the
    disks; if listing the status of the zpool the disks are identified by
    unique strings, like "ata-<vendor>_<hard-disk-model>_<serial-number>".

    What I can say is that on failure the ZFS disk identifications of the
    'removed' disks matched with the respective /dev/sd? that was show by
    smartctl to be defective.


    The thing is that the /dev/sd? designations are not guaranteed to be persistent across reboots. [...]

    If it's neither of the above, then I suspect there to be a problem with
    your hot-swap backplane, as I wrote in my previous reply. It could just
    be an intermittent problem ─ e.g. a contact issue ─ or it could be permanent, but I have insufficient experience with such backplanes, so I don't really know which ones are high quality and which ones are prone
    to failure.

    Hope this helps. ;)

    Thanks!

    Janis

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)