My first posting in this newsgroup; in case this is not the
appropriate newsgroup for this question I'd welcome pointers.
The problem: In the past I had problems with the consistency of my
hard disk file system (from ext2/ext3, reiser, with soft RAID, until
ZFS); I've often corrupted data with the older file systems. Now, with
ZFS running, I notice the probable source of the faulty effect; ZFS
reports one of the hard disks as 'removed' and the whole disks state
as 'degraded'. Running the smartctl tool seems to indicate that one of
the three RAID-Z disks is unavailable, as if it's just switched off.
Removing and re-inserting the disk to its slot activates it again; the smartctl tool shows all the expected disk information and after a ZFS 'online' of that disk everyting is fine (i.e. no data loss).
I ruled out that it's a hard disk problem, since I bought many
different disks (different vendors, or same disk types), and the
problem is always only with the disks that are connected to /dev/sdd.
My suspicion is that the controller hardware in the motherboard might
be faulty.
Any advice about the source of this sort of problem? Or suggestions
how to avoid that the disk at /dev/sdd will occasionally get
unavailable and (sort of) vanishes?
On Sunday 28 Feb 2016 06:55, Janis Papanagnou conveyed the following to comp.unix.admin...
My first posting in this newsgroup; in case this is not the
appropriate newsgroup for this question I'd welcome pointers.
The problem: In the past I had problems with the consistency of my
hard disk file system (from ext2/ext3, reiser, with soft RAID, until
ZFS); I've often corrupted data with the older file systems. Now, with
ZFS running, I notice the probable source of the faulty effect; ZFS
reports one of the hard disks as 'removed' and the whole disks state
as 'degraded'. Running the smartctl tool seems to indicate that one of
the three RAID-Z disks is unavailable, as if it's just switched off.
Removing and re-inserting the disk to its slot activates it again; the
smartctl tool shows all the expected disk information and after a ZFS
'online' of that disk everyting is fine (i.e. no data loss).
I ruled out that it's a hard disk problem, since I bought many
different disks (different vendors, or same disk types), and the
problem is always only with the disks that are connected to /dev/sdd.
My suspicion is that the controller hardware in the motherboard might
be faulty.
Any advice about the source of this sort of problem? Or suggestions
how to avoid that the disk at /dev/sdd will occasionally get
unavailable and (sort of) vanishes?
If it is indeed the controller on the motherboard, then the only thing I
can think of, given that you're on a software RAID, would be to get a
PCI, PCI-X or PCIe SATA adapter card. And then for good measure, you
should connect _all_ of your hard disks to that one.
On the other hand, there's also a chance ─ given that you're alluding to hot-swap drive bays ─ that it could be the backplane itself which is faulty, or the cable for that one particular drive bay. And in that
case, the only thing you can do is replace the cable (which is cheapest)
or the backplane (which will cost you more).
So I would advise first checking the cable, see whether it's well-
seated, try with another cable for a while, and then see whether the
problem persists. With a bit of luck, it's only the cable. which is
faulty. ;)
On 28.02.2016 09:57, Aragorn wrote:
On the other hand, there's also a chance ─ given that you're alluding
to hot-swap drive bays ─ that it could be the backplane itself which
is faulty, or the cable for that one particular drive bay. And in
that case, the only thing you can do is replace the cable (which is
cheapest) or the backplane (which will cost you more).
So I would advise first checking the cable, see whether it's well-
seated, try with another cable for a while, and then see whether the
problem persists. With a bit of luck, it's only the cable. which is
faulty. ;)
Thanks for your suggestions, Aragorn!
Sadly, making a plan to follow your suggestions localizing the
problem, my system seems to have decided to fool me. Without changing anything the ZFS file system again became 'degraded'; but this time
(and for the first time) it is another device, /dev/sdc, that became inaccessible.
Does that now, in your experience, change the diagnosis of the
problem?
[...]
Sadly, making a plan to follow your suggestions localizing the
problem, my system seems to have decided to fool me. Without changing
anything the ZFS file system again became 'degraded'; but this time
(and for the first time) it is another device, /dev/sdc, that became
inaccessible.
Well, there is now something else that pops into my mind, and from
reading your contributions to comp.unix.shell, I'd imagine you to be a professional and thus not to make the mistake I'm about to expound on,
but there _is_ always the chance that the hard disks in your array are actually not RAID-certified.
It is not uncommon for consumer-grade SATA disks ─ and most notably
those made by Western Digital ─ to be a little slow in handling certain status polls from the RAID controller ─ whether hardware or software ─ with as a result that the controller may falsely detect a degraded
state. For this purpose, Western Digital has released "RAID-certified" disks, which have a different timing setup and report faster to status
polls, so that they wouldn't be marked as defective by software or
hardware RAID setups when they are in fact still functioning normally
but busy executing other instructions.
A second possibility is the following... Since you enumerate the
devices as /dev/sdc and /dev/sdd, that tells me that you're running a GNU/Linux system.
And then there are a few questions that pop up,
because then more information is needed...
1. Do you ever power the machine down, and if so, did you power down
between your previous report on the issue and the report that I'm
now replying to?
2. What distribution are you running on your system?
3. Are the devices mounted by UUID or LABEL, or do you mount them
by way of their Linux-specific /dev/sd? designations?
The thing is that the /dev/sd? designations are not guaranteed to be persistent across reboots. [...]
If it's neither of the above, then I suspect there to be a problem with
your hot-swap backplane, as I wrote in my previous reply. It could just
be an intermittent problem ─ e.g. a contact issue ─ or it could be permanent, but I have insufficient experience with such backplanes, so I don't really know which ones are high quality and which ones are prone
to failure.
Hope this helps. ;)
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 285 |
Nodes: | 16 (2 / 14) |
Uptime: | 61:04:15 |
Calls: | 6,488 |
Calls today: | 1 |
Files: | 12,094 |
Messages: | 5,274,430 |