Gather ‘round my children and let me tell you a tale
So I notice that 4 of my 2-disk raid1 arrays were degraded after a RAM upgrade on one of my servers. I checked which device was missing and it was /dev/sdd1 (/dev/sdd2, /dev/sdd3, /dev/sdd6 from other arrays, too, but I will only consider sdd1 from /dev/md1 for the sake of this tale.) Opening the case I saw that I knocked off one of the SATA cables. It seems like about 1 in 10 SATA cables that I encounter do not clip properly and slide off easily. The one that had been knocked off was at the SATA3 port on the mobo (which starts at SATA0, so it seems reasonable that it is /dev/sdd.) Anyways, I reconnect the cable, boot the machine, and mdadm –add the partitions back to the relevant arrays, like so mdadm /dev/md1 –add /dev/sdd1.
Everything looks great in cat /proc/mdstat
After a kernel upgrade, I noticed that the arrays started degraded again! I went ahead and mdadm –add’ed back the partitions, all looked well in /proc/mdstat.
Then I rebooted just to make sure that it would assemble properly on boot. It booted degraded once again.
So after trying lots of stuff such as clearing the superblock, zeroing the whole device, –stop’ing and –assembe’ing manually, I finally do what I should have initially done, which is a dmesg | less. (nothing peculiar was in /var/log/messages and friends)
In dmesg, the kernel tells you what it’s trying to do as it’s assembling the arrays. It finds all the partitions that could possibly be members of raid arrays. It then matches them up by UUID and tries to assemble them. So I first see the kernel considering /dev/sdd1. Ok, so it finds a bunch of partitions that doesn’t match up with it, then it mentions /dev/sda1 matches and will be considered (which was the working disk still in the array.) Then it finds more that don’t match, and finally, to my surprise, mentions that /dev/sdb1 matches and will be considered! So that’s three devices in a two device array. How does the kernel handle this? Well it chooses to use the last 2 matching ones it finds, so it doesn’t bother with /dev/sdd1 at all. It assembles /dev/sda1 and /dev/sdb1 into /dev/md1 but /dev/sdb1 isn’t “fresh” because it was actually the device who’s cable fell off. That’s right, the kernel now decided that the device at SATA3 on the mobo would be /dev/sdb instead of /dev/sdd.
To fix this, I simply had to
mdadm /dev/md1 –add /dev/sdd1 mdadm /dev/md1 –fail /dev/sdd1 mdadm /dev/md1 –remove /dev/sdd1 mdadm /dev/md1 –add /dev/sdb1
for each of the relevant arrays. Then it assembled properly on each boot.
The moral of this story: always do a mdadm –detail /dev/mdX BEFORE adding partitions to an array to make sure you have the proper device name of the failed device.