Wednesday, October 5, 2011

Linux SW Raid - 2x WD20EARS = 2TB Raid1

I've been using Linux software raid now for at least 6 years. What started off as 3x200GB in Raid5, eventually grew to 4x500GB in Raid5. With storage becoming cheaper (and less reliable for that matter), I decided to shift to a Raid1 setup. At $70/disk, the WD20EARS seemed like cheap, safe storage with raid1.

What I guessed would be one afternoon of work for me, turned into 4 days of troubleshooting before I got it working properly. I'm using Ubuntu 10.10 userspace with Linux 3.0. Whether this combination of new kernel and old userspace (mdadm v2.6.7) was the reason or not, I still don't know.

This is how it went down (don't do this):

1) Made 'fd' type partitions sdb1 and sdc1 (both at 2048 block)
2) Created md0 with sdb1 + missing, level=1
3) Ran mkfs.ext4 -b 4096 on md0

At this point sdb1 disappeared! fdisk -lu reports that sdb hasn't even been initialised!

Anyways, I continue to mount md0, copy files, unmount, partprobe'd and remounted for luck and it all looked good.

4) Rebooted the machine

Now I couldn't mount md0 anymore. It couldn't find the ext4 superblocks. I tried a number of locations manually, but nothing seemed to work. I noticed the mdadm --detail md0 says that the array is using "sdb" now (instead of the sdb1 it was actually created with). But then again, sdb1 doesn't exist after mkfs md0. Arghh.

5) Created an '83' type partition sdb1 at the same 2048 block

Now out of nowhere, /dev/md0p1 appears! What the hell is this?

6) Ran fsck on md0 and it can't find the filesystem
7) Ran fsck on md0p1 and now it finds it, but it complains about some boundary sizes being wrong
8) Force fsck on md0p1. All OK.
9) Ran resize2fs to fix the boundary issue

Now I can mount md0. All my files are still there. fdisk -lu shows sdb1 as the linux partition, and shows disk "md0" as having "md0p1" as a partition of type linux. What???

10) mdadm /dev/md0 --add /dev/sdc1. It resyncs, and the array is all clean, non-degraded.

11) Reboot again

Can't access md0 anymore! WHAT IS GOING ON! mdadm --detail md0, shows clean, but it says its using "sdb" and "sdc". Obviously sdc1 doesn't exist anymore either.

I did all this twice to double check I didn't make any stupid mistakes, and I got the exact same result.

After much debugging, I decided to build the raid1 array again using the entire disk sdb and sdc, instead of the 'fd' partitions sdb1 and sdc1 I used first. This just worked! No problems at all. Rebooted a few times, and I can still see my ext4 fs on md0 and everything looks good. Why couldn't it deal with partition'd array members?

If anyone is interested, using an onboard Nvidia SATA II controller, my Raid1 re-sync speed is 100mb/s with the 2x WD20EARS. That's just under 6hrs for re-sync.

Copying from the 4x500GB Raid5 to the 2x2TB Raid1 was at about 90mb/s.

On more thing, before doing any of this, I added wdidle3.exe (v1.05) to UBCD and disabled the idle timer on both the green drives. My Load Cycle Count on two older 1TB WD10EARS drives hit 60,000+ after 1 year of usage. Although WD says these drives can handle up to 300,00 cycles, one of my old 1TB green drives has started reporting bad sectors. Whether this was because of the LCC or because these were just "Green" drives, I don't know.

With fewer drives in the chassis (3 now, was 5), disk temps are about 10C lower. And with the LCC disabled, I expect these drives to last at least 2 years. Only time will tell for sure.

No comments:

Post a Comment