RocketRaid 2680

The Rebuild

I rebuilt Fuzzserv over the past week and a half, and I’ve had the 10TB RAID6 array resynching for about 40 hours before restoring any data to it (call me cautious.)  I came home today and noticed that my filesystem was full.  A quick du showed /var/log/{messages,syslog,kern.log} to be the culprits.  The rr2680 driver had flooded Syslog with these messages:

Feb 15 21:53:46 fuzzserv kernel: [200245.998943] rr2680:Request(ffff8801355d2bf8) 2a-0-c5-70 fd-80-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998950] rr2680:Request(ffff8801355d2ca0) 2a-0-c5-70 fc-30-0-0 || 10-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998956] rr2680:Request(ffff8801355d50b8) 2a-0-c5-70 fd-98-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998963] rr2680:Request(ffff8801355d3528) 2a-0-c5-70 f7-18-0-0 || 80-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998970] rr2680:Request(ffff8801355d7e00) 2a-0-c5-70 fd-b0-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998976] rr2680:Request(ffff8801355d9e28) 2a-0-c5-70 fc-40-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998983] rr2680:Request(ffff8801355d9798) 2a-0-c5-70 fd-c8-0-0 || 10-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998989] rr2680:Request(ffff8801355d8538) 2a-0-c5-70 fb-10-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Feb 15 21:53:46 fuzzserv kernel: [200245.998995] rr2680:Request(ffff8801355d2fe8) 2a-0-c5-70 fd-d8-0-0 || 18-0-0-0 0-0-0-0 to device S2/0 failed as status 6
Interesting.  I turned Syslog off for the time being, and moved /var/log/syslog to a separate drive for later perusal.

The Investigation

I checked /proc/mdstat and noticed that the resync had slowed to a measly 14K/s, so something was definitely wrong.  I ran mdadm -E on all of the drives in the array, and 1 out of the 7 wouldn’t return anything.  Checking the front of the server showed, sure enough, that none of the activity lights were doing anything except for that drive, which had a solid blue activity light.
I tried stopping the array unsuccessfully, and at this point ksoftirqd was eating up 100% of one of my cores, so I figured it was time to restart and see what happens.  I wasn’t too worried since the array didn’t have any data yet, it’s a RAID6, and I had just started storing a write-intent block on my system drive.
The single problem drive stayed lit up through the restart, and the RocketRaid BIOS never completed scanning for drives, so I shut it down, shuffled all the drives up a slot (I was using slots 2-8, which was bothering me anyway) and fired it back up successfully.

The Resolution

Once Ubuntu loaded, mdadm didn’t quite assemble the array correctly.  It created a weird /dev/md0_d0 array out of only 2 of the 7 drives, so I stopped it to free up the drives and manually –assemble‘d the array, which it happily did.  It’s immediately started a resync at ~70M/s, which it estimates taking about 8 hours.  I’m not seeing the errors hitting Syslog, which is to be expected.
I don’t know if the controller crapped itself or if one of the drives had a bad day and the controller didn’t handle it gracefully.  Hopefully it won’t happen again.

Leave a Reply