Hard Disk Reliability - focusing on laptops and RAID

I was thinking about hard drive crashes lately since the hard disk died in my desktop at work, and I also stumbled upon a data loss horror story. My desktop hard disk crash was not a major issue just costing a few hours of my time to restore the data and the price of a new hard disk. This is because I continually back up my data with the following cron job:

rsync -az --delete --delete-excluded --force -e ssh ~/ backup_server:backups/padraig/

But is there anything else I could do to alleviate even these costs?

Single drives

Currently my laptop has a 60GB disk and my backup solution is to copy data off periodically to an external USB hard disk. But is there anything I can do to prolong the life of my hard disk, i.e. reduce the cost to replace the drive as mentioned above?

load/unload cycles

There has been a bit of a buzz lately about ubuntu being easy to configure in such a way as to affect hard drive reliability due to excessive "loading" and "unloading", and this issue has been noticed before in regard to a fedora kernel upgrade. I did some investigation using the smartctl -A /dev/sda command, the summary of which is that fedora at least does not change the "Advanced power management level" for hard disks in the system, but that the default settings can be problematic. In my laptop (which was running Fedora Core 4 for most of its life), my hitachi hard disk has "load cycled" or automatically loaded the disk arm onto the parking ramp, on average once every 48 seconds. In other words, this laptop which has been powered on for about 6 months of its 19 month life, has clocked up 351,675 "load cycles" out of a rated max of 600,000 (which according to the marketing blurb is twice the industry average). This is far too aggressive for my liking, so how do we change it?

There a two ways to try an alleviate this. The obvious way is to change the "Advanced power management level" for the drive using the hdparm -B command. The non obvious one is to get fedora (or ubuntu) to write to the disk less frequently, thus allowing the head to stay on the ramp for longer. A quick investigation showed that various processes periodically read files (usually from the cache), but by default, linux will write the updated access time back to the disk. This file access time is very rarely used, and can be safely turned off. One can do this by adding the "noatime" option to the appropriate mounts in the /etc/fstab file. I also notice that we will not need to worry about this in Fedora 8, since the relatime option will be enabled by default for all mounts. Anyway after I changed the mount options to noatime, the load cycle frequency changed from once every 48 seconds on average to every 108 seconds, which is a big improvement. There are other kernel tunables to reduce drive accesses based around the /proc/sys/vm/laptop_mode setting, which you can enable by adding echo 5 > /proc/sys/vm/laptop_mode to /etc/rc.local for example.

Another issue I noticed with a friend's Dell M1330 laptop with 120GB western digital hard disk, was a very annoying and audible click every time the drive load cycled. So here we disabled the "Advanced power management level" completely using the hdparm -B 255 /dev/sda command, added to the /etc/rc.local file.

Power off retract count

Another statistic provided the smartctl command above is Power-Off_Retract_Count, which is how many times the heads are "emergency parked" when power is lost to the drive. From the hitachi docs describing load cycling it says:

"In the event of power loss to the drive, Hitachi GST invented a fault-tolerant retract system to move the heads to the park position by extracting energy generated from the spinning disks through a high-efficiency retract circuit. This circuit directs the current from the spindle motor back-EMF to the actuator assembly, enabling the sliders to move off the disk area to the ramp in a controlled fashion during an unexpected power down situation. In February, 2000, Hitachi GST was awarded the patent to this invention (US 6,025,968), which is used in all Hitachi GST drives and any hard drive incorporating load/unload technology."

Now, I noticed a click every time my machine was shut down when I was running Fedora Core 4, which is why I have the current high count of 835. This issue has already been fixed in Fedora 7, but one still has to be careful with external USB drives, like I use for my laptop backup solution. In this case one can issue the sdparm --command=stop /dev/sdb command to get the disk to spin down normally.

solid state drives

The above issues really emphasize some of the problems with traditional hard disk technology. I.E. the moving parts are really bad for performance and reliability. I really can't wait for solid state drives to become the norm, and when we will recollect with disdain the rattly things we used to have in our systems. This is starting to happen with the major laptop manufacturers making them available as an option, albeit at a bit of a price premium at present. A 32GB SSD would have ample space for me and generally I think one should not have a large amount of storage anyway on a laptop, due to other reliability issues. Instead use external storage like traditional external USB drives for example which are large, cheap and easy to use. Note be careful as many of the first SSDs have very slow random write performance. Also currently SSDs have too much internal logic which vendors do for maximum compatability, but also to increase differentiation and user tie in. This is analogous to the hardware/software RAID situation as described below.

Drives in combination (RAID)

The other cost I mentioned above was time to repair, which can be addressed using some form of redundancy, which allows the system to continue functioning in a degraded state while you replace the failing or failed disks. Note RAID is only used to increase system availability or performance, and is not enough to ensure the integrity of data on hard disks. Additional higher level data duplication (backup copies) are required to protect against the more common non disk specific issues like operating system bugs, application bugs or keyboard chair interface errors. By the way, don't think "tape" for backup. This is a common misconception I've experienced. Think of what's the easiest way to make a copy of my data, which given today's cheap prices for storage is generally just another instance of your storage media. I'm not sure if tape was ever the right answer, but if it was it was a long time ago.

reliability calculations

So how much does adding another hard disk help in regard to downtime? Well it depends on the RAID type used, the MTBF of the component drives of the RAID, and how quickly one can replace a failed drive in the array.

RAID 0 for example actually reduces the availability of the system as it combines drives in series for performance, so that if either drive fails, the data is lost. I notice that Dell give RAID 0 as a performance option on some desktop systems, and it's even the default currently on the Dimension™ 9200 for example. They do not warn however, that this effectively halves the lifetime of the drives and only say it's faster. RAID 1 will increase the availability of the system as it combines drives in parallel, so that if one fails, the system can still operate in a somewhat degraded state.

To actually quantify the change in MTBF and availability for these series and parallel combinations of drives can be tricky, so you can use my online reliability calculator and plug in the values for MTBF and MTTR for your drives. Note be careful with hard disk MTBF values as manufacturers report different things (even sometimes the MTBF while powered off). Also there are user dependent variables to consider, like duty cycle (how much you use the drive) and temperature etc. It's also worth noting that the effectiveness of RAID diminishes as drive sizes increase.

RAID implementations

RAID itself has an associated cost of course and some suggest that RAID is usually a bad idea. I would qualify that though and suggest that hardware RAID is always a bad idea, while software RAID is usually a good idea. That implies of course one has a good software RAID implementation. I've found Linux' very good but can't comment on other systems. Hardware RAID just adds extra complexity and dependencies and is often buggy anyway. Personally I've seen this "RAID means hardware RAID" misconception a lot, and have had hardware RAID systems forced on me in the past. The extra operating system and hardware dependencies introduced were just ridiculous, and are just aid vendor lock-in. Software RAID as well as being simpler, cheaper, faster and easy to upgrade, can also have functional benefits from tighter integration with the system software.

[Update Nov 2015: Facebook have introduced software RAID improvements to Linux kernel >= 4.4, to associate an SSD with a RAID array to support journalling and therefore closing a "write hole" race in traditional implementations where data may be inconsistent across drives. The provision of a separate stateful cache will also allow for future performance improvements.]