RAID isn't always the answer

A friend of mine has been working on setting up a brand new server. Part of his company is splitting in two, and the part which is leaving needs to have their own fileserver now. The usual concerns apply here: the box needs to be at least as reliable as what they have now. They can't "go retrograde" due to this migration. It should not be surprising that RAID entered the picture.

Over the past couple of weeks as he's been figuring out the specs for this new box, I had the opportunity to watch it all come together. It's been an interesting thing to watch, since it really seems that RAID is still the only option for most people. They don't want to solve their data storage issues any other way.

I've run RAID myself in various capacities. We had countless web hosting customers who had arrays set up, both of the mirrored variety and of the "striped with parity" type. Some had hot spares and others didn't. I also ran it on some of my school district sysadmin gig's servers. Even a few of my home machines have had some flavor of it at various points in time.

What bothers me is that this still seems to be considered the way to go for all problems. You're committing to adding a significant amount of complexity between you and your data in the hopes that it will actually help with the usual problem of drives deciding to die. Sometimes this works out fine, but other times it just adds more things which can fail.

I mentioned this to him, and his (justified) reaction was that it was really the only way to go. I can't really argue with that. Nobody seems to be selling any alternatives, short of going all out and setting up a full-blown SAN. Then you're talking about all sorts of magic hardware, extra adapters, redundant fiber paths, and more. That seems to be too far to the other side. Something should exist in the middle.

That got me thinking about what makes me worry about RAID. My main problem is that you are entirely at the mercy of your controller. Your disks probably aren't even visible (without a hack, that is), and the data stored on them is in some magic format. If your controller blows up, you have to get an exact replacement and hope the new one is willing to bootstrap from the metadata stored on your drives. You probably can't take the drives and mount them directly in a crisis situation, in other words. They are junk without that special controller.

Normally, when you hang a drive off a Linux box, it's nothing special. You can pull that disk out of one machine and stick it into another, and it'll be just fine over there. The partition data and filesystems will make sense to the other machine, and it can be mounted, and then you can dig around on the disk to find whatever it is you want. I started wondering: why can't this apply to a distributed fault-tolerant filesystem?

Ideally, you'd be able to say "store this file", and it would chop it up into reasonable chunks and distribute it to all of the systems which were participating. Those systems would then pass it along to others until all of the chunks reached some minimum number of replicas -- let's go with three for now.

Let's say one of those machines has a disk which starts doing stupid things. This is usually evident from the syslog, and SMART data can also be used to help out. Your storage software would need to start removing chunks from that disk. Some of them might move to other disks in the same machine (if any), and others might migrate to other systems. Then the disk could be unmounted and subjected to a lengthy SMART test to see if it's really going to die or if it's just being annoying. As long as you had enough space in the cluster to pick up the slack, nothing bad should happen as a result of losing it.

Another scenario would be if the disk just keels over with no warning. Maybe it catches a cosmic ray in just the right place, scribbles all over the magic inaccessible tracks which tell it about disk geometry, and dies. The filesystem would obviously abort, and then all reads and writes to it would also fail. This would make all of those chunks disappear from that machine. At the cluster level, those chunks would now be below their target replication value, and that would cause their remaining replicas to be used as sources for cloning. Either of the two servers which hold copies could start this operation, and they'd both be able to provide that data to clients should a request come in at that time.

This is all a matter of timing. If you can get the chunks out of harm's way before all of the copies vanish, then you win and your data remains available. It may mean derating the effective speed of your storage cluster for client accesses when there's been a catastrophic event which requires a bunch of cloning, but that's life. I'd rather have a temporarily-slow array rather than a dead one.

So now let's bring up the "really bad day" situation. In it, all three copies of a chunk somehow become unavailable at the same time. What happens now? Well, reads for that chunk will fail. This is a given. The interesting part is that if nobody asks for it, nothing bad will happen. Maybe it's some "long tail" file and not having it available for a little while is okay. Again, it's far better than having the entire array shut down because of multiple failures. The rest of the array's files are unaffected and life goes on for them.

You get an alert about this "missing chunk" and decide to investigate. It turns out that ninja squirrels got into your data center and purposely shorted out one of your PDUs, taking out two racks, and that is responsible for two of the copies being out. What about the third? Well, that box is actually up, but it started having problems with its hard drive and it was automatically unmounted to start a SMART scan.

Guess what? The data is still there. You can just remount it manually and go digging around. Even if the filesystem is horribly damaged, if you can find just that one file somehow, you can restore it and keep on going. Maybe your directory structure is completely gone and everything wound up in /lost+found. If your chunks have some identifying information in them, then you can just scan everything until you find the right one, and then restore it.

The point is, when things are just files, you have an opportunity to use a bunch of tools to attempt recovery. Nothing ever gets more complicated than ordinary file recovery on a normal filesystem. Sure, any one of these filesystems may be part of a much bigger storage system, but it doesn't get in your way here.

Compare that to a RAID scenario when the controller blows up and now you have a bunch of disks which are effectively black boxes. Unless you somehow know how it does its striping and can rig up some kind of software equivalent, there is no way for you to see what is actually stored on there. That's when you start talking about Fedexing drives to recovery houses and paying thousands of dollars in the hopes of getting something back.

Remember that you can also have human errors which munch your arrays. All it takes is one tech who's just a little too fast and loose while in the admin console and then it's all gone.

I have a final thought about this vaporware software scheme of a bunch of files acting in concert as a pseudo filesystem. Remember that I said we'd aim for three replicas of any given chunk. If the amount of user data is well below the total capacity of the cluster, it may be possible to have many, many more. What I'm talking about is creating extra copies using available disk space.

Linux treats unused RAM as wasted RAM. So, why not treat unused disk space as wasted disk space? Rather than having it spin around and around holding nothing useful, store a copy of a chunk from some other machine. Obviously, you'd want this level of cloning to happen as "best effort" priority so it doesn't affect actual production traffic. If your cluster has times of low utilization on a daily or weekly cycle, it might have a chance to spread that data around.

With that kind of opportunistic extra replication in place, if you really had a triple simultaneous failure, you might still make it out okay, since some chunks might have four, five, six, or however many copies out there. There's just one catch: these extra copies would need the ability to be eliminated and replaced with "real data" at a moment's notice. With some intelligence in your software to pick which copies to evict, you'd have the opportunity to selectively re-balance a cluster as it fills up. The difference here is that the data is already on the machine(s) you want. If you'd rather not have it on another system, just allow it to be bumped by a new chunk. Easy!

Unfortunately for my friend, this is just too flexible for them. They're going with RAID and that's that. They don't want to have a handful of random commodity whiteboxes populated with ordinary hard drives. They want to have a big expensive RAID controller and disks mounted in special enclosures instead.

While working on this post, I heard some news: the new machine had arrived a day or two before, and its RAID controller had already failed. No, I am not making this up. Something happened which rendered it unusable, and he had to contact their vendor to get another one shipped out. That means he gets to go through the "import foreign config" thing when it arrives and hope it just figures things out from his existing array. I wish him luck.

I'd love to build a system like the one I described. Trouble is, I don't know anyone who needs that kind of performance and such flexibility when it comes to data recovery. If you are such a person or know one, please drop me a line and let me know.

It might just be the right move for your data.

RAID isn't always the answer

RAID isn't always the answer

Recommend

Readers respond to my post about RAID

Giving the illusion of uninterrupted connectivity for chats

A variety of screenshots I found amusing

Bad corporate behavior and human behavior I can't explain

Yet another NTSC shutdown: Comcast

Fun and games with theoretical anycast addresses

Web hosting support customers, money, and clues

The dangers of having critical cron jobs on workstations

FizzBuzz handled 8-bit style

Let's dissect ping and glibc to chase down an odd feature

About Joyk