Examining btrfs, Linux’s perpetually half-finished filesystem

Btrfs RAID arrays are a mess

So far, we've mostly just said what btrfs is and does. Now, we're going to start talking about what's missing and what it does just plain wrong.

btrfs-raid5 and btrfs-raid6

The btrfs-raid5 and btrfs-raid6 topologies are extremely unreliable. The btrfs wiki itself describes btrfs parity RAID as "mostly implemented," and it explicitly recommends "[btrfs] parity RAID [should] be used only for testing purposes."

Although btrfs wiki users have repeatedly struggled to soften those warnings—saying it should be fine for data, although not metadata, assuming you explicitly scrub after any power outage or other unclean shutdown—senior btrfs dev and maintainer Josef Bacik wrote a much stronger warning in btrfs-progs. SUSE maintainer David Sterba merged Bacik's warning in March: RAID5/6 support has known problems is strongly discouraged to be used besides testing or evaluation (sic).

When a filesystem's own senior developers and maintainers tell you not to use a feature, please do not use that feature.

btrfs-raid1

On the surface, btrfs-raid1 is a very exciting as well as novel topology. There are lots of hobbyists and junior admins out there with rag-tag collections of working but mismatched drives and dreams of big, redundant arrays. This isn't typically possible with conventional RAID technologies, which generally require matched drive sizes and/or even numbers of drives.

Although btrfs-raid1 fits that use case very well indeed—offering a way to assemble nearly any odd collection of bits and bobs into a redundant array—it's encouraging some pretty risky practices and reducing safety levels in non-obvious ways. First and foremost, just because a disk spins up doesn't mean it's in good shape or should be relied upon.

Moving beyond the question of individual disk reliability, btrfs-raid1 can only tolerate a single disk failure, no matter how large the total array is. The remaining copies of the blocks that were on a lost disk are distributed throughout the entire array—so losing any second disk loses you the array along with it. (This is in contrast to RAID10 arrays, which can survive any number of disk failures as long as no two are from the same mirror pair.)

In short, the very promise of btrfs-raid1 is an invitation to experience catastrophic data loss: the typical use case is quite a few disks, typically of uncertain origin at best, in a topology that is unusually failure-susceptible.

btrfs-raid0

I don't have anything specific to say about btrfs-raid0, other than the fact that it's raid zero. Any failure of any disk loses all data on the array. This is not a storage system, it's a virtual woodchipper. Avoid.

btrfs-raid10

With btrfs-raid5/6 out of contention due to severe write hole problems, btrfs-raid1 dangerous to use with more than a couple of disks, and btrfs-raid0 offering zero redundancy, that leaves us with btrfs-raid10. This is basically the only sane topology available for use with more than two or three total drives managed by btrfs-native raid.

Btrfs RAID array management is a mess

OK, so you made it through the last section unscathed—you wanted btrfs-raid10 anyway! It's fine! Now let's talk about where management and maintenance of your new array falls down. The first issue that crops up is that of storage namespaces.

Storage namespaces

When you create a hardware RAID array, that array exists independent of the original disks, and it's presented to the system as a single "virtual" drive. It's very clear when you're managing the array versus when you're managing individual disks, because the array is a completely separate thing. It has its own internal management and its own logical devicename separate from the drives themselves—the drives themselves may not even expose individual devicenames to the operating system.

The same is true of Linux kernel RAID—mdraid assembles disks into a new logical device with its own devicename, eg., /dev/md0. Although the individual disks retain their own devicenames—eg., /dev/sda—there's little confusion between the disks and the array. Similarly, the Linux Logical Volume Manager (LVM) maintains a separate namespace for its virtual devices and any hardware devices underneath, with user-configurable names for volume groups and logical volumes that do not conflict with hardware devices underneath them.

But btrfs, for some reason, never bothered with that. When you create and mount a btrfs RAID array, it looks something like this:

root@btrfs-test:~# mkfs.btrfs -draid1 -mraid1 /dev/vdc /dev/vdd
btrfs-progs v5.4.1 
See http://btrfs.wiki.kernel.org for more information.

Label:              (null)
UUID:               19a35765-81d4-4f5b-9d7e-393577cf842f
Node size:          16384
Sector size:        4096
Filesystem size:    20.00GiB
Block group profiles:
  Data:             RAID1             1.00GiB
  Metadata:         RAID1           256.00MiB
  System:           RAID1             8.00MiB
SSD detected:       no
Incompat features:  extref, skinny-metadata
Checksum:           crc32c
Number of devices:  2
Devices:
   ID        SIZE  PATH
    1    10.00GiB  /dev/vdc
    2    10.00GiB  /dev/vdd

root@btrfs-test:~# mkdir -p /btrfs-raid1

root@btrfs-test:~# mount /dev/vdc /btrfs-raid1

root@btrfs-test:~#

Yes, you read that correctly—you mount the array using the name of any given disk in the array. No, it doesn't matter which one:

root@btrfs-test:~# umount /btrfs-raid1

root@btrfs-test:~# mount /dev/vdd /btrfs-raid1

root@btrfs-test:~# grep btrfs-raid1 /etc/fstab
/dev/disk/by-uuid/19a35765-81d4-4f5b-9d7e-393577cf842f /btrfs-raid1 btrfs defaults,noauto 0 0

Yes, this is as weird as it looks. You can use the UUID shown in the mkfs.btrfs output to at least give you a way to automount the array from /etc/fstab without specifying an individual disk in what's theoretically a redundant array, as shown above in the last line. But that's less useful than you might think.

btrfs-raid is redundant—but only grudgingly

As any storage administrator worth their salt will tell you, RAID is primarily about uptime. Although it may keep your data safe, that's not its real job—the job of RAID is to minimize the number of instances in which you have to take the system down for extended periods of time to restore from proper backup.

Once you understand that fact, the way btrfs-raid handles hardware failure looks downright nuts. What happens if we yank a disk from our btrfs-raid1 array above?

First, we'll unmount the array, then tell virt-manager to pull one of the two virtual disks from the test VM—which we'll verify by checking /dev for /vdc and /vdd, the component drives of our little btrfs-raid1. Then, we'll try to remount the array:

root@btrfs-test:~# umount /btrfs-raid1

root@btrfs-test:~# ls /dev/vd* | egrep 'vdc|vdd'
/dev/vdd

root@btrfs-test:~# mount /btrfs-raid1
mount: /btrfs-raid1: wrong fs type, bad option, bad superblock on /dev/vdd, missing codepage or helper program, or other error.

Even though our array is technically "redundant," it refuses to mount with /dev/vdc missing. Even worse, it throws an extremely misleading error that would lead an admin to believe that the remaining disk had problems—which it does not.

In order to get the array to mount, you need to pass it a special option:

root@btrfs-test:~# mount -o degraded /btrfs-raid1

root@btrfs-test:~# btrfs filesystem show /btrfs-raid1
Label: none  uuid: 19a35765-81d4-4f5b-9d7e-393577cf842f
	Total devices 2 FS bytes used 448.00KiB
	devid    2 size 10.00GiB used 1.26GiB path /dev/vdd
	*** Some devices missing

Now, you might be thinking "that doesn't seem so bad," but the need to pass the -o degraded option means that any automounting btrfs-raid arrays will refuse to mount at boot.

In the worst-case scenario—a root filesystem that itself is stored "redundantly" on btrfs-raid1 or btrfs-raid10—the entire system refuses to boot. When that happens, the admin must descend into Busybox hell to manually edit grub config lines to temporarily mount the array degraded.

If you're thinking, "Well, the obvious step here is just to always mount degraded," the btrfs devs would like to have a word with you. When I first brought this up as a possible mitigation eight years ago, the consensus was "oh no, don't do that."

There are several fairly good reasons to require an admin to stumble over their own feet hard when a btrfs-raid array loses a disk—reasons that can very easily add up to permanent data loss.

btrfs-raid requires unusually careful maintenance

If you lose a drive from a conventional RAID array, or an mdraid array, or a ZFS zpool, that array keeps on trucking without needing any special flags to mount it. If you then add the failed drive back to the array, your RAID manager will similarly automatically begin "resilvering" or "rebuilding" the array in order to catch the temporarily missing drive up on any data it has missed out on.

That, unfortunately, is not the case with btrfs-native RAID. Let's examine:

root@btrfs-test:~# touch /btrfs-raid1/vdc-is-missing
root@btrfs-test:~# ls -l /btrfs-raid1
total 0
-rw-r--r-- 1 root root 0 Sep 17 20:40 vdc-is-missing

root@btrfs-test:~# umount /btrfs-raid1

root@btrfs-test:~# ls /dev | egrep 'vdc|vdd'
vdc
vdd

root@btrfs-test:~# mount /btrfs-raid1

root@btrfs-test:~# ls /btrfs-raid1
vdc-is-missing

First, we created a file while /dev/vdc was not present in the array. Then, we unmounted the array (no errors). Then we added /dev/vdc back into the system—simulating, for example, a drive connected via a flaky SATA cable that occasionally drops off the system bus. Then, we remounted the array.

Notice that there was no error when remounting the array—btrfs wanted to find two disks belonging to the array, found two disks belonging to the array, and therefore cheerfully mounted it. Btrfs did not ask any questions or throw any warnings, despite the fact that the array had previously been mounted degraded, and data stored non-redundantly to it.

In a normal RAID array, automounting with the missing disk included would make sense—after all, the array would automatically and immediately begin rebuilding/resilvering the missing data onto the newly reconnected disk. But that was not the case with btrfs nine years ago, and it's still not the case with btrfs today.

We'll even manually trigger a scrub—a procedure that storage admins generally understand to look for and automatically repair any data issues in an array—before proceeding:

root@btrfs-test:~# btrfs scrub start /btrfs-raid1
scrub started on /btrfs-raid1, fsid 19a35765-81d4-4f5b-9d7e-393577cf842f (pid=2037)
WARNING: errors detected during scrubbing, corrected

root@btrfs-test:~# btrfs scrub status /btrfs-raid1
UUID:             19a35765-81d4-4f5b-9d7e-393577cf842f
Scrub started:    Fri Sep 17 20:45:03 2021
Status:           finished
Duration:         0:00:00
Total to scrub:   816.00KiB
Rate:             0.00B/s
Error summary:    super=2 csum=5
  Corrected:      5
  Uncorrectable:  0
  Unverified:     0

Btrfs found some errors! So we should be good to go now, and our data should be stored redundantly on both drives—even though some of it was originally stored only on /dev/vdd while /dev/vdc was on temporary holiday. Let's check, by this time removing /dev/vdd and then examining the array:

root@btrfs-test:~# umount /btrfs-raid1

root@btrfs-test:~# mount -o degraded /btrfs-raid1
mount: /btrfs-raid1: wrong fs type, bad option, bad superblock on /dev/vdc, missing codepage or helper program, or other error.

root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc

Well how about that... even though we manually initiated a scrub and let it finish, our array is still inconsistent and even outright non-mountable, because it ran for a little while without a disk and then that disk was re-added.

If we'd held our mouths just right, we could have worked our way around this. Let's examine what we should have done:

root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc
vdd

root@btrfs-test:~# mount /btrfs-raid1

root@btrfs-test:~# btrfs balance /btrfs-raid1
Done, had to relocate 6 out of 6 chunks

root@btrfs-test:~# umount /btrfs-raid1

root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc

root@btrfs-test:~# mount -o degraded /btrfs-raid1

root@btrfs-test:~# ls /btrfs-raid1
vdc-is-missing

The command that we were supposed to run was btrfs balance—with both drives connected and a btrfs balance run, it does correct the missing blocks, and we can now mount degraded on only the other disk, /dev/vdc. But this was a very tortuous path, with a lot of potential for missteps and zero discoverability.

Again, by contrast, either mdraid or ZFS will cheerfully boot and/or mount degraded without hassle as necessary, and then it will automatically add a vagabond disk back into the array when it shows back up and scan and correct it to account for any data stored while it was missing.

Btrfs' refusal to mount degraded, automatic mounting of stale disks, and lack of automatic stale disk repair/recovery do not add up to a sane way to manage a "redundant" storage system.

Conclusions

Believe it or not, we've still only scratched the surface of btrfs problems. Similar problems and papercuts lurk in the way it manages snapshots, replication, compression, and more. Once we get through that, there's performance to talk about—which in many cases can be orders of magnitude slower than either ZFS or mdraid in reasonable, common real-world conditions and configurations.

We'll return to this analysis in the near future. In the meantime, if you're going to run btrfs in any configuration in which it manages multiple disks—as opposed to, eg., Synology and Netgear NAS devices, which crucially layer btrfs on top of traditional systems like LVM to avoid these pitfalls—please do so very carefully.

Btrfs RAID arrays are a mess

btrfs-raid5 and btrfs-raid6

btrfs-raid1

btrfs-raid0

btrfs-raid10

Btrfs RAID array management is a mess

Storage namespaces

btrfs-raid is redundant—but only grudgingly

btrfs-raid requires unusually careful maintenance

Conclusions

Recommend

微软 2021 秋季发布会：Surface 家族迎来高刷新率时代，对大黑框说拜拜！

Code to the Future

Big Brother Sets Eyes on Crypto

Meet the Writer: Sergey Prilutskiy is The Co-founder of Blockchain Company, MixB...

NFTs And The Creation Of The Metaverse

Craig Williams is a Comedian and the Curator of Your Wasted Time

Fedora Workstation: Our Vision for Linux Desktop

Vectorizing xxHash for Fun and Profit | The Pseudo Random Bit Bucket

How to Add a Porn Filter to Your Home Network

#Decentralized-Internet Writing Competition: August Results Announced!

About Joyk