Examining btrfs, Linux’s perpetually half-finished filesystem
source link: https://arstechnica.com/gadgets/2021/09/examining-btrfs-linuxs-perpetually-half-finished-filesystem/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Btrfs RAID arrays are a mess
So far, we've mostly just said what btrfs is and does. Now, we're going to start talking about what's missing and what it does just plain wrong.
btrfs-raid5 and btrfs-raid6
The btrfs-raid5
and btrfs-raid6
topologies are extremely unreliable. The btrfs wiki itself describes btrfs parity RAID as "mostly implemented," and it explicitly recommends "[btrfs] parity RAID [should] be used only for testing purposes."
Although btrfs wiki users have repeatedly struggled to soften those warnings—saying it should be fine for data, although not metadata, assuming you explicitly scrub after any power outage or other unclean shutdown—senior btrfs dev and maintainer Josef Bacik wrote a much stronger warning in btrfs-progs.
SUSE maintainer David Sterba merged Bacik's warning in March: RAID5/6 support has known problems is strongly discouraged to be used besides testing or evaluation
(sic).
When a filesystem's own senior developers and maintainers tell you not to use a feature, please do not use that feature.
btrfs-raid1
On the surface, btrfs-raid1
is a very exciting as well as novel topology. There are lots of hobbyists and junior admins out there with rag-tag collections of working but mismatched drives and dreams of big, redundant arrays. This isn't typically possible with conventional RAID technologies, which generally require matched drive sizes and/or even numbers of drives.
Although btrfs-raid1
fits that use case very well indeed—offering a way to assemble nearly any odd collection of bits and bobs into a redundant array—it's encouraging some pretty risky practices and reducing safety levels in non-obvious ways. First and foremost, just because a disk spins up doesn't mean it's in good shape or should be relied upon.
Moving beyond the question of individual disk reliability, btrfs-raid1
can only tolerate a single disk failure, no matter how large the total array is. The remaining copies of the blocks that were on a lost disk are distributed throughout the entire array—so losing any second disk loses you the array along with it. (This is in contrast to RAID10 arrays, which can survive any number of disk failures as long as no two are from the same mirror pair.)
In short, the very promise of btrfs-raid1
is an invitation to experience catastrophic data loss: the typical use case is quite a few disks, typically of uncertain origin at best, in a topology that is unusually failure-susceptible.
btrfs-raid0
I don't have anything specific to say about btrfs-raid0
, other than the fact that it's raid zero. Any failure of any disk loses all data on the array. This is not a storage system, it's a virtual woodchipper. Avoid.
btrfs-raid10
With btrfs-raid5/6
out of contention due to severe write hole problems, btrfs-raid1
dangerous to use with more than a couple of disks, and btrfs-raid0
offering zero redundancy, that leaves us with btrfs-raid10
. This is basically the only sane topology available for use with more than two or three total drives managed by btrfs-native raid.
Btrfs RAID array management is a mess
OK, so you made it through the last section unscathed—you wanted btrfs-raid10
anyway! It's fine! Now let's talk about where management and maintenance of your new array falls down. The first issue that crops up is that of storage namespaces.
Storage namespaces
When you create a hardware RAID array, that array exists independent of the original disks, and it's presented to the system as a single "virtual" drive. It's very clear when you're managing the array versus when you're managing individual disks, because the array is a completely separate thing. It has its own internal management and its own logical devicename separate from the drives themselves—the drives themselves may not even expose individual devicenames to the operating system.
AdvertisementThe same is true of Linux kernel RAID—mdraid assembles disks into a new logical device with its own devicename, eg., /dev/md0
. Although the individual disks retain their own devicenames—eg., /dev/sda
—there's little confusion between the disks and the array. Similarly, the Linux Logical Volume Manager (LVM) maintains a separate namespace for its virtual devices and any hardware devices underneath, with user-configurable names for volume groups and logical volumes that do not conflict with hardware devices underneath them.
But btrfs
, for some reason, never bothered with that. When you create and mount a btrfs
RAID array, it looks something like this:
root@btrfs-test:~# mkfs.btrfs -draid1 -mraid1 /dev/vdc /dev/vdd
btrfs-progs v5.4.1
See http://btrfs.wiki.kernel.org for more information.
Label: (null)
UUID: 19a35765-81d4-4f5b-9d7e-393577cf842f
Node size: 16384
Sector size: 4096
Filesystem size: 20.00GiB
Block group profiles:
Data: RAID1 1.00GiB
Metadata: RAID1 256.00MiB
System: RAID1 8.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Checksum: crc32c
Number of devices: 2
Devices:
ID SIZE PATH
1 10.00GiB /dev/vdc
2 10.00GiB /dev/vdd
root@btrfs-test:~# mkdir -p /btrfs-raid1
root@btrfs-test:~# mount /dev/vdc /btrfs-raid1
root@btrfs-test:~#
Yes, you read that correctly—you mount the array using the name of any given disk in the array. No, it doesn't matter which one:
root@btrfs-test:~# umount /btrfs-raid1
root@btrfs-test:~# mount /dev/vdd /btrfs-raid1
root@btrfs-test:~# grep btrfs-raid1 /etc/fstab
/dev/disk/by-uuid/19a35765-81d4-4f5b-9d7e-393577cf842f /btrfs-raid1 btrfs defaults,noauto 0 0
Yes, this is as weird as it looks. You can use the UUID shown in the mkfs.btrfs
output to at least give you a way to automount the array from /etc/fstab
without specifying an individual disk in what's theoretically a redundant array, as shown above in the last line. But that's less useful than you might think.
btrfs-raid is redundant—but only grudgingly
As any storage administrator worth their salt will tell you, RAID is primarily about uptime. Although it may keep your data safe, that's not its real job—the job of RAID is to minimize the number of instances in which you have to take the system down for extended periods of time to restore from proper backup.
Once you understand that fact, the way btrfs-raid
handles hardware failure looks downright nuts. What happens if we yank a disk from our btrfs-raid1
array above?
First, we'll unmount the array, then tell virt-manager
to pull one of the two virtual disks from the test VM—which we'll verify by checking /dev
for /vdc
and /vdd
, the component drives of our little btrfs-raid1
. Then, we'll try to remount the array:
root@btrfs-test:~# umount /btrfs-raid1
root@btrfs-test:~# ls /dev/vd* | egrep 'vdc|vdd'
/dev/vdd
root@btrfs-test:~# mount /btrfs-raid1
mount: /btrfs-raid1: wrong fs type, bad option, bad superblock on /dev/vdd, missing codepage or helper program, or other error.
Even though our array is technically "redundant," it refuses to mount with /dev/vdc
missing. Even worse, it throws an extremely misleading error that would lead an admin to believe that the remaining disk had problems—which it does not.
In order to get the array to mount, you need to pass it a special option:
root@btrfs-test:~# mount -o degraded /btrfs-raid1
root@btrfs-test:~# btrfs filesystem show /btrfs-raid1
Label: none uuid: 19a35765-81d4-4f5b-9d7e-393577cf842f
Total devices 2 FS bytes used 448.00KiB
devid 2 size 10.00GiB used 1.26GiB path /dev/vdd
*** Some devices missing
Now, you might be thinking "that doesn't seem so bad," but the need to pass the -o degraded
option means that any automounting btrfs-raid arrays will refuse to mount at boot.
In the worst-case scenario—a root filesystem that itself is stored "redundantly" on btrfs-raid1
or btrfs-raid10
—the entire system refuses to boot. When that happens, the admin must descend into Busybox hell to manually edit grub config lines to temporarily mount the array degraded.
If you're thinking, "Well, the obvious step here is just to always mount degraded," the btrfs devs would like to have a word with you. When I first brought this up as a possible mitigation eight years ago, the consensus was "oh no, don't do that."
There are several fairly good reasons to require an admin to stumble over their own feet hard when a btrfs-raid array loses a disk—reasons that can very easily add up to permanent data loss.
btrfs-raid requires unusually careful maintenance
If you lose a drive from a conventional RAID array, or an mdraid array, or a ZFS zpool, that array keeps on trucking without needing any special flags to mount it. If you then add the failed drive back to the array, your RAID manager will similarly automatically begin "resilvering" or "rebuilding" the array in order to catch the temporarily missing drive up on any data it has missed out on.
AdvertisementThat, unfortunately, is not the case with btrfs-native RAID. Let's examine:
root@btrfs-test:~# touch /btrfs-raid1/vdc-is-missing
root@btrfs-test:~# ls -l /btrfs-raid1
total 0
-rw-r--r-- 1 root root 0 Sep 17 20:40 vdc-is-missing
root@btrfs-test:~# umount /btrfs-raid1
root@btrfs-test:~# ls /dev | egrep 'vdc|vdd'
vdc
vdd
root@btrfs-test:~# mount /btrfs-raid1
root@btrfs-test:~# ls /btrfs-raid1
vdc-is-missing
First, we created a file while /dev/vdc
was not present in the array. Then, we unmounted the array (no errors). Then we added /dev/vdc
back into the system—simulating, for example, a drive connected via a flaky SATA cable that occasionally drops off the system bus. Then, we remounted the array.
Notice that there was no error when remounting the array—btrfs wanted to find two disks belonging to the array, found two disks belonging to the array, and therefore cheerfully mounted it. Btrfs did not ask any questions or throw any warnings, despite the fact that the array had previously been mounted degraded, and data stored non-redundantly to it.
In a normal RAID array, automounting with the missing disk included would make sense—after all, the array would automatically and immediately begin rebuilding/resilvering the missing data onto the newly reconnected disk. But that was not the case with btrfs nine years ago, and it's still not the case with btrfs today.
We'll even manually trigger a scrub
—a procedure that storage admins generally understand to look for and automatically repair any data issues in an array—before proceeding:
root@btrfs-test:~# btrfs scrub start /btrfs-raid1
scrub started on /btrfs-raid1, fsid 19a35765-81d4-4f5b-9d7e-393577cf842f (pid=2037)
WARNING: errors detected during scrubbing, corrected
root@btrfs-test:~# btrfs scrub status /btrfs-raid1
UUID: 19a35765-81d4-4f5b-9d7e-393577cf842f
Scrub started: Fri Sep 17 20:45:03 2021
Status: finished
Duration: 0:00:00
Total to scrub: 816.00KiB
Rate: 0.00B/s
Error summary: super=2 csum=5
Corrected: 5
Uncorrectable: 0
Unverified: 0
Btrfs found some errors! So we should be good to go now, and our data should be stored redundantly on both drives—even though some of it was originally stored only on /dev/vdd
while /dev/vdc
was on temporary holiday. Let's check, by this time removing /dev/vdd
and then examining the array:
root@btrfs-test:~# umount /btrfs-raid1
root@btrfs-test:~# mount -o degraded /btrfs-raid1
mount: /btrfs-raid1: wrong fs type, bad option, bad superblock on /dev/vdc, missing codepage or helper program, or other error.
root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc
Well how about that... even though we manually initiated a scrub and let it finish, our array is still inconsistent and even outright non-mountable, because it ran for a little while without a disk and then that disk was re-added.
If we'd held our mouths just right, we could have worked our way around this. Let's examine what we should have done:
root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc
vdd
root@btrfs-test:~# mount /btrfs-raid1
root@btrfs-test:~# btrfs balance /btrfs-raid1
Done, had to relocate 6 out of 6 chunks
root@btrfs-test:~# umount /btrfs-raid1
root@btrfs-test:~# ls /dev/ | egrep 'vdc|vdd'
vdc
root@btrfs-test:~# mount -o degraded /btrfs-raid1
root@btrfs-test:~# ls /btrfs-raid1
vdc-is-missing
The command that we were supposed to run was btrfs balance
—with both drives connected and a btrfs balance
run, it does correct the missing blocks, and we can now mount degraded on only the other disk, /dev/vdc
. But this was a very tortuous path, with a lot of potential for missteps and zero discoverability.
Again, by contrast, either mdraid or ZFS will cheerfully boot and/or mount degraded without hassle as necessary, and then it will automatically add a vagabond disk back into the array when it shows back up and scan and correct it to account for any data stored while it was missing.
Btrfs' refusal to mount degraded, automatic mounting of stale disks, and lack of automatic stale disk repair/recovery do not add up to a sane way to manage a "redundant" storage system.
Conclusions
Believe it or not, we've still only scratched the surface of btrfs
problems. Similar problems and papercuts lurk in the way it manages snapshots, replication, compression, and more. Once we get through that, there's performance to talk about—which in many cases can be orders of magnitude slower than either ZFS or mdraid in reasonable, common real-world conditions and configurations.
We'll return to this analysis in the near future. In the meantime, if you're going to run btrfs in any configuration in which it manages multiple disks—as opposed to, eg., Synology and Netgear NAS devices, which crucially layer btrfs on top of traditional systems like LVM to avoid these pitfalls—please do so very carefully.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK