A kernel without buffer heads

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider subscribing to LWN. Thank you for visiting LWN.net!

No data structures found in the Linux kernel — at least, in any version that escaped from Linus Torvalds's development machine — are older than the buffer head. Like many other legacies from the early days of Linux, buffer heads have been targeted for removal for years. They persist, though, despite the problems they present. Now, Christoph Hellwig has posted a patch series that enables the building of a kernel without buffer heads — but the cost of doing so at this point will be more than most want to pay.

The first public release of the Linux kernel was version 0.01, and struct buffer_head was a part of it:

    struct buffer_head {
	char * b_data;			/* pointer to data block (1024 bytes) */
	unsigned short b_dev;		/* device (0 = free) */
	unsigned short b_blocknr;	/* block number */
	unsigned char b_uptodate;
	unsigned char b_dirt;		/* 0-clean,1-dirty */
	unsigned char b_count;		/* users using this block */
	unsigned char b_lock;		/* 0 - ok, 1 -locked */
	struct task_struct * b_wait;
	struct buffer_head * b_prev;
	struct buffer_head * b_next;
	struct buffer_head * b_prev_free;
	struct buffer_head * b_next_free;
    };

While the best disk drives available decades ago were nominally "fast", accessing data on disk was still slower, by several orders of magnitude, than accessing data in main memory. So the importance of caching file data was well understood long before Linux was born. The approach that was generally in use at that time was to cache disk blocks, with filesystem code operating on data in that cache; Torvalds followed that model with Linux. Thus, from the beginning, the Linux kernel included a "buffer cache" that held copies of blocks found on the system's disks.

The buffer_head structure was the key to managing the buffer cache. The combination of the b_dev and b_blocknr fields uniquely identified which block a given buffer cache entry referred to, while b_data pointed to the cached data itself. The other fields tracked whether the block needed to be written back to disk, how many users it had, and more. It was a core part of the kernel's block I/O subsystem — and of its memory management code as well.

Over time, it became clear that file caching could be done better if it were implemented as a cache of file data, rather than of disk blocks. During the 1.3 development cycle, Torvalds began implementing a new feature known as the "page cache", which would manage pages of data from files, rather than disk blocks. A number of advantages came from that change; many operations on file data could avoid calling into the filesystem code entirely if that data could be found in the cache, for example. Caching data at a higher level better matched how that data was used, and the ability to cache full pages (generally eight times larger than the 512-byte block size typically found at that time) improved efficiency.

The only problem was that the buffer cache was deeply wired into both the block subsystem and the filesystem implementations, so this cache continued to exist, alongside the page cache, for several more years until the two were unified. Even then, the buffer cache was at the core of the API used for block I/O. This was not optimal: filesystems worked hard to store data contiguously on disk, and the page cache could keep that data together in memory with at least page granularity, but the buffer-head interface required every I/O operation to be broken down into 512-byte blocks — each with its own buffer_head structure. That was a lot of overhead, much of which just added work for storage drivers, which had to try to reassemble larger chunks for reasonable I/O performance.

The 2.5 development series (the last of the odd-number development kernels under the older model) addressed this problem by reworking the block layer around a new data structure called the "bio" that could represent block I/O requests more efficiently. Over the years, the bio has evolved considerably as the need to support ever-higher I/O rates has grown, but it still remains the way that block I/O requests are assembled and managed.

Meanwhile, though, struct buffer_head can still be found in current kernels. And, more to the point, a number of filesystems still use it. The role that buffer heads once played in cache management has long since ended, but they still handle an important task in parts of the kernel: tracking the mapping between data cached in memory and the location on persistent storage where that data lives. The kernel has a rather more modern interface (iomap) for this purpose, but not all subsystems are using it.

One of the holdouts is ext4, which still makes heavy use of buffer heads. This filesystem, of course, is derived from ext2, which first entered the kernel with the 0.99.7 release in early 1993. Ext2 was based on block pointers; each file would have a list associated with it containing the numbers of the blocks on disk holding that file's data. Such a layout, where each block on disk is a separate entity (even if the filesystem tries to keep them together) fits the buffer head model reasonably well. So it is not surprising the buffer heads were embedded deeply within ext2, and are still there 30 years later in ext4, even though ext4 gained support for extents — a rather more efficient representation of large files — in 2006.

Buffer heads, clearly, still work, but they still add overhead to file I/O. They also present an obstacle to changes that developers want to make to the memory-management and filesystem layers, including the ongoing folio work. So the desire to get rid of buffer heads, which has been present for a long time, seems to be getting stronger.

But, as Hellwig's patch series shows, ext4 is not the only place where buffer heads persist. That series, after a bit of refactoring, adds a new BUFFER_HEAD configuration option that controls the compilation of buffer-head support. Any code that needs buffer heads will select that option; if a kernel is built without any code needing buffer heads, then the resulting kernel will not have that support. Such a kernel will be lacking a few important features, though, including the ext4 filesystem, but also F2FS, FAT, GFS2, HFS, ISO9660 (CDROM), JFS, NTFS, NTFS3, and the device-mapper layer. On the other hand, it is possible to build a buffer-head-free kernel that supports Btrfs and XFS.

It seems unlikely that there will be many kernels built without buffer-head support in the near future. This work does, however, make it easier to see where the remaining users are, which should help to focus work toward getting rid of buffer heads for real. That job is still likely to take some time — one does not perform major surgery on a heavily used filesystem in a hurry — and it may accelerate the removal of some old and unloved filesystems (like JFS). One of these years, though, it will become possible to drop this core kernel data structure that has been there since the beginning.

(Log in to post comments)

Welcome to LWN.net

Recommend

Async and Await in Vanilla JavaScript

9.5分收官：《漫长的季节》每张海报都藏着戏

Berkshire Hathaway: What you need to know from the annual shareholders meeting

Choose Boring Technology Culture

iMac Turns 25 Today: When to Expect the Next Model to Launch

Top 10 Tips for Shopping in an E-commerce Marketplace

谷歌内部文件泄露：大模型已被开源社区「偷家」，不改变ChatGPT也会黯然失色

Top 5 VS Code extensions

Top Stories: One Month to WWDC, iOS 17 Rumor Recap, New AirPods Firmware, and Mo...

密码管理和2FA管理软件

About Joyk