Linux异步IO新时代：io_uring

Linux 5.1合入了一个新的异步IO框架和实现：io_uring，由block IO大神Jens Axboe开发。这对当前异步IO领域无疑是一个喜大普奔的消息，这意味着，Linux native aio的时代即将成为过去，io_uring的时代即将开启。

从Linux IO说起

Linux最初的IO系统调用需要追溯到read(2)和write(2)。后来发展为增加offset的pread(2)和pwrite(2)，以及基于vector的版本preadv(2)和pwritev(2)。再接下来扩展为preadv2(2)和pwritev2(2)。尽管形式看上去多种多样，但它们都有一个共同的特征就是同步，即系统调用需要在数据读取或写入后才返回。应某些应用场景的诉求，异步IO接口应势而生。POSIX对应的接口为aio_read(3)和aio_write(3)，但其实现平淡无奇且性能不好。Linux Native异步IO接口即我们常称的aio，同样有着诸多限制：

最大的限制无疑是仅支持direct aio。而O_DIRECT要求bypass缓存和size对齐等，直接影响了很多场景的使用。而对buffered IO，其表现为同步。
即使满足了所有异步IO的约束，有时候还是可能会被阻塞，例如，等待元数据IO，或者存储设备的请求槽位都正在使用等等。
存在额外的开销，每个IO提交需要拷贝64+8字节，每个IO完成需要拷贝32字节，这在某些场景下影响很可观。在使用完成event的时候需要非常小心，否则容易丢事件。IO总是需要至少2个系统调用（submit + wait-for-completion)，在spectre/meltdown开启下性能下降非常严重。

相比新开发一套接口，扩展和改进现有接口往往有着更多的优势。在过去的数年间，针对上述限制一的很多改进努力都未果。尤其是近年来快速设备的出现，更加体现出现有异步接口的局限性。

io_uring

新接口的设计要求如下，尽管有些看上去存在互斥性，如高效且可伸缩的接口往往比较难用，或者说难以正确地使用；特性丰富和高效很难同时满足等等。

简单易用；
可扩展，如为未来的网络/非块存储IO接入考虑；
特性丰富，满足所有应用；
高效，尤其是针对大部分场景的512B或4K IO；
可伸缩。

io_uring首先需要围绕高效进行设计。为了避免在提交和完成事件中存在内存拷贝，io_uring设计了一对共享的ring buffer用于应用和内核之间的通信。其中，针对提交队列（SQ），应用是IO提交的生产者（producer），内核是消费者（consumer）；反过来，针对完成队列（CQ），内核是完成事件的生产者，应用是消费者。

数据结构

/* * IO submission data structure (Submission Queue Entry) */
struct io_uring_sqe {
    __u8    opcode;         /* type of operation for this sqe */
    __u8    flags;          /* IOSQE_ flags */
    __u16   ioprio;         /* ioprio for the request */
    __s32   fd;             /* file descriptor to do IO on */
    __u64   off;            /* offset into file */
    __u64   addr;           /* pointer to buffer or iovecs */
    __u32   len;            /* buffer size or number of iovecs */
    union {
        __kernel_rwf_t  rw_flags;
        __u32           fsync_flags;
        __u16           poll_events;
    };
    __u64   user_data;      /* data to be passed back at completion time */
    union {
        __u16   buf_index;      /* index into fixed buffers, if used */
        __u64   __pad2[3];
    };
};

/* * IO completion data structure (Completion Queue Entry) */
struct io_uring_cqe {
    __u64   user_data;      /* sqe->data submission passed back */
    __s32   res;            /* result code for this event */
    __u32   flags;
};

系统调用

/* * setup a context for performing asynchronous I/O * * The io_uring_setup() system call sets up a submission queue (SQ) and completion queue (CQ) with * at least entries entries, and returns a file descriptor which can be used to perform subsequent * operations on the io_uring instance.The submission and completion queues are shared between * userspace and the kernel, which eliminates the need to copy data when initiating and completing * I/O. */
int io_uring_setup(u32 entries, struct io_uring_params *p);

/* * initiate and/or complete asynchronous I/O * * io_uring_enter() is used to initiate and complete I/O using the shared submission and completion * queues setup by a call to io_uring_setup(2). A single call can both submit new I/O and wait for * completions of I/O initiated by this call or previous calls to io_uring_enter(). */
int io_uring_enter(unsigned int fd, unsigned int to_submit,
                unsigned int min_complete, unsigned int flags,
                sigset_t *sig);

/* * register files or user buffers for asynchronous I/O * * The io_uring_register() system call registers user buffers or files for use in an io_uring(7) instance * referenced by fd. Registering files or user buffers allows the kernel to take long term references to * internal data structures or create long term mappings of application memory, greatly reducing * per-I/O overhead. */
int io_uring_register(unsigned int fd, unsigned int opcode,
                void *arg, unsigned int nr_args)

liburing

为了方便使用，Jens Axboe还开发了一套liburing库，同时在fio中提供了ioengine=io_uring的支持。通过liburing库，应用不必了解诸多io_uring的细节就可以简单地使用起来。例如，无需担心memory barrier，或者是ring buffer管理之类等。简单的example如下：

/* setup io_uring and do mmap */
io_uring_queue_init(ENTRIES, &ring, 0);

/* get an sqe and fill in a READV operation */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, fd, &iovec, 1, offset);

/* tell the kernel we have an sqe ready for consumption */
io_uring_submit(&ring);

/* wait for the sqe to complete */
io_uring_wait_cqe(&ring, &cqe);

/* read and process cqe event */
io_uring_cqe_seen(&ring, cqe);

/* tear down */
io_uring_queue_exit(&ring);

io_uring高级特性

Fixed Files and BuffersIORING_REGISTER_FILES/IORING_UNREGISTER_FILES 通过io_uring_register()系统调用提前注册一组file，缓解每次IO操作的fget()/fput()带来的开销。

IORING_REGISTER_BUFFERS/IORING_UNREGISTER_BUFFERS 通过io_uring_register()系统调用注册一组固定的IO buffers，当应用重用这些IO buffers时，只需要map/unmap一次即可，而不是每次IO都要去做。

Polled IOIORING_SETUP_IOPOLL 与非polling模式等待硬件中断唤醒不同，内核将采用polling模式不断轮询硬件以确认IO请求是否已经完成，这在追求低延时和高IOPS的应用场景非常有用。

io_uring性能

注：数据来源于io_uring的patchset v5。

3d xpoint, 4k random read

Interface	QD	Polled		Latency		IOPS
--------------------------------------------------------------------------
io_uring	1	0		 9.5usec	 77K
io_uring	2	0		 8.2usec	183K
io_uring	4	0		 8.4usec	383K
io_uring	8	0		13.3usec	449K

libaio		1	0		 9.7usec	 74K
libaio		2	0		 8.5usec	181K
libaio		4	0		 8.5usec	373K
libaio		8	0		15.4usec	402K

io_uring	1	1		 6.1usec	139K
io_uring	2	1		 6.1usec	272K	
io_uring	4	1		 6.3usec	519K
io_uring	8	1		11.5usec	592K

spdk		1	1		 6.1usec	151K
spdk		2	1		 6.2usec	293K
spdk		4	1		 6.7usec	536K
spdk		8	1		12.6usec	586K

非polling模式，io_uring相比libaio提升不是很明显；在polling模式下，io_uring能与spdk接近，甚至在queue depth较高时性能更好，完爆libaio。

Peak IOPS, 512b random read

Interface	QD	Polled		Latency		IOPS
--------------------------------------------------------------------------
io_uring	4	1		 6.8usec	 513K
io_uring	8	1		 8.7usec	 829K
io_uring	16	1		13.1usec	1019K
io_uring	32	1		20.6usec	1161K
io_uring	64	1		32.4usec	1244K

spdk		4	1		 6.8usec	 549K
spdk		8	1		 8.6usec	 865K
spdk		16	1		14.0usec	1105K
spdk		32	1		25.0usec	1227K
spdk		64	1		47.3usec	1251K

在queue depth较低时有约7%的差距，但在queue depth较高时基本接近。

Peak per-core, multiple devices, 4k random read

Interface	QD	Polled		IOPS
--------------------------------------------------------------------------
io_uring	128	1		1620K
libaio		128	0		 608K
spdk		128	1		1739K

注：根据引用文档1的描述，per-core极限性能已经能到约1700K。

Reference

High-performance asynchronous I/O with io_uring

Ringing in a new asynchronous I/O API

Efficient IO with io_uring

从Linux IO说起

io_uring

数据结构

系统调用

liburing

io_uring高级特性

io_uring性能

Reference

Recommend

绿色数据中心发展论坛亮点纷呈：共创绿色数据中心，共享低碳发展新时代

5G赋能，融会贯通中国广电领航数字中国新时代

阿里巴巴 & 腾讯 -「新时代」的摩根 & 洛克菲勒

新时代的流量增长极：公益营销

投中信息杨晓磊：新时代里终将会成长出一批“超新星”投资人

全球首款音乐阅读器海信TOUCH正式发布开启沉浸听读新时代

拥抱5G创新时代!高通朋友圈版图扩大

践行双碳战略，共创园区新时代

投中信息杨晓磊：新时代里终将会成长出一批"超新星"投资人 | 艾问人物 · 艾...

迎接新时代——新西兰技术移民改革尘埃落定(长文)

About Joyk