74

Linux异步IO新时代:io_uring

 5 years ago
source link: https://www.tuicool.com/articles/3uuEnmr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Linux 5.1合入了一个新的异步IO框架和实现:io_uring,由block IO大神Jens Axboe开发。这对当前异步IO领域无疑是一个喜大普奔的消息,这意味着,Linux native aio的时代即将成为过去,io_uring的时代即将开启。

从Linux IO说起

Linux最初的IO系统调用需要追溯到read(2)和write(2)。后来发展为增加offset的pread(2)和pwrite(2),以及基于vector的版本preadv(2)和pwritev(2)。再接下来扩展为preadv2(2)和pwritev2(2)。尽管形式看上去多种多样,但它们都有一个共同的特征就是同步,即系统调用需要在数据读取或写入后才返回。应某些应用场景的诉求,异步IO接口应势而生。POSIX对应的接口为aio_read(3)和aio_write(3),但其实现平淡无奇且性能不好。Linux Native异步IO接口即我们常称的aio,同样有着诸多限制:

  • 最大的限制无疑是仅支持direct aio。而O_DIRECT要求bypass缓存和size对齐等,直接影响了很多场景的使用。而对buffered IO,其表现为同步。

  • 即使满足了所有异步IO的约束,有时候还是可能会被阻塞,例如,等待元数据IO,或者存储设备的请求槽位都正在使用等等。

  • 存在额外的开销,每个IO提交需要拷贝64+8字节,每个IO完成需要拷贝32字节,这在某些场景下影响很可观。在使用完成event的时候需要非常小心,否则容易丢事件。IO总是需要至少2个系统调用(submit + wait-for-completion),在spectre/meltdown开启下性能下降非常严重。

相比新开发一套接口,扩展和改进现有接口往往有着更多的优势。在过去的数年间,针对上述限制一的很多改进努力都未果。尤其是近年来快速设备的出现,更加体现出现有异步接口的局限性。

io_uring

新接口的设计要求如下,尽管有些看上去存在互斥性,如高效且可伸缩的接口往往比较难用,或者说难以正确地使用;特性丰富和高效很难同时满足等等。

  • 简单易用;

  • 可扩展,如为未来的网络/非块存储IO接入考虑;

  • 特性丰富,满足所有应用;

  • 高效,尤其是针对大部分场景的512B或4K IO;

  • 可伸缩。

io_uring首先需要围绕高效进行设计。为了避免在提交和完成事件中存在内存拷贝,io_uring设计了一对共享的ring buffer用于应用和内核之间的通信。其中,针对提交队列(SQ),应用是IO提交的生产者(producer),内核是消费者(consumer);反过来,针对完成队列(CQ),内核是完成事件的生产者,应用是消费者。

数据结构

/* * IO submission data structure (Submission Queue Entry) */
struct io_uring_sqe {
    __u8    opcode;         /* type of operation for this sqe */
    __u8    flags;          /* IOSQE_ flags */
    __u16   ioprio;         /* ioprio for the request */
    __s32   fd;             /* file descriptor to do IO on */
    __u64   off;            /* offset into file */
    __u64   addr;           /* pointer to buffer or iovecs */
    __u32   len;            /* buffer size or number of iovecs */
    union {
        __kernel_rwf_t  rw_flags;
        __u32           fsync_flags;
        __u16           poll_events;
    };
    __u64   user_data;      /* data to be passed back at completion time */
    union {
        __u16   buf_index;      /* index into fixed buffers, if used */
        __u64   __pad2[3];
    };
};

/* * IO completion data structure (Completion Queue Entry) */
struct io_uring_cqe {
    __u64   user_data;      /* sqe->data submission passed back */
    __s32   res;            /* result code for this event */
    __u32   flags;
};

系统调用

/* * setup a context for performing asynchronous I/O * * The io_uring_setup() system call sets up a submission queue (SQ) and completion queue (CQ) with * at least entries entries, and returns a file descriptor which can be used to perform subsequent * operations on the io_uring instance.The submission and completion queues are shared between * userspace and the kernel, which eliminates the need to copy data when initiating and completing * I/O. */
int io_uring_setup(u32 entries, struct io_uring_params *p);

/* * initiate and/or complete asynchronous I/O * * io_uring_enter() is used to initiate and complete I/O using the shared submission and completion * queues setup by a call to io_uring_setup(2). A single call can both submit new I/O and wait for * completions of I/O initiated by this call or previous calls to io_uring_enter(). */
int io_uring_enter(unsigned int fd, unsigned int to_submit,
                unsigned int min_complete, unsigned int flags,
                sigset_t *sig);

/* * register files or user buffers for asynchronous I/O * * The io_uring_register() system call registers user buffers or files for use in an io_uring(7) instance * referenced by fd. Registering files or user buffers allows the kernel to take long term references to * internal data structures or create long term mappings of application memory, greatly reducing * per-I/O overhead. */
int io_uring_register(unsigned int fd, unsigned int opcode,
                void *arg, unsigned int nr_args)

liburing

为了方便使用,Jens Axboe还开发了一套liburing库,同时在fio中提供了ioengine=io_uring的支持。通过liburing库,应用不必了解诸多io_uring的细节就可以简单地使用起来。例如,无需担心memory barrier,或者是ring buffer管理之类等。简单的example如下:

/* setup io_uring and do mmap */
io_uring_queue_init(ENTRIES, &ring, 0);

/* get an sqe and fill in a READV operation */
sqe = io_uring_get_sqe(&ring);
io_uring_prep_readv(sqe, fd, &iovec, 1, offset);

/* tell the kernel we have an sqe ready for consumption */
io_uring_submit(&ring);

/* wait for the sqe to complete */
io_uring_wait_cqe(&ring, &cqe);

/* read and process cqe event */
io_uring_cqe_seen(&ring, cqe);

/* tear down */
io_uring_queue_exit(&ring);

io_uring高级特性

Fixed Files and BuffersIORING_REGISTER_FILES/IORING_UNREGISTER_FILES 通过io_uring_register()系统调用提前注册一组file,缓解每次IO操作的fget()/fput()带来的开销。

IORING_REGISTER_BUFFERS/IORING_UNREGISTER_BUFFERS 通过io_uring_register()系统调用注册一组固定的IO buffers,当应用重用这些IO buffers时,只需要map/unmap一次即可,而不是每次IO都要去做。

Polled IOIORING_SETUP_IOPOLL 与非polling模式等待硬件中断唤醒不同,内核将采用polling模式不断轮询硬件以确认IO请求是否已经完成,这在追求低延时和高IOPS的应用场景非常有用。

io_uring性能

注:数据来源于io_uring的patchset v5。

  • 3d xpoint, 4k random read

Interface	QD	Polled		Latency		IOPS
--------------------------------------------------------------------------
io_uring	1	0		 9.5usec	 77K
io_uring	2	0		 8.2usec	183K
io_uring	4	0		 8.4usec	383K
io_uring	8	0		13.3usec	449K

libaio		1	0		 9.7usec	 74K
libaio		2	0		 8.5usec	181K
libaio		4	0		 8.5usec	373K
libaio		8	0		15.4usec	402K

io_uring	1	1		 6.1usec	139K
io_uring	2	1		 6.1usec	272K	
io_uring	4	1		 6.3usec	519K
io_uring	8	1		11.5usec	592K

spdk		1	1		 6.1usec	151K
spdk		2	1		 6.2usec	293K
spdk		4	1		 6.7usec	536K
spdk		8	1		12.6usec	586K

非polling模式,io_uring相比libaio提升不是很明显;在polling模式下,io_uring能与spdk接近,甚至在queue depth较高时性能更好,完爆libaio。

  • Peak IOPS, 512b random read

Interface	QD	Polled		Latency		IOPS
--------------------------------------------------------------------------
io_uring	4	1		 6.8usec	 513K
io_uring	8	1		 8.7usec	 829K
io_uring	16	1		13.1usec	1019K
io_uring	32	1		20.6usec	1161K
io_uring	64	1		32.4usec	1244K

spdk		4	1		 6.8usec	 549K
spdk		8	1		 8.6usec	 865K
spdk		16	1		14.0usec	1105K
spdk		32	1		25.0usec	1227K
spdk		64	1		47.3usec	1251K

在queue depth较低时有约7%的差距,但在queue depth较高时基本接近。

  • Peak per-core, multiple devices, 4k random read

Interface	QD	Polled		IOPS
--------------------------------------------------------------------------
io_uring	128	1		1620K
libaio		128	0		 608K
spdk		128	1		1739K

注:根据引用文档1的描述,per-core极限性能已经能到约1700K。

Reference

High-performance asynchronous I/O with io_uring

Ringing in a new asynchronous I/O API

Efficient IO with io_uring


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK