1

Docker and the OCI container ecosystem

 1 year ago
source link: https://lwn.net/SubscriberLink/902049/374614a66c0367f3/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Welcome to LWN.net

The following subscription-only content has been made available to you by an LWN subscriber. Thousands of subscribers depend on LWN for the best news from the Linux and free software communities. If you enjoy this article, please consider accepting the trial offer on the right. Thank you for visiting LWN.net!

Free trial subscription

Try LWN for free for 1 month: no payment or credit card required. Activate your trial subscription now and see why thousands of readers subscribe to LWN.net.

Docker has transformed the way many people develop and deploy software. It wasn't the first implementation of containers on Linux, but Docker's ideas about how containers should be structured and managed were different from its predecessors. Those ideas matured into industry standards, and an ecosystem of software has grown around them. Docker continues to be a major player in the ecosystem, but it is no longer the only whale in the sea — Red Hat has also done a lot of work on container tools, and alternative implementations are now available for many of Docker's offerings.

Anatomy of a container

A container is somewhat like a lightweight virtual machine; it shares a kernel with the host, but in most other ways it appears to be an independent machine to the software running inside of it. The Linux kernel itself has no concept of containers; instead, they are created by using a combination of several kernel features:

  • Bind mounts and overlayfs may be used to construct the root filesystem of the container.
  • Control groups may be used to partition CPU, memory, and I/O resources for the host kernel.
  • Namespaces are used to create an isolated view of the system for processes running inside the container.

Linux's namespaces are the key feature that allow the creation of containers. Linux supports namespaces for multiple different aspects of the system, including user namespaces for separate views of user and group IDs, PID namespaces for distinct sets of process IDs, network namespaces for distinct sets of network interfaces, and several others. When a container is started, a runtime creates the appropriate control groups, namespaces, and filesystem mounts for the container; then it launches a process inside the environment it has created.

There is some level of disagreement about what that process should be. Some prefer to start an init process like systemd and run a full Linux system inside the container. This is referred to as a "system container"; it was the most common type of container before Docker. System containers continue to be supported by software like LXC and OpenVZ.

Docker's developers had a different idea. Instead of running an entire system inside a container, Docker says that each container should only run a single application. This style of container is known as an "application container." An application container is started using a container image, which bundles the application together with its dependencies and just enough of a Linux root filesystem to run it.

A container image generally does not include an init system, and may not even include a package manager — container images are usually replaced with updated versions rather than updated in place. An image for a statically-compiled application may be as minimal as a single binary and a handful of support files in /etc. Application containers usually don't have a persistent root filesystem; instead, overlayfs is used to create a temporary layer on top of the container image. This is thrown away when the container is stopped. Any persistent data outside of the container image is grafted on to the container's filesystem via a bind mount to another location on the host.

The OCI ecosystem

These days, when people talk about containers, they are likely to be talking about the style of application containers popularized by Docker. In fact, unless otherwise specified, they are probably talking about the specific container image format, run-time environment, and registry API implemented by Docker's software. Those have all been standardized by the Open Container Initiative (OCI), which is an industry body that was formed in 2015 by Docker and the Linux Foundation. Docker refactored its software into a number of smaller components; some of those components, along with their specifications, were placed under the care of the OCI. The software and specifications published by the OCI formed the seed for what is now a robust ecosystem of container-related software.

The OCI image specification defines a format for container images that consists of a JSON configuration (containing environment variables, the path to execute, and so on) and a series of tarballs called "layers". The contents of each layer are stacked on top of each other, in series, to construct the root filesystem for the container image. Layers can be shared between images; if a server is running several containers that refer to the same layer, they can potentially share the same copy of that layer. Docker provides minimal images for several popular Linux distributions that can be used as the base layer for application containers.

The OCI also publishes a distribution specification. In this context, "distribution" does not refer to a Linux distribution; it is used in a more general sense. This specification defines an HTTP API for pushing and pulling container images to and from a server; servers that implement this API are called container registries. Docker maintains a large public registry called Docker Hub as well as a reference implementation (called "Distribution", perhaps confusingly) that can be self-hosted. Other implementations of the specification include Red Hat's Quay and VMware's Harbor, as well as hosted offerings from Amazon, GitHub, GitLab, and Google.

A program that implements the OCI runtime specification is responsible for everything pertaining to actually running a container. It sets up any necessary mounts, control groups, and kernel namespaces, executes processes inside the container, and tears down any container-related resources once all the processes inside of it have exited. The reference implementation of the runtime specification is runc, which was created by Docker for the OCI.

There are a number of other OCI runtimes to choose from. For example, crun offers an OCI runtime written in C that has the goal of being faster and more lightweight than runc, which, like most of the rest of the OCI ecosystem, is written in Go. Google's gVisor includes runsc, which provides greater isolation from the host by running applications on top of a user-mode kernel. Amazon's Firecracker is a minimal hypervisor written in Rust that can use KVM to give each container its own virtual machine; Intel's Kata Containers works similarly but supports multiple hypervisors (including Firecracker.)

A container engine is a program that ties these three specifications together. It implements the client side of the distribution specification to retrieve container images from registries, interprets the images it has retrieved according to the image specification, and launches containers using a program that implements the runtime specification. A container engine provides tools and/or APIs for users to manage container images, processes, and storage.

Kubernetes is a container orchestrator, capable of scheduling and running containers across hundreds or even thousands of servers. Kubernetes does not implement any of the OCI specifications itself. It needs to be used in combination with a container engine, which manages containers on behalf of Kubernetes. The interface that it uses to communicate with container engines is called the Container Runtime Interface (CRI).

Docker

Docker is the original OCI container engine. It consists of two main user-visible components: a command-line-interface (CLI) client named docker, and a server. The server is named dockerd in Docker's own packages, but the repository was renamed moby when Docker created the Moby Project in 2017. The Moby Project is an umbrella organization that develops open-source components used by Docker and other container engines. When Moby was announced, many found the relationship between Docker and the Moby project to be confusing; it has been described as being similar to the relationship between Fedora and Red Hat.

dockerd provides an HTTP API; it usually listens on a Unix socket named /var/run/docker.sock, but can be made to listen on a TCP socket as well. The docker command is merely a client to this API; the server is responsible for downloading images and starting container processes. The client supports starting containers in the foreground, so that running a container at the command-line behaves similarly to running any other program, but this is only a simulation. In this mode, the container processes are still started by the server, and input and output are streamed over the API socket; when the process exits, the server reports that to the client, and then the client sets its own exit status to match.

This design does not play well with systemd or other process supervision tools, because the CLI never has any child processes of its own. Running the docker CLI under a process supervisor only results in supervising the CLI process. This has a variety of consequences for users of these tools. For example, any attempt to limit a container's memory usage by running the CLI as a systemd service will fail; the limits will only apply to the CLI and its non-existent children. In addition, attempts to terminate a client process may not result in terminating all of the processes in the container.

Failure to limit access to Docker's socket can be a significant security hazard. By default dockerd runs as root. Anyone who is able to connect to the Docker socket has complete access to the API. Since the API allows things like running a container as a specific UID and binding arbitrary filesystem locations, it is trivial for someone with access to the socket to become root on the host. Support for running in rootless mode was added in 2019 and stabilized in 2020, but is still not the default mode of operation.

Docker can be used by Kubernetes to run containers, but it doesn't directly support the CRI specification. Originally, Kubernetes included a component called dockershim that provided a bridge between the CRI and the Docker API, but it was deprecated in 2020. The code was spun out of the Kubernetes repository and is now maintained separately as cri-dockerd.

containerd & nerdctl

Docker refactored its software into independent components in 2015; containerd is one of the fruits of that effort. In 2017, Docker donated containerd to the Cloud Native Computing Foundation (CNCF), which stewards the development of Kubernetes and other tools. It is still included in Docker, but it can also be used as a standalone container engine, or with Kubernetes via an included CRI plugin. The architecture of containerd is highly modular. This flexibility helps it to serve as a proving ground for experimental features. Plugins may provide support for different ways of storing container images and additional image formats, for example.

Without any additional plugins, containerd is effectively a subset of Docker; its core features map closely to the OCI specifications. Tools designed to work with Docker's API cannot be used with containerd. Instead, it provides an API based on Google's gRPC. Unfortunately, concerned system administrators looking for access control won't find it here; despite being incompatible with Docker's API, containerd's API appears to carry all of the same security implications.

The documentation for containerd notes that it follows a smart client model (as opposed to Docker's "dumb client"). Among other differences, this means that containerd does not communicate with container registries; instead, (smart) clients are required to download any images they need themselves. Despite the difference in client models, containerd still has a process model similar to that of Docker; container processes are forked from the containerd process. In general, without additional software, containerd doesn't do anything differently from Docker, it just does less.

When containerd is bundled with Docker, dockerd serves as the smart client, accepting Docker API calls from its own dumb client and doing any additional work needed before calling the containerd API; when used with Kubernetes, these things are handled by the CRI plugin. Other than that, containerd didn't really have its own client until relatively recently. It includes a bare-bones CLI called ctr, but this is only intended for debugging purposes.

This changed in December 2020 with the release of nerdctl. Since its release, running containerd on its own has become much more practical; nerdctl features a user interface designed to be compatible with the Docker CLI and provides much of the functionality Docker users would find missing from a standalone containerd installation. Users who don't need compatibility with the Docker API might find themselves quite happy with containerd and nertdctl.

Podman

Podman is an alternative to Docker sponsored by Red Hat, which aims to be a drop-in replacement for Docker. Like Docker and containerd, it is written in Go and released under the Apache 2.0 License, but it is not a fork; it is an independent reimplementation. Red Hat's sponsorship of Podman is likely to be at least partially motivated by the difficulties it encountered during its efforts to make Docker's software interoperate with systemd.

On a superficial level, Podman appears nearly identical to Docker. It can use the same container images, and talk to the same registries. The podman CLI is a clone of docker, with the intention that users migrating from Docker can alias docker to podman and mostly continue with their lives as if nothing had changed.

Originally, Podman provided an API based on the varlink protocol. This meant that while Podman was compatible with Docker on a CLI level, tools that used the Docker API directly could not be used with Podman. In version 3.0, the varlink API was scrapped in favor of an HTTP API, which aims to be compatible with the one provided by Docker while also adding some Podman-specific endpoints. This new API is maturing rapidly, but users of tools designed for Docker would be well-advised to test for compatibility before committing to switch to Podman.

As it is largely a copy of Docker's API, Podman's API doesn't feature any sort of access control, but Podman has some architectural differences that may make that less important. Podman gained support for running in rootless mode early on in its development. In this mode, containers can be created without root or any other special privileges, aside from that small bit of help from newuidmap and newgidmap. Unlike Docker, when Podman is invoked by a non-root user, rootless mode is used by default.

Users of Podman can also dodge security concerns about its API socket by simply disabling it. Though its interface is largely identical to the Docker CLI, podman is no mere API client. It creates containers for itself without any help from a daemon. As a result, Podman plays nicely with tools like systemd; using podman run with a process supervisor works as expected, because the processes inside the container are children of podman run. The developers of Podman encourage people to use it in this way by a command to generate systemd units for Podman containers.

Aside from its process model, Podman caters to systemd users in other ways. While running an init system such as systemd inside of a container is antithetical to the Docker philosophy of one application per container, Podman goes out of its way to make it easy. If the program to run specified by the container is an init system, Podman will automatically mount all the kernel filesystems needed for systemd to function. It also supports reporting the status of containers to systemd via sd_notify(), or handing the notification socket off to the application inside of the container for it to use directly.

Podman also has some features designed to appeal to Kubernetes users. Like Kubernetes, it supports the notion of a "pod", which is a group of containers that share a common network namespace. It can run containers using Kubernetes configuration files and also generate Kubernetes configurations. However, unlike Docker and containerd, there is no way for Podman to be used by Kubernetes to run containers. This is a deliberate omission. Instead of adding CRI support to Podman, which is a general-purpose container engine, Red Hat chose to sponsor the development of a more specialized alternative in the form of CRI-O.

CRI-O

CRI-O is based on many of the same underpinnings as Podman. So the relationship between CRI-O and Podman could be said to be similar to the one between containerd and Docker; CRI-O delivers much of the same technology as Podman, with fewer frills. This analogy doesn't stretch far, though. Unlike containerd and Docker, CRI-O and Podman are completely separate projects; one is not embedded by the other.

As might be suggested by its name, CRI-O implements the Kubernetes CRI. In fact, that's all that it implements; CRI-O is built specifically and only for use with Kubernetes. It is developed in lockstep with the Kubernetes release cycle, and anything that is not required by the CRI is explicitly declared to be out of scope. CRI-O cannot be used without Kubernetes and includes no CLI of its own; based on the stated goals of the project, any attempt to make CRI-O suitable for standalone use would likely be viewed as an unwelcome distraction by its developers.

Like Podman, the development of CRI-O was initially sponsored by Red Hat; like containerd, it was later donated to the CNCF in 2019. Although they are now both under the aegis of the same organization, the narrow focus of CRI-O may make it more appealing to Kubernetes administrators than containerd. The developers of CRI-O are free to make decisions solely on the basis of maximizing the benefit to users of Kubernetes, whereas the developers of containerd and other container engines have many other types of users and uses cases to consider.

Conclusion

These are just a few of the most popular container engines; other projects like Apptainer and Pouch cater to different ecological niches. There are also a number of tools available for creating and manipulating container images, like Buildah, Buildpacks, skopeo, and umoci. Docker deserves a great deal of credit for the Open Container Initiative; the standards and the software that have resulted from this effort have provided the foundation for a wide array of projects. The ecosystem is robust; should one project shut down, there are multiple alternatives ready and available to take its place. As a result, the future of this technology is no longer tied to one particular company or project; the style of containers that Docker pioneered seems likely to be with us for a long time to come.

Did you like this article? Please accept our trial subscription offer to be able to see more content like it and to participate in the discussion.

(Log in to post comments)


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK