Bare Metal K8s Clustering at Scale

At full scale Chick-fil-A will be running Kubernetes at the Edge in each of our 2000 restaurants. That means roughly 6000 devices at the Edge running Kubernetes.

One of the biggest challenges associated with this is bare metal clustering on-the-fly, in-restaurant.

While most Kubernetes deployments are in the cloud or benefit from skilled technicians that are physically located near their deployments (or at least equipped with remote access), our deployments are completed by installers who focus only on initial hardware installations. They never connect to the compute devices directly — rather they connect ethernet and power cords, and then look at an app to check the status of the cluster as it self-bootstraps. Replacements are completed by restaurant Owner/Operators or their teams, which are sometimes less technical.

On top of that, our Edge deployments are not exactly in a “datacenter environment”.

Edge Compute hardware and a typical installation

Clustering: Options we considered

To solve this clustering challenge, we surveyed the landscape and considered several options:

Kubespray — we started with the Ansible based Kubespray but found it to be fairly brittle. When things went well we got a cluster. When they didn’t, we created a brick that was hard to transform back into a computer. We also found that the process to initiate clusters was very slow, often taking as much as 30 minutes on our hardware stack. While we are loathe to discredit the project long-term as a result, this was enough for us to move in a different direction.

Openshift — it can create K8s clusters, but we didn’t like the idea of being closely tied to a vendor’s solution for a critical part of our infrastructure.

Kops — we are big fans of kops, and we use it to deploy our cloud “control plane” Kubernetes clusters. Unfortunately, when we began our Edge Compute journey, kops was not a viable bare metal solution. We look forward to seeing how it develops in the future.

Kubeadm — another Kubernetes proper clustering utility. It looks promising, but is definitely much more complex (perhaps due to its flexibility) than some of the alternatives, including…

RKE

In our first iteration, RKE was our winner. RKE is a Kubernetes clustering engine provided by Rancher Labs. While we decided NOT to use Rancher 2.0 to manage our clusters, we do like the simplicity of using RKE to initialize and maintain a cluster.

To use RKE, you need to determine a leader node and provide it with a configuration YAML file that includes data about the cluster, including the host names for the nodes that will participate in the clustering activity.

If nodes in the cluster are added or removed (or die), the configuration file needs to remain an accurate representation of the current and to-be member nodes. Failure to keep this configuration up-to-date will result in failed clustering attempts. While we believe the absence of a node should not fail the clustering initialization/update, that’s the way it works today.

Installation Process

Our installation process in restaurants is very simple to perform — unbox the devices, plug them into power and the labeled switch ports, and that’s it. They automatically boot on power, and they self-bootstrap and cluster. While this is awesome since it enables a non-technical user to execute installations and replacements without any knowledge of Kubernetes or even the overall architecture, it does create a need for a much more sophisticated bootstrapping process.

The to-be cluster nodes need to coordinate with each other to determine who is going to participate in clustering. They also need to elect a leader to execute the cluster creation via RKE.

Introducing Highlander

To solve this problem, we developed Highlander… because there can be only one. Cluster initiator, that is. One cluster initiator.

Highlander is part of our base Edge image. When each node boots, it UDP broadcasts its presence and asks if there is an established leader. It also begins to listen itself. After a few seconds with no replies, it will send another broadcast announcing that it has declared itself the leader. Are there any objections? Should nobody object, the node will soon establish itself as the cluster leader and respond to any future requests asking about an available leader.

If another node has already claimed the role of leader, the new node will acknowledge the declaration. The existing leader will “RKE up” to include the new node in the cluster.

The nodes gossip periodically to ensure the leader is still there. If the leader ever dies, a new leader will be elected through a simple protocol that uses random sleeps and leader declarations. While this is simple and unsophisticated, it is easy to reason about and understand, and it works effectively at scale.

Once leader election is complete, highlander also ensures the cluster is properly configured. In our environment, this includes:

Switching out KubeDNS for CoreDNS
Setting up Istio / other core control plane pods
OAuth identity negotiation **

**each of our nodes gets its own identity and a short-lived JWT for accessing authenticated resources. Highlander provisions this identity and makes the token available as a Kubernetes secret.

Process in aggregate

While we mostly focused on cluster initialization during this post, here is a look at our entire process for node initialization on-the-fly, in-restaurant.

(Inevitable) Failures

Infrastructure breaks and we want to be resilient to its failure. Node failures can occur for many reasons: device failure, network switch failure, power cords accidentally unplugged. In all of these cases, we have to quickly diagnose what is a true failure and what is an unrelated anomaly. That process is complex and the subject for another post in the future. That said, when we diagnose failures, our process is to drop ship a base image replacement to the restaurant (complete with visual installation instructions) and have the Owner/Operator or their team execute the placement.

Meanwhile, our Kubernetes cluster will continue to operate on a reduced number of nodes, and will be ready to welcome the replacement node when it arrives.

What’s next?

We like RKE, but there is still a chance that we will make a shift to a different clustering engine in the future. We would consider revisiting Kubeadm as it matures, or possibly consider rolling our own cluster manager to give us more control over the process. Having developed Highlander, it would not be a stretch to complete the rest of the clustering process. We will keep a close eye on the ecosystem and see what develops.

Edit 7/2/18 — since writing this, many readers have asked “why not just use the cloud? Why computing at the Edge?”. We have realized that we did not provide much context about why we’re doing what we’re doing, so we will follow up with an post about that soon.

Like what we’re doing? Chick-fil-A is always looking for talented engineers to join our team. We would love to hear from you, so contact Brian Chambers on LinkedIn if you would like to hear about opportunities to join the team.

Clustering: Options we considered

RKE

Installation Process

Introducing Highlander

Process in aggregate

(Inevitable) Failures

What’s next?

Recommend

Serverless的微服务持续交付案例

What Can I Do About Bufferbloat?

【Learning OpenCV with iOS】(四) 图像亮度和对比度

Top 20 Python libraries for data science in 2018

Java 10 新特性介绍

New in iOS 12: Adding a Custom UI and Interactivity in Local and Push Notificati...

HttpClientFactory in ASP.NET Core 2.1 (Part 4) Integrating with Polly for transi...

Java 9、10、11，哪个才是 Java 程序员的本命？

观点|在开源项目中做出你的第一个贡献

Gio.js：基于 Three.js 的 web 3D 地球数据可视化的开源组件库

About Joyk