Repairing network hardware at scale with SRE principles
source link: https://chinagdg.org/2018/08/repairing-network-hardware-at-scale-with-sre-principles/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Repairing network hardware at scale with SRE principles
Source: Repairing network hardware at scale with SRE principles from Google Cloud Platform
By James O’Keeffe, Senior Site Reliability Engineer
To support our Google Cloud Platform (GCP) customers, we run a complex global network that depends on multiple providers and a lot of hardware. Google network engineering uses a diverse set of vendor equipment to route user traffic from an internet service provider to one of our serving front ends inside a GCP data center. This equipment is proprietary and made by external networking vendors such as Arista, Cisco and Juniper. Each vendor has distinct operational methods, configurations and operational consoles.
With hundreds of distinct components utilized across our global network, we routinely deal with hardware failures—for example, a failed power supply, line card or control plane card. The complexity of today’s cloud networks means that there are a huge number of places where failure can occur. When we first began building and operating our own data centers, Google had a team of engineers, network engineers and site reliability engineers (SREs) who performed fault detection, mitigation and repair work on these devices, using manual processes guided by a ticket system. Google’s SRE principles are prescriptive, and aim to guide developers and operations teams toward better systems reliability. As with DevOps, avoiding toil—the manual tasks that can eat up too much time—is an essential goal.
We realized after becoming familiar with common hardware problems that any ticket type that we encountered repeatedly and that follows a predetermined sequence of steps can easily be automated. Our team created a list of playbooks over time that detailed steps of how to deal with each hardware failure scenario, taking into account relevant software and hardware bugs and typical steps to resolution. Each playbook is used when an alert is received. Given that we already knew in advance how to deal with each issue as it arose, it made sense to automate the work. Here’s how we did it.
Building the automation interface
“In the old way of doing things, we treat our servers like pets, for example, Bob the mail server. If Bob goes down, it’s all hands on deck. The CEO can’t get his email and it’s the end of the world. In the new way, servers are numbered, like cattle in a herd. For example, www001 to www100. When one server goes down, it’s taken out back, shot, and replaced on the line.”
– Randy Bias
The above quote describes a classic engineering scenario often applied within SRE: “Pets vs. cattle,” which describes a way of looking at data center hardware as either individual components or a herd of them. The two categories of equipment can be described as follows:
Pet:
- An individual device you work on. You’re familiar with all of its particular failure modes.
- When it gets sick, you come to the rescue.
Cattle:
- A fleet of devices with a common interface.
- You manage the “herd” of devices as a group.
- The common interface lets you perform the same basic operations on any device, regardless of its manufacturer.
type Linecard interface {
Online() error
Offline() error
Status() error
}
The error qualifier in Go simply means that the function returns an error object if it fails. The underlying code implementing this interface for a Juniper line card varies significantly from implementation on the Cisco line card, but the caller of the function is insulated from the implementation. The upper level code imports the library, and when it operates on a line card, it can only perform one of those three actions we specified above.
We then realized that we could apply the same interface to many hardware components—for example, a fan. For certain vendors, the Online() and Offline() functions did nothing, because those vendors didn’t support turning a fan off, so we just used the interface to check the status.
type Fan interface {
Online() error
Offline() error
Status() error
}
Building upon this line of thought, we realized that we could generalize this interface to define a common interface for all hardware components within a device.
type Component interface {
Online() error
Offline() error
Status() error
}
By structuring the code this way, anyone can add a device from a new vendor. Moreover, anyone can add any type of new component as a library. Once the library implements this common interface, it can be registered as a handler for that specific vendor and component.
Deciding what to automate
The system needed to interact with humans at various stages of the automation. To decide what to automate, we drew a flow chart of the normal human-based repair sequence and drew boxes around stages we believed we could replace with automation. We used the task of replacing a vendor control plane board as an example. Many of the steps have self-explanatory names, but these are definitions of some of the more complex ones:
- Determine control plane: Find faulty control plane unit.
- Determine state: Is it the master or the backup?
- Copy image to control plane: Copy the appropriate software image to the master control plane.
- Offline control plane: Send the backup control plane offline.
- Toggle mastership: Make the replaced control plane the new master.
Automation, before and after
You can see in Figure 3 what the system looked like after automation. Before automating this workflow, there would have been a lot of manual work. When an alert initially came in, an engineer would have stopped traffic to the device, and offlined by hand the bad component. Our network operations center (NOC) team would then work with the vendor—for example, Juniper or Cisco— to get a replacement part on-site. Next, we would file a change request in our change management system, noting the date of the operation.
On the day of the operation:
- The data center technician would click “start” on the change management system to begin the repair.
- Our system picks up this change and is ready to begin the repair.
- The technician clicks “start” on our UI.
- An “offline” state machine starts proceeding through the various steps to take the component offline safely.
- The UI notifies the user each step of the way.
- Once the state machine has completed, it notifies the technician, who can safely replace the component.
- Once the component is replaced and re-cabled, the technician returns to the UI and begins the “online” state machine, which safely returns the component into production.
Automation lessons learned
Tips for reducing toil through automation
- Measure your toil.
- Tackle the biggest sources of toil first, and don’t try to solve all problems at once.
- Carefully consider whether to enhance existing tools or build new ones. Even if you can partially repurpose another team’s work, would creating a tool from scratch actually make more sense cost- or resource-wise?
- Take a design-driven approach. Iterate on the design, starting small and iterating quickly. Don’t try to design the perfect approach from the start.
- Measure your time savings to determine your return on investment.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK