103. Chaos Engineering

Hosted by Rick Newman, with guest Mikolaj Pawlikowski.

Chaos engineering is a way of testing your software predicated on the fact that something in your system, at some point, will break. By intentionally causing disruptions--for example, dropping network connections--and observing how your system responds, you'll better prepare yourself for when the unexpected happens. Mikolaj Pawlikowski, author of "Chaos Engineering: Crash test your applications" explains the philosophies and best practices behind these resiliency techniques.

Show notes

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called "Chaos Engineering: Crash test your applications." The theory behind chaos engineering is to "break things on purpose" in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there's no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies' availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn't need to focus solely on complicated problems like race conditions across distributed systems. Often, there's enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode

Mikolaj's book is called Chaos Engineering: Crash test your applications -- get a 40% discount using the code podish19
powerfulseal is a testing tool for Kubernetes clusters
Mikolaj distributes the Chaos Engineering Newsletter
Conf42 is a conference focusing on high-level computer science
ChaosConf is the world’s largest Chaos Engineering event
Awesome Chaos Engineering is a curated list of Chaos Engineering resources

Transcript

Rick: Hello, and welcome to the Heroku Code[ish] podcast. I'm your host today, Rick Newman, and I am here today with Mikolaj Pawlikowski, who has an upcoming book, Chaos Engineering: Site Reliability Through Controlled Disruption. Miko, thank you so much for joining us. And I wonder if you could just talk a little bit about yourself and a little bit about your upcoming book.

Mikolaj: Sure. I'm really happy to be here. Thanks for hosting me. Like I said, I just finished my book. It's called Chaos Engineering: Crash Test Your Application I think they're going to change the title before it goes to print, but that's the temporary title for now. For those of you who have never heard of chaos engineering, you might have heard of things like chaos monkey. And probably if you Googled the term, you're going to come up with some kind of slogans like breaking things on purpose and stuff like that. But I guess engineering is just a practice of experimenting on a system and that system can be anything. It can be big, it can be massive. It can be tiny. You typically hear about the big ones because in the distributed systems, there's just more stuff that can go wrong and the practice of experimenting and to increase the likelihood of things recovering the way that you want them to recover and uncovering the things that don't recover the way you want them to recover is basically what we do with chaos engineering.

Mikolaj: So the deliberate practice of injecting, the kind of failure that the real world is like to inject in your system to verify that your assumptions are correct. It's a really fine discipline, a lot of fun.

Read the full transcript

103. Chaos Engineering

103. Chaos Engineering

Hosted by Rick Newman, with guest Mikolaj Pawlikowski.

Show notes

Links from this episode

Transcript

Recommend

HRF Supporting Specter, Lot49 – Bitcoin Magazine

What the web could be (in 2021 and beyond)

任泽平：我们可能正站在流动性的拐点上

Understanding Bitcoin’s Scarcity

Achieving Bitcoin Anonymity Through Mixers

Github GitHub - davidhampgonsalves/life-dashboard: Heads up Display for every da...

TDD (Test-Driven Development) - My personal journey

Waldo's My Roommate?

The Sudden, Unexpected End of Crypto Tribalism

Bitcoin’s Future: Off-Chain Contracts – Bitcoin Magazine

About Joyk