3

103. Chaos Engineering

 3 years ago
source link: https://www.heroku.com/podcasts/codeish/103-chaos-engineering?utm_campaign=changelog-news
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

103. Chaos Engineering

Hosted by Rick Newman, with guest Mikolaj Pawlikowski.

Chaos engineering is a way of testing your software predicated on the fact that something in your system, at some point, will break. By intentionally causing disruptions--for example, dropping network connections--and observing how your system responds, you'll better prepare yourself for when the unexpected happens. Mikolaj Pawlikowski, author of "Chaos Engineering: Crash test your applications" explains the philosophies and best practices behind these resiliency techniques.


Show notes

Rick Newman interviews Mikolaj Pawlikowski, who recently wrote a book called "Chaos Engineering: Crash test your applications." The theory behind chaos engineering is to "break things on purpose" in your operational flow. You want to deliberately inject failures that might occur in production ahead of time, in order to anticipate them, and thus implement workarounds and corrections. Typically, this practice is often used for large, distributed systems, because of the many points of failure, but it can be useful in any architecture.

One of the obstacles to embracing chaos engineering is finding high level approval from other teammates, or even managers. Even after the feature is a complete and the unit tests are passing, it can be difficult to convince someone that some resiliency work needs to continue, because there's no visible or tangible benefit to preparing for a disaster. Mikolaj suggests that people clearly lay out to other colleagues the ways a system can fail, and the impact it can have on the application or business. Rather than try to fear monger, it can be useful to point to other companies' availability issues as words of caution for their teams to embrace. Mikolaj also says that chaos engineering doesn't need to focus solely on complicated problems like race conditions across distributed systems. Often, there's enough low hanging fruit, such as disk space running out or an API timing out, that can be useful to consider fixing.

The chaos engineering mindset can also extend beyond pure software. If you think about people working across different timezones as a distributed system, you can also optimize for failures in communication before they occur. Everyone works at a different pace, and communication issues can be analogous to a network loss. Rather than fix miscommunications after they occur, establishing shared practices (like writing down every meeting, or setting up playbooks) can go a long way to ensuring that everyone will be able to do their best under changing circumstances.

Links from this episode

Transcript

Rick: Hello, and welcome to the Heroku Code[ish] podcast. I'm your host today, Rick Newman, and I am here today with Mikolaj Pawlikowski, who has an upcoming book, Chaos Engineering: Site Reliability Through Controlled Disruption. Miko, thank you so much for joining us. And I wonder if you could just talk a little bit about yourself and a little bit about your upcoming book.

Mikolaj: Sure. I'm really happy to be here. Thanks for hosting me. Like I said, I just finished my book. It's called Chaos Engineering: Crash Test Your Application I think they're going to change the title before it goes to print, but that's the temporary title for now. For those of you who have never heard of chaos engineering, you might have heard of things like chaos monkey. And probably if you Googled the term, you're going to come up with some kind of slogans like breaking things on purpose and stuff like that. But I guess engineering is just a practice of experimenting on a system and that system can be anything. It can be big, it can be massive. It can be tiny. You typically hear about the big ones because in the distributed systems, there's just more stuff that can go wrong and the practice of experimenting and to increase the likelihood of things recovering the way that you want them to recover and uncovering the things that don't recover the way you want them to recover is basically what we do with chaos engineering.

Mikolaj: So the deliberate practice of injecting, the kind of failure that the real world is like to inject in your system to verify that your assumptions are correct. It's a really fine discipline, a lot of fun.

Read the full transcript


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK