Don't Do This in Production

Around March of 2017, I received a call asking for a code review on a product about to be launched. This company had issues with memory leaks, spontaneous crashing, slow loading, CPU spiking, and had to release in a couple of weeks. You might have heard this story before, just not from me, and not about this company. It’s surprisingly common.

We got together on the weekend and started looking through the code together. It took about half a day to discover the source of the known problems, and another half day to write-up a document for their engineering team to fix them. The launch succeeded, but it made me wonder how the product ever got to that place.

When I chatted with the developers, they seemed like intelligent people. The only clear issue was lack of experience, which they could really only solve by continuing to build and grow. I’ve run into that before. It’s common, and I think it’s healthy – well, most of the time.

In this case, however, it had a nefarious twist: all of these developers lacked experience .

The department that built the product had recently come into existence, and they hired a team of developers without having a technical person on staff to vet them. It’s difficult enough for a technical person to vet a developer – I can’t even imagine vetting a candidate without having a technical background. They hired the first developer, and he vetted the second developer, and so on until they had a development team.

If you’re lucky enough for your first developer to have significant experience and a desire to mentor, then you’re golden. If you’re unlucky, however – and it’s very easy to be unlucky at something like this – then you may end up with a very fast moving team that builds very fragile software.

“Move fast and break things,” they said. It turns out that’s a pretty bad idea when your business relies on a small number of large customers. Broken products tend to scare them off, which in turn tanks your business. There’s a lot to be said for building things that work, but “move slowly and steadily towards a goal” just doesn’t have the same ring.

In reality, there’s a balance between moving fast and and moving slow. It’s difficult to communicate that balance because every type of product demands a different balance. I suppose that intuition comes from experience, which is a terrible answer for someone trying to learn.

What’s a new developer to do?

The natural tendency seems to be asking the internet. It turns out that this is incredibly effective .

It’s also incredibly dangerous . Before I go any further, I’ll continue my story.

This company continued to work with me after that product launch. I reviewed a significant amount of code, helped mentor their developers, and built new projects for them. Everything went swimmingly.

One day, I ran into a section of code that triggered my spidey sense. I could have sworn that I had seen it before. Sure enough, after pasting a line into a search engine, I found the exact section of code in a blog post. Naturally I read the whole thing, right up to the line that said, “ Don’t do this in production. “

Yet here it was, tipping its hat at me from the front lines of a production codebase.

It didn’t take long to find many sections of code from similar blog posts. Almost all of the blog posts either wrote a disclaimer or should have written one. They all solved one small piece of a problem, but took many liberties in their solution to make it simpler to read. It’s understandable. Most readers appreciate brevity when learning a concept.

The code from these blog posts had spread through the codebase like a disease, scattering issues here and there without any rhyme or reason. And there wasn’t any obvious cure other than to read everything manually and fix issues as I went along. Without unit tests or automated deployments, this took almost a year . I’m almost certain the cost of fixing the code exceeded the margin on revenue due to writing it in the first place.

But what other option did these developers have? They had to deliver something, and they had never released a production application before. So they did what any sensible person would try to do, and they learned on the job.

Production systems can fail in an incredible number of ways. Without having experienced or read about these failures, it’s difficult to have an intuition about how to prevent them or how to solve them. It’s a tall order to ask a new development team to do this, especially without any guidance.

Before going any further, I want to mention that every person involved in this mess had good intentions. The developers who wrote the code wanted to build a good product and improve themselves. The managers who hired them wanted the same thing. The blog post writers wanted to share useful solutions. Everyone did their best to help one another out, and it’s important to remember that.

This wasn’t a problem with people.

I have an overwhelming empathy for developers in this position. They have more information than they will ever need, but it’s completely disorganized. It’s like trying to build a ten piece puzzle, except you have to find the ten pieces within a pile of 10,000,000,000 pieces, all of which are square, and you don’t know what it’s supposed to look like at the end. Good luck.

If you read this far hoping for an answer, then I’m sorry: I don’t have a simple one. This is a difficult problem to solve. The solution is too large for a single blog post, changes every day, and differs subtly for every project.

This problem motivated me to start blogging. I’ve been blessed to have learned for almost two decades from incredibly talented mentors, writers, and coworkers. Without the advice from these people, I would still be writing GOTO statements in QBasic (shudder). It’s time for me to take the ball and run with it.

I’ll summarize with this:

This blog is about building production ready applications. It will do this from every aspect: from infrastructure automation, to testing, to design, to debugging, to documentation, to development process, to security. Every post will stand on its own feet, ready to use in the real world – ready to use in production.

Thanks for reading! Please leave a comment if you have one, or a request for a post topic, or any suggestions for how to improve.

This wasn’t a problem with people.

Recommend

System Design Interview Concepts – Database Sharding

A Haskell Compiler (Slides on GHC Implementation)

Making of “Highway at Night” (2014)

The short history of the “about:” URL (2013)

GitHub - arguiot/TheoremJS: A Math library for computation in JavaScript

GitHub - franciscop/umbrella: Lightweight javascript library for DOM manipulatio...

怎样评价日本的柏青哥产业？ - 知乎

意大利公路桥坍塌砸毁民居，已致 39 人死亡，情况怎么样了？事故究竟是天灾还是人祸？...

运动员退役后很容易发福吗？有没有职业运动员退役后还能保持身材的例子？ - 知乎

中科院在铁基超导体中发现天使粒子魅影马约拉纳费米子模，什么是马约拉纳费米子模？其...

About Joyk

Don&#39;t Do This in Production

This wasn’t a problem with people.

Recommend

About Joyk

Don't Do This in Production