How And Why We Switched from Erlang to Python (2011)

A core component of Mixpanel is the server that sits at http://api.mixpanel.com . This server is the entry point for all data that comes into the system – it’s hit every time an event is sent from a browser, phone, or backend server. Since it handles traffic from all of our customers’ customers, it must manage thousands of requests per second, reliably. It implements an interface we’ve spec’d out here, and essentially decodes the requests, cleans them up, and then puts them on a queue for further processing.

Because of these performance requirements, we originally wrote the server in Erlang (with MochiWeb ) two years ago. After two years of iteration, the code has become difficult to maintain. No one on our team is an Erlang expert, and we have had trouble debugging downtime and performance problems. So, we decided to rewrite it in Python, the de-facto language at Mixpanel.

Given how crucial this service is to our product, you can imagine my surprise when I found out that this would be my first project as an intern on the backend team. I really enjoy working on scaling problems, and the cool thing about a startup like Mixpanel is that I got to dive into one immediately. Our backend architecture is modular, so as long my service implemented the specification, I didn’t have to worry about ramping up on other Mixpanel infrastructure.

Libraries and Tradeoffs

The first thing to think about is the networking library and framework to use. This server needs to scale, which for Python means using asynchronous I/O. At Mixpanel, we use eventlet pretty widely [1], so I decided to stick with that. Furthermore, since the API server handles and responds with some interesting headers, I decided to use eventlet’s raw WSGI library. Eventlet is actually built to emulate techniques pioneered by Erlang. Its “green threads” are pretty similar to Erlang’s “actors.” The main difference is that eventlet can’t influence the Python runtime, but actors are built into Erlang at a language level, so the Erlang VM can do some cool stuff like mapping actors to kernel threads (one per core) and preemption. We get around this problem by launching one API server per core and load balancing with nginx.

Another thing to think about is the JSON library to use. Erlang is historically bad at string processing, and it turns out that string processing is very frequently the limiting factor in networked systems because you have to serialize data every time you want to transfer it. There’s not a lot of documentation online about mochijson’s performance, but switching to Python I knew that simplejson is written in C, and performs roughly 10x better than the default json library.

Finally, we use a few stateful, global data structures to track incoming requests and funnel them off to the right backend queues. In Erlang, the right way to do this is to spawn off a separate set of actors to manage each data structure and message pass with them to save and retrieve data. Our code was not set up this way at all, and it was clearly crippled by being haphazardly implemented in a functional style. It was quick and easy for me to implement a clear and easy-to-use analog in Python, and I was able to enhance some of the backing algorithms in the process. It might not sound “cool” to implement a data structure this way, but I was able to provide some important operations in constant time along with other optimizations that were cripplingly slow in the Erlang version.

Performance

Of course, a major concern with Python is performance. We receive a few thousands of requests per second, and it’s important that the API server can handle spikes in traffic. As a startup, we also need to be careful about our server costs. We could buy several servers to power the API and scale horizontally, but if we can write a fast enough server to begin with, that’s a waste of money. The optimizations I made were relatively minor, and most of my speed came from leveraging the right Python libraries. The community is extremely active, so many of my questions were already answered on Stack Overflow and in eventlet’s documentation.

Benchmarking

The setup I used to benchmark was pretty simple. I ran the API server (it only used a single core) and kestrel (queue) on the same machine, with 4 GB of RAM and a 2000 MHz AMD Opteron processor. The 4 client processes ran together on a quad-core machine with 1 GB of RAM and the same type of CPU. Each client had 25 eventlet green threads and ran requests randomly from access logs I pre-processed. Everything was running on Rackspace. Here are the results:

jY3mym.png!web YBjuMn.png!web

As you can see, we maintained roughly 1000-1200 requests per second (on a single core), with a latency of almost always less than 100 milliseconds. We plan to deploy the API on a quad-core machine with 4 GB of RAM, so these numbers are definitely fast enough.

A Little Treat

If you read our API spec, you’ll notice that we return ‘0’ on failure and ‘1’ on success. The failure could be anything from invalid base-64 encoding to an error on our end. Developers writing their own clients have complained about this, and we listened. With the rewrite you can set “verbose=1” in the query string and we’ll respond with a helpful error message if your request fails. We’ll post again when this feature is fully live.

Reflections

I’ve written a few servers like this as personal projects, but I’ve never gotten the opportunity to throw one against real scale. With actual traffic, the challenges are much different, and unlike the prototypes I’ve written before, the code has to work. Our customers rely on the API server reliably storing their requests, and we need to recover from every type of possible failure. The biggest challenge for me was pushing the server from working 99.9% of the time to 99.99% of the time, because those last few bugs were especially hard to find.

I’ve learned a lot about how to scale a real service in the couple of weeks I’ve been here. I went into Mixpanel thinking Erlang was a really cool and fast language, but after spending a significant amount of time sorting through a real implementation, I understand how important code clarity and maintainability is. Scalability is as much about being able to think through your code as it is about systems-level optimizations.

If you’re itching to learn more about real-world scaling problems, then you should work here. Mixpanel offers the unique combination of a fun startup environment with the scaling challenges that companies like Facebook and Twitter face every day.

[1] The other popular one is gevent. Avery wrote about some of the tradeoffs earlier: http://code.mixpanel.com/2010/10/29/gevent-the-good-the-bad-the-ugly/

Libraries and Tradeoffs

Performance

Benchmarking

A Little Treat

Reflections

Recommend

CMU Computer Systems: Self-Grading Lab Assignments

如何看待细谷伸之被踢出《一拳超人》动画团体一事？ - 知乎

快速学习时序图：时序图简介、画法及实例

仅需三步，从零开始找出典型用户

产品分析报告 | 大姨妈，从不止于经期管理

产品分析 | TapTap App，从游戏分发到玩家社交的发展与困境

微软设计师干货分享：3大App消息推送模式及适用场景

从积累到衰退，产品生命周期的一般性规律

解析“淘集集”的 “关系链增长”：历时20天，DAU从0到400万

怼了大学学生会会有什么后果？ - 知乎

About Joyk