The AWS and MongoDB Infrastructure of Parse: Lessons Learned

This is the extended form of a comment that got some interest on Hackernews. After a grace period of one year, Parse is now offline. This is a collection of learnings and technical decisions that might be useful for other companies running cloud services. At least, it directly affects the design of our own Backend-as-a-Service Baqend.

So here are some facts and trivia that are not so well-known or published that I collected by talking to Parse engineers that now work at Facebook. As I am unsure about whether they were allowed to share this information, I will not mention them by name.

Parse is offline

Users and Traction

1 million apps were deployed to Parse.
The largest Parse app had 40M users.

The largest Parse customer only used it for Push notifications

Parse was one of the world’s largest MongoDB user
Clash of Kings used Parse for push notifications and made up roughly half of all pushes that went through Parse. They never moved any other parts to Parse, due to scalability concerns.
Original reason for Facebook to acquire Parse was to push their mobile SDKs and to create synergies with mobile ads. Parse was often sold as a package deal with Facebook advertising.
Static pricing model measured in guaranteed requests per second did not work well.
Business problem: people tended to remain in the free-tier.
Technical problem I: complicated rate limiting. If limit exceeded by a factor of 60 for a minute, requests were dropped. Limits were tracked using a shared Memcache instance. Consequence: when developers experienced rate limits in the API, they added retries. The retries incurred enormous load in the Parse backend.
Technical problem II: the real problem and bottleneck was not the API servers but almost always the shared MongoDB database cluster.

Parse Server

Server was Rails at first (with 24 threads max. concurrency) with very little throughput per server (~15–30 requests per second)
The server was later rewritten in Go. The open-source Parse server is written Node.js and lacks many functionalities of the original Parse server in Go.
Backend was completely on Amazon Web Services
It was planned to migrate Parse to Facebook’s infrastructure (e.g. Haystack, Tao, F4, Extended Apache Giraph, Gorilla) but the project was abandoned
Roughly 8 developers working on SDKs, 8 on the server, 8 DevOps + a few more engineers

Database

>40 MongoDB Replica Sets with 3 nodes each

Parse went for RocksDB as their primary storage engine.

Storage Engine: RocksDB (i.e. MongoRocks), an append-only engine based on log-structured merge trees (similar to e.g. Cassandra, HBase, CouchDB, LevelDB, WiredTiger, TokuDB). Reason: there is better handling of many collections — in contract to WiredTiger that uses one file for each collection. Compression was better by a factor of 2–3 in terms of space. Writes and replication also were more efficient in terms of latency/lag. The move to MongoRocks from MMap was done by adding a replica with MongoRocks that was later promoted as the new master.
Used only instance storage with SSDs, no EBS.
No sharding: each tenant was mapped statically to exactly one replica set using MongoDB’s primary database logic.
The Mongo Write Concern was 1 (!), i.e. writes were confirmed before they were replicated. Some people complained about lost data and stale reads
Slave reads were allowed for performance reasons
Partial updates were problematic as small updates to large docs got “write amplification” when being written to oplog
Frequent (daily) master reelections on AWS EC2. Rollback files were discarded and let to data loss
Developed a special “flashback” tool that recorded workloads that could later be rerun for internal load and functional testing
JS ran in forked V8 engine to enforce 15 second execution limit for user-provided code
No sharding automation: manual, error-prone process for largest customers
Indexing not exposed: automatic rule-based generation from slow query logs. Did not work well for larger apps.
Slow queries killed by cron job that polled Mongos currentOp and maintained a limit per (API-key, query template) combination
Backups: if important customers lost data due to human error, Facebook engineers would manually recover it from periodic backups
The object-level ACL system was highly inefficient. Numerours indexes were required that could sometimes surpass the actual data size by a factor of 3–4.
As there was no mechanism for concurrency control (except for minimal support for things like counters), applications were often inconsistent

What Parse should have done differently

Parse did a lot of things right. The documentation was great, the mobile SDKs were solid and the web UIs well-designed. However, they had an unspoken value system of not trusting their users to deal with complex database and architectural problems.

Coming from a database background, our idea is that developers should know about details such as schemas and indexes (the Parse engineers strongly agreed in hindsight). Also, we think that backend services are not limited to mobile apps but very useful for the web.

I think that providers should be open about their infrastructure and trade-offs, which Parse only was after it had already failed.

If this idea sounds interesting to you, have a look at Baqend. It is a high-performance BaaS that focuses on web performance through transparent caching and scalability through auto-sharding and polyglot persistence.

We strongly believe that architecture should not be a secret.

Don’t want to miss our next post on Backend-as-a-Service topics? Get it conveniently delivered to your inbox by joining our newsletter.

The AWS and MongoDB Infrastructure of Parse: Lessons Learned

The AWS and MongoDB Infrastructure of Parse: Lessons Learned

Users and Traction

Parse Server

Database

What Parse should have done differently

Recommend

High Performance Website Hosting with SSL and HTTP/2 Made Simple

Web Performance in a Nutshell: HTTP/2, CDNs and Browser Caching

Lessons Learned Building a Backend-as-a-Service: A Technical Deep Dive

A Backend for Your React and React Native Apps: Baqend React Starters

Building Static Sites in 2017: Cloud-Hosted, CMS-Backed, and API-Driven

Going Real-Time Has Just Become Easy: Baqend Real-Time Queries Hit Public Beta

A Real-Time Database Survey: The Architecture of Meteor, RethinkDB, Parse &...

New-Generation Travel Websites Load 7 Times Faster Than Traditional Hotel Sites

Web Performance Made Simple:

Rethinking Web Performance with Service Workers

About Joyk