7

VRChat’s New Years 2021 — or, what the $%@& was that?

 3 years ago
source link: https://medium.com/vrchat/vrchats-new-years-2021-or-what-the-was-that-d84334789f77
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Image for post
Image for post

VRChat’s New Years 2021 — or, what the $%@& was that?

If you’re unaware, we had a bit of an issue at probably one of the most popular times to hang out in VRChat and traditionally where we set concurrent user records — New Years Eve, around the time that EST hits midnight. VRChat went down. Great timing, yeah? Let’s talk about what happened.

First off, before the issues started, we were chugging along with no problem. In fact, we shattered our previous concurrent player record with over a mind-boggling 40,000 players online at the same time. Our DevOps team had made sure that our servers were buffed up and ready for a huge boost in players over the holidays. Both our API servers and our real-time networking servers were reporting green across the board. Everything was looking good!

About twenty minutes to midnight (EST), people suddenly were unable to get any file assets from our servers (no avatars, worlds, icons, nothin’). Our API stopped talking to people. This meant that your menu wouldn’t work, if you left the world you were in you were stuck in limbo, and the client thought that your internet was dead because it couldn’t see the configuration file it needs to get every time it starts up. Several team members noticed something was wrong in-app very quickly. Players who had been partying seconds before started alt-tabbing over to Discord. Our chat mods began sweating, excessively and immediately.

Our first thought is that our services were having problems keeping up with the numbers we had. However, every single metric, statistic, alarm, siren, alert, and klaxon we had set on our own services were saying that things were fine, and had been fine, and our servers were wondering where all those cool people had gone. That wasn’t it.

So, VRChat uses a bunch of services to ensure that our servers are safe and protected from things like DDoS attacks and various other security concerns.

One of these services had, unknown to us, set a hard limit on the number of requests that we could receive per second. From what we have found out, there was no way we could have known what that limit was, and no way we could have set it higher without their help.

When we passed that number (around 10 minutes before problems started being widely reported), our security partner assumed that we were being hit by a denial of service attack, and started taking automated measures. Of course, we weren’t being attacked — we were just quite popular.

As users attempted to log in, get their social lists, join worlds, switch avatars, and do all the things that VRChat does, the automated system began to decide that because all these requests were happening over that hard limit, they were now all bad requests. Legitimate users turned into attackers in the eyes of our security partner’s automated systems.

The automated system’s response was to immediately shut down all traffic to our systems. Obviously, this is not the correct response to things being totally fine, but very busy.

We were able to implement a short-term solution to get services back online within an hour and a half.

On any other day, 90 minutes of downtime wouldn’t be too awful, but 90 minutes on this particular day meant that a ton of people hanging out with their friends waiting on the EST and CST New Years missed out on their turn to see the ball drop. Considering the state of things, missing out on your New Years countdown with your friends made a good number of people understandably frustrated.

Going forward, we have established a new, much higher set of limits with our security partner. We’ll be aware if we begin to approach that limit again, and can warn them in advance. There are some things the VRChat client can do to help improve this behavior as well, and those changes will be going out with the next few releases.

Finally, we are looking into ways we can improve our security detection and response systems. Our traffic patterns can look quite unusual next to other applications, so we’ll work hard to ensure our relationship with our security partner takes that into account.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK