Hosting a GPT-2 autoresponder bot

In case anyone was wondering (maybe no one was wondering), here are some verbose details about how I host GPT-2 continuously for my @nostalgebraist-autoresponder bot.

If that sounds boring, keep in mind that this post will also contain some complaints about ML software and Google’s ML software specifically, which generated a lot of ~user engagement~last time I did it:)

——–

I used to host the GPT-2 part of the bot on my unimpressive laptop, which doesn’t have a CUDA-usable GPU. This was much less bad than it sounds – it was slow on CPU, but not that slow. But it did slow down the laptop appreciably during generation, and spin the fans loudly.

The limiting factor there was memory. I could only support the 774M model, because the 1.5B model literally wouldn’t fit on my RAM at once.

Time for a digression about training, which will explain how I got to the hosting situation I’m in.

As it happens, memory is also the limiting factor I’ve encountered while training on cloud GPUs and TPUs. Current cloud offerings make it easy to pay more money for more parallel copies of the same processor, which lets you increase your batch size, but if you can’t even fit a batch of one on a processor then this is not immediately helpful. You can make it work if you do “model parallelism” rather than “data parallelism,” putting different layers on different devices and passing data back and forth in every pass.

My first successful attempt at fine-tuning 1.5B used some code I hacked together to do just this, on an AWS EC2 instance. To my surprise, this wasn’t unusably slow – it was pretty fast! – but it cost a lot of money per unit time.

Then I heard from gwern that you could fine-tune 1.5B for free using the cloud TPUs that Google Colab gives you. There’s some code out there that shows how to do this, although it’s quite slow because it doesn’t really do data parallelism (IIUC it uses all the memory but only one of the cores?). I naively said “oh I’ll modify this to do it the right way” and promptly went on a frustrating and unsuccessful little sidequest chronicled in the tensorflow rant linked above. Then I gave up and went back to using the TPU not for its parallelism but really just for its memory … and price, namely zero.

——–

You may notice a pattern already: seemingly bad, shitpost-like ideas turn out to work fine, while “good” ideas don’t get off the ground. Here’s another bad idea: what if I continuously hosted the model in Google Colab? This seems like it shouldn’t be possible: Colab is meant for quick interactive demos, and if you want to do “long-running computations” Google will sell you the same resources for an extremely non-free price.

Well, I tried it and it … uh, it worked fine. It was a little awkward, because I didn’t want to mess around trying to figure out the IP address or whatever of my Colab machine. (Maybe this is easy, I never tried.) Instead the Colab process only sends requests, once per minute, to a service on laptop, and these simultaneously fetch any new generation needs from my laptop (carried by the response) while sending my laptop the results of any generations that have completed (carried by the request body).

That’s it for the GPT-2 part. All the interaction with tumblr is in my laptop in another service; this one posts its generation needs to the aforementioned “Colab bridge service,” the one polled once per minute by Colab.

As a sidenote, I now have a fourth service running a BERT model (cheap enough for laptop) that predicts how many notes a post can get. These days, the Colab process generates several possibilities for each post, which end up back in the “Colab bridge service,” where the BERT “selector” service gets them in its once-per-minute polls, decides which one is most viral, and sends this selection back to the “Colab bridge service,” which then tells the tumblr-interacting service about it the next time it asks.

The sheer Rube Goldberg zaniness of all this is part of the fun. It doesn’t even cause many bugs. (Most bugs are due to me not understanding the tumblr API or something – the tumblr-interaction code is also quite ugly and due for a rewrite.)

——–

Anyway, the important part here is that I’ve got the big GPT-2 running continuously for free.

Well, not precisely: once or twice a day, the Colab notebook will time out (I’ve copy/pasted some JS trick to make this less frequent but it still happens), and then I have to press a button and go through some Google auth flow to restart it. Also, once in a blue moon Google will decide not to let me have a TPU for a while because they’re prioritized for “interactive users” of Colab and not “long-running computations,” which I am advised to do the right way, in Google Cloud, for money. I get really annoyed whenever this happens (just ask my wife) – unfairly annoyed since I’m getting something for free – but it’s only happened 2 or 3 times and lasts only somewhere between 15 minutes and 24 hours.

These minor inconveniences, as well as the awkwardness of doing things on a weird janky jupyter notebook server, could be avoided if I just graduated from the “free trial” of Colab to the grown-up world of real cloud TPUs. That would cost … uh … at minimum ,about $1K a month. That’s just the TPU, mind you, I’d also need to pay a smaller amount for use of a “VM.”

This is pretty strange. What I’m doing is clearly not the intended use of Colab, although I’m not aware of any TOS it violates (only an FAQ that says cryptocurrency mining is disallowed). As far as I can tell, free TPU usage on Colab is meant as a free trial or demo of how great cloud TPUs are, which will cause you to pay money for them. Instead, it has taught me two things: that cloud TPUs are actually a $^#!ing pain in the ass to use, and that if you do manage to get them working you should not pay for them because you can get them for free at the cost of some slight awkwardness.

Presumably this was set up on the assumption that usage like mine would be infrequent. If (when?) someone open-sources code that lets you do all this really easily in script-kiddie fashion, I imagine Google would notice and stop it from being possible.

Even then, the fact that it can exist at all is strange. Google seems to think I value avoiding some slight inconvenience at $1000/month, and what’s more, they’ve chosen to provide not a free trial of a convenient thing (a tried and true approach) but a free inconvenient version of a convenient thing, forever . This can’t even sell me on the convenience of the “real” thing, since I’ve never seen it!

And in fact, I value the convenience at less than $0/month , for its absence gives me some little puzzles to solve, and a slight frisson of beating the system when I succeed.

——–

Now for the moral.

As I’ve alluded to in various recent posts, the ML ecosystem of 2020 seems addicted to the idea of fast, free, extremely easy demos. Everything out there wants to show you how easy it is. Not to be easy, but to look easy in a demo or tutorial.

For example, Google Colab exists entirely to make demos of machine learning code that run instantly in anyone’s browser. This is not because anyone thinks people should really write their code in this way. “Real” use is supposed to cost money, involve configuring an environment and being the sort of person who knows what “configuring an environment” means, not doing everything in a goddamned jupyter notebook, etc.

But, for some reason, we supposedly need demos that can be used outside of the “real use” context. We need them so badly that Google is willing to provide a basically functioning copy of an entire setup that basically suffices for real use, available instantly on demand to anyone for $0, just so the demos can work. For people used to doing real things the usual way, various things about the demo setup will be awkward, and certainly you won’t get any official tech support for doing real things inside them, only for doing demos. But to people used to doing real things, that is no obstacle.

It just doesn’t add up in my head. If code in Colab is just there for demonstrative purposes and you’re supposed to copy it over to a “real” setting later, then you have to do all the “real” setup anyway. I guess Colab lets you share code without worrying about how to run it on someone else’s machine, hence “ colab oratory”? But that’s a research tool that could easily be sold for money, so why make it and the underlying hardware free? If Google’s cool little demo notebooks of BERT or whatever aren’t “real,” then they don’t teach you anything that a static explainer page wouldn’t. If they are “real,” then they’re the real thing, for free.

There is way too much ML-related code out there that has been released too early, that hasn’t had enough craftsmanship put into it, that does magic with the press of a button and is usable by a 10-year-old but doesn’t seem to have considered what serious use looks like. Colab seems in line with this mindset, and designed to produce more of this kind of thing.

My own code to use it with GPT-2, and indeed the entirety of my bot, is terrible as code, and I can’t imagine how to improve it because it’s so coupled to the weirdness of so many other systems designed for the exact contours of the moment, of other people’s hacks to make GPT-2 work, of my hacks to make Colab serving work.

Everything has what I called a “shitpost” feel, like it’s using things out of their intended context. Javascript snippets pasted into Chrome to robotically press a button in a jupyter notebook, half-understood tensorflow snippets that leave 7/8 of a state-of-the-art cloud computer idle so I can use the other 1/8 mostly for its RAM , etc. Elsewhere in the cloud world people haveautomated the process of booting up, say, 1000 cloud computers, then installing Conda on every one of them, just so you can run a 1000-step for loop very quickly, with all that expensively constructed and identical state vanishing at the end if you take a break for lunch. This is hilariously inefficient but cheap, while the more sensible ways of saying “hey Amazon, I’m gonna want to do 1000 things at once with numpy a lot in the next week” cost a great deal of money. Maybe this correctly reflects how much different things cost in AWS? But it feels awfully unstable, as if empires are being build on the results of some middle manager at Amazon or Google forgetting something between meetings.

The cloud computing giants can do deep learning the right way, internally, perhaps. The rest of us are left with shitpost engineering, carrying dril’s spirit with us into our code even as we automate away his art in the same breath.

Recommend

NaN in JavaScript

干货 | 基于委员会的分片区块链中的安全性和可扩展性

微服务之服务治理：Envoy 全局 gRPC 限速服务 lyft/ratelimit 详解

在 Go 中编写令人愉快的 HTTP 中间件

从GB到KB，零知识证明如何打造简洁的区块链？

物联网安全系列之远程破解Google Home

Power-Up Your Golang Logging: Idiomatic Log Strategies in Go

Corundum: Open-source, high performance, FPGA-based NIC

HelloTalk 基于 OpenResty 的全球化探索之路

想靠炒币炒鞋财务自由？泡沫来了，没倾家荡产就烧高香吧

About Joyk