A Distributed Language

TL;DR I think it would be great to have a language where single instance of a program can run non-stop for 10 years across a dynamic, heterogeneous mix of hardware, where the program code can perform its own allocation and management of infrastructure resources such as databases, VMs, and public IP addresses, as well as allowing maintainers to safely hot-swap new code into the running program to update it.

This is a deviation from my normal topic — Microvium — to talk about a random idea1.

I recently got a new job at Flip Logistics. Like many businesses an engineer might get a job at, they offer a SaaS solution: a cloud-hosted application with a web front-end, some native apps for various devices, some public APIs for integration with 3rd parties, and backend storage and logic. So far, this aspect is the same as every company I’ve worked at, and I’m sure the same as many others.

Being a newcomer is a great time to see things with fresh eyes. In this post, I’ll go over some of the issues I see with the construction of SaaS software in general (none of this is specific to Flip in any way, and I don’t speak on behalf of Flip in this post), and propose a direction that may resolve a lot of the complexity.

The End Goal

I think the best way to express the issues I see with SaaS2 implementation is to contrast it to the implementation of a single native application. I believe that much of the implementation inefficiency/complexity of a cloud application comes from the fact that it’s distributed: it exists beyond a single OS process. A typical programming language only has first-class support for things pieces running within a single process.

Self-Contained

A native application is self-contained: it starts, it allocates all the resources it needs, including all the interfaces it uses to communicate to the outside environment (e.g. the user).

Contrast this with distributed applications where resources are imposed externally: virtual machines, databases, IP addresses, certificates, firewall rules, etc.

Imagine if simple applications worked this way: to start MS Word, you first need to go to your OS Dashboard where you instantiate a (blank) GUI window for MS Word to eventually use; then you instantiation a file for it work on; and reserve the amount of RAM you want it to run in, and the core of the processor, etc. Then you run your configuration manager to associate the MS Word “process” with the RAM, the file, the GUI window, etc. Then finally you boot the whole thing up and now you can edit your Word document.

Conversely, imagine a cloud “application” that you could just run (instantiate) by the equivalent of just “double-clicking” on it, and letting it allocate its own resources: VMs, network configuration, firewall rules, load balancing, etc.

Documentation, APIs, and Type Safety

In most application software I’ve written, little or no technical documentation needs to exist outside the source code itself. I’m a huge fan of self-documenting code, where the name and type signature of a function or class describes all you really care to know about what it does3.

The information in interface signatures is not only useful for humans, it’s useful because the type system can ensure that implementations and clients of the interface are in agreement about the contract that must be met. This is safe.

But this only applies to interfaces within the context of a single application. As soon as you enter the realm of distributed applications, you now have the issue of defining contracts between the individual distributed components. Much of the time, these are not type-checked, so producers and consumers have no guarantee that contacts are met.

These kinds of inter-service contacts are normally documented externally to the code, where it’s easy to have them get out of date. And hitting “navigate to definition” in your favorite IDE doesn’t take you from the client code to the server code where you can actually see the implementation of the API you might be invoking.

Here’s an example to illustrate the lack of type checking:
In C#, I can bring a queue into existence by new Queue<T>(). Or I could use a 3rd party queue library with similar syntax. The type checker enforces that the contract between the producer and consumer side of the queue. Contrast this with an AWS or Azure distributed application, where instantiating a queue service is done outside your program code and there is no type checking to verify the contract between producers and consumers of the queue, or to self-document the kind of things that consumers should expect to get out of the queue.

This isn’t a straightforward problem to solve. A good type system for distributed applications would probably need to account for tearing between the client and server, where the client and server are on different versions — so contracts defined by type signatures would need to span into the past and future.

Debugging

I’m not sure that any solution in the near future can solve this problem, but the debugging experience for a distributed application is severely lacking.

One area of issue is the fact that you cannot step across service boundaries. While debugging a client of an API, I can step into other functions within that client, but I cannot step from the client to the server code and see both the client and server in the same “call stack”.

Prior Solutions

I’m sure many people have tried to solve these kinds of problems before, to varying success. Two existing solutions off the top of my head:

ASP.NET Web Forms. Within the limited problem-space of a client web application and its backend server, Web Forms attempts to bridge the gap between client and server, having both in the same project and a fair amount of type safety between them. This only works for certain types of applications, and is far from a general solution to the issue.

Infrastructure as Code (IaC), in various shapes — this allows code to specify the so-called “infrastructure”, such as what VMs to create and how to route network traffic to them. This gets part of the way there, but it falls short of the ideal.

To go back to the MS Word analogy, IaC seems akin to having a script that you must run prior to running MS Word, so that the script can set up the “infrastructure” required for MS Word (reserving RAM, creating the file, instantiating the GUI, etc). This still seems unnatural, and it doesn’t give you type checking between the “infrastructure” and the other code.

My Solution

Starting at the end and working backwards, I want a solution where allocating a queue service is as easy as allocating a queue data structure. Something like:

// Instantiate a queue data structure in memory

let localQueue = new Queue<int>();

// Instantiate a queue service, which may in turn instantiate the VM it needs, etc

let infrastructureQueue = new QueueService<int>();

// Instantiate a queue data structure in memory
let localQueue = new Queue<int>();
// Instantiate a queue service, which may in turn instantiate the VM it needs, etc
let infrastructureQueue = new QueueService<int>();

(I’m using a TypeScript-esque language here for my examples)

Similarly for a database:

let db = new DynamoDB<MySchema>({ ...options });

let db = new DynamoDB<MySchema>({ ...options });

I don’t mean this to create a connection to a database, I mean that these actually create a database. The returned db variable would therefore be a local object representing the remote resource.

Application Lifetime

I also believe that the code that creates the database should be part of the application code. Just like in a simple native application, if a resource is needed by the application, the code at startup should acquire that resource (or it can be acquired on-demand when it’s later needed).

MS Word is instantiated when you double-click the icon (and this is when it acquires many of its resources, such as creating the GUI window), and it’s lifetime might be minutes, hours, or maybe days, before you shut it down and it reclaims all it’s resources.

The lifetime of a distributed cloud service is much longer — you might start it up in production one day, and then retire it 10 or 20 years later. If it allocates a database at startup, you would expect that database to be around for the full 20 years.

This should drive home a paradigm shift I’m trying to communicate: a distributed application, as I’m using the term, is not the thing we currently call a “service” or “microservice”. Rather, the code for the “distributed application” is something that might run for a decade, or a century.

An example is “the Google search engine”, which from the outside is just an application available at google.com which has been running continuously since 1998.

Today’s languages and application environments are simply not suited for this paradigm. We’re used to application code being immutable for the duration of the application: it needs to be right before you run the application. If you need to update the application code, you need to stop the application, and it releases all its resources to deploy the update.

When we move into the domain of distributed applications, and these resources include things like databases, message queues, load balancers, etc., we can’t afford to have the application tear them down when we “close” the application to update its code (nor can we afford to close the application to update its code).

But we may be heading towards a world where it’s possible to mutate a running application. An example that comes to mind with JavaScript is React-JS Hot Reloader (I think this video shows it in action). It is intended as a development tool: changes to your code files are detected when you hit “save”, and the changes are injected into the running application without stopping it. In particular, the in-memory state of your components and store are preserved.

I’m not suggesting that Hot Reloader is the solution here, but just that it shows that the paradigm is unfolding. It’s not unreasonable to think of a future where a single distributed program (single code project with a single entry point) can run for years on a cloud of physical machines and other hardware, allocating and deallocating resources according to the rules it defines in its code — such as making a new database for each client, and spinning up VMs according to its dynamic load4 — and allowing us to maintain it while it continues to run by hot-swapping code into it.

I think that likely the ultimate solution here involves a new language that is designed to handle this kind of thing. A language in which the capability of hot-swapping pieces of code is assumed from the start. A language with type-safety across a non-atomic release (patching different physical machines at different times). A language which can describe processes that run on unreliable hardware and with unreliable communication.

But possibly a lot of progress can also be made with existing languages. Maybe in future I’ll go into more detail about what that might look like.

Microvium

I know that this post isn’t about Microvium, but I want to tie this to Microvium as well, because there is certainly some overlap. I won’t dwell on it because Microvium was not designed to solve this problem.

The key feature of Microvium that has relevance here is the snapshotting feature. It plays with the relationship between code and physical hardware in a novel way:

A Microvium process (a running instance of a Microvium program) doesn’t run on just one machine. It is in some sense a “distributed” application because it runs first in one environment and then is suspended and moved to a different environment. In a typical use, it will first run on a server or development machine, and then resume execution on a client or microcontroller.
A Microvium process, and by extension the resource is it allocates (such as memory structures) can transcend the physical hardware on which it runs. For example, you can kill power to the machine, releasing all physical RAM, but yet the Microvium app can still have live objects in that powerless state. The program cannot make forward progress without some CPU, but because of the snapshotting feature, the machine (CPU and RAM) that continues execution is not necessarily the same that starts it — the program can jump between machines while keeping it’s live state.
A Microvium program can control its own compilation5, because compilation == snapshotting. I previously wrote about how snapshotting can supersede bundling and be superior to it. But looking at this more abstractly, we can think of this as internalizing an externally-imposed workflow, which is similar in principle to my proposal to bring IaC scripts into the program that runs on that infrastructure: giving a program its own control over the infrastructure and perhaps even the build process.

While Microvium may touch on some of the concepts required for a solution, just like Hot Reloading, it’s clearly not the complete picture.

Although I actually do talk about Microvium at the bottom
I’m going to use the terms “SaaS application”, “cloud application” and “distributed application” interchangeably in this post, on the assumption that most large SaaS system are complicated because they run over distributed hardware
Or at least I strive toward this end goal
All by rules defined _in the program code_, or delegated to your favorite third party library, not necessarily baked into the platform
I’m using the term “compilation” here to mean compiling the source code to bytecode

A Distributed Language

The End Goal

Self-Contained

Documentation, APIs, and Type Safety

Debugging

Prior Solutions

My Solution

Application Lifetime

Microvium

Recommend

Quality vs Quantity

Microvium BoostIt's Magic

jimmysongio

Spring-Framework-Documentation

www.flydean.com

Configuring Cron Jobs on a Synology NAS

Thymeleaf: Best Way to Display Active Navigation Item

Storing Semi-Structured Data in PostgreSQL

Earn a Free Certificate from Google Cloud: Machine Learning for Business Profess...

Six Business Use Cases for Machine Learning

About Joyk