Arcan versus Xorg: Feature parity and Beyond

This is the follow-up to the ‘Arcan versus Xorg: approaching feature parity’ article which is recommended reading if you have not done so already.

After that article, there was only one (and a half) real feature left to safely claim parity and that can be covered rather quickly. Thereafter we can nibble on the bites that are in Arcan, but not in Xorg — the reason for the difference in scope is best saved for a different time, although it is a good one.

First, let us not forget that there are more vectors for qualities that are significant to users than just features. Client compatibility is something that has been much lower on the list of priorities, yet is an important quality.

The reason is that prematurely adding support for something like a new display server backend to a toolkit, game engine or windowing library without both necessary and sufficient features in place will lead to a scattered actual feature set. There will be theoretical features, and then the features some clients actually might use some version or interpretation of. These two sets will slip further and further apart unless each affected project has exceptionally alert developers, and the reference implementation having basic hygiene in place regarding conformance verification and validation tools.

We have experimented with such implementations, of course, but that was mainly to test the design and current implementation; to find out where our APIs gets annoying to use, so that nobody else suffers.

The other reason for downplaying compatibility is that client applications has strong biases on how things are “supposed” to work, with the worst offenders being just ‘GUI toolkits’. Small changes can have large impact on application design and you are forced to adapt a certain worldview that lessens your perspective. As a trivial thought exercise, just consider something like a smartphone in a world where touch input was not invented, but eye tracking was flawless.

To re-emphasise the point of the Arcan project as such. It is not, and never was, to reinvent X. The principle program (12 principles for a diverging desktop future) say as much. The project is about showing that there were other paths not taken by the big players, and hint at what those paths offered when followed to their logical conclusion.

Most of the lessons in here are not amazing discoveries from diligent research. Rather, they are uncovering wisdom hidden inside the circuits of elderly arcade machines. It is the reason for header-bar image here still being a cropped PCB from that era.

Enough editorialising, towards the Xorgcism! This will again be painfully long and dry. Appy-polly-loggies in advance for that. Quicknav links are as follows:

Network Transparency

The main missing feature was about network support. Luckily I do not have to cover that again — there are already two articles on the topic: one that digs into the situation in Xorg; one that covers our solution. If there is anything left to re-emphasise, it would be that the networking part is a completely separate process that reinterprets the networked client data.

There are three reasons for this – this first is that optimal local rendering and optimal remote rendering are incompatible ideas that were historically similar enough that the discrepancy could be tolerated. Nowadays we want both and that takes a different architecture where everyone involved adapts to which one of the two that is currently desired.

The second is for security; the client IPC is a privilege boundary. Network access is yet another privilege boundary. Mixing multiple privilege boundaries within the same process practically merges the two into one in this case, greatly reducing security. Even if you have a separate network process, should all it is doing be proxy-patching packets, you gain very little actual protection. This is because your ‘network process’ is just yet another router or proxy on the way to the real parser in the privileged process, and the only primitive often needed (writing to a file-descriptor) is hard or impossible to ‘sandbox’.

The third is for resilience. Recall the lengths we have previously gone to in order to have crash resilience. There has been much more work done in that direction since the linked article, and one of the final stages is to make sure client state can gracefully survive GPU loss and, should the circumstances permit, machine loss. These things do not work cleanly as an after-thought – the consequences permeate everything.

Drawing API

The other feature left was the fabled drawing API – to which I would also consider font management — as that lies somewhere in the middle; fonts were already covered in the previous article as part of the section on ‘mixed-DPI’ drawing.

The thing is, Arcan started with a drawing API – initially for specifying interactive animated presentations, documents and some ehrm, darker things of which only beer and cocktails in shady bars might, briefly, bring back to the surface. The display server angle turned into a detour and necessary evil after certain actors could not ‘leave well-enough alone’. All the projects shown here over the years have been using this API and been driving its development in lieu of a requirements specification / waterfall like design phase.

Anyhow, the ‘animated presentations’ bit brought in all kinds of input devices, as well as audio and video. Video parsing and decoding is a finicky process at the best of times, and the typical libraries for this were obvious exploitation vectors. This meant that process separation with least-privilege was clearly needed. With that you get a realtime/media IPC system, so clean that up, add buttons and there is your basic display server.

As a side-note: Having a ‘drawing API’ on the consumer side is not hard per se and, in fact, is the strategy that won, with local raster being a distant fallback. If you want accelerated rendering through OpenGL or Vulkan, it ends up being command buffers (hey, draw commands!) sent to the GPU that will actually turn it into a signal and/or pixels – the drawing won’t happen in-process through the grace of putting pixels in a buffer. You could add non-accelerated drawing to Wayland with a one line addition to wl_shm defining a buffer format as SVG, wouldn’t that be interesting? ( ͡° ͜ʖ ͡°)

J’accuse! – I hear you zoladats triumphantly proclaim. That is not for the clients touse now is it‽. So lwa-ts me explain the two sides to this story. A longer version with charts and code and everything can be found in the article on AWK for Multimedia.

The big bit about a drawing API is the level of abstraction it provides. If it is at the low level of “putPixel()”, you lose — the framebuffer is dead, long live the framebuffer. If it is about “bind set of uniforms, buffers, gpu program and supply commands” you get what the high end graphics libraries already provide. The real challenge is finding the right set of mid-level functions that provide something more without getting morbidly obese to the scale of whatever flavour ‘absorb all’ UI toolkit that is in fashion these days; to understand how, where and when the JS/HTML5/CSS/… soup fail.

Thus, a network friendly, compact, mid-level drawing++ related API is the target. One that can be used out-of-process and as an “intermediate language” (think LLVM-IR) for higher level UI components or document interpreters to render to, while at the same time having the feature set expected of ‘web-app’ like targets.

The “++” represents the process control, audio and meta-IPC routing that is also necessary to cover the span of what is expected of ‘a desktop’. The processing of these commands translate to another intermediate layer, ‘AGP’ which is a simplified subset of OpenGL that was needed as a decoupling mechanism to eventually switch to-/provide- Vulkan and Software rasterisation. It aims at at a network friendly level at about the quality of DOOM3/WebGL, not great, but thereafter video compression starts to win out.

The following figure illustrates the different ways of running Arcan to cover all combinations of ‘drawing API’:

So Arcan can be a client to itself, although it is a separate binary (arcan_lwa – (Lightweight-application) marked as ArcanLW in the figure.

ArcanLW will connect to Arcan ‘as a display server’ and act like a normal client. The side benefit here is that if you know how to write an app with the one, you can write or patch the window manager in the other. You get one API to control, manipulate or inspect all the routing for the entirety of your desktop. It privileges separates by design, it has been stable for years.

The last piece to this puzzle is to be able to provide RPC like access to this API. Here is why Lua matters more.

For those not in the know, exposing a C API to the Lua VM is about as simple as it gets and it fits other low level calling conventions; register the symbol name and function pointer of the function in the VM namespace; pop arguments from the stack based on the desired types; process; push the function return to the stack.

It can get more involved than this, but not by much. That actually doubles as a protocol decoder, thus it can be used as a decoupling mechanism. Define a bytecode and there is your \out of process\ drawing/WM API to bind to other languages.

Migration and Recovery

While we did brush upon crash recovery in the previous post, as well as a previous article on the subject – there is still a lot to it. Enough work has been done since to warrant yet another article.

People aware of some GTK history knows that it actually has the underused ability to jump between X Displays. The ‘underused’ part can probably be blamed on the, ehrm, ergonomics of spinning up and maintaining multiple X servers to jump between. Politely put, it was not for the faint of heart.

It is a good idea in principle, although there is not much benefit to do it on the toolkit level versus at the IPC level and, as it turns out, much easier — the feature will always apply regardless if the toolkit had enough foresight and resources to implement it- or not.

In X, if a window manager crashes — it is not that big of a deal since a new window-manager-as-client can reconnect and start rebuilding its view of the world. In reality it is rather janky and there are possible left-overs when the window manager additions to the scene graph in Xorg was of a more complex nature.

To learn from this, we come back to resource allocation in Arcan: Each clients gets a primary segment (window++) and then dynamically renegotiates for more (with rejection being the default). The window manager can kill any and all secondary allocations and ‘reset’ the primary — forcing it to try and renegotiate new resources.

This means you have one guaranteed ‘window’ and the rest that can come and go. This is partly to allow for much crazier window management schemes, while at the same time not losing the more simple ones. A client developer gets a known and guaranteed set of available features to serve as a baseline, with sets of extras to add- or react- to.

The WM also gets access to a key/value store for tracking WM specific information about its clients. Both clients and WM gets GUIDs for repairing the clients to this information so things re-appear where expected. It also gets the ability to dynamically redirect clients to other display servers – as well as inform about alternate ones should the present one disappear.

The WM also gets a specific event entrypoint, ‘adopt’ which references clients of unknown origin. This allows controlled handover between WMs for live-switching only relevant clients, as well as a recovery mechanism for the WM to recovery from errors in itself.

This brings us to a handful of possible error scenarios and recoveries:

Bad WM code → Lua virtual machine error handler → Engine Reloads WM (same WM or ‘fallback’ one) → Clients receives ‘RESET’ event → WM “adopt” entry-point called.
Live-lock in WM →Watchdog process sends signal →(same as 1.)
Crash in Engine → Client-side IPC library detects, connect loop (same or new address) →injects ‘RESET’ event.
Live-lock in Engine → Watchdog process sends kill →3.

The extra twist is that the WM can ‘fake’ a crash for one or many clients, so that the procedure from 3 can be triggered on command. It can also send a new address for the client to connect to. This opens up for dynamically migrating clients between local and remote servers. The short clip below shows this going from local to remote:

While the internals for how all this works is quite complicated — both the WM and the client facing code is quite trivial.

Hook Scripts

A lesson to learn from XTEST, XRANDR and the tools that take advantage of these extensions- is that no matter how well you do your requirements engineering, people will disagree with your world view, and sometimes loudly so. Some just need their ‘hacks’ for the sake of it — the act of hacking and the feeling of agency it brings has value that should not be dismissed outright.

Is it possible to provide an outlet for such expressions without painting yourself into a design corner? The answer here are ‘hook scripts’. To understand how and why, we need something of a schematic.

This provides two possible paths for intercepting and manipulating WM behaviour. In the pre- stage the API that the WM sees can be intercepted and patched before the WM scripts themselves have a chance to run.

The other happens after the initialisation entry point but before the actual event loop starts. This allows you to intercept the event handlers themselves and both pre- and post- hook those.

When combined these let you ‘monkey patch’ any part of the WMs view of things, and work around the “stuff-you-don’t-like”. It can also be used to inject features that were not allowed as part of Arcan itself, for whatever reason.

A practical example is how to add external input drivers:

arcan -H hook/external_input.lua myapp &; ARCAN_CONNPATH=extio_1 my_input

This opens up a single-use connection point (extio_1) and additional ones (_2, _3, …) for each time the hook script is added. Then your external input driver is set to connect to that name, whereby ‘myapp’ sees the inputs as coming from the engine itself.

This is yet another point where the Lua tactic shows its strength — if hundreds of thousands of beginner- to intermediate- level developers managed to do it for World of Warcraft and lots of games thereafter; anyone who can juggle things as clumsy as bourne shell scripts around should be able to ‘scratch their itch’ with ease.

Alternate Representations and Accessibility

Now we are at the more unique stuff. In most other display systems, a client performs some version of the following:

Allocate a window.
Tie it to a pixel buffer.
Signal when it is ready to be used.

Add the ability to control parts of its position and composition order and you are surprisingly close to done — the ‘odd man out’ is practically Wayland where even this is convoluted to the extreme.

Arcan has a different view on clients. While the clients can perform the sequence above, the WM can also tell the client to have a new window — and know if the client accepted it or not.

This was brushed upon in the article on ‘Leveraging the Display Server to improve debugging‘ — go there for the longer explanation. The short form is that if we add the notion of more abstract types to our windows, like ‘Debug’ or ‘Accessibility’ and so on – it implies asking the client ‘can you provide such a view of yourself’.

Thus there is no need for a separate ‘input-accessibility-bus’ or complexities of that nature — it is all about re-use of existing mechanisms to form new features.

The obvious normal case for a feature like this is ‘screen sharing’ – there is strictly no need for the client to be able to enumerate resources and displays in order to implement the sharing and collaborations features; it should not need to know how the sausage gets made.

The rest of the desktop is perfectly capable of providing the UI language to annoy a user for such details, or to automatically map and provide the contextually relevant subset. The role of the sharing client should be to receive and serve, not to define and control.

Application Launching and Handover allocation

Arcan can act as a launcher for applications. As such, it knows if an application dies, and will respond to that. This is done in order to improve security, as well as to be able to provide control flows that are common to gaming consoles, smart phones, set-top boxes and so on.

Most of this feature did not come from the strict idea of acting as a ‘launching frontend’ but as a side effect of wanting to be able to privilege-separate any parsing of untrusted contents. This practically means spawning new limited processes that can send the results back, and controlling their life-cycle.

With the launcher path, it means that a client does not ‘connect’ as an external source, but rather inherits its IPC connection as part of being run. In fact, locking down the ‘external connect’ path is a single ‘Hook Script’ that blocks out the ‘target_alloc(str)’ function.

What it affords us is client identity and authenticity. This is a hard problem to implement as an afterthought with the primitives that a POSIX-style userspace provides you, and ultimately it ends up failing spectacularly or juggling cryptographic primitives and failing subtly; the chain of trust is broken.

It also affords us a client-unique key-value store for the metadata config that is needed to ‘remember’ what some window related to a client did – across crashes, reboots and restarts. It also lets us inform our scheduler, resource allocation and the likes to treat these clients special, in order to make ‘fork-bomb’ like Denial of Service attacks much harder.

The twist to this is ‘handover allocation’. Clients themselves may want to repeat the cycle by spawning new processes, as part of the intended feature set, as well as privilege separation / sandboxing of its own. The alternative would be ‘premature composition’ where the client becomes a display server of its own, which is a bad idea.

The following figure illustrates both the ‘launcher’ path as well as the ‘handover’ form:

The user defines the set of allowed targets that ‘MyWM’ can execute (binary + arguments + metadata) through the ‘arcan_db’ tool and the ‘add_target’ command shown at the bottom of the figure.

MyWM refers to this by calling the launch_target builtin function, which sets resources up and spawns the new process with ‘mybin’ running based on what is in the database as ‘thething’. The code in mybin decides that it wants a new window to delegate something, say embedded privilege separated video playback. It requests a new window of the special handover type.

MyWM is friendly enough and ‘accepts’ by calling accept_target as part of the eventloop tied from the launch_target call. New primitives are allocated and pushed to mybin. Mybin launches the new program ‘somebin’ which inherits and opens the connection to Arcan the same way ‘mybin’ did. At the same time, mybin gets an abstract cookie that can be used to define and reposition ‘somebin’ within its own coordinate system and scene graph subgraph. X style reparenting, but without most of the chaos.

More details on some use of this feature can be found in the article on: ‘Another low-level Arcan application: Tray Icon Handler’

State Controls and Transfer

A (not) so big gamble on client compatibility is that for the future, we really need to improve how virtual machines (a class of applications that include ‘terminals’, emulators and ‘modern’ browsers) integrate with the grander system.

All the big actors are moving in this ominous direction as a preface to ‘clouding’ these virtual machines and then keeping things there.

One of the special needs of virtual machine class clients, is that of state controls – suspend, resume, snapshot and restore.

Arcan is ahead of the curve there since we used emulators as the reference model for the ‘needs’ of such a client from the start. They have been leveraged for testing performance, throughput and latency ever since – emulation has hard real-time requirements after all, and a whole bunch of free test cases in the shape of ‘Tool-Assisted Speedruns’. Another such thing is that ‘state controls’ are a natural thing in well-written emulators.

In the Arcan API, this boils down to trivial commands in the WM layer – something like:

bond_target(vid_client_a, vid_client_b)

Is literally enough to tell one client to pack up its state and forward it to client_b, and to tell client b to restore from the incoming state – even if client_b is networked.

This also means that the ‘sharing’ window feature can be combined with the ‘alternate representation’ feature to fake new devices inside of the virtual machine. Something like the following:

define_recordtarget(vid_client_a, {src1, src2}, more_controls_here)

Would create a composition (of src1, src2) and inject into (vid_client_a) that can then present it as a ‘camera’ or ‘screen sharing’ and so on.

Audio

This might come as a surprise to some, but audio and video processing are really quite similar. If you only do graphics and have not studied audio, or the other way around, you are missing out. It might take a course or two on ‘Digital Signal Processing’ or something to that effect to not hit any grand barriers, but that should be part of basic computing hygiene by now.

Not only do they have a large overlap in the theoretical foundation, but their internal structure from an operating system position does as well. Separating the two is something that has increased systemic complexity by a fair amount. Instead of using the same IPC primitives; transfer modes; device navigation; security; hotplug notification; output routing and redirection; we traditionally maintain two or more ecosystems for doing 90% of the same thing – even though the HDMI cables themselves tell us to stop.

What then follows is that a lot of applications need to combine the two in a coordinated and synchronised way. So instead of these two domains work over the same IPC system that has done the painfully heavy lifting, a third is added (Binder, D-Bus and so on), increasing systemic complexity by magnitudes.

While the feature itself is in a more primitive stage, much of it piggybacks from the work done on video and input latency – so when those stages are closer to complete the last bits of audio will get filled in as well. Again similarly to graphics where we can’t realistically aspire to be the next ‘Unreal Engine’, the role of audio support here is not to try replacing ‘pro’ audio but to fill out the ‘simple to middle’ level with coherent controls that meshes well with desktop style coordination – and wrap it in controls that can delegate to ‘pro’ audio consumers, much like we handle features like colour management.

This becomes especially important for the networked desktop use case, for accessibility (the ‘screen reader’) as well as for VR and AR experiences where positional audio suddenly becomes one of the most powerful tools available.

Exotic Input and VR

This is another chapter both for the future, for the present and for improving the past. The previous article mostly skimmed over the input model, and as with the other features — there is a lot of nuance to it.

The input model in Arcan covers both the expected basics (mouse, touch and ‘translated’ – i.e. keyboards) as well as game (‘assistive’) devices, eye trackers and similar sensors. This goes all the way to weird hybrid-I/O (see Interfacing with a stream deck). It is possible to safely and securely attach external input devices as shown in the section on hook scripts.

Two other “input” parts are client-announced input labels and VR.

Starting with the client-announced input labels – any client can announce a set of active input labels. These are short high level tags e.g. “Copy_Clipboard” with some kind of description in the current language settings, “Copy the word at the cursor”, some nature of its type (analog or digital) as well as a suggested default keybinding (Ctrl+C), if digital.

The WM can leverage this feature to provide a coherent binding interface, work around conflicts with existing bindings from other clients or the WM itself. Subsequently it can be used to implement your ‘hotkey’ manager (best done as a hook script) without overreaching and giving it access to anything other than what is strictly necessary.

The WM simply sets the appropriate label ‘tag’ to the next input event that gets routed to a client. This also meshes with other accessibility features, such as routing through a voice synthesizer to say that you pressed ‘copy clipboard’ rather than ctrl-down-C and then guessing whatever that meant in the context of the active application.

VR is an interesting example of a complex input model with a short window of opportunity; the ultimate avatar matches your every muscle at high sample rates.

This is a complex operation that aggregates sensors from multiple timing sources, including arrays of video cameras and laser projects. It does not mesh well with the traditional event loop structure as queues gets saturated instantly.

The span between ‘present’ to ‘possible’ inputs is also much wider – not everyone should need a full body suit, or have all fingers and toes present, and the sample streams themselves are sensitive as a number of strong biometrics can be extracted.

The following diagram tries to take a stab at explaining how the layers mesh together to provide VR composition and input management.

Akin to launch_target mentioned previously, there is a function for launch_avfeed available to the WM for launching the ‘decode’ frameserver. This provides a privilege separated producer process that provides a single input feed. Internally it works like any old client would, but since we have a well defined role and interface for it, the consequences for forcibly killing/restarting it are predictable.

It can be used to retrieve video feeds from webcams, and the file browser in Durden uses it for live previews of images and video. For VR, it allows us to access and map the many possible tracking cameras (inside out, outside in, hands, …) then compose/embed (AR), filter and perform computer vision to improve position tracking and so on.

For device control itself, we have yet another process/binary that is mapped to the function vr_setup. This one provides display metadata about the display panel and lens configuration, as well as hotplug events of ‘limbs’ (slots matched to the average human anatomy). The WM decides which ‘limbs’ map to which objects in 3d space, and how or if they should be forwarded to any clients.

The sample streams will not be processed as a normal input event, that would be too costly. Instead they continuously update designated locations in shared memory, along with timestamps and checksums – based on the currently set limb-map.

A client connects and identifies its primary segment as the L eye, the R eye or LR in a side-by-side configuration. For the L and R cases, a secondary segment is then allocated through the primary, to cover for the missing eye. It requests an extended mapping for VR data transfer (similarly to how colour management transfer its metadata back and forth).

A Path from Curses – TUI

On top of the mid-high level drawing API and the low-level client API we have in the client facing side of the ‘SHMIF’ IPC system, there is also a separate client API for text dominant clients. This API is built on top of SHMIF, and exposes a stable subset of its features.

The included terminal emulator, afsrv_terminal, is built using this API – and it has a basic alternate shell that completely skip terminal protocols and pseudo-terminals.

The API provides basic window segmentation and positioning that integrates with the window manager of the desktop, and similar integration features specifically for command line so that the WM can decide how to present and interpret completions and suggestions. It can trigger alerts and notifications, as well as announce inputs in a similar fashion as described in the ‘exotic input and VR’ category.

The output buffer format, tpack, has been mentioned before. It enables server-side rendering, both for efficient network remoting (since SHMIF is involved, the network transparency parts mentioned earlier applies). It is also for optimal sharing of font caches and resolved glyphs (and GPU texture atlases) transparently between clients, allowing for zero overdraw and minimal input to output latency.

Colors and palette can be defined and redefined by your window manager; TUI clients themselves then have the option of 24-bit colours that they pick themselves – but also the preferred option of semantic labels that map to the active palette. This means a way out of the terminal-legacy “Red is now Green” problem of specifying colours.

Content synchronisation is always explicit, so that no window should have tearing or draw output that will be replaced miliseconds later, or waste a vsync on an incomplete update.

All text is unicode and the clipboard has a side channel so large paste operations do not block or otherwise interfere with stdin/stdout. Locale (input/output) language can be provided, and a getopt- like replacement for input arguments that are packed over an environment variable in order to not complicate legacy argc/argv further.

Attention alerts can be sent, as well as notifications. There are controls for describing content length and byte-precise seeking, as well as job progress status.

Speaking of stdin/stdout/stderr – those can be redirected dynamically by the server. Additional input/output pipes can be dynamically announced and added. As part of that feature a TUI client can also announce support for input and output types, both as immediate requests or capability hints for universal open/save and file browser integration.

There are a few basic primitives to get started. Examples of such primitives are readline, buffer view (everyone deserves a hex editor) and list view, but the core will not go deeper than that. The focus, instead is on providing discoverable command line interfaces and one-off commands that focus on processing and providing data for data flow graphs instead of pipelines.

Arcan versus Xorg: Feature parity and Beyond | Arcan