Reducing UDP Latency

Mar 20 ·6min read

Hi! I’m one of Embox RTOS developers, and in this article I’ll tell you about one of the typical problems in the world of embedded systems and how we were solving it.

Stating the problem

Control and responsibility is a key point for a wide range of embedded systems. On the one hand, sensors and detectors must notify some other devices that some event occurred, on the other hand, other systems should react as soon as possible. Examples of such systems include CNC, vehicle control, avionics, distributed sensor systems and lot of others.

At the same time, it’s really hard to develop bare-metal programs for a number of reasons:

Developers don’t have much choice for frameworks and languages: it probably will be ANSI C and assembly language even for non-time-critical parts of code which can be developed faster with something else (for example, debugging output, collecting statistics, some user interface for diagnostics and so on)
There are really lots of solutions which require different hardware hardware drivers: network, interrupt controller, timer and UART driver are bare minimum
Some systems have both FPGA and HPS, which leads to additional steps to “glue” all the parts together

This leads popularity of Linux kernel in embedded systems and it works great in lots of applications, as it gives portable and stable code base.

But let’s see some specific case: time-critical applications that rely on network.

“Time-critical” may mean different things:

Applications that require high bandwidth
Applications that require low latency

Linux works great with the first case as there are a number of possible optimizations (turn off interrupt coalescing and so on…), but can you achieve better results in terms of low latency? Let’s find out!

Real-life example

We had a following task: minimize possible latency for every single UDP response over the ethernet. DE0-Nano-Soc board was used as an embedded system core which would control some peripheral devices as a reaction for commands in those UDP packets.

Network topology is Point-to-Point, so there are no intermediate hubs, routers and other network devices.

Maximum acceptable latency is 0.1ms while basic Linux solution could only provide 0.5ms.

At the same time it was necessary to support POSIX-compatible programs.

ABBZveV.jpg!web

To measure estimated response time we will use two hosts.

The first host will be a desktop computer running GNU/Linux operating system, the second host will be DE0-Nano-SoC development board. This board has FPGA and HPS (Harp Processing System, it’s basically ARM), and we’re going to minimize response time for HPS running Embox RTOS.

We will use simple testing application which looks like this:

while (1) {
 char buf[BUFLEN];recvfrom(s, buf, BUFLEN); 
 sendto(s, buf, BUFLEN);
}

This program will run on the second host, i.e. DE0-Nano-SoC.

First host will be sending UDP packets and waiting response for each of them, measuring time for the response.

for (int i = 0; i < N; i++) {
  char buf_tx[BUFLEN], buf_rx[BUFLEN];sprintf(buf_tx, “This is packet %d\n”, i);time_t time_begin = time_now();sendto(s, buf_tx, BUFLEN);
  recvfrom(s, buf_rx, BUFLEN);time_t time_end = time_now();
  if (memcmp(buf_tx, buf_rx, sizeof(buf))) {
    printf(“%d: Buffer mismatch\n”, i);
  }
  if (time_end — time_begin > TIME_LIMIT) {
    printf(“Slow answer #%d: %d\n”, i, time_end — time_begin);
  }
}

Also we measure average, minimal and maximal response time.

Source code is available on GitHub .

With test run we made sure that packets are received successfully, so we have started to make some basic optimizations:

Get rid of all debug UART output: it turned out to be the slowest part
Compiling with -O2
Enabling L2 cache controller PL310 (this point was the least effective)

After sending 500 000 packets we have following measurements:

Avg: 4.52ms
Min: 3.12ms
Max: 12.24ms

This is still multiple times slower than time limits we need to meet, and average response time should be almost ten times lower to compete with Linux.

Finding out the reason

One of the possible sources for slow data processing may be other processes who use system resources, but in this case nothing else is running.

May be there are too much interrupts from some peripherals? But that’s not the case too: we only process network and timer interrupts; first ones are necessary to process ethernet frames and second ones do not tend to do any effect: if timer goes slower, response time doesn’t decrease anyway.

Eventually we have found out that high latency was caused by low link speed: we used 100 Mbit/s USB-to-ethernet adapter; net driver didn’t support 1Gbit/s link too.

After patching driver and replacing ethernet adapter with faster one we’re getting following results:

Avg: 0.08ms
Min: 0.07ms
Max: 4.31ms

Linux comparison

As we are using POSIX-compatible application for our measurements, it’s very easy to cross-build it for Linux:

arm-linux-gnueabihf-gcc server.c -O2 , which builds ELF file.

Running with the same client on the host side:

Avg: 0.77ms
Min: 0.74ms
Max: 5.31ms

As you can see, in this test Embox is able to respond almost 9 times faster than Linux, which is a pretty good result.

Dispersion

While average response time is pretty good, maximum time kills the positive effect for two reasons:

Of course it’s long enough to fail time limit, but even more importantly
It creates significant uncertainty to system behavior

How can you investigate the reason for such dispersion? We decided to start with measuring time which takes ethernet frame to be fully processed between receiving and responding. It was possible to collect statistics on the development board for future analyzing, but it’s much simpler just to send this data in the UDP packet itself and process in on the desktop computer.

Time of packet receive time is written to some variable inside interrupt handler, send time is written just before activating netcard DMA.

int net_tx(…) {
  if (is_udp_packet()) {
    timestamp2 = timer_get();
    memcpy(packet[UDP_OFFT],
            &timestamp1,
            sizeof(timestamp1));
    memcpy(packet[UDP_OFFT + sizeof(timestamp2)],
            &timestamp2,
            sizeof(timestamp2));
    …
  }
}

This time we got following results:

Avg: 8673
Min: 6191 
Max: 11950

It turned out that dispersion for Embox processing UDP packet is not big at all: it’s just about 25% which hardly explains final 5000% dispersion (Avg: 0.08ms Max: 4.31ms).

Even if Embox will process every UDP packet with the same time, it may reduce it by just 1/4 which still will be too much, so we have started to find out another reason for such behavior.

What if problem is on the other side?

So now we have two potential problems:

Hardware issues
Linux host latency

It’s much harder to solve the first problem, so in hope that it’s not the case we started to think how to solve this problem.

How do we check it?

First of all, we can just try to set highest priority to this test on the host system.

nice -n -20 ./client

However, this didn’t have any significant effect. It seemed that average time reduced slightly, but still was too small compared to big dispersion.

Another solution is to change scheduling policy to round-robin. You can do it with chrt command like this:

chrt --rr 99 ./client

Finally, it worked!

Number of “slow” responses has decreased dramatically. This histogram shows difference for round-robin and regular scheduling:

Rz6vAvY.png!web

Other ways to reduce latency for Linux host

Using raw sockets. It’s not exactly the same task, but if you really need lowest possible latency, it’s probably not that good idea to use UDP at all :)
Interrupt coalescing may increase network latency, so it can be helpful to turn it off
You can use libpcap and TPACKETv3 supported by Linux kernel. Speed increase is being achieved by removing the overhead for copying from kernel space to user space. pcap also allows to apply packet filtering
XDP or eXpress Data Path is a BPF-like project which allows to lower overhead too
Some other ways are considered in this Cloudflare blogpost

Reducing UDP Latency