0

Implementing ‘strlen’ using SVE

 1 year ago
source link: https://lemire.me/blog/2022/12/19/implementing-strlen-using-sve/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Skip to content

Daniel Lemire's blog

Daniel Lemire is a computer science professor at the University of Quebec (TELUQ) in Montreal. His research is focused on software performance and data engineering. He is a techno-optimist and a free-speech advocate.

Implementing ‘strlen’ using SVE

In C, the length of a string in marked by a 0 byte at the end of the string. Thus to determine the length of the string, one must scan it, looking for the 0 byte. Recent ARM processors have a powerful instruction set (SVE) that is well suited for such problems. It allows you to load large registers at once and to do wide comparisons (comparing many bytes at once).

Yet we do not want to read too much data. If you read beyond the string, you could hit another memory page and trigger a segmentation fault. This could crash your program.

Thankfully, SVE comes with a load instruction that would only fault on the ‘first active element’: as long as the first element you are loading is valid, then there is no fault.

With this in mind, a simple algorithm to compute the length of a C string is as follows:

  1. Load a register.
  2. Compare each byte in it to 0.
  3. If any comparison matches, then locate the match and return the corresponding length.
  4. If not, increment by the register size (given by svcntb()), and repeat.

Using intrinsics, the code looks as follows…

size_t sve_strlen(const char *s) {
  size_t len = 0;
  while (true) {
    svuint8_t input = svldff1_u8(svptrue_b8(), (const uint8_t *)s + len);
    svbool_t matches = svcmpeq_n_u8(svptrue_b8(), input, 0);
    if (svptest_any(svptrue_b8(), matches)) {
      return len + svlastb_u8(svbrka_z(matches, matches), svindex_u8(0, 1));
    }
    len += svcntb();
  }
}

In assembly, the code looks as follows…

mainloop:
        ldff1b  { z0.b }, p0/z, [x10, x8]
        add     x8, x8, x9
        cmpeq   p1.b, p0/z, z0.b, #0
        b.eq    .mainloop

I do not discuss the details of my function, but it assumes implicitly that the underlying register size is no larger than 256. I would recommend adding a check that svcntb() is no larger than 256 before using this function in a real system.

Benchmarking against your system’s strlen function is difficult methodologically. Nevertheless, we can measure the number of instructions retired for sizeable strings as an indication of the relative speed. Using GCC 11 on a graviton 3 system (Amazon), I get the following metrics:

system’s strlen 0.23 instructions per byte
SVE 0.15 instructions per byte

Though I do not advocate adopting SVE as a replacement for strlen at this time, the potential is interesting, considering that I threw together my implementation in minutes.

My source code is available.

Credit: Thanks to Denis Yaroshevskiy for inviting me to look at non-faulting loads.

Update: It turns out that strlen was one of the examples that ARM used in its slides presenting SVE. At a glance, their implementation looks like mine but with more sophistication.

Published by

2ca999bef9535950f5b84281a4dab006?s=56&d=mm&r=g

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire

Leave a Reply Cancel reply

Your email address will not be published. The comment form expects plain text. If you need to format your text, you can use HTML elements such strong, blockquote, cite, code and em. For formatting code as HTML automatically, I recommend tohtml.com.

Comment *

Name *

Email *

Website

Save my name, email, and website in this browser for the next time I comment.

Receive Email Notifications?

You may subscribe to this blog by email.

Post navigation


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK