Counting the number of matching characters in two ASCII strings

3 years ago

source link: https://lemire.me/blog/2021/05/21/counting-the-number-of-matching-characters-in-two-ascii-strings/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Counting the number of matching characters in two ASCII strings

Suppose that you give me two ASCII strings having the same number of characters. I wish to compute efficiently the number of matching characters (same position, same character). E.g., the strings ‘012c’ and ‘021c’ have two matching characters (‘0’ and ‘c’).

The conventional approach in C would look as follow:

uint64_t standard_matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    }
    return count;
}

There is nothing wrong with this code. An optimizing compiler can auto-vectorize this code so that it will do far fewer than one instruction per byte, given long enough strings.

However, it does appear that the routine looks at every character, one by one. So it looks like you are loading two values, then you are comparing and then incrementing a counter, for each character. So it might compile to over 5 instructions per character (prior to auto-vectorization).

What you can do instead is load the data in blocks of 8 bytes, into 64-bit integers as in the following code. Do not be mislead by the apparently expensive memcpy calls: an optimizing compiler will turn these function calls into a single load instruction.

uint64_t matching_bytes(char * c1, char * c2, size_t n) {
    size_t count = 0;
    size_t i = 0;
    uint64_t x, y;
    for(; i + sizeof(uint64_t) <= n; i+= sizeof(uint64_t)) {
      memcpy(&x, c1 + i, sizeof(uint64_t) );
      memcpy(&y, c2 + i, sizeof(uint64_t) );
      count += matching_bytes_in_word(x,y);
    }
    for(; i < n; i++) {
        if(c1[i]  == c2[i]) { count++; }
    }
    return count;
}

So we just need a function that can compare two 64-bit integers and find how many matching bytes there are. Thankfully there are fairly standard techniques to do so such as the following. (I borrowed part of the routine from Wojciech Muła.)

uint64_t matching_bytes_in_word(uint64_t x, uint64_t y) {
  uint64_t xor_xy = x ^ y;
  const uint64_t t0 = (~xor_xy & 0x7f7f7f7f7f7f7f7fllu) + 0x0101010101010101llu;
  const uint64_t t1 = (~xor_xy & 0x8080808080808080llu);
  uint64_t zeros = t0 & t1;
  return ((zeros >> 7) * 0x0101010101010101ULL) >> 56;
}

With this routine, you can bring down the instruction count to about 2 per character, including all the overhead and the data loading. It is strictly better than what you could with character-by-character processing by a factor of two (for long strings).

Though I seem to restrict the problem to ASCII inputs, my code actually counts the number of matching bytes. If you know that the input is ASCII, you can further optimize the routine.

I leave it as an exercise for the reader to write a function that counts the number of matching characters within a range, or to determine whether all characters in a given range match.

The proper way to solve this problem is with SIMD instructions, and most optimizing compilers should do that for you starting from a simple loop. However, if it is not possible and you have relatively long strings, then the approach I described could be beneficial.

My source code is available.

Published by

Daniel Lemire

A computer science professor at the University of Quebec (TELUQ). View all posts by Daniel Lemire

Recommend

thomasrayner.ca 4 years ago
Cache

Splitting Strings On Escaped Characters In PowerShell - Literal vs. Dynamic Cont...

Splitting Strings On Escaped Characters In PowerShell - Literal vs. Dynamic Content Before we get into this post, here’s a little required reading:

reactnativeexample.com 4 years ago
Cache

React Native component and hook to animate counting up or down to a number

use-count-up React/React Native component and hook to animate counting up or down to a number. Key features :trophy: Lighter implementation and smaller bundle size

www.codesd.com 4 years ago
Cache

Find matching strings in the same text string separated by a word using the word...

Find matching strings in the same text string separated by a word using the word office vba advertisements I need your help on this problem. I...

blogs.sas.com 3 years ago
Cache

Removing leading characters from SAS strings

Removing leading characters from SAS strings 4

blogs.sas.com 3 years ago
Cache

Removing trailing characters from SAS strings

Removing trailing characters from SAS strings 0

blogs.sas.com 3 years ago
Cache

Introducing TRIMS function to remove any leading and/or trailing characters from...

Introducing TRIMS function to remove any leading and/or trailing characters from SAS strings

siongui.github.io 3 years ago
Cache

[Golang] Iterate Over UTF-8 Strings (non-ASCII strings)

Iterate over UTF-8 or non-ASCII strings in Golang. Iterations by for or range keyword.

www.geeksforgeeks.org 2 years ago
Cache

Remove minimum number of characters so that two strings become anagram

Remove minimum number of characters so that two strings become anagramSkip to content

blog.appsignal.com 2 years ago
Cache

Sanitize Strings in Elixir with Pattern Matching and Recursion

In this post, we'll use two of Elixir's most powerful features - pattern matching and recursion - to sanitize a string by removing invalid Unicode "Specials" char...

www.geeksforgeeks.org 2 years ago
Cache

Maximize ascii sum removing K characters in String

Maximize ascii sum removing K characters in StringGiven a string s and an integer K, the task is to remove K characters from the string such that the sum of ASCII va...

Counting the number of matching characters in two ASCII strings

Counting the number of matching characters in two ASCII strings

Published by

Daniel Lemire

Recommend

About Joyk