33

Preprocessing Phase for C++17's Searchers

 5 years ago
source link: https://www.tuicool.com/articles/hit/uuEnQvV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Searchers from C++17 are a new way to perform efficient pattern lookups. The new standard offers three searchers: default_searcher , boyer_moore_searcher , and boyer_moore_horspool_searcher . The last two implements algorithms that require some additional preprocessing for the input pattern. Is there a way to separate preprocessing time from the search time?

Short Reminder

In mylast article, I introduced searchers that were added into C++17.

Quoting the standard :

Searchers provide function object types for operations that search for a sequence [pat_first, pat_­last) in another sequence [first, last) that is provided to the object's function call operator. The first sequence (the pattern to be searched for) is provided to the object's constructor, and the second (the sequence to be searched) is provided to the function call operator.

Each specialization of a class template specified in this subclause shall meet the CopyConstructible and CopyAssignable requirements.

template<class ForwardIterator, class Searcher>
ForwardIterator search( ForwardIterator first, ForwardIterator last,
                        const Searcher& searcher );

For now we have three searchers:

default_searcher
boyer_moore_searcher
boyer_moore_horspool_searcher

Last time, however, I didn't correctly summarise what a searcher is. This is because it's not immediately clear if you just look at the std::search reference.

The basic idea is that each Searcher wraps the pattern you'd like to search. That means also doing some necessary preprocessing. Later — inside std::search — each searcher exposes operator()(first, last) — a way to look for a pattern into the [first, last) range.

Additionally, since searchers are copyable and assignable, you can pass them around in your application.

Since a searcher is a separate object, we can do a little experiment and measure how much time it takes... let's see.

Demo Application

Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp

How the test works:

  • The app loads a file, like a book sample - 500kb of text.
  • The whole file content is stored in one input string.
  • A pattern is selected.
    • You can look for a string.
    • You can look for N characters from the input string (from the start, the center, or the end).
  • The app uses several algorithms and runs each search ITER times.

Command line:

searcher.exe file iterations N|string Pos
file - text file to load
iterations - the number of iterations
N|string - number of letters or a given string
Pos - optional parameter when N is specified:
    0 - the start of the input string
    1 - the centre of the input string
    > 1 - end of the input string

For example:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"

The above command will look for "the town" string in the input file "book-test.txt" and will perform 1000 iterations.

Another command:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10 1

This will look for 10 characters from the centre (pos=1).

Here's the code for the boyer_moore_horspool version:

Searcher Preprocessing

In the first version of the demo application I used the following code:

RunAndMeasure("boyer_moore_horspool_searcher", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        auto it = std::search(testString.begin(), testString.end(),
            std::boyer_moore_horspool_searcher(
                needle.begin(), needle.end()));
        if (it == testString.end())
            std::cout << "The string " << needle << " not found\n";
    }
});

The above code measured the whole search. But now we can split it, or extract the preprocessing phase.

For example:

RunAndMeasure("boyer_moore_searcher init only", [&]() {
    for (size_t i = 0; i < ITERS; ++i)
    {
        std::boyer_moore_searcher b(needle.begin(), needle.end());
        DoNotOptimizeAway(&b);
    }
    return 0;
});

All data structures must be initialized in the constructor of the searcher objects. Later, only the operator() is used to perform the search.

Some Performance Results

Here's what I got from running a few tests.

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 1000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 1000
string::find: 207.235 ms
default searcher: 336.469 ms
boyer_moore_searcher init only: 4.65379 ms
boyer_moore_searcher: 33.383 ms
boyer_moore_horspool_searcher init only: 0.926099 ms
boyer_moore_horspool_searcher: 31.652 ms

When searching for 100 letters from the centre of the input string, both of the new algorithms were faster than the default searcher and string::find . boyer_moore uses more time to perform the initialisation than boyer_moore_horspool (it creates two lookup tables, rather than one, so it will use more space and preprocessing). But it looks like boyer_moore search time is a bit faster: 33ms - 4.6ms vs 31.6 - 0.92ms .

The cost of preprocessing in boyer_moore is more visible if you make the pattern even larger:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000 1
string length: 547412
test iterations: 1000
needle from the center...
pattern length: 10000
string::find: 154.501 ms
default searcher: 291.107 ms
boyer_moore_searcher init only: 104.912 ms
boyer_moore_searcher: 126.098 ms
boyer_moore_horspool_searcher init only: 6.35085 ms
boyer_moore_horspool_searcher: 25.0702 ms

104ms vs. 6ms!

How about more realistic searchers and patterns. It's probably quite rare to look for 1000 letters...

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 8
string::find: 32.6278 ms
default searcher: 57.7197 ms
boyer_moore_searcher init only: 0.446075 ms
boyer_moore_searcher: 21.9445 ms
boyer_moore_horspool_searcher init only: 0.337122 ms

When looking for "the town" (appears in line 711 out of 9469 line). The preprocessing seems to be super fast, and the new algorithms could beat the string::find version.

If the string is longer and positioned in near the end of the file:

.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "This Web site
 includes information about Project"
string length: 547412
test iterations: 1000
needle is a string...
pattern length: 48
string::find: 60.324 ms
default searcher: 408.87 ms
boyer_moore_searcher init only: 0.670692 ms
boyer_moore_searcher: 125.899 ms
boyer_moore_horspool_searcher init only: 0.326464 ms
boyer_moore_horspool_searcher: 127.477 ms

Here, when looking for "This Web site includes information about Project" - which is located at the end of the file (a single occurrence) Boyer-Moore algorithms are 2x slower than string::find .

As usual, I encourage you to perform your own tests.

Summary

In this post, I wanted to stress that each searcher can perform some initialization in its constructor. What's more, searchers can be initialized once and then passed around in the application — might be useful when you search for the same pattern over and over.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK