Preprocessing Phase for C++17's Searchers
source link: https://www.tuicool.com/articles/hit/uuEnQvV
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Searchers from C++17 are a new way to perform efficient pattern lookups. The new standard offers three searchers: default_searcher
, boyer_moore_searcher
, and boyer_moore_horspool_searcher
. The last two implements algorithms that require some additional preprocessing for the input pattern. Is there a way to separate preprocessing time from the search time?
Short Reminder
In mylast article, I introduced searchers that were added into C++17.
Quoting the standard :
Searchers provide function object types for operations that search for a sequence [pat_first, pat_last)
in another sequence [first, last)
that is provided to the object's function call operator. The first sequence (the pattern to be searched for) is provided to the object's constructor, and the second (the sequence to be searched) is provided to the function call operator.
Each specialization of a class template specified in this subclause shall meet the CopyConstructible
and CopyAssignable
requirements.
template<class ForwardIterator, class Searcher> ForwardIterator search( ForwardIterator first, ForwardIterator last, const Searcher& searcher );
For now we have three searchers:
default_searcher boyer_moore_searcher boyer_moore_horspool_searcher
Last time, however, I didn't correctly summarise what a searcher is. This is because it's not immediately clear if you just look at the std::search
reference.
The basic idea is that each Searcher wraps the pattern you'd like to search. That means also doing some necessary preprocessing. Later — inside std::search
— each searcher exposes operator()(first, last)
— a way to look for a pattern into the [first, last)
range.
Additionally, since searchers are copyable and assignable, you can pass them around in your application.
Since a searcher is a separate object, we can do a little experiment and measure how much time it takes... let's see.
Demo Application
Source code: github.com/fenbf/articles/cpp17/searchers/searchers.cpp
How the test works:
- The app loads a file, like a book sample - 500kb of text.
- The whole file content is stored in one input string.
-
A pattern is selected.
- You can look for a string.
- You can look for N characters from the input string (from the start, the center, or the end).
-
The app uses several algorithms and runs each search
ITER
times.
Command line:
searcher.exe file iterations N|string Pos file - text file to load iterations - the number of iterations N|string - number of letters or a given string Pos - optional parameter when N is specified: 0 - the start of the input string 1 - the centre of the input string > 1 - end of the input string
For example:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town"
The above command will look for "the town" string in the input file "book-test.txt" and will perform 1000 iterations.
Another command:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10 1
This will look for 10 characters from the centre (pos=1).
Here's the code for the boyer_moore_horspool
version:
Searcher Preprocessing
In the first version of the demo application I used the following code:
RunAndMeasure("boyer_moore_horspool_searcher", [&]() { for (size_t i = 0; i < ITERS; ++i) { auto it = std::search(testString.begin(), testString.end(), std::boyer_moore_horspool_searcher( needle.begin(), needle.end())); if (it == testString.end()) std::cout << "The string " << needle << " not found\n"; } });
The above code measured the whole search. But now we can split it, or extract the preprocessing phase.
For example:
RunAndMeasure("boyer_moore_searcher init only", [&]() { for (size_t i = 0; i < ITERS; ++i) { std::boyer_moore_searcher b(needle.begin(), needle.end()); DoNotOptimizeAway(&b); } return 0; });
All data structures must be initialized in the constructor of the searcher objects. Later, only the operator()
is used to perform the search.
Some Performance Results
Here's what I got from running a few tests.
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 1000 1 string length: 547412 test iterations: 1000 needle from the center... pattern length: 1000 string::find: 207.235 ms default searcher: 336.469 ms boyer_moore_searcher init only: 4.65379 ms boyer_moore_searcher: 33.383 ms boyer_moore_horspool_searcher init only: 0.926099 ms boyer_moore_horspool_searcher: 31.652 ms
When searching for 100 letters from the centre of the input string, both of the new algorithms were faster than the default searcher and string::find
. boyer_moore
uses more time to perform the initialisation than boyer_moore_horspool
(it creates two lookup tables, rather than one, so it will use more space and preprocessing). But it looks like boyer_moore
search time is a bit faster: 33ms - 4.6ms
vs 31.6 - 0.92ms
.
The cost of preprocessing in boyer_moore
is more visible if you make the pattern even larger:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 10000 1 string length: 547412 test iterations: 1000 needle from the center... pattern length: 10000 string::find: 154.501 ms default searcher: 291.107 ms boyer_moore_searcher init only: 104.912 ms boyer_moore_searcher: 126.098 ms boyer_moore_horspool_searcher init only: 6.35085 ms boyer_moore_horspool_searcher: 25.0702 ms
104ms vs. 6ms!
How about more realistic searchers and patterns. It's probably quite rare to look for 1000 letters...
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "the town" string length: 547412 test iterations: 1000 needle is a string... pattern length: 8 string::find: 32.6278 ms default searcher: 57.7197 ms boyer_moore_searcher init only: 0.446075 ms boyer_moore_searcher: 21.9445 ms boyer_moore_horspool_searcher init only: 0.337122 ms
When looking for "the town" (appears in line 711 out of 9469 line). The preprocessing seems to be super fast, and the new algorithms could beat the string::find
version.
If the string is longer and positioned in near the end of the file:
.\searchers.exe ..\..\SampleBooks\book-test.txt 1000 "This Web site includes information about Project" string length: 547412 test iterations: 1000 needle is a string... pattern length: 48 string::find: 60.324 ms default searcher: 408.87 ms boyer_moore_searcher init only: 0.670692 ms boyer_moore_searcher: 125.899 ms boyer_moore_horspool_searcher init only: 0.326464 ms boyer_moore_horspool_searcher: 127.477 ms
Here, when looking for "This Web site includes information about Project" - which is located at the end of the file (a single occurrence) Boyer-Moore algorithms are 2x slower than string::find
.
As usual, I encourage you to perform your own tests.
Summary
In this post, I wanted to stress that each searcher can perform some initialization in its constructor. What's more, searchers can be initialized once and then passed around in the application — might be useful when you search for the same pattern over and over.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK