C++ For C# Developers: Part 49 – Ranges and Parallel Algorithms
source link: https://www.jacksondunstan.com/articles/6973
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
C++ For C# Developers: Part 49 – Ranges and Parallel Algorithms
April 19, 2021 Tags: algorithm, generic, iterator, threading
Generic algorithms have been available in C++ for decades, but the last two versions of the language have really ramped up the functionality. C++17 added support for parallel execution of generic algorithms to easily take advantage of multi-core CPUs. Then C++20 added support for ranges, a composable version of generic algorithms that’s even closer to LINQ in C#. Today we’ll explore both of these!
Table of Contents
Parallel Algorithms
Similar to ParallelEnumerable
in .NET 5.0’s version of LINQ, C++17 added overloads to many generic algorithms to enable specifying their “execution policy.” C++ allows specifying two related factors: parallelism and sequencing. “Parallel” refers to whether the operations that make up the algorithm may be performed in parallel on multiple threads. “Unsequenced” refers to whether the operations may be vectorized such as with the CPU’s SIMD instructions to operate on multiple elements of the collection at once. This table shows the options:
std::execution::sequenced_policy
std::execution::seq
No
No
C++17
std::execution::unsequenced_policy
std::execution::unseq
No
Yes
C++20
std::execution::parallel_policy
std::execution::par
Yes
No
C++17
std::execution::parallel_unsequenced_policy
std::execution::par_unseq
Yes
Yes
C++17
The fewer restrictions we put on the algorithm, the faster it may perform. The defaults we’ve had for decades are equivalent to seq
but we can enable parallelism and vectorization with par_unseq
or just one with unseq
or par
.
The following program benchmarks std::sort
on a std::vector<uint32_t>
containing one million random integers. It uses std::sort
in its default form as well as with all of the execution policies. It makes use of several C++ Standard Library features including random number generation and time-checking:
#include <execution> #include <algorithm> #include <chrono> #include <vector> #include <cassert> #include <random> #include <limits> template <typename TExecutionPolicy> void Sort(std::vector<uint32_t>& v, const char* label, const TExecutionPolicy& ep) { // Randomize the array std::random_device rd; std::mt19937 gen{ rd() }; std::uniform_int_distribution dist{ std::numeric_limits<uint32_t>::min(), std::numeric_limits<uint32_t>::max() }; for (uint32_t i = 0; i < static_cast<uint32_t>(v.size()); ++i) { v[i] = dist(gen); } // Take the time before sorting std::chrono::steady_clock clock{}; auto before{ clock.now() }; // Use normal sort if execution policy is null // Otherwise use sort that has an execution policy if constexpr (std::is_same_v<TExecutionPolicy, std::nullptr_t>) { std::sort(v.begin(), v.end()); } else { std::sort(ep, v.begin(), v.end()); } // Take the time after sorting auto after{ clock.now() }; auto elapsed{ after - before }; auto ms{ std::chrono::duration_cast<std::chrono::milliseconds>(elapsed) }; DebugLog(label, ms); // Make sure the array is actually sorted assert(std::is_sorted(v.begin(), v.end())); } int main() { std::vector<uint32_t> v{}; v.resize(1000000); Sort(v, "default", nullptr); Sort(v, "sequenced", std::execution::seq); Sort(v, "parallel", std::execution::par); Sort(v, "unseq", std::execution::unseq); Sort(v, "parallel_unsequenced", std::execution::par_unseq); }
Compiling this with Microsoft Visual Studio 16.9.4 in Release mode and running on Windows 10 20H2 with an Intel Core i5-1135G7 yielded these results:
Execution Policy Time default 74mssequenced
74ms
parallel
23ms
unseq
73ms
parallel_unsequenced
16ms
We see here that sequenced
really is the same as not specifying an execution policy at all. Opting for unseq
hardly made an improvement, but parallel
cut the time by nearly 70%! parallel_unsequenced
went even further and cut the time by over 78%!
A lot of these results will depend on compiler, Standard Library implementation, data type, container contents, OS, CPU and other hardware, and the chosen algorithm. As always, it’s best to measure the actual scenario being optimized to gain the most relevant timing data.
That said, it’s clear from this timing that at least the std::sort
algorithm on integers lends itself to parallelism. Rather than painstakingly implementing our own parallel sort function, we can simply pass an execution policy to std::sort
and take advantage of the library version with any data type and any container type. Here’s a few examples:
#include <execution> #include <algorithm> #include <vector> #include <deque> std::vector<uint32_t> v{}; v.resize(1000000); std::iota(v.begin(), v.end(), 10); // Parallel check that a function returns true for all elements DebugLog( std::all_of( std::execution::par_unseq, v.begin(), v.end(), [](uint32_t x) { return x >= 10; })); // true // Parallel search for an element that a function returns true for DebugLog( std::distance( v.begin(), std::find_if( std::execution::par_unseq, v.begin(), v.end(), [](uint32_t x) { return x == 100; }))); // 90 // Parallel transform of elements into another container // Output container even has a different type: a double-ended queue std::deque<uint32_t> d{}; d.resize(v.size()); std::transform( std::execution::par_unseq, v.begin(), v.end(), d.begin(), [](uint32_t x) { return x * 2; }); DebugLog(d[0], d[1], d[2], d[3], d[4]); // 20, 22, 24, 26, 28
That’s really all there is to parallel and unsequenced algorithms. The goal is to make parallelism easy, so there’s no need to manage threads or data synchronization at all. Just pass an appropriate execution policy and let the Standard Library take care of the implementation.
RangesThe “ranges” portion of the C++20 Standard Library adds an alternative approach to the same generic algorithms we’ve already seen. For example, std::ranges::sort
acts like std::sort
except that it takes a “range” rather than a pair of iterators:
#include <ranges> #include <algorithm> #include <vector> std::vector<int> v{ 4, 5, 2, 1, 3 }; // Only pass a range, not a pair of iterators std::ranges::sort(v); DebugLog(v[0], v[1], v[2], v[3], v[4]); // 1, 2, 3, 4, 5
While not as flexible as the pairs of iterators we’ve passed to algorithms so far, this approach is less error-prone as we might avoid accidentally passing mismatching iterators and causing undefined behavior as an algorithm reads beyond the beginning or end of a container:
#include <algorithm> #include <vector> std::vector<int> v{ 4, 5, 2, 1, 3 }; std::vector<int> v2{ 4, 5, 2, 1, 3 }; // Pass an iterator to the beginning of v but the end of v2 // find() can never get to the second iterator from the first by calling ++ // It'll read off the end of v, causing undefined behavior like a crash auto it{ std::find(v.begin(), v2.end(), 6) }; DebugLog("found?", it == v.end()); // ???
Another new feature is support for “projection” functions. We can specify a function for elements to be passed to and that return the value to use, similar to std::transform
. Here we use a projection function to sort a player’s points rather than the Player
object itself:
#include <ranges> #include <algorithm> #include <vector> struct Player { const char* Name; int NumPoints; }; std::vector<Player> p{ {"Alice", 30}, {"Bob", 20}, {"Carol", 100} }; std::ranges::sort( p, // Range to sort {}, // Default comparison [](const Player& x) { return x.NumPoints; }); // Projection function DebugLog(p[0].Name, p[1].Name, p[2].Name); // Bob, Alice, Carol
These are incremental improvements to the classic generic algorithms library, but where ranges really shine is with their system of “views.” A view is like an IEnumerable
in C# LINQ. It’s not a collection, but instead a lazily-evaluated view of something that might be a collection. Combined with some operator overloading, we can easily compose many algorithms into one:
#include <ranges> #include <algorithm> #include <vector> using namespace std::ranges::views; struct Player { const char* Name; int NumPoints; }; std::vector<Player> players{ {"Alice", 30}, {"Bob", 20}, {"Carol", 100} }; // Filter for players with more than 25 points // Then transform players to remove a point // Then reverse the order of the players auto result = players | filter([](Player p) { return p.NumPoints > 25; }) | transform([](Player p) { p.NumPoints--; return p; }) | reverse; for (const Player& p : result) { DebugLog(p.Name, p.NumPoints); // "Carol, 99" then "Alice, 29" }
The result
here isn’t evaluated until the range-based for
loop actually iterates through the view. The view itself is composed via the overloaded |
operator out of several “view adapters”: filter
, transform
, and reverse
. These view adapters are really just terse ways to create views. In this case, the filter
function creates and returns a filter_view
, transform
creates a transform_view
, and reverse
creates a reverse_view
.
It’s worth noting that this lazy evaluation can be more efficient than the eager version we’ve seen so far. That’s because there’s no need to allocate full-size temporary buffers to hold intermediate results. Instead, we simply need the current value being evaluated. Consider the classic version of the above ranges code:
#include <ranges> #include <algorithm> #include <vector> using namespace std::ranges::views; struct Player { const char* Name; int NumPoints; }; std::vector<Player> players{ {"Alice", 30}, {"Bob", 20}, {"Carol", 100} }; // Filter for players with more than 25 points std::vector<Player> output{}; std::copy_if( players.begin(), players.end(), std::back_inserter(output), [](Player p) { return p.NumPoints > 25; }); // Then transform players to remove a point std::transform( output.begin(), output.end(), output.begin(), [](Player p) { p.NumPoints--; return p; }); // Then reverse the order of the players std::reverse(output.begin(), output.end()); for (const Player& p : output) { DebugLog(p.Name, p.NumPoints); // "Carol, 99" then "Alice, 29" }
In order to use std::filter
and std::transform
, we needed to provide a place to put the output. We didn’t want to overwrite the existing players
so we needed to create output
. This could have been really inefficient for a large amount of data. It also would have required a lot of dynamic growing of the output
as elements were inserted by filter
. We could approximate with reserve
to limit the number of grow operations, but there’s often no fast, precise, and general way to know how much space to reserve.
Ranges eliminate this problem by operating on one element at a time. There’s never a need for a buffer that can hold all the elements. Instead, we’re just allocating enough for the current element.
We’ve seen that each kind of iterator has its own “tag” class that it derives from and then C++20 provides concept equivalents of each of these. The same is also true for ranges. The major difference is that input_or_output
has been replaced with just range
:
A wide variety of ranges, views, and view adapters are available. The ranges library provides what’s essentially a mirror version of all of the generic algorithms we’ve seen so far. Thankfully, all of the naming also mirrors the existing functions so it’s easy to find a range equivalent.
ConclusionC++17 allows us to specify how we want our generic algorithms to run. We can run them in the classic single-threaded way with no vector operations or we can opt in to multi-threading and vector operations. Doing so is dead simple and can greatly improve the performance of the algorithms.
C++20 adds ranges, a lazy-evaluated version of generic algorithms. It’s much more composable, less error-prone, and sometimes even more efficient than the classic algorithms. It ends up resembling a much faster version of LINQ that eschews all the garbage creation and virtual function calls.
In the end, it’s our choice which of these options we want to use. All of them support a wide variety of common algorithms. We’re even free to build our own and have them work with the algorithms provided by the C++ Standard Library. All these options combine to significantly reduce the number of “raw” loops we need to write in C++.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK