Destroying C with 20 lines of Haskell: wc
source link: https://0xd34df00d.me/posts/2020/02/destroying-c-with-20-lines-of-haskell.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
tl;dr: today we’ll look at implementing a toy wc
command that is about 4-5 times faster than the corresponding GNU Coreutils implementation.
So I’ve recently come across a post
by Chris Penner describing a Haskell implementation of the Unix wc
command. Chris did a great job optimizing the Haskell version as well as showing how some high-level primitives (monoids and streaming, for one) turn out to be useful here, although the result was still a bit slower than C. There’s also a parallel version that relies on the monoidal structure of the problem a lot, and that one actually beats C.
But that post left me wondering: is it possible to do better without resorting to parallel processing?
Turns out the answer is yes. With some quite minor tweaks, the Haskell version manages to beat the hell out of the C version that presumably has decades of man-hours put into it.
Experimental setup
As usual, let’s go through the benchmarking procedure first.
Input data
I’ve downloaded this file and concatenated it with itself so that the total size is about 1.8 gigabytes:
% for i in `seq 1 10`; cat part.txt >> test.txt % du -sh test.txt 1.8G test.txt
By the way, the file is residing on a tmpfs
partition, so there is no disk IO involved.
Hardware and software
I’m running Gentoo Linux on Core i7 4770 with 32 gigabytes of RAM with most of it free during the benchmarks.
All the Haskell code is built using ghc 8.8.2.
I’m competing against coreutils 8.31 built using gcc 9.2 with -O2 -march=native
:
% wc --version wc (GNU coreutils) 8.31 Packaged by Gentoo (8.31-r1 (p0))
By the way, that -march=native
part is a little bit of cheating in favour of C, since the resulting Haskell binary can run on any x86-64 machine, while wc
compiled with that flag can only run on my CPU and newer ones (that is, assuming the compiler actually used some of the newer SIMD extensions). But, well, let’s be kind and give the C version a bit of advantage.
Measuring
Measurements are done using time
. time
shows both user and system time, but I only consider user time for comparison, since:
- System time proves to be quite consistently equal to about 0.3 s.
- I’m not benchmarking the kernel anyway, so I’m curious how much time is spent in my code.
As with any fully deterministic benchmark, I run each executable several times (5 in this case) and report the minimal (user) time.
Baseline
So how does the C code perform? Let’s run wc
to find out!
Benchmarking as described above gives us 7.20 seconds of user time. So, this sets the baseline for our implementation.
Haskell
Let’s see where do we start. I’ll take this version from Chris’ post, changing the function name and renaming cs
to bs
since we’re counting bytes anyway:
import qualified Data.ByteString.Lazy.Char8 as BS import Data.Char wc :: BS.ByteString -> (Int, Int, Int) wc s = let (bs, ws, ls, _) = BS.foldl' go (0, 0, 0, False) s in (bs, ws, ls) where go :: (Int, Int, Int, Bool) -> Char -> (Int, Int, Int, Bool) go (!bs, !ws, !ls, !wasSpace) c = let addLine | c == '\n' = 1 | otherwise = 0 addWord | wasSpace = 0 | isSpace c = 1 | otherwise = 0 in (bs + 1, ws + addWord, ls + addLine, isSpace c)
Sure, there are better-performing single-threaded versions mentioned in Chris’ post, but it’s just that this one is a bit more friendly to the kind of changes I will be making later on. Also, I don’t really need the monoidal structure if I’m not going to use associativity, and I’m not going to use associativity as I keep this single-threaded.
Anyway, according to the numbers in Chris’ post, this version is about 9 times slower than wc
. I was unable to reproduce this: on my machine it’s 433%
of wc
, taking 31.23 seconds to run.
Also, this version has a tiny bug (it doesn’t count last word if it’s not followed by a space or a newline), but I’m not fixing it just yet, especially since the fix is trivial and amounts to considering the last Bool
field of the tuple that’s currently being merely dropped — so it shouldn’t affect the performance that much. We’ll fix this in the final version.
Anyway, how can this be improved?
Records to the rescue
Firstly, I don’t like big tuples, especially tuples having elements of the same type — it’s just too error-prone for my small attention span. So let’s replace the tuple with a record:
{-# LANGUAGE Strict #-} {-# LANGUAGE RecordWildCards #-} import qualified Data.ByteString.Lazy.Char8 as BS import Data.Char data State = State { bs :: Int , ws :: Int , ls :: Int , wasSpace :: Bool } wc :: BS.ByteString -> (Int, Int, Int) wc s = (bs, ws, ls) where State { .. } = BS.foldl' go (State 0 0 0 False) s go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) (isSpace c) where addLine | c == '\n' = 1 | otherwise = 0 addWord | wasSpace = 0 | isSpace c = 1 | otherwise = 0
We don’t use bang patterns anymore since we effectively made the fields of the record strict via the {-# LANGUAGE Strict #-}
pragma. This wouldn’t change the performance much, would it?
Turns out it does, and quite a lot — by a factor of four! This version takes 7.56 seconds or 105%
of wc
, so we’re almost as fast as the baseline now! How come?
Well, the answer is simple, and it’s also hinted at in the original post: once we’ve defined a record type with strict fields, the compiler has easier time figuring out that it could unpack the contents of those fields. So we’re effectively having
data State = State { bs :: {-# UNPACK #-} !Int , ws :: {-# UNPACK #-} !Int , ls :: {-# UNPACK #-} !Int , wasSpace :: Bool }
and this saves us an indirection and a memory allocation per Int
field (even if the allocated values never leave the nursery thus being quite cheap).
CSE
Next, note we’re computing isSpace c
at least once for every character, but most of the times we do it twice. Let’s make sure we indeed only call isSpace
once!
So we change our recursive go
function like this:
go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp where isSp = isSpace c addLine | c == '\n' = 1 | otherwise = 0 addWord | wasSpace = 0 | isSp = 1 | otherwise = 0
The result? The run time more than halved, taking 2.93 seconds or 41%
of wc
. So, we’re more than twice as fast as wc
now, still having pure and idiomatic code!
I’m not sure why ghc doesn’t do this optimization for us, especially given that it never increases the amount of work done in this particular case, and doing the optimization manually (as in the code above) gives some good results, improving the runtime
Compiler options and pragmas
Another thing worth trying is to play around with compiler options. Two biggest knobs are optimization level and the codegen backend, and, in addition to that, we could try inlining wc
. There is no good systematic approach here, so we’ll just try these things in all possible combinations:
-
LLVM codegen (via
-fllvm
) — it sometimes gets better on hardcore number-crunching code, but it might help here too. -
Optimization level (via
-O2
) — most of the times the default-O
is good enough, and there is no observable difference between that and-O2
, but why not try? -
Inlining the function via the
{-# INLINE wc #-}
pragma. I don’t really think it would help since this function is called only once so the corresponding overhead is negligible, and I don’t expect any further optimization opportunities to kick after inlining. But, again, let’s try and see.
I’m not even showing the resulting code since the function body isn’t affected at all.
So here are the results in tabular form:
LLVM-O2
Inlining
Time, s
% of wc
time
2.93
41
✓
3.96
55
✓
✓
2.61
36
✓
2.59
36
✓
2.23
31
✓
✓
✓
2.02
28
✓
✓
2.01
28
✓
✓
2.01
28
Not sure why inlining helps since this function is called only once (so the function call overhead doesn’t matter) and I don’t really see any opportunity for other optimizations to kick in after inlining, but I won’t argue with the numbers.
On the other hand, -O2
really helps, while LLVM doesn’t give any improvement, and, in fact, it makes things worse when used alone.
Anyway, we could just as well stop here and push this version to prod. This is still quite an idiomatic Haskell, and C already looks pretty beaten up ( 28% !). If I were to present the results somewhere all while showing the elegance of functional programming, this is the version I’d present.
But let’s keep going (and let’s keep -fllvm
in our flags, cause why not).
More unpacking
The speedup we got after replacing the tuple with a custom record suggests another possible optimization.
Note that the Bool
that we have in the record doesn’t get unpacked since it has more than one constructor. What if we replace it with an Int
, where 1
denotes True
and 0
denotes False
? This is a bit ugly, but, you know, we’re competing with C, so let’s do it!
data State = State { bs :: Int , ws :: Int , ls :: Int , wasSpace :: Int } wc :: BS.ByteString -> (Int, Int, Int) wc s = (bs, ws, ls) where State { .. } = BS.foldl' go (State 0 0 0 0) s go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp where isSp | isSpace c = 1 | otherwise = 0 addLine | c == '\n' = 1 | otherwise = 0 addWord | wasSpace == 1 = 0 | isSp == 1 = 1 | otherwise = 0
This by itself doesn’t change much (I’m still getting the same results), but it opens some further opportunities. Namely, notice that the expression for addWord
can be translated from guards (which are essentially nested if
s) to a single arithmetic expression:
addWord = (1 - wasSpace) * isSp
Proving the equivalence is left as an exercise to the reader.
Anyway, this change further reduces the run time to 1.91 s, or 26%
of wc
.
Less Unicode
What if we wanted to push just a little further?
A reasonable compromise seems to be to ignore the presence of Unicode spaces
outside of ASCII, avoiding the isSpace
function (which didn’t do its Unicode job anyway since we are looking at the input a byte at a time). It’s crucial to note that it does not
prevent us from correctly counting characters
in later developments of this little wc
substitute, but I’ll elaborate on this in subsequent posts.
While we’re at it, we’ll also import Data.ByteString.Lazy
instead of Data.ByteString.Lazy.Char8
to avoid unpacking and repacking bytes into Char
s.
{-# LANGUAGE Strict #-} {-# LANGUAGE RecordWildCards #-} module Data.WordCount where import qualified Data.ByteString.Lazy as BS data State = State { bs :: Int , ws :: Int , ls :: Int , wasSpace :: Int } wc :: BS.ByteString -> (Int, Int, Int) wc s = (bs, ws, ls) where State { .. } = BS.foldl' go (State 0 0 0 0) s go State { .. } c = State (bs + 1) (ws + addWord) (ls + addLine) isSp where isSp | c == 32 || c - 9 <= 4 = 1 | otherwise = 0 addLine | c == 10 = 1 | otherwise = 0 addWord = (1 - wasSpace) * isSp {-# INLINE wc #-}
This change results in quite a dramatic improvement, reducing the run time to 1.45 s, or 20%
of wc
time, even with a bit of an advantage given to wc
in form of -march=native
.
And, I think, that’s about as best as it could get with reasonable effort, so let’s stop here.
Fixing the off-by-one error
The problem is that we don’t count the last word as a word if it isn’t followed by a newline or a space character. Luckily, the fix is easy: just take into account the wasSpace
field and set its initial value to 1
, resulting in code like
wc s = (bs, ws + 1 - wasSpace, ls) where State { .. } = BS.foldl' go (State 0 0 0 1) s ...
Obviously, this does not affect performance at all.
Bonus: reducing system time
Remember those 0.35 seconds spent in kernel that we agreed not to consider? How about trying to cut that down as well?
I haven’t shown yet how do I actually use the wc
function. So here’s the main
:
main :: IO () main = do [path] <- getArgs contents <- BS.readFile path print $ wc contents
readFile
wastes some cycles unnecessarily copying the data from the kernel to the userspace. We could eliminate that overhead if we used something like mmap
. Let’s take, for instance, the bytestring-mmap
package that provides ByteString
s over mmap
ed files with zero copying. We’ll change the main
to go along the lines of
main :: IO () main = do [path] <- getArgs contents <- unsafeMMapFile path print $ wc contents
and since unsafeMMapFile
returns a strict ByteString
, we’ll also change our word counting module to import Data.ByteString
instead of Data.ByteString.Lazy
.
The result? We’ve significantly reduced the time spent in kernel: it’s now 0.04-0.06 seconds instead of 0.35 s we had previously. This does not have any statistically significant effect on our user time, though.
One might say the downside is that we’ve lost referential transparency: if the file changes our ByteString
will change as well, but I’d say that it does not matter in our case since we only observe our ByteString
just once.
Conclusion
This section will be quite short.
So we’ve managed to just smash a C program that was looked at by thousands of eyes of quite hardcore low-level Unix hackers over a few decades. We did this with a handful of lines of pure, mutation-less, idiomatic Haskell, achieving about 4 to 5 times of throughput of the C version and spending less than an hour on all the optimizations.
The resulting code is available at github .
Stay tuned for a second part, where we will investigate shaping all this into an actual wc
substitute, where different statistics can be turned on or off, all while not computing what the user hasn’t requested.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK