

Even quicker byte count
source link: https://llogiq.github.io/2016/09/27/count.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Even quicker byte count
27 September 2016
Last time, I thought I’d sped up line counting (which both xi-rope and ripgrep use) pretty good. I had two versions (one using POPCNT and being faster on newer machines, and one using old-fashioned bit-hacks running faster on older ones). Along comes the phenomenal Veedrac and sends a PR with a completely unsafe version that nonetheless not only works, but smokes every benchmark competitor by a factor of at least two.
I’ve never been so happy for being beaten in a performance contest.
Veedrac’s code is a tour de force. Pointer arithmetic, array reduction, bit
manipulation, you name it. Like my fastest version, it reduces the count in
steps, but the highest step increment is a whopping 8K! Not only that, it
manages to squeeze the intermediate count for those 8192 bytes into 4 usize
s.
Now, how does that work? There’s actually a simple trick to it. Apart from using multiple usizes to keep intermediate counts, unlike my solution in words instead of bytes, which allows it to add more than 255 values and combining those counts only when the ~8k bytes have been screamed through. Hyper-screaming indeed.
Apart from using 4 counters in parallel (which complicates matters just a bit), the algorithm is as follows:
- Per usize, get the one-bits (via subtraction/masking, as my count did)
- Collect up to 255×8 one-bit values by simple addition
- combine each two bytes into one short word by shift + mask + addition (this is how it looks like on 64 bit machines, on 32 bit machines, only half the size is used):
- Now we have four 16-bit entity per 64-bit word that carry numbers. Note that the number of each value is strictly less than 255 which means we can add two of them in a 16-bit half-word without overflow. This is exactly what the algorithm does (if the slice is large enough).
- finally, the counts are reduced by our multiply + shift trick from last time,
only that we multiply by a value that has one bit set per 16bit instead of per
byte (it looks like
0x1000100010001
in hex on 32bit machines, and we get that number portably by dividingstd::usize::MAX / 0xFFFF
):
(for 32-bit architectures, we obviously only shift by 16)
Remember we do this by four 64-bit words at once, so each loop iteration counts eight kilobytes. We could in theory save even more by doing the multiply-reduction only every 128 loops (because after that the counts could overflow), but the savings are probably negligible.
Since not all slices are that big, we do another round for 1 kilobyte,
sixtyfour bytes and finally single usize
s (Note that on 32 bit machine, we
need to half the numbers).
In conclusion: A clever way to use the available bits that I did not see before allowed Veedrac to amortize the count reductions with much more bytes counted. No wonder the code is faster.
PS.: I tried reproducing this with much less unsafe code, but couldn’t even get close. So the unsafe code stays for now. I have reviewed and tested the code and am confident it will run perfectly fine.
Recommend
-
112
I found previously that CDS and AOT can have a dramatic effect when used to speed up JVM startup for a simple “Hello World”. Now, I want to see how effective that ac...
-
53
It's not just Android P getting an update today. An update is rolling out to the Wear OS app in Google Play. It'll add a handful of minor changes to your w... by Ryan Whitwam in Applications, Google, News, Wear OS
-
45
SSH Generator Simple tool to generate ssh keys. Every time when i used to create ssh keys it generally took at least 10mins as a beginner. It will not take more than 2mins to create your ssh keys using this.
-
53
Quicker 是一款 Windows 平台的效率工具,由独立开发者@崔亮开发。Quicker 的使用逻辑和软件启动器比较接近,在快捷调用的基础上增加了更深度的功能。它的亮点在于用户可以通过 可视化编程 的方式编写属于自己的动作,官方共享库...
-
40
Windows - @depress - 欢迎用了比较久的大佬留言,我想知道,第一,是否真的使效率提升很多?第二,学习成本如何?第三,这三者功能上是否有冲突?推荐安装哪个?简单看了介绍,有点心动。
-
7
Quicker Local Maven Builds Maven is a depth-first recursive build technology. Projects can be multiple modules in one repo, and if ‘mvn install’ is lau...
-
16
Advanced Mobile Location makes your emergency calls quicker and easier
-
9
Wednesday, 19 May 2021 12:07 There are Now Quicker, Better Ways to Find that Perfect Employee By Hiremii Limited ...
-
9
Birchtree By Matt Birchler I've been writing here since 2010! Back when personal blogs were all the rage. Kids, a...
-
8
Count of pairs having even and odd LCM from an arraySkip to content
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK