

A modulo multiplication algorithm that is 2x faster than compiler implementation
source link: https://codeforces.com/blog/entry/111566
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Given and numbers , compute each
I came up with a algorithm that is almost twice as fast as the compiler implementation (when is a constant), which can effectively speed up the NTT and some DPs.
First of all
where is the fractional part function. The principle of the above equation is that is only related to the fractional part of .
By analogy with barrett reduction, let , then
Here are two multiple choice questions.
- The last formula does not necessarily work out as an integer, to be rounded down or up.
We choose , with the final formula rounded down. The former will make the result biased large, the latter will make the result biased small, a perceptual understanding so that the result will be more accurate.
Theorem: Let , when , the calculation result of the lower rounding is exact.
Proof: written outside
const int P = 998244353;
void calc(int n, int k, int a[]) {
unsigned long long p = ((unsigned __int128)k << 64) + P - 1) / P;
for(int i = 0; i < n; i++)
a[i] = (unsigned)a[i] * p * (unsigned __int128)P >> 64;
}
A few notes.
- The code uses
unsigned __int128
so it can only be used in a 64-bit architecture. (unsigned)a[i] * p
will automatically be modulo 2^64.* (unsigned __int128)P >> 64
is 64-bit multiplication retaining only the high 64 bits (in the rdx register), the same speed as multiplying twounsigned long long
s together.- Converting
a[i]
tounsigned
is becauseint
tounsigned
and then tounsigned long long
is faster than going directly tounsigned long long
, which requires sign extension.
Speed test.
code, contains two parts.
- The first part is the reciprocal throughput, the time taken by the CPU to be highly parallel (modern CPUs can be parallelized at instruction level on a single core), containing a total of modulo multiplications.
- The second part is the Latency, which is the time taken for each modulo multiplication to be performed sequentially without parallelism.
Possible output:
Throughput test(50000 * 50000):
Compiler's signed modulo: 1954.83 ms
Compiler's unsigned modulo: 1746.73 ms
My modulo: 1160.47 ms
Latency test(50000 * 25000):
Compiler's signed modulo: 4329.33 ms
Compiler's unsigned modulo: 3945.29 ms
My modulo: 2397.97 ms
By the way, a few general facts.
int
tounsigned
then tolong long
is faster thanlong long
, but negative numbers will be wrong.unsigned long long
modulo constants is faster thanlong long
.- Constant modulo multiplication is almost 4 times faster in parallel than serial (as is modulo multiplication of my invention).
Extensions
It is also possible to compute , but cannot exceed .
Recommend
-
4
Python Modulo: Using the % Operator Python supports a wide range of arithmetic operators that you can use when working with
-
5
In this post, we discover a strange creature named Modulo Bias, learn how it is born, why it is so dangerous, and how to fight it. The perpetual finding Over the last 3 years, I’ve worked on countless code revie...
-
14
Using Absolute Value, Sign, Rounding and Modulo in CSS Today Ana Tudor on Jul 28, 2021 Learn Development at
-
6
A fast alternative to the modulo reduction Suppose you want to pick an integer at random in a set of N elements. Your computer has functions to generate random 32-bit integers, how do you transform such numbers into in...
-
4
MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS Oct 05, 2021...
-
6
DeepMind unveils first AI to discover faster matrix multiplication algorithms
-
1
Computer Science > Symbolic Computation [Submitted on 8 Oct 2022 (
-
6
Ken Shirriff's blog Computer history, restoring vintage computers, IC re...
-
3
Trivial functions can still be non-nothrow (modulo compiler bugs)
-
13
Tutorial: A simple O(n log n) polynomial multiplication algorithm Tutorial: A simple O(n log n) polynomial multiplication algorithm...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK