CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?
source link: https://www.codesd.com/item/cuda-8-0-gtx-1080-why-is-vector-addition-slower-than-5x-matrix-multiplication.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?
I am using latest CUDA 8.0 with GTX 1080, and running samples to test speed. (I know they do not reflect the optimal speed, but I just want to compare horizontally.)
In 0_Simple/matrixMul
, the speed is measured by the code, which gives:
Performance= 1029.91 GFlop/s, Time= 0.127 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Then I ran 0_Simple/vectorAdd
, and copy the speed testing code from above sample. i.e.:
// Measure speed
cudaEvent_t start;
cudaEventCreate(&start);
cudaEvent_t stop;
cudaEventCreate(&stop);
cudaEventRecord(start, NULL);
int nIter = 300;
for (int i = 0; i < nIter; i++) {
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
}
cudaEventRecord(stop, NULL);
cudaEventSynchronize(stop);
float msecTotal = 0.0f;
cudaEventElapsedTime(&msecTotal, start, stop);
float msecPerAdd = msecTotal / nIter;
double flopsPerAdd = numElements;
double gigaFlops = (flopsPerAdd * 1.0e-9f) / (msecPerAdd / 1000.0f);
printf("Performance= %.2f GFLOPS, Time= %.3f ms, Size= %.0f Ops\n", gigaFlops, msecPerAdd, flopsPerAdd);
I also enlarged the numElements
from 50000
to 67108864
. The speed result is:
Performance= 19.85 GFLOPS, Time= 3.380 ms, Size= 67108864 Ops
which is almost 5x slower.
I know that sample code may be suboptimal, so could anyone tell me why the vectorAdd code is so slow, and how to optimize it?
I am using CUDA 8.0, and GTX 1080
Unlike matrix multiplication, vector addition is a memory bandwidth bound operation. The correct way to measure its performance is to measure the bandwidth of the global memory access. For vector addition, it includes 2 input and 1 output vectors, and can be calculated as follows.
3 * numElements * sizeof(d_A[0]) / kernel_running_time
You could compare it with the bandwidth of a simple D2D copy to see if you have reached the peak.
Recommend
-
40
By:Eli Leiba | Last Updated: 2018-11-23 | | Related Tips:More > T-SQL Problem I need to perform matri...
-
54
Then we write 3 loops to multiply the matrices element wise. The shape of the final matrix will be (number of rows matrix_1) by (number of columns of matrix_2). Now let’s create a basic neural net where we wil...
-
3
Bra-Ket Notation Trivializes Matrix Multiplication 27 Nov 2016 One of the first things you notice, when learning quantum things, is people surrounding all their symbols with a strange angular notation. I...
-
7
Dynamic programming deep-diveDynamic programming deep-dive: Chain Matrix Multiplication Apr 25, 2019 • Avik Das
-
6
Matrix Multiplication Inches Closer to Mythic GoalRead LaterShareCopied!...
-
4
Limits On Matrix Multiplication August 30, 2018 Can 2.3728639 be best? Josh Alman is a graduate student at a technical school in the Boston area. He is working on...
-
1
MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS Oct 05, 2021...
-
11
Java Program to Perform Matrix Multiplication In the current post, I have written a java program to perform a simple matrix multiplication. For matrix multiplication the column of the first matrix should be equal to the...
-
2
In this article, we will learn matrix-vector multiplication using NumPy. Table Of Contents What is a matrix in numpy and how to create it? The numpy stand...
-
3
What Is Fast Matrix Multiplication? The definition of matrix multiplication says that for matrices...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK