CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

2 years ago

source link: https://www.codesd.com/item/cuda-8-0-gtx-1080-why-is-vector-addition-slower-than-5x-matrix-multiplication.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

advertisements

I am using latest CUDA 8.0 with GTX 1080, and running samples to test speed. (I know they do not reflect the optimal speed, but I just want to compare horizontally.)

In 0_Simple/matrixMul, the speed is measured by the code, which gives:

Performance= 1029.91 GFlop/s, Time= 0.127 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block

Then I ran 0_Simple/vectorAdd, and copy the speed testing code from above sample. i.e.:

// Measure speed
    cudaEvent_t start;
    cudaEventCreate(&start);
    cudaEvent_t stop;
    cudaEventCreate(&stop);

    cudaEventRecord(start, NULL);
    int nIter = 300;
    for (int i = 0; i < nIter; i++) {
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    }
    cudaEventRecord(stop, NULL);

    cudaEventSynchronize(stop);
    float msecTotal = 0.0f;
    cudaEventElapsedTime(&msecTotal, start, stop);
    float msecPerAdd = msecTotal / nIter;
    double flopsPerAdd = numElements;
    double gigaFlops = (flopsPerAdd * 1.0e-9f) / (msecPerAdd / 1000.0f);
    printf("Performance= %.2f GFLOPS, Time= %.3f ms, Size= %.0f Ops\n", gigaFlops, msecPerAdd, flopsPerAdd);

I also enlarged the numElements from 50000 to 67108864. The speed result is:

Performance= 19.85 GFLOPS, Time= 3.380 ms, Size= 67108864 Ops

which is almost 5x slower.

I know that sample code may be suboptimal, so could anyone tell me why the vectorAdd code is so slow, and how to optimize it?

I am using CUDA 8.0, and GTX 1080

Unlike matrix multiplication, vector addition is a memory bandwidth bound operation. The correct way to measure its performance is to measure the bandwidth of the global memory access. For vector addition, it includes 2 input and 1 output vectors, and can be calculated as follows.

3 * numElements * sizeof(d_A[0]) / kernel_running_time

You could compare it with the bandwidth of a simple D2D copy to see if you have reached the peak.

Recommend

www.tuicool.com 5 years ago
Cache

Matrix Multiplication Calculated with T-SQL

By:Eli Leiba | Last Updated: 2018-11-23 | | Related Tips:More > T-SQL Problem I need to perform matri...

towardsdatascience.com 4 years ago
Cache

Matrix multiplication: The PyTorch way

Then we write 3 loops to multiply the matrices element wise. The shape of the final matrix will be (number of rows matrix_1) by (number of columns of matrix_2). Now let’s create a basic neural net where we wil...

algassert.com 3 years ago
Cache

Bra-Ket Notation Trivializes Matrix Multiplication

Bra-Ket Notation Trivializes Matrix Multiplication 27 Nov 2016 One of the first things you notice, when learning quantum things, is people surrounding all their symbols with a strange angular notation. I...

avikdas.com 3 years ago
Cache

Dynamic programming deep-dive: Chain Matrix Multiplication

Dynamic programming deep-diveDynamic programming deep-dive: Chain Matrix Multiplication Apr 25, 2019 • Avik Das

www.quantamagazine.org 3 years ago
Cache

Mathematicians Inch Closer to Matrix Multiplication Goal | Quanta Magazine

Matrix Multiplication Inches Closer to Mythic GoalRead LaterShareCopied!...

rjlipton.wpcomstaging.com 2 years ago
Cache

Limits On Matrix Multiplication

Limits On Matrix Multiplication August 30, 2018 Can 2.3728639 be best? Josh Alman is a graduate student at a technical school in the Boston area. He is working on...

www.infoq.com 2 years ago
Cache

MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS

MIT Researchers Open-Source Approximate Matrix Multiplication Algorithm MADDNESS Oct 05, 2021...

adnjavainterview.blogspot.com 2 years ago
Cache

Java Program to Perform Matrix Multiplication

Java Program to Perform Matrix Multiplication In the current post, I have written a java program to perform a simple matrix multiplication. For matrix multiplication the column of the first matrix should be equal to the...

thispointer.com 1 year ago
Cache

Matrix Vector multiplication using NumPy in Python

In this article, we will learn matrix-vector multiplication using NumPy. Table Of Contents What is a matrix in numpy and how to create it? The numpy stand...

nhigham.com 1 year ago
Cache

What Is Fast Matrix Multiplication?

What Is Fast Matrix Multiplication? The definition of matrix multiplication says that for matrices...

CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

Recommend

About Joyk