6

CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

 2 years ago
source link: https://www.codesd.com/item/cuda-8-0-gtx-1080-why-is-vector-addition-slower-than-5x-matrix-multiplication.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

CUDA 8.0, GTX 1080, why is vector addition slower than 5x matrix multiplication?

advertisements

I am using latest CUDA 8.0 with GTX 1080, and running samples to test speed. (I know they do not reflect the optimal speed, but I just want to compare horizontally.)

In 0_Simple/matrixMul, the speed is measured by the code, which gives:

Performance= 1029.91 GFlop/s, Time= 0.127 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block

Then I ran 0_Simple/vectorAdd, and copy the speed testing code from above sample. i.e.:

// Measure speed
    cudaEvent_t start;
    cudaEventCreate(&start);
    cudaEvent_t stop;
    cudaEventCreate(&stop);

    cudaEventRecord(start, NULL);
    int nIter = 300;
    for (int i = 0; i < nIter; i++) {
        vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, numElements);
    }
    cudaEventRecord(stop, NULL);

    cudaEventSynchronize(stop);
    float msecTotal = 0.0f;
    cudaEventElapsedTime(&msecTotal, start, stop);
    float msecPerAdd = msecTotal / nIter;
    double flopsPerAdd = numElements;
    double gigaFlops = (flopsPerAdd * 1.0e-9f) / (msecPerAdd / 1000.0f);
    printf("Performance= %.2f GFLOPS, Time= %.3f ms, Size= %.0f Ops\n", gigaFlops, msecPerAdd, flopsPerAdd);

I also enlarged the numElements from 50000 to 67108864. The speed result is:

Performance= 19.85 GFLOPS, Time= 3.380 ms, Size= 67108864 Ops

which is almost 5x slower.

I know that sample code may be suboptimal, so could anyone tell me why the vectorAdd code is so slow, and how to optimize it?

I am using CUDA 8.0, and GTX 1080


Unlike matrix multiplication, vector addition is a memory bandwidth bound operation. The correct way to measure its performance is to measure the bandwidth of the global memory access. For vector addition, it includes 2 input and 1 output vectors, and can be calculated as follows.

3 * numElements * sizeof(d_A[0]) / kernel_running_time

You could compare it with the bandwidth of a simple D2D copy to see if you have reached the peak.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK