5

Cache, Loops and Performance

 3 years ago
source link: https://www.codesd.com/item/cache-loops-and-performance.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Cache, Loops and Performance

advertisements

Some time ago I wrote a little piece of code to ask about on interviews and see how people understand concepts of cache and memory:

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <iostream>

#define TOTAL 0x20000000

using namespace std;

__int64 count(int INNER, int OUTER)
{
    int a = 0;
    int* arr = (int*) HeapAlloc(GetProcessHeap(), 0, INNER * sizeof(int));
    if (!arr) {
        cerr << "HeapAlloc failed\n";
        return 1;
    }
    LARGE_INTEGER freq;
    LARGE_INTEGER startTime, endTime;
    __int64 elapsedTime, elapsedMilliseconds;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&startTime);

    /* Начало работы */

    for (int i = 0; i < OUTER; i++) {
        for (int j = 0; j < INNER; j++) {
            a |= i;
            arr[j] = a;
        }
    }

    /* Конец работы */

    QueryPerformanceCounter(&endTime);
    elapsedTime = endTime.QuadPart - startTime.QuadPart;
    elapsedMilliseconds = (1000 * elapsedTime) / freq.QuadPart;
    HeapFree(GetProcessHeap(), 0, arr);
    return elapsedMilliseconds;
}

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 time;
    for (int INNER = 0x10; INNER <= 0x2000000; INNER <<= 1) {
        int OUTER = TOTAL / INNER;
        time = count(INNER, OUTER);
        cout << INNER << "\t" << OUTER << "\t" << time << "\n";
    }
}

That's what it compiles to (the loop itself):

00401062  xor         ecx,ecx
00401064  test        ebp,ebp
00401066  jle         count+83h (401083h)
00401068  xor         eax,eax
0040106A  test        ebx,ebx
0040106C  jle         count+7Ch (40107Ch)
0040106E  mov         edi,edi
00401070  or          esi,ecx
00401072  mov         dword ptr [edi+eax*4],esi
00401075  add         eax,1
00401078  cmp         eax,ebx
0040107A  jl          count+70h (401070h)
0040107C  add         ecx,1
0040107F  cmp         ecx,ebp
00401081  jl          count+68h (401068h)

That's what the program outputs on my machine:


LOG2(INNER) LOG2(OUTER)  Time, ms
4           25           523
5           24           569
6           23           441
7           22           400
8           21           367
9           20           358
10          19           349
11          18           364
12          17           378
13          16           384
14          15           357
15          14           377
16          13           379
17          12           390
18          11           386
19          10           419
20          9            995
21          8            1,015
22          7            1,038
23          6            1,071
24          5            1,082
25          4            1,119


I ask people to explain what's going on. As inner array grows, the number of cycles decreases, as the time does. As inner array outgrows the cache, cache misses begin to happen and time raises. It's all right so far.

BUT: when the INNER array size is 16 (that gives us 64 bytes of data), there is little performance boost, despite the greater number of jmps in code. It's little (523 vs. 569), but reproducible.

The question is: why this boost?


Probably because 64 is the cache line size on your machine, and you basically run each iteration fully out of a single cache line.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK