Cache, Loops and Performance

advertisements

Some time ago I wrote a little piece of code to ask about on interviews and see how people understand concepts of cache and memory:

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <iostream>

#define TOTAL 0x20000000

using namespace std;

__int64 count(int INNER, int OUTER)
{
    int a = 0;
    int* arr = (int*) HeapAlloc(GetProcessHeap(), 0, INNER * sizeof(int));
    if (!arr) {
        cerr << "HeapAlloc failed\n";
        return 1;
    }
    LARGE_INTEGER freq;
    LARGE_INTEGER startTime, endTime;
    __int64 elapsedTime, elapsedMilliseconds;
    QueryPerformanceFrequency(&freq);
    QueryPerformanceCounter(&startTime);

    /* Начало работы */

    for (int i = 0; i < OUTER; i++) {
        for (int j = 0; j < INNER; j++) {
            a |= i;
            arr[j] = a;
        }
    }

    /* Конец работы */

    QueryPerformanceCounter(&endTime);
    elapsedTime = endTime.QuadPart - startTime.QuadPart;
    elapsedMilliseconds = (1000 * elapsedTime) / freq.QuadPart;
    HeapFree(GetProcessHeap(), 0, arr);
    return elapsedMilliseconds;
}

int _tmain(int argc, _TCHAR* argv[])
{
    __int64 time;
    for (int INNER = 0x10; INNER <= 0x2000000; INNER <<= 1) {
        int OUTER = TOTAL / INNER;
        time = count(INNER, OUTER);
        cout << INNER << "\t" << OUTER << "\t" << time << "\n";
    }
}

That's what it compiles to (the loop itself):

00401062  xor         ecx,ecx
00401064  test        ebp,ebp
00401066  jle         count+83h (401083h)
00401068  xor         eax,eax
0040106A  test        ebx,ebx
0040106C  jle         count+7Ch (40107Ch)
0040106E  mov         edi,edi
00401070  or          esi,ecx
00401072  mov         dword ptr [edi+eax*4],esi
00401075  add         eax,1
00401078  cmp         eax,ebx
0040107A  jl          count+70h (401070h)
0040107C  add         ecx,1
0040107F  cmp         ecx,ebp
00401081  jl          count+68h (401068h)

That's what the program outputs on my machine:

LOG2(INNER) LOG2(OUTER)  Time, ms
4           25           523
5           24           569
6           23           441
7           22           400
8           21           367
9           20           358
10          19           349
11          18           364
12          17           378
13          16           384
14          15           357
15          14           377
16          13           379
17          12           390
18          11           386
19          10           419
20          9            995
21          8            1,015
22          7            1,038
23          6            1,071
24          5            1,082
25          4            1,119

I ask people to explain what's going on. As inner array grows, the number of cycles decreases, as the time does. As inner array outgrows the cache, cache misses begin to happen and time raises. It's all right so far.

BUT: when the INNER array size is 16 (that gives us 64 bytes of data), there is little performance boost, despite the greater number of jmps in code. It's little (523 vs. 569), but reproducible.

The question is: why this boost?

Probably because 64 is the cache line size on your machine, and you basically run each iteration fully out of a single cache line.

Cache, Loops and Performance

Cache, Loops and Performance

Recommend

龙8官网正版在线-疫情之下，美国还在应对每年必发的山火

tmpmail – A temporary email right from Linux / Unix terminal

Quick tip: Display banner/message before OpenSSH authentication

How to evaluate 2 query output data in one with double

苹果大幅增加iPhone 12产能：国人已对这个版本无爱！

Linux set up bluetooth speaker to stream audio from your Android / iOS mobile ph...

Crystal Reports Error in Formula

Raspberry Pi 400 Desktop PC: Complete specs and pricing

How to get the header of the site?

龙8官网正版-青海格尔木境内5名枸杞采摘工人遭3只棕熊袭击

About Joyk