0

How to Use GPU in VBA/Excel

 1 month ago
source link: https://www.codeproject.com/Articles/5332060/How-to-Use-GPU-in-VBA-Excel
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction

OpenCL is a C99 based language to program the GPGPU and CPU (the Wikipedia mentioned also DSPs and FPGAs). The positive of OpenCL is that the same code can be executed on GPGPU and CPU without any changes and the number of supported platforms is enormous. From the other side, the most widespread programming language for non-professionals, VBA, supports only calculations at one processor and cannot asynchronously execute the code.

I like Excel very much because of its interactivity, but sometimes for massive calculations, one needs a little more power. So, I hope this article will close this gap and will show how to add the support of multiplatform / multithreading calculations on GPGPU / CPU to Excel.

The purpose of this short article is to show how to add the support of OpenCL to the VBA. Since my needs were limited only by increasing of computational power of Excel, no image processing and corresponding topics were implemented / discussed. For those, who like the solution-from-the-box, there is an installer (Windows only) that installs the ClooWrapperVBA library and registers it. For others, who do not trust installers, the same content is available also as zip-file. You will have only to register DLL using "register.bat".

Who benefits from the library?

  • The hobby programmers who make their models in Excel. That's me!
  • The people who would like to learn OpenCL, but do not want to install additional programs like Visual Studio.
  • The scientists of old school, who do their investigations in Excel. I have heard about that samurais.

Requirements:

  • Windows
  • .NET 4.0 Framework
  • Installed Excel
  • OpenCL.dll in Windows folder (normally exist on each Windows computer).

How the code is organized:

  • Sources contain sources of the wrapper (in C#) and VBA code. Demo Excel table is not attached to the sources.
  • Demo Excel table can be found after installation in "C:\Program Files (x86)\ClooWrapperVBA\demo".
  • If you want to compile the C# sources by yourself, the demo Excel table can be found in zip file in folder "demo".

Let's Burn the GPU/CPU!

Common Notes

The DLL is written in C# and contains COM interface to Cloo, accessible from Excel. Cloo is an open source wrapper to execute OpenCL code from .NET.

Hello World!

Let's start with a simple program that performs matrix multiplication code written in OpenCL.

VBScript
Copy Code
Dim clooConfiguration As New ClooWrapperVBA.Configuration

At first, that would be interesting to check which hardware is available to us. The code snippet below iterates through all available platforms and all devices inside each platform and gets their configurations:

VBScript
Shrink ▲   Copy Code
nPlatforms = clooConfiguration.Platforms

For i = 1 To nPlatforms
    result = clooConfiguration.SetPlatform(i - 1)
    If result Then
        platformName = clooConfiguration.Platform.PlatformName
        platformVendor = clooConfiguration.Platform.PlatformVendor
        platformVersion = clooConfiguration.Platform.PlatformVersion
        
        nDevices = clooConfiguration.Platform.Devices
        For j = 1 To nDevices
            result = clooConfiguration.Platform.SetDevice(j - 1)
            
            If result Then
                deviceType = clooConfiguration.Platform.Device.DeviceType
                deviceName = clooConfiguration.Platform.device.DeviceName
                deviceVendor = clooConfiguration.Platform.device.DeviceVendor
                maxComputeUnits = clooConfiguration.Platform.device.MaxComputeUnits
                deviceAvailable = clooConfiguration.Platform.device.DeviceAvailable
                compilerAvailable = clooConfiguration.Platform.device.CompilerAvailable
                deviceVersion = clooConfiguration.Platform.device.DeviceVersion
                driverVersion = clooConfiguration.Platform.device.DriverVersion
                globalMemorySize = clooConfiguration.Platform.device.GlobalMemorySize
                maxClockFrequency = clooConfiguration.Platform.device.MaxClockFrequency
                maxMemoryAllocationSize = _
                   clooConfiguration.Platform.device.MaxMemoryAllocationSize
                openCLCVersionString = _
                    clooConfiguration.Platform.device.OpenCLCVersionString
            End If
        Next j
    End If
Next i 

The important configuration settings are:

  • deviceType ("GPU" / "CPU")
  • maxComputeUnits - number of processors / threads that can be used
  • deviceAvailable
  • compilerAvailable

Note, that SetPlatform and SetDevice functions return true if the functions were executed successfully, false, if not.

If no platforms or devices were found, then check if OpenCL.dll exist in "Windows" folder (normally in "C:\Windows\"). If you have no OpenCL.dll in "Windows" folder, then copy it from other computer or Google for it. Do antivirus check in any case! Another possible reason, the GPGPU / CPU drivers are too old and not supported by OpenCL. Google the latest drivers for your CPU / GPGPU.

Matrix Multiplication in OpenCL

Now, let us multiply two floating point matrices M1[p, q] and M2[q, r] using the first available device that can compile OpenCL sources.

First of all, the OpenCL sources will be read.

VBScript
Copy Code
Open Application.ActiveWorkbook.Path & "\cl\MatrixMultiplication.cl" For Binary As #1
sources = Space$(LOF(1))
Get #1, , sources
Close #1

The matrix multiplication OpenCL code of two floating point matrices is shown below:

Copy Code
__kernel void FloatMatrixMult_
(__global float* MResp, __global float* M1, __global float* M2, __global int* q)
{
    // Vector element index
    int i = get_global_id(0);
    int j = get_global_id(1);
    int p = get_global_size(0);
    int r = get_global_size(1);
    MResp[i + p * j] = 0;
    int QQ = q[0];
    for (int k = 0; k < QQ; k++)
    {
        MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j];
    }
}

If the found device has a compiler (compilerAvailable = true), we will try to compile the OpenCL sources:

VBScript
Copy Code
Set progDevice = New ClooWrapperVBA.ProgramDevice
result = progDevice.Build(sources, "", platformId, deviceId, cpuCounter, buildLogs)

The parameters are:

  • sources - the sources as a plain text, read from the file at previous step.
  • Second parameter ("") contains compiler options. In the simplest case, it can be empty ("", not null or Nothing). Among the common compiler options, like "-w" (Inhibit all warning messages), you can also define here commonly used constants ("-D name=definition") and use them in the OpenCL code. The complete list of compiler options can be found at Khronos web page.
  • The fifth parameter, cpuCounter, defines the device index of some certain type ("CPU" or "GPU"). Since your platform can have thousands of devices of different type, this parameter distinguishes between different devices of the same type.
  • buildLogs shows the current compiler logs. It's important to check if it contains any errors. For me, it was very helpful to check the build logs for warnings when I developed my first own OpenCL code. The warnings were mostly the reasons of the kernel crash during execution. The errors are also accumulated in property ErrorString:
VBScript
Copy Code
errorString = progDevice.ErrorString

If the compilation was successful (result = true), we have to define the kernel which we want to execute. The single parameter of the CreateKernel function is a kernel name (string):

VBScript
Copy Code
result = progDevice.CreateKernel("DoubleMatrixMult")

Then we have to set input and output arrays of the kernel. OpenCL typically uses vectors (1-dimensional arrays) as input parameters:

VBScript
Copy Code
result = progDevice.SetMemoryArgument_Double(0, vecResp)
result = progDevice.SetMemoryArgument_Double(1, vecM1)
result = progDevice.SetMemoryArgument_Double(2, vecM2)
result = progDevice.SetMemoryArgument_Long(3, vecQ)

To set kernel arguments, there are six functions:

  • To set arrays:
    • SetMemoryArgument_Long (corresponds to array of integers in C#)
    • SetMemoryArgument_Single (corresponds to array of floats in C#)
    • SetMemoryArgument_Double
  • To set values:
    • SetValueArgument_Long (corresponds to integer in C#)
    • SetValueArgument_Single (corresponds to float in C#)
    • SetValueArgument_Double

The first parameter, argument index, starts with 0 for first argument and must be increased for the next arguments. It is also very important to set variables in a right sequence. First, the variable with argument index 0, then with argument index 1 and so on.

Now, the input array sizes must be set in the globalWorkSize array:

VBScript
Copy Code
globalWorkSize(0) = p
globalWorkSize(1) = r 

Finally, we can start kernel execution. The function ExecuteSync returns to VBA only when its execution is finished. If you want to run your OpenCL code in asynchronous mode on several devices, then you have to read "Advanced topics":

VBScript
Copy Code
result = progDevice.ExecuteSync(globalWorkOffset, globalWorkSize, localWorkSize)

The results of execution must be taken from the kernel using appropriate Get functions:

VBScript
Copy Code
result = progDevice.GetMemoryArgument_Double(0, vecResp)

The first parameter, argument index, has the same meaning as the argument index parameter of "Set" functions, but you can use the "Get" functions in arbitrary order.

Finally, the memory should be cleaned from the arrays and all instantiated OpenCL objects:

VBScript
Copy Code
result = progDevice.ReleaseMemObject(3)
result = progDevice.ReleaseMemObject(2)
result = progDevice.ReleaseMemObject(1)
result = progDevice.ReleaseMemObject(0)
result = progDevice.ReleaseKernel
result = progDevice.ReleaseProgram

The releasing functions have to be called in the following sequence:

  • Release memory arguments, starting with highest argument index
  • Release kernel
  • ReleaseProgram

Finally, note, that all COM-visible functions of DLL return boolean: true, if the function was executed successfully, false, if not. Any function returned false, then it is helpful to check errors in property ErrorString.

Advanced Topics

The basic example discussed in details before contains only a small part of possibilities of Cloo. The complete possibilities of ClooWrapperVBA can be tested in corresponding Excel sheets ("Configuration", "Performance", "Asynchronous").

Configuration

Simply press "Configuration" button and you will get the truth about your system. :-)

Performance Measurements

The sheet performs simple performance measurement using and OpenCL code from the article "How to Use Your GPU in .NET". The performance test is done at first found CPU and GPGPU device in single and double precisions.

Another performance test is done for multiplication of two 1200*1200 double precision matrices in VBA and first found CPU / GPGPU. The results of calculations on CPU / GPGPU are compared to VBA results (see cells C3:C4) and the computation time was measured (cells B2:B4). The results show that the OpenCL code executed on GPGPU is 300 times faster than VBA (cell B9) and also 8 times faster that native C# code! (Simply imagine, my Excel is 8 times faster than native C#!!!) The "native" calculations were done in C# and contain the same multiplication of two matrices 1200*1200. The native performance measurement was performed only on my computer and therefore stay unchanged. If you want, you can write this part on any programming language you want (C#/C/C++) and update the execution time manually. The reason I didn't add it to ClooWrapperVBA is because it was a single test case and I wanted to have only a clean source of ClooWrapperVBA.

Asynchronous Execution of OpenCL Code

There are two additional functions to execute the kernel in asynchronous mode:

1. ExecuteBackground

VBScript
Copy Code
result = programDevice.ExecuteBackground_
         (globalWorkOffset, globalWorkSize, localWorkSize, THREAD_PRIORITY)

In the snippet below, the OpenCL code is executed at CPU and GPGPU simultaneously in infinite loop:

  • Every 100 ms, the state of the execution at CPU / GPGPU is checked using ExecutionCompleted function (true - execution is completed, false - the OpenCL code is still running).
  • If execution was completed, the output array will be read from the kernel, the new input array will be written to the kernel and the OpenCL code will be started at desired device using ExecuteBackground.
  • The infinite loop will run until the MAX_TASKS (=20) will be reached.
VBScript
Shrink ▲   Copy Code
While Not allTasks_Completed
    For i = 1 To progDevices.Count
        If progDevices.Item(i).ProgramDevice.ExecutionCompleted Then
            result = progDevices.Item(i).ProgramDevice.GetMemoryArgument_Double_
            (0, vecResp) ' Extract the results and do something with received data here.
            
            finishedTasks = finishedTasks + 1
            
            ' Start new task
            If startedTasks < MAX_TASKS Then
                ReDim vecResp(UBound(vecResp))  ' Erase output vector.
                result = progDevices.Item(i).ProgramDevice.SetMemoryArgument_Double_
                         (0, vecResp)
                
                ' If you want to use callbacks, than use function below
                ' "CPU_Task_Completed" is a function that will obtain the callback.
                ' Call progDevices.Item(i).ProgramDevice.ExecuteAsync_
                (globalWorkOffset, globalWorkSize, localWorkSize, _
                 THREAD_PRIORITY, AddressOf Asynchronous.CPU_Task_Completed)
                
                result = progDevices.Item(i).ProgramDevice.ExecuteBackground_
                (globalWorkOffset, globalWorkSize, localWorkSize, THREAD_PRIORITY)
                startedTasks = startedTasks + 1
                currentTaskId(i) = startedTasks
            Else
                ' If the maximal number of tasks is reached, 
                ' then set "ExecutionCompleted" to false to avoid additional outputs.
                progDevices.Item(i).ProgramDevice.ExecutionCompleted = False
            End If
            
            If startedTasks = finishedTasks Then
                allTasks_Completed = True
            End If
        End If
    Next i
    
    DoEvents
    Sleep (100)
Wend 

2. ExecuteAsync

VBScript
Copy Code
result = programDevice.ExecuteAsync(globalWorkOffset, globalWorkSize, _
localWorkSize, THREAD_PRIORITY, AddressOf Asynchronous.CPU_Task_Completed)
  • The first versions of Excel demo table used the function ExecuteAsync to run OpenCL code in asynchronous mode. The function used callback to the VBA function when the execution was completed. But later, I found that it is rather problematic to debug the code which uses callback functions. Excel, for example, crashes if you are currently in debugger and the code received a callback. Even writing the results to the sheet can cause an Excel crash. Therefore, in the test table, I do not use this function even if it is still in the COM interface.

The common parameter of both functions, THREAD_PRIORITY, sets one of five priorities (0 - "Lowest", 1 - "BelowNormal", 2 - "Normal", 3 - "AboveNormal", 4 - "Highest"), but I'm pretty sure that this parameter has no influence on the execution of the OpenCL code. I simply added it because I had such possibility :-)

The sheet initializes OpenCL program to multiply two 2000*2000 matrices at first found CPU and GPGPU devices and run it 20 times in asynchronous mode. So, every time the program is executed at CPU and GPGPU asynchronously. The state of execution is shown in the sheet. On my computer, the number of finished CPU calculations is 4 times smaller than a number of GPGPU processes, which correspond to performance measurements done before.

Run ClooWrapperVBA from VBScript

Actually, it was not a big surprise for me that VBScript can work with COM-DLLs, but it was still quite surprising. The ClooWrapperVBA can obtain the platform/device configurations and write them out. The calculations using ClooWrapperVBA are at the moment not possible because the variables in VBScript are of type Variant and must be sent to DLL using ArrayList, which will make DLL more complicated. Since the usage from VBScript was not intended, the ClooWrapperVBA can be used in VBScript only to write out your current platform/device configuration.

Conclusion and Points of Interest

The sources are available at my GitHub page and it would be great if you (yes, especially YOU!) will create a pull request. Any improvements are highly appreciated!

History

  • 15th May, 2022: Initial version

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK