63

An Introduction to GPU Programming in Julia

 5 years ago
source link: https://www.tuicool.com/articles/hit/rUrA3eU
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
["^ ","~:view-data",["^ ","~:article-access?",false,"~:article",["^ ","~:db/id",17592186649289,"~:nextjournal/id","~u0287c2e0-a8ac-4929-b5ab-9efe94bac3c4","~:article/name","julia-gpu-programming","~:article/published-at","~m1539888386536","~:article/profile",["^ ","^3",17592186045492,"~:profile/handle","sdanisch","~:profile/name","Simon Danisch"],"~:article/change",["^ ","^3",17592187039877,"^4","~u5bc8d4e7-dd64-44c1-bb80-8271c2baa849","~:change/inserted-at","~m1539888359297"],"~:article/published-change",["^ ","^3",17592187039877,"^4","~u5bc8d4e7-dd64-44c1-bb80-8271c2baa849","^;","~m1539888359297"],"~:article/title","An Introduction to GPU Programming in Julia ","~:article/preview","QmQtLzEYY97Rme9aS5Wsacg6PGwdPLdvmV1e6k2d8bj8y8"],"~:article-contents",["^ ","^2",["^ ","~:root","3950cb96-1122-4d9f-8b74-7168fb77849a","~:nodes",["^ ","403a6184-13b5-4d48-8e5c-3612ce2ea5d1",["^ ","~:id","403a6184-13b5-4d48-8e5c-3612ce2ea5d1","~:kind","text","~:content","

"],"d0caaf8f-5c0b-4c0e-8b9e-1bd2e2e99aed",["^ ","^C","d0caaf8f-5c0b-4c0e-8b9e-1bd2e2e99aed","^D","text","^E","

This article aims to give a quick introduction about how GPUs work and specifically give an overlook of the current Julia GPU ecosystem and how easy it is to get simple GPU programs running. To make things easier, you can run all the code samples directly in the article if you have anaccount and click on edit .

"],"e01448e0-7630-4c6a-b415-050bd4b37efa",["^ ","^C","e01448e0-7630-4c6a-b415-050bd4b37efa","^D","text","^E","

The function that makes this possible is named gpu_call . It can be called as gpu_call(kernel, A::GPUArray, args) and will call kernel with the arguments (state, args...) on the GPU. State is a backend specific object to implement functionality like getting the thread index. A GPUArray needs to get passed as the second argument to dispatch to the correct backend and supply the defaults for the launch parameters.

Lets use gpu_call to implement a simple map kernel:

"],"5b06a745-4830-4b8e-a5b2-df8449cef3b3",["^ ","^E","struct Test # an immutable struct\n# that only contains other immutable, which makes \n# isbitstype(Test) == true\n\tx::Float32 \nend\n\n# the isbits property is important, since those types can be used\n# without constraints on the GPU!\n@assert isbitstype(Test) == true\nx = (2, 2)\nisa(x, Tuple{Int, Int}) # tuples are also immutable\nmutable struct Test2 #-> mutable, isbits(Test2) == false\n\tx::Float32\nend\nstruct Test3\n\t# contains a heap allocation/ reference, not isbits\n\tx::Vector{Float32}\n\ty::Test2 # Test2 is mutable and also heap allocated / a reference\nend\nVector{Test} #<- Array with isbits elements is contigious in memory\nVector{Test2} #<

- Array with mutable elements is basically a linked list, since it just contains the pointer to its elements and won't work with the GPU.","~:refs",["~#list",[]],"~:name","Structs","~:output-log-lines",["^ "],"~:language","julia","^C","5b06a745-4830-4b8e-a5b2-df8449cef3b3","~:compute-ref","~u550ca310-d2e7-11e8-b7e3-d5a7297efcb7","~:runtime",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","~:outputs",["^ ","~_",["^ ","^D","data","^K",null,"~:coder","json","~:blob",["^ ","^C","Qmf95KPCz5ETEtBLXzBqZ3rh6zFk3eQNWQog7nBVwJotNS","~:size",16,"~:content-type","application/json"]]],"~:error",null,"~:exec-duration",843,"~:bucket",null],"4e09f815-feab-4f09-8d9f-e63eba843aa2",["^ ","^C","4e09f815-feab-4f09-8d9f-e63eba843aa2","^D","section","~:title"," The GPUArray Constructors","^E",["^J",["07d732ed-7e58-4970-af4a-16701813f0ee"]],"~:sections",["^J",[]]],"90e93106-c175-437c-a107-fb911bf59788",["^ ","^C","90e93106-c175-437c-a107-fb911bf59788","^D","section","^Y","Garbage Collection","^E",["^J",["fe642bca-b571-4183-a38a-1e937833a4d8","24e4194b-b6fd-43c4-b876-cc6b6aea8403","f9c22c0d-fa74-4a04-a4a8-52ccfea9b2d7"]],"^Z",["^J",[]]],"8d00da3d-dc4d-47b0-9f8b-eb26e505d889",["^ ","^C","8d00da3d-dc4d-47b0-9f8b-eb26e505d889","^D","text","^E","

There aremany more use cases, including solving differential equations, FEM simulations, andsolving PDEs.

"],"6b8c8d02-f072-4454-9687-1e0b795d9e47",["^ ","^C","6b8c8d02-f072-4454-9687-1e0b795d9e47","^D","text","^E","

  • A GPU is a separate piece of hardware with its own memory space and different architecture. As a result, there are long transfer times from RAM to the GPUs memory (VRAM). Even launching a kernel on the GPU (in other words scheduling a function call) introduces large latencies. Times are around ~10us for GPUs, compared to a few nano seconds on a CPU
  • Setting up a kernel can quickly become complicated without a high level wrapper
  • Lower precision is the default and higher precision computations can easily void all performance gains
  • GPU functions (kernels) are inherently parallel, so writing GPU kernels is at least as difficult as writing parallel CPU code, but the difference in hardware adds quite a bit of complexity
  • Related to the above, a lot of algorithms won't port nicely to the GPU. For more details on what to expect, have a look at thisblog post
  • Kernels are usually written in a C/C++ dialect, which is not the nicest language to write down your algorithms
  • There is a divide between CUDA and OpenCL, which are the dominant frameworks used to write low-level GPU code. While CUDA only supports Nvidia hardware, OpenCL supports all hardware but is a bit rough around the edges. One needs to decide what to use, and will get pretty much stuck with that decision

"],"b15e66ec-cbe7-4b87-94fc-64a1bf79ff3b",["^ ","^C","b15e66ec-cbe7-4b87-94fc-64a1bf79ff3b","^D","text","^E","

A GPU is a massively parallel processor, with a couple of thousand parallel processing units. For example theTesla k80, which is used in this article, offers 4992 parallel CUDA cores. GPUs are quite different from CPUs in terms of frequencies, latencies and hardware capabilities, but this is somewhat similar to a slow CPU with 4992 cores!

"],"e4267f2c-06e0-4ae1-b250-66a9565b765b",["^ ","^C","e4267f2c-06e0-4ae1-b250-66a9565b765b","^D","section","^Y","GPUArrays in the real world","^E",["^J",["bfa64579-c73b-48b7-b6c4-88fd69aaf9d1","e2ce55ae-a15c-49dc-ac05-239bd226b07a","7568cecf-0eaa-4fe3-bc80-9cf89360c8fb","8d00da3d-dc4d-47b0-9f8b-eb26e505d889","56322bde-814a-4056-8601-6ca5c8434825","4a7aa2b0-fb74-40b6-8b52-cbedeaf7597a","7bf252ce-e3a3-4312-b3c8-5c6d74dbd62e","dd74954e-f344-4533-8ff4-e8e072c669aa","b4e4c855-1f1c-4bbc-8b4c-704550497733","f6e7762e-b6c5-4960-840a-8c46fb255143"]]],"d612c50e-123b-46d2-bb0f-acb93ea1e9a5",["^ ","^C","d612c50e-123b-46d2-bb0f-acb93ea1e9a5","^D","text","^E","

Some more operations supported by GPUArrays:

"],"7bf252ce-e3a3-4312-b3c8-5c6d74dbd62e",["^ ","^E","# train\nfor i = 1:10\n Flux.train!(loss, train, opt, cb = evalcb)\nend","^I",["^J",[]],"^L",["^ ","~:stdout",3],"^M","julia","^C","7bf252ce-e3a3-4312-b3c8-5c6d74dbd62e","^N","~ud7f89130-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ "],"^U",null,"^V",29622,"^W",null],"63a71268-3337-4bf2-8d47-b7e641d16dbb",["^ ","^C","63a71268-3337-4bf2-8d47-b7e641d16dbb","^D","text","^E","

As you can see, for large arrays one gets a solid 60-80x speed-up by moving the calculation to the GPU. Getting this speed-up was as simple as converting the Julia array to a GPUArray.

"],"bd97b005-9736-4628-a5db-26d82089d15a",["^ ","^C","bd97b005-9736-4628-a5db-26d82089d15a","^D","section","^Y"," Array Operations","^E",["a8566d64-5ace-4acf-ad8e-2021d18c225d","298522f5-fb0c-43df-aa38-e8949ef6685e","e60c9e23-2f00-46ad-9973-a31470112212","8d37d54c-4774-45ab-a1bc-bad9f73a0231","8b302ae5-daef-427a-8cd2-b1dbbf1679f3","62a9ecf5-18af-4956-98e1-a95cd0c47859","d612c50e-123b-46d2-bb0f-acb93ea1e9a5","bd6e2023-447e-4062-a7d8-9c25f509d4c8"],"^Z",["^J",[]]],"5a85c88c-f0a9-4064-bf8a-62c80bf270ae",["^ ","^C","5a85c88c-f0a9-4064-bf8a-62c80bf270ae","^D","text","^E","

In the above example you can see the iteration order of a more complex launch configuration. Figuring out the right iteration + launch configuration is crucial to achieve state of the art GPU performance – but won't be part of this simple introduction. There are plenty of GPU tutorials for CUDA and OpenCL which explain this in great detail and those principles are identical when programming the GPU in Julia.

"],"fdf4733c-a4d3-4665-9137-2e15e54f5181",["^ ","^C","fdf4733c-a4d3-4665-9137-2e15e54f5181","^D","section","^Y","How does the GPU work","^E",["^J",["d0caaf8f-5c0b-4c0e-8b9e-1bd2e2e99aed","12260210-0c7d-4cbc-9262-30ac57e4110f","b15e66ec-cbe7-4b87-94fc-64a1bf79ff3b","6494cb51-4ad3-4327-9e27-c2c804df6c81","a034c062-a45b-467f-a953-cb7702a9bad4","6b8c8d02-f072-4454-9687-1e0b795d9e47","0f5d9919-e5ed-4259-8a25-9c835cd74ab1"]],"^Z",["^J",["095f7d06-d4fc-49e2-a68c-96cad2fb0a62","830cd8cc-8d91-412d-b90e-6f1d81a3a8a1","2eba3a00-d38e-45f7-b5c0-8a1bba7f5ce2","bd97b005-9736-4628-a5db-26d82089d15a"]]],"f9c22c0d-fa74-4a04-a4a8-52ccfea9b2d7",["^ ","^C","f9c22c0d-fa74-4a04-a4a8-52ccfea9b2d7","^D","text","^E","

Consequently, one can only use stack allocation on the device, and for the rest pre-allocated GPU buffers are used. As transfers are expensive, it is common to reuse and pre-allocate as much as possible when programming the GPU.

"],"80bd4385-cd68-4fad-ada0-981d4b98b100",["^ ","^E","using CuArrays\n\nthreads = (2, 2)\nblocks = (2, 2)\nT = fill(CuArray, (0, 0), (4, 4))\nB = fill(CuArray, (0, 0), (4, 4))\ngpu_call(T, (B, T), (blocks, threads)) do state, A, B\n # those names pretty much refer to the cuda names\n b = (blockidx_x(state), blockidx_y(state))\n bdim = (blockdim_x(state), blockdim_y(state))\n t = (threadidx_x(state), threadidx_y(state))\n idx = (bdim .* (b .- 1)) .+ t\n A[idx...] = b\n B[idx...] = t\n return\nend\nprintln(\"Threads index: \\n\", T)\nprintln(\"Block index: \\n\", B)","^I",["^J",[]],"^L",["^ ","^16",4],"^M","julia","^C","80bd4385-cd68-4fad-ada0-981d4b98b100","^N","~ua8e7ec10-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ "],"^U",null,"^V",1986,"^W",null],"298522f5-fb0c-43df-aa38-e8949ef6685e",["^ ","^E","x = zeros(4, 4) # 4x4 array of zeros\ny = zeros(4) # 4 element array\nz = 2 # a scalar\n# y's 1st dimension gets repeated for the 2nd dimension in x\n# and the scalar z get's repeated for all dimensions\n# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`\nx .+ y .+ z","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","298522f5-fb0c-43df-aa38-e8949ef6685e","^N","~u582c0b80-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","data","^K",null,"^Q","json","^R",["^ ","^C","QmXbQ4AUGGFrYmA4yfUt8PKEiFumpg1QQJvxWstV9zhzEU","^S",73,"^T","application/json"]]],"^U",null,"^V",703,"^W",null],"8b302ae5-daef-427a-8cd2-b1dbbf1679f3",["^ ","^C","8b302ae5-daef-427a-8cd2-b1dbbf1679f3","^D","text","^E","

This means any Julia function that runs without allocating heap memory (only creating isbits types), can be applied to each element of a GPUArray and multiple dot calls will get fused into one kernel call. As kernel call latency is high, this fusion is a very important optimization.

"],"3950cb96-1122-4d9f-8b74-7168fb77849a",["^ ","^C","3950cb96-1122-4d9f-8b74-7168fb77849a","^Y","An Introduction to GPU Programming in Julia ","^D","section","~:version",2,"^E",["^J",[]],"^Z",["^J",["fdf4733c-a4d3-4665-9137-2e15e54f5181","e4267f2c-06e0-4ae1-b250-66a9565b765b","ea004e0d-2b7d-4d1d-87b0-58bac63971d8","33457f06-ffd8-4e64-a9e3-f336e875b947"]],"~:settings",["^ ","~:numbered?",false,"~:sidebar?",false,"~:subtitle?",false,"~:centered?",false,"~:authors?",false]],"24e4194b-b6fd-43c4-b876-cc6b6aea8403",["^ ","^C","24e4194b-b6fd-43c4-b876-cc6b6aea8403","^D","text","^E","

As an alternative to heap allocated arrays inside the kernel, you can use GPUArrays. The GPUArray constructor will create GPU buffers and transfer the data to VRAM. If you call Array(gpu_array) the array will get transferred back to RAM, represented as a normal Julia Array. The Julia handle to those gpu arrays is tracked by Julia's GC and if it's not used anymore, the GPU memory will be freed.

"],"b9afd6ac-9711-4b4b-9f88-89e40405af04",["^ ","^C","b9afd6ac-9711-4b4b-9f88-89e40405af04","^D","text","^E","

Location of creation describes if the object was created on the CPU and then transferred to the GPU kernel, or if it was created on the GPU inside the kernel. The table shows if it is possible to create an instance of a type, and for the transfer from CPU to GPU, the table also indicates if the object gets copied or passed by reference.

"],"778cc725-33cc-4cb1-a388-6a9da1daff5d",["^ ","^E","using CuArrays, FileIO, Colors, GPUArrays, BenchmarkTools\nusing CuArrays: CuArray\n\"\"\"\nThe function calculating the Julia set\n\"\"\"\nfunction juliaset(z0, maxiter)\n c = ComplexF32(-0.5, 0.75)\n z = z0\n for i in 1:maxiter\n abs2(z) > 4f0 && return (i - 1) % UInt8\n z = z * z + c\n end\n return maxiter % UInt8 # % is used to convert without overflow check\nend\nrange = 100:50:2^12\ncutimes, jltimes = Float64[], Float64[]\nfunction run_bench(in, out)\n # use dot syntax to apply `juliaset` to each elemt of q_converted \n # and write the output to result\n out .= juliaset.(in, 16)\n # all calls to the GPU are scheduled asynchronous, \n # so we need to synchronize\n GPUArrays.synchronize(out)\nend\n# store a reference to the last results for plotting\nlast_jl, last_cu = nothing, nothing\nfor N in range\n w, h = N, N\n q = [ComplexF32(r, i) for i=1:-(2.0/w):-1, r=-1.5:(3.0/h):1.5]\n for (times, Typ) in ((cutimes, CuArray), (jltimes, Array))\n # convert to Array or CuArray - moving the calculation to CPU/GPU\n q_converted = Typ(q)\n result = Typ(zeros(UInt8, size(q)))\n for i in 1:10 # 5 samples per size\n # benchmarking macro, all variables need to be prefixed with $\n t = Base.@elapsed begin\n\t\t\t\trun_bench(q_converted, result)\n end\n global last_jl, last_cu # we're in local scope\n if result isa CuArray\n last_cu = result\n else\n \tlast_jl = result\n end\n push!(times, t)\n end\n end\nend\n\ncu_jl = hcat(Array(last_cu), last_jl)\ncmap = colormap(\"Blues\", 16 + 1)\ncolor_lookup(val, cmap) = cmap[val + 1]\nsave(\"results/juliaset.png\", color_lookup.(cu_jl, (cmap,)))","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","778cc725-33cc-4cb1-a388-6a9da1daff5d","^N","~ufbc54280-d2e6-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","juliaset.png",["^ ","^D","file","^K","juliaset.png","~:last-modified",1539875132000,"~:idx",null,"^R",["^ ","^C","QmQXdNWEJhYMUpoJueuHCg2eUgjonmoP8pBTTWZTtgj5X8","^S",481341,"^W",null,"^T","image/png"]]],"^U",null,"^V",118649,"^W",null],"3428b837-5a8e-44d8-91e3-2e8361152cb0",["^ ","^E","using BenchmarkTools\nfunction threadded_map!(f::Function, A::Array, B::Array)\n Threads.@threads for i in 1:length(A)\n A[i] = f(B[i])\n end\n A\nend\nx, y = rand(10^7), rand(10^7)\nkernel(y) = (y / 33f0) * (732.f0/y)\n# on the cpu without threads:\nsingle_t = @belapsed map!($kernel, $x, $y)\n\n# \"on the CPU with 4 threads (2 real cores):\nthread_t = @belapsed threadded_map!($kernel, $x, $y)\n\n# on the GPU:\nxgpu, ygpu = cu(x), cu(y)\ngpu_t = @belapsed begin\n map!($kernel, $xgpu, $ygpu)\n GPUArrays.synchronize($xgpu)\nend\ntimes = [single_t, thread_t, gpu_t]\nspeedup = maximum(times) ./ times\nprintln(\"speedup: $speedup\")\nbar([\"1 core\", \"2 cores\", \"gpu\"], speedup, legend = false, fillcolor = :grey, ylabel = \"speedup\")","^I",["^J",[]],"^L",["^ ","^16",1],"^M","julia","^C","3428b837-5a8e-44d8-91e3-2e8361152cb0","^N","~u8ebeafe0-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","plotly","^K",null,"^Q","plotly","^R",["^ ","^C","QmNn8cip4FgfKL3myRkvVHE3uSdfKYHdHw3yUQtBqVzcYB","^S",2776,"^T","application/json"]]],"^U",null,"^V",43890,"^W",null],"62a9ecf5-18af-4956-98e1-a95cd0c47859",["^ ","^E","using CuArrays\nA = cu([1, 2, 3])\nB = cu([1, 2, 3])\nC = rand(CuArray, Float32, 3)\nresult = A .+ B .- C\ntest(a::T) where T = a * convert(T, 2) # convert to same type as `a`\n\n# inplace broadcast, writes directly into `result`\nresult .= test.(A) # custom function work\n\n# The cool thing is that this composes well with custom types and custom functions.\n# Let's go back to our Point type and define addition for it\nBase.:(+)(p1::Point, p2::Point) = Point(p1.x + p2.x, p1.y + p2.y)\n\n# now this works:\ncustom_types = cu(Point[(1, 2), (4, 3), (2, 2)])\n\n# This particular example also shows the power of broadcasting: \n# Non array types are broadcasted and repeated for the whole length\nresult = custom_types .+ Ref(Point(2, 2))\n\n# So the above is equal to (minus all the allocations):\n# this allocates a new array on the gpu, which we can avoid with the above broadcast\nbroadcasted = fill(CuArray, Point(2, 2), (3,))\n\nresult == custom_types .+ broadcasted","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","62a9ecf5-18af-4956-98e1-a95cd0c47859","^N","~u58977780-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","data","^K",null,"^Q","json","^R",["^ ","^C","Qmaa779KVD5z1LkzSccCMjJmYtd11DXA4gtTgsNFo8365G","^S",4,"^T","application/json"]]],"^U",null,"^V",3700,"^W",null],"a8566d64-5ace-4acf-ad8e-2021d18c225d",["^ ","^C","a8566d64-5ace-4acf-ad8e-2021d18c225d","^D","text","^E","

Lots of operations are already defined. Most importantly, GPUArrays support Julia's fusing dot broadcasting notation . This notation allows you to apply a function to each element of an array, and create a new array out of the return values of f . This functionality is usually referred to as a map. The broadcast refers to the fact that arrays with different shapes get broadcasted to the same shape.

This is how it works:

"],"2eba3a00-d38e-45f7-b5c0-8a1bba7f5ce2",["^ ","^C","2eba3a00-d38e-45f7-b5c0-8a1bba7f5ce2","^D","section","^Y","Memory","^E",["^J",["6c575ff1-1123-4a22-adbb-292a3fe99d73","5b06a745-4830-4b8e-a5b2-df8449cef3b3","d412f192-1c45-46ba-aa09-657fd3d22093","1da131e8-d5a9-4e7b-a605-14391ed24107","b9afd6ac-9711-4b4b-9f88-89e40405af04"]],"^Z",["^J",["90e93106-c175-437c-a107-fb911bf59788","4e09f815-feab-4f09-8d9f-e63eba843aa2"]]],"b41071e8-02ab-47f3-946b-0b802fff459b",["^ ","^C","b41071e8-02ab-47f3-946b-0b802fff459b","^D","text","^E","

Let's try to figure out what this is doing! In simple terms, this will call the julia function kernel length( A ) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B . If we calculated our own indices instead of using linear_index , we'd need to make sure that we don't have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:

"],"f70ec7d4-93d1-4076-843e-3ecd79642556",["^ ","^C","f70ec7d4-93d1-4076-843e-3ecd79642556","^D","text","^E","

GPUArrays.jl is that foundation in Julia. It offers an abstract array implementation tailored towards using the raw power of highly parallel hardware. It contains all the necessary functionality to set up the GPU, launch Julia GPU functions and offers some basic array algorithms.

Being abstract means that it needs a concrete implementation coming in the form of CuArrays and CLArrays . They both offer exactly the same interface, thanks to inheriting all functionality from GPUArrays. The only difference shows up when allocating an array, which forces one to decide if the array lives on a CUDA or OpenCL device. More about this in the Memory section.

GPUArrays helps to reduce code duplication, because it allows one to write hardware independent GPU kernels which can be compiled to native GPU code by either CuArrays or CLArrays. So, lots of generic kernels can be shared between all packages inheriting from GPUArrays.

To help you choose: CuArrays only works with Nvidia GPUs, while CLArrays works with most available GPUs. CuArrays is more stable than CLArrays and works on Julia 0.7 already. The speed differences are mixed with no clear winner. I would suggest to try both and see what works best.

For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

"],"3bf52ba8-8341-4248-a628-72df5304880c",["^ ","^C","3bf52ba8-8341-4248-a628-72df5304880c","^D","text","^E","

Julia has come a long way to bring composable high-level programming to the high performance world. Now it's time to do the same for the GPU.

"],"e60c9e23-2f00-46ad-9973-a31470112212",["^ ","^C","e60c9e23-2f00-46ad-9973-a31470112212","^D","text","^E","

The fusion happens because the Julia compiler will rewrite this expression into one lazy broadcast call that gets the call tree passed, which then can fuse the whole call tree into one function before looping over the array.

"],"f6e7762e-b6c5-4960-840a-8c46fb255143",["^ ","^C","f6e7762e-b6c5-4960-840a-8c46fb255143","^D","text","^E","

Another nice property to look at is that GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU . This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia, which is explained in great detail in: Why Numba and Cython are no substitute for Julia

"],"f58a7aba-f416-41a0-bd08-8fe07b0df022",["^ ","^C","f58a7aba-f416-41a0-bd08-8fe07b0df022","^D","text","^E","

Most highly parallel algorithms need to churn through quite a bit of data to overcome all the threading and latency overheads. So most algorithms will need arrays to manage all that data, which calls for a good GPU array library as a crucial foundation.

"],"4b7926e0-6502-4410-92eb-1b34d89f2d92",["^ ","^C","4b7926e0-6502-4410-92eb-1b34d89f2d92","^D","text","^E","

One can get pretty far by just using the generic abstract array interface of GPUArrays without ever writing any GPU kernels. However, at some point one might need to implement an algorithm that needs to run on the GPU and can't be expressed by a combination of generic array algorithms!

"],"a034c062-a45b-467f-a953-cb7702a9bad4",["^ ","^C","a034c062-a45b-467f-a953-cb7702a9bad4","^D","text","^E","

The sheer number of parallel threads one can launch can yield massive speed-ups, but also makes it harder to utilize the GPU. Let's have a detailed look at the disadvantages one buys into when utilizing this raw power:

"],"830cd8cc-8d91-412d-b90e-6f1d81a3a8a1",["^ ","^C","830cd8cc-8d91-412d-b90e-6f1d81a3a8a1","^D","section","^Y","Performance","^E",["^J",["fc35d0a2-6efb-4971-9728-e5b6198ad57e","778cc725-33cc-4cb1-a388-6a9da1daff5d","f5616db3-d53e-4347-a8c4-8be45e7308f2","63a71268-3337-4bf2-8d47-b7e641d16dbb","1b96bbbf-1dba-4b5b-9187-ab5a76361a19","7390d157-1854-412e-bba7-0aea472566e0"]]],"1da131e8-d5a9-4e7b-a605-14391ed24107",["^ ","^C","1da131e8-d5a9-4e7b-a605-14391ed24107","^D","formula","^E","\\tiny\n\\left[\n\\begin{array}{ccccccc}\n\\text{} & \\text{ isbits type} & \\text{immutable type} & \\text{mutable type} & \\text{julia array} & \\text{gpu array} & \\text{(device) local memory} \\\\\n\\text{location of creation} & \\text{} & \\text{} & \\text{} & \\text{} & \\text{} & \\text{} \\\\\n\\text{on CPU} & \\text{yes, copy} & \\text{no ptrs, copy} & \\text{copy} & \\text{no} & \\text{by reference} & \\text{no} \\\\\n\\text{on GPU} & \\text{yes} & \\text{may contain device ptr} & \\text{no (on 0.7 if elimated)} & \\text{no} & \\text{no} & \\text{yes} \\\\\n\\end{array}\n\\right]"],"33457f06-ffd8-4e64-a9e3-f336e875b947",["^ ","^C","33457f06-ffd8-4e64-a9e3-f336e875b947","^D","section","^Y","Conclusion","^E",["3bf52ba8-8341-4248-a628-72df5304880c","0b7718a8-066d-4e2b-a020-e3530b4284ec","403a6184-13b5-4d48-8e5c-3612ce2ea5d1"]],"bfa64579-c73b-48b7-b6c4-88fd69aaf9d1",["^ ","^C","bfa64579-c73b-48b7-b6c4-88fd69aaf9d1","^D","text","^E","

Let's jump right into some cool use cases.

"],"4a7aa2b0-fb74-40b6-8b52-cbedeaf7597a",["^ ","^E","using Flux, Flux.Data.MNIST, Statistics\nusing Flux: onehotbatch, onecold, crossentropy, throttle\nusing Base.Iterators: repeated, partition\nusing CuArrays\n\n# Classify MNIST digits with a convolutional network\n\nimgs = MNIST.images()\n\nlabels = onehotbatch(MNIST.labels(), 0:9)\n\n# Partition into batches of size 1,000\ntrain = [(cat(float.(imgs[i])..., dims = 4), labels[:,i])\n for i in partition(1:60_000, 1000)]\n\nuse_gpu = true # helper to easily switch between gpu/cpu\n\ntodevice(x) = use_gpu ? gpu(x) : x\n\ntrain = todevice.(train)\n\n# Prepare test set (first 1,000 images)\ntX = cat(float.(MNIST.images(:test)[1:1000])..., dims = 4) |> todevice\ntY = onehotbatch(MNIST.labels(:test)[1:1000], 0:9) |> todevice\n\nm = Chain(\n Conv((2,2), 1=>16, relu),\n x -> maxpool(x, (2,2)),\n Conv((2,2), 16=>8, relu),\n x -> maxpool(x, (2,2)),\n x -> reshape(x, :, size(x, 4)),\n Dense(288, 10), softmax) |> todevice\n\nm(train[1][1])\n\nloss(x, y) = crossentropy(m(x), y)\n\naccuracy(x, y) = mean(onecold(m(x)) .== onecold(y))\n\nevalcb = throttle(() -> @show(accuracy(tX, tY)), 10)\nopt = ADAM(Flux.params(m));","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","4a7aa2b0-fb74-40b6-8b52-cbedeaf7597a","^N","~ud224b720-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ "],"^U",null,"^V",6031,"^W",null],"0f5d9919-e5ed-4259-8a25-9c835cd74ab1",["^ ","^C","0f5d9919-e5ed-4259-8a25-9c835cd74ab1","^D","text","^E","

The good news is Julia, a high level scripting language, allows you to write both kernel and surrounding code in Julia itself, while running on most GPU hardware!

"],"b4e4c855-1f1c-4bbc-8b4c-704550497733",["^ ","^C","b4e4c855-1f1c-4bbc-8b4c-704550497733","^D","text","^E","

Just by converting the arrays to GPUArrays (with gpu(array) ) we were able to move the entire computation to the GPU and get a pretty nice speed improvement. This is thanks to Julia's sophisticated AbstractArray infrastructure, into which GPUArrays seamlessly integrates. Subsequently, if you leave out the conversion to a GPUArray, the code will also run with normal Julia arrays – but then of course on the CPU. You can try this out by changing use_gpu = true to use_gpu = false and rerun the initialization and training cells. Comparing GPU and CPU, I get 975 seconds for a CPU run and 29 seonds for the GPU - which is a nice speed-up of ~ 33x .

"],"fe642bca-b571-4183-a38a-1e937833a4d8",["^ ","^C","fe642bca-b571-4183-a38a-1e937833a4d8","^D","text","^E","

A big difference when working with the GPU is that there is no garbage collector (GC) on the GPU. This is gladly not a big issue, since the kind of high performance kernel one writes for the GPU shouldn't create any GC-tracked memory to begin with.

Implementing a GC for the GPU is possible, but remember that every kernel executed is massively parallel. Creating and tracking lots of heap memory in every one of the ~1000 gpu threads will quickly destroy any performance gains, so it's really just not worth the effort.

"],"ea004e0d-2b7d-4d1d-87b0-58bac63971d8",["^ ","^C","ea004e0d-2b7d-4d1d-87b0-58bac63971d8","^D","section","^Y","Writing GPU Kernels","^E",["^J",["4b7926e0-6502-4410-92eb-1b34d89f2d92","dc1691b3-b051-4078-8105-be5693dde633","e01448e0-7630-4c6a-b415-050bd4b37efa","e542cbee-022a-4da4-91bb-f683b86080f3","b41071e8-02ab-47f3-946b-0b802fff459b","3428b837-5a8e-44d8-91e3-2e8361152cb0","f21ef039-a847-43e7-97be-55b8936c1cba","d8fb28e2-49aa-4c56-83e4-1dfeb06c2af9","80bd4385-cd68-4fad-ada0-981d4b98b100","5a85c88c-f0a9-4064-bf8a-62c80bf270ae"]],"^Z",["^J",[]]],"f21ef039-a847-43e7-97be-55b8936c1cba",["^ ","^C","f21ef039-a847-43e7-97be-55b8936c1cba","^D","text","^E","

Because the function isn't doing a lot of work, we don't see perfect scaling, but the threaded and GPU version still give us a nice speed-up.

"],"fac98232-a8eb-42d6-89e2-1b6e373a87c1",["^ ","~:runtime/inherited-environment-variables",["^J",[["^ ","^K","SERVICE_9998_NAME","~:value","runtime-7f9b8ecb-113f-41ec-a965-450fd1c0faa7"],["^ ","^K","SERVICE_9998_CHECK_TCP","^29","true"],["^ ","^K","SERVICE_TAGS=urlprefix-/runner/17592186656385/runtime/7f9b8ecb-113f-41ec-a965-450fd1c0faa7 strip","^29","/runner/17592186656385/runtime/7f9b8ecb-113f-41ec-a965-450fd1c0faa7"],["^ ","^K","NEXTJOURNAL_RUNTIME_SERVICE_URL","^29","https://nextjournal.com/runner/17592186656385/runtime/7f9b8ecb-113f-41ec-a965-450fd1c0faa7"],["^ ","^K","NEXTJOURNAL_MOUNT_CUDA","^29","9.2-cudnn7-devel-ubuntu18.04"],["^ ","^K","DISPLAY","^29",":0"],["^ ","^K","NVIDIA_VISIBLE_DEVICES","^29","all"],["^ ","^K","NVIDIA_DRIVER_CAPABILITIES","^29","all"],["^ ","^K","PATH","^29","/usr/local/julia/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],["^ ","^K","JULIA_PATH","^29","/usr/local/julia"],["^ ","^K","JULIA_GPG","^29","3673DF529D9049477F76B37566E3C7DC03D6E495"],["^ ","^K","JULIA_VERSION","^29","1.0.0"],["^ ","^K","LC_ALL","^29","en_US.UTF-8"],["^ ","^K","LANGUAGE","^29","en_US.en"],["^ ","^K","LANG","^29","en_US.UTF-8"],["^ ","^K","DEBIAN_FRONTEND","^29","noninteractive"],["^ ","^K","BASH_ENV","^29","/.bash_profile"]]],"~:type","~:nextjournal","^M","julia","^C","fac98232-a8eb-42d6-89e2-1b6e373a87c1","^D","runtime","^U",null,"~:environment",["^2<

",["^ ","~:article/nextjournal.id","~u0287ca51-4717-45ef-9a18-70153befcf15","~:change/nextjournal.id","~u5bc4663e-a617-4cad-a5fd-bcf13a88c7c2","~:node/id","7f9b8ecb-113f-41ec-a965-450fd1c0faa7"]],"~:runtime/environment-variables",[["^ ","^K","JULIA_NUM_THREADS","^29","8"],["^ ","^K","GKS_WSTYPE","^29","png"]],"~:resources","~:gpu"],"fc35d0a2-6efb-4971-9728-e5b6198ad57e",["^ ","^C","fc35d0a2-6efb-4971-9728-e5b6198ad57e","^D","text","^E","

Let's quickly motivate why we would want to move our calculations to the GPU with a simple interactive code example calculating the julia set:

"],"bd6e2023-447e-4062-a7d8-9c25f509d4c8",["^ ","^C","bd6e2023-447e-4062-a7d8-9c25f509d4c8","^D","text","^E","

"],"e542cbee-022a-4da4-91bb-f683b86080f3",["^ ","^E","using GPUArrays, CuArrays\n# Overloading the Julia Base map! function for GPUArrays\nfunction Base.map!(f::Function, A::GPUArray, B::GPUArray)\n # our function that will run on the gpu\n function kernel(state, f, A, B)\n # If launch parameters aren't specified, linear_index gets the index\n # into the Array passed as second argument to gpu_call (`A`)\n i = linear_index(state)\n \t\tif i<

= length(A)\n @inbounds A[i] = f(B[i])\n end\n return\n end\n # call kernel on the gpu\n gpu_call(kernel, A, (f, A, B))\nend","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","e542cbee-022a-4da4-91bb-f683b86080f3","^N","~u8ebcdb20-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ "],"^U",null,"^V",11,"^W",null],"dc1691b3-b051-4078-8105-be5693dde633",["^ ","^C","dc1691b3-b051-4078-8105-be5693dde633","^D","text","^E","

The nice thing is that GPUArrays takes away quite a bit of work with a layered approach that lets you start off with high level code, but allows you to pretty much write low-level kernels similarly to what one would find in most OpenCL/CUDA examples. It also allows you to execute kernels both on OpenCL or CUDA devices, abstracting away any differences in those frameworks.

"],"dd74954e-f344-4533-8ff4-e8e072c669aa",["^ ","^E","using Colors, FileIO, ImageShow\nN = 22\nimg = tX[:, :, 1:1, N:N]\nprintln(\"Predicted: \", Flux.onecold(m(img)) .- 1)\nGray.(collect(tX[:, :, 1, N]))","^I",["^J",[]],"^L",["^ ","^16",1],"^M","julia","^C","dd74954e-f344-4533-8ff4-e8e072c669aa","^N","~u8ad25a80-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","file","^K",null,"^Q","display","^R",["^ ","^C","QmUmz4tfrmBaNWXa3iRFMrKHYKPrD5T4wNVMHYRyq6PMxX","^S",606,"^T","image/png"]]],"^U",null,"^V",6569,"^W",null],"0b7718a8-066d-4e2b-a020-e3530b4284ec",["^ ","^C","0b7718a8-066d-4e2b-a020-e3530b4284ec","^D","text","^E","

The hope is that Julia lowers the bar for people to start programming on GPUs, and that we can grow an extendable platform for open source GPU computing. The first success story, of automatic differentiation working out of the box via Julia packages that haven't even been written for the GPU, gives a lot of reason to believe in the success of Julia's extendable and generic design in the domain of GPU computing.

"],"6494cb51-4ad3-4327-9e27-c2c804df6c81",["^ ","^E","using CUDAdrv; CUDAdrv.name(CuDevice(0))","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","6494cb51-4ad3-4327-9e27-c2c804df6c81","^N","~ufb4da180-d2e6-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","data","^K",null,"^Q","json","^R",["^ ","^C","QmYFmTkHeERryshLEYiyHmHukLBjwsjKHDPiNrzrZRt4M5","^S",11,"^T","application/json"]]],"^U",null,"^V",777,"^W",null],"095f7d06-d4fc-49e2-a68c-96cad2fb0a62",["^ ","^C","095f7d06-d4fc-49e2-a68c-96cad2fb0a62","^D","section","^Y","GPUArrays","^E",["^J",["f58a7aba-f416-41a0-bd08-8fe07b0df022","f70ec7d4-93d1-4076-843e-3ecd79642556"]],"^Z",["^J",[]]],"e2ce55ae-a15c-49dc-ac05-239bd226b07a",["^ ","^C","e2ce55ae-a15c-49dc-ac05-239bd226b07a","^D","text","^E","

This GPU accelerated smoke simulation was created with GPUArrays + CLArrays and runs on both GPU or CPU, with the GPU version being up to 15x faster:

"],"12260210-0c7d-4cbc-9262-30ac57e4110f",["^ ","^C","12260210-0c7d-4cbc-9262-30ac57e4110f","^D","text","^E","

First of all, what is a GPU anyways?

"],"07d732ed-7e58-4970-af4a-16701813f0ee",["^ ","^E","using CuArrays, LinearAlgebra\n\n# GPU Arrays can be constructed from all Julia arrays containing isbits types!\nA1D = cu([1, 2, 3]) # cl for CLArrays\nA1D = fill(CuArray{Int}, 0, (100,)) # CLArray for CLArrays\n# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64\ndiagonal_matrix = CuArray{Float32}(I, 100, 100)\nfilled = fill(CuArray, 77f0, (4, 4, 4)) # 3D array filled with Float32 77\nrandy = rand(CuArray, Float32, 42, 42) # random numbers generated on the GPU\n# The array constructor also accepts isbits iterators with a known size\n# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won't need to materialize them as an gpu array\nfrom_iter = CuArray(1:10)\n# let's create a point type to further illustrate what can be done:\nstruct Point\n x::Float32\n y::Float32\nend\nBase.convert(::Type{Point}, x::NTuple{2, Any}) = Point(x[1], x[2])\n# because we defined the above convert from a tuple to a point\n# [Point(2, 2)] can be written as Point[(2,2)] since all array \n# elements will get converted to Point\ncustom_types = cu(Point[(1, 2), (4, 3), (2, 2)])\ntypeof(custom_types)","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","07d732ed-7e58-4970-af4a-16701813f0ee","^N","~u558d6bd0-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","data","^K",null,"^Q","json","^R",["^ ","^C","QmQxCHHamFSPxeHGaz6mwcQbRfh79LdFc7DBeLT6TXZ9iD","^S",18,"^T","application/json"]]],"^U",null,"^V",4394,"^W",null],"1b96bbbf-1dba-4b5b-9187-ab5a76361a19",["^ ","^C","1b96bbbf-1dba-4b5b-9187-ab5a76361a19","^D","text","^E","

One might think that the GPU performance suffers from being written in a dynamic language like Julia, but Julia's GPU performance should be pretty much on par with the raw performance of CUDA or OpenCL. Tim Besard did a great job at integrating the LLVM Nvidia compilation pipeline to achieve the same – or sometimes even better – performance as pure CUDA C code. Tim published a pretty detailedblog post in which he explains this further. CLArrays approach is a bit different and generates OpenCL C code directly from Julia, which has the same performance as OpenCL C!

"],"56322bde-814a-4056-8601-6ca5c8434825",["^ ","^C","56322bde-814a-4056-8601-6ca5c8434825","^D","text","^E","

Let's walk through a simple Machine Learning example, to see how GPUArrays can be used:

"],"d8fb28e2-49aa-4c56-83e4-1dfeb06c2af9",["^ ","^C","d8fb28e2-49aa-4c56-83e4-1dfeb06c2af9","^D","text","^E","

The GPU is a bit more complex than what the thread example allows us to show, since the hardware threads are laid out inblocks of threads – gpu_call abstracts that away in the simple version, but it can also be used with more complex launch configurations:

"],"7568cecf-0eaa-4fe3-bc80-9cf89360c8fb",["^ ","^C","7568cecf-0eaa-4fe3-bc80-9cf89360c8fb","^D","embed","~:url","https://vimeo.com/235601956"],"f5616db3-d53e-4347-a8c4-8be45e7308f2",["^ ","^E","using Plots; plotly()\nx = repeat(range, inner = 10)\nspeedup = jltimes ./ cutimes\nPlots.scatter(\n log2.(x), [speedup, fill(1.0, length(speedup))], \n label = [\"cuda\" \"cpu\"], markersize = 2, markerstrokewidth = 0,\n legend = :right, xlabel = \"2^N\", ylabel = \"speedup\"\n)","^I",["^J",[]],"^L",["^ "],"^M","julia","^C","f5616db3-d53e-4347-a8c4-8be45e7308f2","^N","~u42b5a900-d2e7-11e8-b7e3-d5a7297efcb7","^O",["^O","fac98232-a8eb-42d6-89e2-1b6e373a87c1"],"^D","code","^P",["^ ","~_",["^ ","^D","plotly","^K",null,"^Q","plotly","^R",["^ ","^C","QmYdUiaWGXPacP8DMfKY7pV4aewoW8S6RkatCtNCpqHrnQ","^S",57676,"^T","application/json"]]],"^U",null,"^V",30768,"^W",null],"7390d157-1854-412e-bba7-0aea472566e0",["^ ","^C","7390d157-1854-412e-bba7-0aea472566e0","^D","text","^E","

To get a better idea of the performance and see some comparisons to multithreadded CPU code, I collected some more benchmarks .

"],"d412f192-1c45-46ba-aa09-657fd3d22093",["^ ","^C","d412f192-1c45-46ba-aa09-657fd3d22093","^D","text","^E","

All those Julia types behave differently when transferred to the GPU or when created on the GPU. You can use the following table to get an overview of what to expect:

"],"8d37d54c-4774-45ab-a1bc-bad9f73a0231",["^ ","^C","8d37d54c-4774-45ab-a1bc-bad9f73a0231","^D","text","^E","

If you want a more throrough and interactive explanation of how broadcasting works, you can have a look at this great guide: julia.guide/broadcasting

"],"6c575ff1-1123-4a22-adbb-292a3fe99d73",["^ ","^C","6c575ff1-1123-4a22-adbb-292a3fe99d73","^D","text","^E","

GPUs have their own memory space with video memory (VRAM), different caches, and registers. Whatever you do, any Julia object must get transferred to the GPU before you can work with it. Not all types in Julia work on the GPU.

To give you an overview first let's see what Julia types there are:

"]],"~:transclusions",["~#cmap",[["^ ","^2=","~u0287ca51-4717-45ef-9a18-70153befcf15","^2>","~u5bc4663e-a617-4cad-a5fd-bcf13a88c7c2","^2?","7f9b8ecb-113f-41ec-a965-450fd1c0faa7"],["^ ","^28",["^J",[["^ ","^K","PATH","^29","/usr/local/julia/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"],["^ ","^K","JULIA_PATH","^29","/usr/local/julia"],["^ ","^K","JULIA_GPG","^29","3673DF529D9049477F76B37566E3C7DC03D6E495"],["^ ","^K","JULIA_VERSION","^29","1.0.0"],["^ ","^K","LC_ALL","^29","en_US.UTF-8"],["^ ","^K","LANGUAGE","^29","en_US.en"],["^ ","^K","LANG","^29","en_US.UTF-8"],["^ ","^K","DEBIAN_FRONTEND","^29","noninteractive"],["^ ","^K","BASH_ENV","^29","/.bash_profile"]]],"^;","~m1539597886404","~:transclusion",["^ ","^2=","~u0287ca51-4717-45ef-9a18-70153befcf15","^2>","~u5bc4663e-a617-4cad-a5fd-bcf13a88c7c2","^2?","7f9b8ecb-113f-41ec-a965-450fd1c0faa7"],"^I",["^J",[]],"^K","Julia GPU+Flux","~:docker/environment-image","eu.gcr.io/nextjournal-com/environment@sha256:4dfe6b9d4a85cc91cdf94c349f8f3d8028a513a9c8b58cb0d71346ccdfdbda0c","^2:","^2;","~:environment?",true,"^M","julia","^C","7f9b8ecb-113f-41ec-a965-450fd1c0faa7","^D","runtime","~:changed?",false,"^U",null,"^2<",["^2<

",["^ ","^2=","~u0289217e-a1bb-440d-9224-f32b960a271a","^2>","~u5b7430ce-e0bf-4c81-b2c1-aee21676992b","^2?","e8f2328c-62cf-4dca-849f-73746f78304c"]],"^2@",[["^ ","^K","NEXTJOURNAL_MOUNT_CUDA","^29","9.2-cudnn7-devel-ubuntu18.04"],["^ ","^K","DISPLAY","^29",":0"]],"^2A","^2B","~:diff",""]]]],"^C",17592186649289,"^4","~u0287c2e0-a8ac-4929-b5ab-9efe94bac3c4"],"~:article-settings",["^ ","^1B",false,"^1C",false,"^1D",false,"^1E",false,"^1F",false],"~:preview?",false,"~:change-id","CLNNwCxCMobywDuw6EznMN","^1@","published","~:html","

An Introduction to GPU Programming in Julia

How does the GPU work

This article aims to give a quick introduction about how GPUs work and specifically give an overlook of the current Julia GPU ecosystem and how easy it is to get simple GPU programs running. To make things easier, you can run all the code samples directly in the article if you have an account and click on edit .

First of all, what is a GPU anyways?

A GPU is a massively parallel processor, with a couple of thousand parallel processing units. For example the Tesla k80 , which is used in this article, offers 4992 parallel CUDA cores. GPUs are quite different from CPUs in terms of frequencies, latencies and hardware capabilities, but this is somewhat similar to a slow CPU with 4992 cores!

using
 
CUDAdrv
;

 
CUDAdrv.name
(
CuDevice
(
0
)

)

0.8s

Julia

The sheer number of parallel threads one can launch can yield massive speed-ups, but also makes it harder to utilize the GPU. Let's have a detailed look at the disadvantages one buys into when utilizing this raw power:

  • A GPU is a separate piece of hardware with its own memory space and different architecture. As a result, there are long transfer times from RAM to the GPUs memory (VRAM). Even launching a kernel on the GPU (in other words scheduling a function call) introduces large latencies. Times are around ~10us for GPUs, compared to a few nano seconds on a CPU
  • Setting up a kernel can quickly become complicated without a high level wrapper
  • Lower precision is the default and higher precision computations can easily void all performance gains
  • GPU functions (kernels) are inherently parallel, so writing GPU kernels is at least as difficult as writing parallel CPU code, but the difference in hardware adds quite a bit of complexity
  • Related to the above, a lot of algorithms won't port nicely to the GPU. For more details on what to expect, have a look at this blog post
  • Kernels are usually written in a C/C++ dialect, which is not the nicest language to write down your algorithms
  • There is a divide between CUDA and OpenCL, which are the dominant frameworks used to write low-level GPU code. While CUDA only supports Nvidia hardware, OpenCL supports all hardware but is a bit rough around the edges. One needs to decide what to use, and will get pretty much stuck with that decision

The good news is Julia, a high level scripting language, allows you to write both kernel and surrounding code in Julia itself, while running on most GPU hardware!

GPUArrays

Most highly parallel algorithms need to churn through quite a bit of data to overcome all the threading and latency overheads. So most algorithms will need arrays to manage all that data, which calls for a good GPU array library as a crucial foundation.

GPUArrays.jl is that foundation in Julia. It offers an abstract array implementation tailored towards using the raw power of highly parallel hardware. It contains all the necessary functionality to set up the GPU, launch Julia GPU functions and offers some basic array algorithms.

Being abstract means that it needs a concrete implementation coming in the form of CuArrays and CLArrays . They both offer exactly the same interface, thanks to inheriting all functionality from GPUArrays. The only difference shows up when allocating an array, which forces one to decide if the array lives on a CUDA or OpenCL device. More about this in the Memory section.

GPUArrays helps to reduce code duplication, because it allows one to write hardware independent GPU kernels which can be compiled to native GPU code by either CuArrays or CLArrays. So, lots of generic kernels can be shared between all packages inheriting from GPUArrays.

To help you choose: CuArrays only works with Nvidia GPUs, while CLArrays works with most available GPUs. CuArrays is more stable than CLArrays and works on Julia 0.7 already. The speed differences are mixed with no clear winner. I would suggest to try both and see what works best.

For this article I'm going to choose CuArrays, since this article is written for Julia 0.7 / 1.0, which still isn't supported by CLArrays.

Performance

Let's quickly motivate why we would want to move our calculations to the GPU with a simple interactive code example calculating the julia set:

118.6s

Julia

using
 
CuArrays
,

 
FileIO
,

 
Colors
,

 
GPUArrays
,

 
BenchmarkTools
\n
using
 
CuArrays:
 
CuArray
\n
"""
\n
The function calculating the Julia set
\n
"""
\n
function
 
juliaset
(
z0
,

 
maxiter
)

\n

    
c
 
=
 
ComplexF32
(
-0.5
,

 
0.75
)

\n

    
z
 
=
 
z0
\n

    
for
 
i
 
in
 
1:maxiter
\n

        
abs2
(
z
)

 
>
 
4f0
 
&&
 
return
 

(
i
 
-
 
1
)

 
%
 
UInt8
\n

        
z
 
=
 
z
 
*
 
z
 
+
 
c
\n

    
end
\n

    
return
 
maxiter
 
%
 
UInt8
 
# % is used to convert without overflow check
\n
end
\n
range
 
=
 
100:50:2^12
\n
cutimes
,

 
jltimes
 
=
 
Float64
[

]

,

 
Float64
[

]

\n
function
 
run_bench
(
in
,

 
out
)

\n

  
# use dot syntax to apply `juliaset` to each elemt of q_converted 
\n

  
# and write the output to result
\n

  
out
 
.=
 
juliaset.
(
in
,

 
16
)

\n

  
# all calls to the GPU are scheduled asynchronous, 
\n

  
# so we need to synchronize
\n

  
GPUArrays.synchronize
(
out
)

\n
end
\n
# store a reference to the last results for plotting
\n
last_jl
,

 
last_cu
 
=
 
nothing
,

 
nothing
\n
for
 
N
 
in
 
range
\n

  
w
,

 
h
 
=
 
N
,

 
N
\n

  
q
 
=
 

[
ComplexF32
(
r
,

 
i
)

 
for
 
i=1:-
(
2.0/w
)
:-1
,

 
r=-1.5:
(
3.0/h
)
:1.5
]

\n

  
for
 

(
times
,

 
Typ
)

 
in
 

(

(
cutimes
,

 
CuArray
)

,

 

(
jltimes
,

 
Array
)

)

\n

    
# convert to Array or CuArray - moving the calculation to CPU/GPU
\n

    
q_converted
 
=
 
Typ
(
q
)

\n

    
result
 
=
 
Typ
(
zeros
(
UInt8
,

 
size
(
q
)

)

)

\n

    
for
 
i
 
in
 
1:10
 
# 5 samples per size
\n

      
# benchmarking macro, all variables need to be prefixed with $
\n

      
t
 
=
 
Base.@elapsed
 
begin
\n

\t\t\t\t
run_bench
(
q_converted
,

 
result
)

\n

      
end
\n

      
global
 
last_jl
,

 
last_cu
 
# we're in local scope
\n

      
if
 
result
 
isa
 
CuArray
\n

        
last_cu
 
=
 
result
\n

      
else
\n

      \t
last_jl
 
=
 
result
\n

      
end
\n

      
push!
(
times
,

 
t
)

\n

    
end
\n

  
end
\n
end
\n

\n
cu_jl
 
=
 
hcat
(
Array
(
last_cu
)

,

 
last_jl
)

\n
cmap
 
=
 
colormap
(
"Blues"
,

 
16
 
+
 
1
)

\n
color_lookup
(
val
,

 
cmap
)

 
=
 
cmap
[
val
 
+
 
1
]

\n
save
(
"results/juliaset.png"
,

 
color_lookup.
(
cu_jl
,

 

(
cmap
,

)

)

)

118.6s

Julia

%5C
using
 
Plots
;

 
plotly
(

)

\n
x
 
=
 
repeat
(
range
,

 
inner
 
=
 
10
)

\n
speedup
 
=
 
jltimes
 
./
 
cutimes
\n
Plots.scatter
(

\n

  
log2.
(
x
)

,

 

[
speedup
,

 
fill
(
1.0
,

 
length
(
speedup
)

)

]

,

 

\n

  
label
 
=
 

[
"cuda"
 
"cpu"
]

,

 
markersize
 
=
 
2
,

 
markerstrokewidth
 
=
 
0
,

\n

  
legend
 
=
 
:right
,

 
xlabel
 
=
 
"2^N"
,

 
ylabel
 
=
 
"speedup"
\n

)

30.8s

Julia

As you can see, for large arrays one gets a solid 60-80x speed-up by moving the calculation to the GPU. Getting this speed-up was as simple as converting the Julia array to a GPUArray.

One might think that the GPU performance suffers from being written in a dynamic language like Julia, but Julia's GPU performance should be pretty much on par with the raw performance of CUDA or OpenCL. Tim Besard did a great job at integrating the LLVM Nvidia compilation pipeline to achieve the same – or sometimes even better – performance as pure CUDA C code. Tim published a pretty detailed blog post in which he explains this further. CLArrays approach is a bit different and generates OpenCL C code directly from Julia, which has the same performance as OpenCL C!

To get a better idea of the performance and see some comparisons to multithreadded CPU code, I collected some more benchmarks .

Memory

GPUs have their own memory space with video memory (VRAM), different caches, and registers. Whatever you do, any Julia object must get transferred to the GPU before you can work with it. Not all types in Julia work on the GPU.

To give you an overview first let's see what Julia types there are:

struct
 
Test
 
# an immutable struct
\n
# that only contains other immutable, which makes 
\n
# isbitstype(Test) == true
\n

\t
x::Float32
 

\n
end
\n

\n
# the isbits property is important, since those types can be used
\n
# without constraints on the GPU!
\n
@assert
 
isbitstype
(
Test
)

 
==
 
true
\n
x
 
=
 

(
2
,

 
2
)

\n
isa
(
x
,

 
Tuple
{
Int
,

 
Int
}

)

 
# tuples are also immutable
\n
mutable
 
struct
 
Test2
 
#-> mutable, isbits(Test2) == false
\n

\t
x::Float32
\n
end
\n
struct
 
Test3
\n

\t
# contains a heap allocation/ reference, not isbits
\n

\t
x::Vector{Float32}
\n

\t
y::Test2
 
# Test2 is mutable and also heap allocated / a reference
\n
end
\n
Vector
{
Test
}

 
# <-  Array with isbits elements is contigious in memory
\n
Vector
{
Test2
}

 
# <- Array with mutable elements is basically a linked list, since it just contains the pointer to its elements and won't work with the GPU.

0.8s

Julia

All those Julia types behave differently when transferred to the GPU or when created on the GPU. You can use the following table to get an overview of what to expect:

[ isbits typeimmutable typemutable typejulia arraygpu array(device) local memorylocation of creationon CPUyes, copyno ptrs, copycopynoby referencenoon GPUyesmay contain device ptrno (on 0.7 if elimated)nonoyes]\\tiny\n\\left[\n\\begin{array}{ccccccc}\n\\text{} & \\text{ isbits type} & \\text{immutable type} & \\text{mutable type} & \\text{julia array} & \\text{gpu array} & \\text{(device) local memory} \\\\\n\\text{location of creation} & \\text{} & \\text{} & \\text{} & \\text{} & \\text{} & \\text{} \\\\\n\\text{on CPU} & \\text{yes, copy} & \\text{no ptrs, copy} & \\text{copy} & \\text{no} & \\text{by reference} & \\text{no} \\\\\n\\text{on GPU} & \\text{yes} & \\text{may contain device ptr} & \\text{no (on 0.7 if elimated)} & \\text{no} & \\text{no} & \\text{yes} \\\\\n\\end{array}\n\\right] location of creation on CPU on GPU  isbits type yes, copy yes immutable type no ptrs, copy may contain device ptr mutable type copy no (on 0.7 if elimated) julia array no no gpu array by reference no (device) local memory no yes

Location of creation describes if the object was created on the CPU and then transferred to the GPU kernel, or if it was created on the GPU inside the kernel. The table shows if it is possible to create an instance of a type, and for the transfer from CPU to GPU, the table also indicates if the object gets copied or passed by reference.

Garbage Collection

A big difference when working with the GPU is that there is no garbage collector (GC) on the GPU. This is gladly not a big issue, since the kind of high performance kernel one writes for the GPU shouldn't create any GC-tracked memory to begin with.

Implementing a GC for the GPU is possible, but remember that every kernel executed is massively parallel. Creating and tracking lots of heap memory in every one of the ~1000 gpu threads will quickly destroy any performance gains, so it's really just not worth the effort.

As an alternative to heap allocated arrays inside the kernel, you can use GPUArrays. The GPUArray constructor will create GPU buffers and transfer the data to VRAM. If you call Array(gpu_array) the array will get transferred back to RAM, represented as a normal Julia Array. The Julia handle to those gpu arrays is tracked by Julia's GC and if it's not used anymore, the GPU memory will be freed.

Consequently, one can only use stack allocation on the device, and for the rest pre-allocated GPU buffers are used. As transfers are expensive, it is common to reuse and pre-allocate as much as possible when programming the GPU.

The GPUArray Constructors

using
 
CuArrays
,

 
LinearAlgebra
\n

\n
# GPU Arrays can be constructed from all Julia arrays containing isbits types!
\n
A1D
 
=
 
cu
(

[
1
,

 
2
,

 
3
]

)

 
# cl for CLArrays
\n
A1D
 
=
 
fill
(
CuArray
{
Int
}

,

 
0
,

 

(
100
,

)

)

 
# CLArray for CLArrays
\n
# Float32 array - Float32 is usually preferred and can be up to 30x faster on most GPUs than Float64
\n
diagonal_matrix
 
=
 
CuArray{Float32}
(
I
,

 
100
,

 
100
)

\n
filled
 
=
 
fill
(
CuArray
,

 
77f0
,

 

(
4
,

 
4
,

 
4
)

)

 
# 3D array filled with Float32 77
\n
randy
 
=
 
rand
(
CuArray
,

 
Float32
,

 
42
,

 
42
)

 
# random numbers generated on the GPU
\n
# The array constructor also accepts isbits iterators with a known size
\n
# Note, that since you can also pass isbits types to a gpu kernel directly, in most cases you won't need to materialize them as an gpu array
\n
from_iter
 
=
 
CuArray
(
1:10
)

\n
# let's create a point type to further illustrate what can be done:
\n
struct
 
Point
\n

    
x::Float32
\n

    
y::Float32
\n
end
\n
Base.convert
(
::Type{Point}
,

 
x::NTuple{2, Any}
)

 
=
 
Point
(
x
[
1
]

,

 
x
[
2
]

)

\n
# because we defined the above convert from a tuple to a point
\n
# [Point(2, 2)] can be written as Point[(2,2)] since all array 
\n
# elements will get converted to Point
\n
custom_types
 
=
 
cu
(
Point
[

(
1
,

 
2
)

,

 

(
4
,

 
3
)

,

 

(
2
,

 
2
)

]

)

\n
typeof
(
custom_types
)

4.4s

Julia

Array Operations

Lots of operations are already defined. Most importantly, GPUArrays support Julia's fusing dot broadcasting notation . This notation allows you to apply a function to each element of an array, and create a new array out of the return values of f . This functionality is usually referred to as a map. The broadcast refers to the fact that arrays with different shapes get broadcasted to the same shape.

This is how it works:

x
 
=
 
zeros
(
4
,

 
4
)

 
# 4x4 array of zeros
\n
y
 
=
 
zeros
(
4
)

 
# 4 element array
\n
z
 
=
 
2
 
# a scalar
\n
# y's 1st dimension gets repeated for the 2nd dimension in x
\n
# and the scalar z get's repeated for all dimensions
\n
# the below is equal to `broadcast(+, broadcast(+, xx, y), z)`
\n
x
 
.+
 
y
 
.+
 
z

0.7s

Julia

The fusion happens because the Julia compiler will rewrite this expression into one lazy broadcast call that gets the call tree passed, which then can fuse the whole call tree into one function before looping over the array.

If you want a more throrough and interactive explanation of how broadcasting works, you can have a look at this great guide: julia.guide/broadcasting

This means any Julia function that runs without allocating heap memory (only creating isbits types), can be applied to each element of a GPUArray and multiple dot calls will get fused into one kernel call. As kernel call latency is high, this fusion is a very important optimization.

using
 
CuArrays
\n
A
 
=
 
cu
(

[
1
,

 
2
,

 
3
]

)

\n
B
 
=
 
cu
(

[
1
,

 
2
,

 
3
]

)

\n
C
 
=
 
rand
(
CuArray
,

 
Float32
,

 
3
)

\n
result
 
=
 
A
 
.+
 
B
 
.-
 
C
\n
test
(
a::T
)

 
where
 
T
 
=
 
a
 
*
 
convert
(
T
,

 
2
)

 
# convert to same type as `a`
\n

\n
# inplace broadcast, writes directly into `result`
\n
result
 
.=
 
test.
(
A
)

 
# custom function work
\n

\n
# The cool thing is that this composes well with custom types and custom functions.
\n
# Let's go back to our Point type and define addition for it
\n
Base.:
(
+
)

(
p1::Point
,

 
p2::Point
)

 
=
 
Point
(
p1.x
 
+
 
p2.x
,

 
p1.y
 
+
 
p2.y
)

\n

\n
# now this works:
\n
custom_types
 
=
 
cu
(
Point
[

(
1
,

 
2
)

,

 

(
4
,

 
3
)

,

 

(
2
,

 
2
)

]

)

\n

\n
# This particular example also shows the power of broadcasting: 
\n
# Non array types are broadcasted and repeated for the whole length
\n
result
 
=
 
custom_types
 
.+
 
Ref
(
Point
(
2
,

 
2
)

)

\n

\n
# So the above is equal to (minus all the allocations):
\n
# this allocates a new array on the gpu, which we can avoid with the above broadcast
\n
broadcasted
 
=
 
fill
(
CuArray
,

 
Point
(
2
,

 
2
)

,

 

(
3
,

)

)

\n

\n
result
 
==
 
custom_types
 
.+
 
broadcasted

3.7s

Julia

Some more operations supported by GPUArrays:

GPUArrays in the real world

Let's jump right into some cool use cases.

This GPU accelerated smoke simulation was created with GPUArrays + CLArrays and runs on both GPU or CPU, with the GPU version being up to 15x faster:

There are many more use cases , including solving differential equations, FEM simulations, and solving PDEs .

Let's walk through a simple Machine Learning example, to see how GPUArrays can be used:

6.0s

Julia

using
 
Flux
,

 
Flux.Data.MNIST
,

 
Statistics
\n
using
 
Flux:
 
onehotbatch
,

 
onecold
,

 
crossentropy
,

 
throttle
\n
using
 
Base.Iterators:
 
repeated
,

 
partition
\n
using
 
CuArrays
\n

\n
# Classify MNIST digits with a convolutional network
\n

\n
imgs
 
=
 
MNIST.images
(

)

\n

\n
labels
 
=
 
onehotbatch
(
MNIST.labels
(

)

,

 
0:9
)

\n

\n
# Partition into batches of size 1,000
\n
train
 
=
 

[

(
cat
(
float.
(
imgs
[
i
]

)
...
,

 
dims
 
=
 
4
)

,

 
labels
[
:
,
i
]

)

\n

         
for
 
i
 
in
 
partition
(
1:60_000
,

 
1000
)

]

\n

\n
use_gpu
 
=
 
true
 
# helper to easily switch between gpu/cpu
\n

\n
todevice
(
x
)

 
=
 
use_gpu
 
?
 
gpu
(
x
)

 
:
 
x
\n

\n
train
 
=
 
todevice.
(
train
)

\n

\n
# Prepare test set (first 1,000 images)
\n
tX
 
=
 
cat
(
float.
(
MNIST.images
(
:test
)

[
1:1000
]

)
...
,

 
dims
 
=
 
4
)

 
|>
 
todevice
\n
tY
 
=
 
onehotbatch
(
MNIST.labels
(
:test
)

[
1:1000
]

,

 
0:9
)

 
|>
 
todevice
\n

\n
m
 
=
 
Chain
(

\n

  
Conv
(

(
2
,
2
)

,

 
1=>16
,

 
relu
)

,

\n

  
x
 
->
 
maxpool
(
x
,

 

(
2
,
2
)

)

,

\n

  
Conv
(

(
2
,
2
)

,

 
16=>8
,

 
relu
)

,

\n

  
x
 
->
 
maxpool
(
x
,

 

(
2
,
2
)

)

,

\n

  
x
 
->
 
reshape
(
x
,

 
:
,

 
size
(
x
,

 
4
)

)

,

\n

  
Dense
(
288
,

 
10
)

,

 
softmax
)

 
|>
 
todevice
\n

\n
m
(
train
[
1
]

[
1
]

)

\n

\n
loss
(
x
,

 
y
)

 
=
 
crossentropy
(
m
(
x
)

,

 
y
)

\n

\n
accuracy
(
x
,

 
y
)

 
=
 
mean
(
onecold
(
m
(
x
)

)

 
.==
 
onecold
(
y
)

)

\n

\n
evalcb
 
=
 
throttle
(

(

)

 
->
 
@show
(
accuracy
(
tX
,

 
tY
)

)

,

 
10
)

\n
opt
 
=
 
ADAM
(
Flux.params
(
m
)

)

;

6.0s

Julia

# train
\n
for
 
i
 
=
 
1:10
\n

    
Flux.train!
(
loss
,

 
train
,

 
opt
,

 
cb
 
=
 
evalcb
)

\n
end

29.6s

Julia

\n\n\n

using
 
Colors
,

 
FileIO
,

 
ImageShow
\n
N
 
=
 
22
\n
img
 
=
 
tX
[
:
,

 
:
,

 
1:1
,

 
N:N
]

\n
println
(
"Predicted: "
,

 
Flux.onecold
(
m
(
img
)

)

 
.-
 
1
)

\n
Gray.
(
collect
(
tX
[
:
,

 
:
,

 
1
,

 
N
]

)

)

6.6s

Julia

\n

Just by converting the arrays to GPUArrays (with gpu(array) ) we were able to move the entire computation to the GPU and get a pretty nice speed improvement. This is thanks to Julia's sophisticated AbstractArray infrastructure, into which GPUArrays seamlessly integrates. Subsequently, if you leave out the conversion to a GPUArray, the code will also run with normal Julia arrays – but then of course on the CPU. You can try this out by changing use_gpu = true to use_gpu = false and rerun the initialization and training cells. Comparing GPU and CPU, I get 975 seconds for a CPU run and 29 seonds for the GPU - which is a nice speed-up of ~ 33x .

Another nice property to look at is that GPUArrays never had to implement automatic differentiation explicitly to support the backward pass of the neuronal network efficiently. This is because Julia's automatic differentiation libraries work for arbitrary functions and emit code that can run efficiently on the GPU . This helps a lot to get Flux working on the GPU with minimal developer effort - and makes Flux GPU support work efficiently even for user defined functions. That this works out of the box without coordination between GPUArrays + Flux is a pretty unique property of Julia, which is explained in great detail in: Why Numba and Cython are no substitute for Julia

Writing GPU Kernels

One can get pretty far by just using the generic abstract array interface of GPUArrays without ever writing any GPU kernels. However, at some point one might need to implement an algorithm that needs to run on the GPU and can't be expressed by a combination of generic array algorithms!

The nice thing is that GPUArrays takes away quite a bit of work with a layered approach that lets you start off with high level code, but allows you to pretty much write low-level kernels similarly to what one would find in most OpenCL/CUDA examples. It also allows you to execute kernels both on OpenCL or CUDA devices, abstracting away any differences in those frameworks.

The function that makes this possible is named gpu_call . It can be called as gpu_call(kernel, A::GPUArray, args) and will call kernel with the arguments (state, args...) on the GPU. State is a backend specific object to implement functionality like getting the thread index. A GPUArray needs to get passed as the second argument to dispatch to the correct backend and supply the defaults for the launch parameters.

Lets use gpu_call to implement a simple map kernel:

using
 
GPUArrays
,

 
CuArrays
\n
# Overloading the Julia Base map! function for GPUArrays
\n
function
 
Base.map!
(
f::Function
,

 
A::GPUArray
,

 
B::GPUArray
)

\n

    
# our function that will run on the gpu
\n

    
function
 
kernel
(
state
,

 
f
,

 
A
,

 
B
)

\n

        
# If launch parameters aren't specified, linear_index gets the index
\n

        
# into the Array passed as second argument to gpu_call (`A`)
\n

        
i
 
=
 
linear_index
(
state
)

\n

    \t\t
if
 
i
 
<=
 
length
(
A
)

\n

          
@inbounds
 
A
[
i
]

 
=
 
f
(
B
[
i
]

)

\n

        
end
\n

        
return
\n

    
end
\n

    
# call kernel on the gpu
\n

    
gpu_call
(
kernel
,

 
A
,

 

(
f
,

 
A
,

 
B
)

)

\n
end

0.0s

Julia

Let's try to figure out what this is doing! In simple terms, this will call the julia function kernel length( A ) times in parallel on the GPU. Each parallel invocation of kernel has a thread index, which we can use to safely index into the arrays A and B . If we calculated our own indices instead of using linear_index , we'd need to make sure that we don't have multiple threads reading and writing to the same array locations. So, if we wrote this in pure Julia with threads, an equivalent version would look like this:

using
 
BenchmarkTools
\n
function
 
threadded_map!
(
f::Function
,

 
A::Array
,

 
B::Array
)

\n

    
Threads.@threads
 
for
 
i
 
in
 
1:length
(
A
)

\n

        
A
[
i
]

 
=
 
f
(
B
[
i
]

)

\n

    
end
\n

  
A
\n
end
\n
x
,

 
y
 
=
 
rand
(
10^7
)

,

 
rand
(
10^7
)

\n
kernel
(
y
)

 
=
 

(
y
 
/
 
33f0
)

 
*
 

(
732.f0/y
)

\n
# on the cpu without threads:
\n
single_t
 
=
 
@belapsed
 
map!
(
$kernel
,

 
$x
,

 
$y
)

\n

\n
# "on the CPU with 4 threads (2 real cores):
\n
thread_t
 
=
 
@belapsed
 
threadded_map!
(
$kernel
,

 
$x
,

 
$y
)

\n

\n
# on the GPU:
\n
xgpu
,

 
ygpu
 
=
 
cu
(
x
)

,

 
cu
(
y
)

\n
gpu_t
 
=
 
@belapsed
 
begin
\n

  
map!
(
$kernel
,

 
$xgpu
,

 
$ygpu
)

\n

  
GPUArrays.synchronize
(
$xgpu
)

\n
end
\n
times
 
=
 

[
single_t
,

 
thread_t
,

 
gpu_t
]

\n
speedup
 
=
 
maximum
(
times
)

 
./
 
times
\n
println
(
"speedup: $speedup"
)

\n
bar
(

[
"1 core"
,

 
"2 cores"
,

 
"gpu"
]

,

 
speedup
,

 
legend
 
=
 
false
,

 
fillcolor
 
=
 
:grey
,

 
ylabel
 
=
 
"speedup"
)

43.9s

Julia

\n

Because the function isn't doing a lot of work, we don't see perfect scaling, but the threaded and GPU version still give us a nice speed-up.

The GPU is a bit more complex than what the thread example allows us to show, since the hardware threads are laid out in blocks of threads gpu_call abstracts that away in the simple version, but it can also be used with more complex launch configurations:

using
 
CuArrays
\n

\n
threads
 
=
 

(
2
,

 
2
)

\n
blocks
 
=
 

(
2
,

 
2
)

\n
T
 
=
 
fill
(
CuArray
,

 

(
0
,

 
0
)

,

 

(
4
,

 
4
)

)

\n
B
 
=
 
fill
(
CuArray
,

 

(
0
,

 
0
)

,

 

(
4
,

 
4
)

)

\n
gpu_call
(
T
,

 

(
B
,

 
T
)

,

 

(
blocks
,

 
threads
)

)

 
do
 
state
,

 
A
,

 
B
\n

  
# those names pretty much refer to the cuda names
\n

    
b
 
=
 

(
blockidx_x
(
state
)

,

 
blockidx_y
(
state
)

)

\n

    
bdim
 
=
 

(
blockdim_x
(
state
)

,

 
blockdim_y
(
state
)

)

\n

    
t
 
=
 

(
threadidx_x
(
state
)

,

 
threadidx_y
(
state
)

)

\n

    
idx
 
=
 

(
bdim
 
.*
 

(
b
 
.-
 
1
)

)

 
.+
 
t
\n

    
A
[
idx...
]

 
=
 
b
\n

    
B
[
idx...
]

 
=
 
t
\n

    
return
\n
end
\n
println
(
"Threads index: \\n"
,

 
T
)

\n
println
(
"Block index: \\n"
,

 
B
)

2.0s

Julia

\n\n\n\n

In the above example you can see the iteration order of a more complex launch configuration. Figuring out the right iteration + launch configuration is crucial to achieve state of the art GPU performance – but won't be part of this simple introduction. There are plenty of GPU tutorials for CUDA and OpenCL which explain this in great detail and those principles are identical when programming the GPU in Julia.

Conclusion

Julia has come a long way to bring composable high-level programming to the high performance world. Now it's time to do the same for the GPU.

The hope is that Julia lowers the bar for people to start programming on GPUs, and that we can grow an extendable platform for open source GPU computing. The first success story, of automatic differentiation working out of the box via Julia packages that haven't even been written for the GPU, gives a lot of reason to believe in the success of Julia's extendable and generic design in the domain of GPU computing.

"]]


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK