A 3-step processing flow:
nvcc
separates the source code into host and device components:
Device functions like kernels are processed by NVIDIA compiler.
Host function are processed by standard compiler.
Uses __global__
keyword to declare a function as device code.
The syntax add<<< 1, 1 >>>();
represents a kernel launch syntax. A call to run the function add
on the GPU.
add<<< N, 1 >>>();
means we’re launching N
blocks. The second parameter 1
is the number of threads.
The combination of threads and blocks collectively is called a grid.
Uses “built-in” variables such as blockIdx
.
CUDA can guarantee that the size of primitives matches between host and device.
CUDA aims to “harmonize” with the host code.
<<< BLOCKS, THREADS >>>
Blocks can be accessed with blockIdx.x
, blockIdx.y
, and blockIdx.z
.
Threads can be accessed with threadIdx.x
, threadIdx.y
, and threadIdx.z
.
With M
threads per block a unique index for each thread (worker) is given by:
int index = threadIdx.x + blockIdx.x * M
A built-in variable blockDim.x
can give us number of threads per block. The more canonical way:
int index = threadIdx.x + blockIdx.x * blockDim.x
.Sometimes length of the data and number of threads might not be evenly divisible. This means we might spin up more threads than data is available. It’s common to add checks to avoid accessing invalid memory:
__global__ void add(int* a, int* b, int* c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
Threads within a block have some extra capabilities, e.g. can communicate and synchronize with each other. Threads across different blocks do not have that capability.