Video 1 - Introduction to CUDA C++

A 3-step processing flow:

  1. Copy data from CPU memory to GPU memory.
  2. Load a GPU program and execute (kernel?), caching the data (result?) on chip.
  3. Copy the results from GPU back to CPU.

nvcc separates the source code into host and device components:

Device functions like kernels are processed by NVIDIA compiler.

Host function are processed by standard compiler.

Uses __global__ keyword to declare a function as device code.

The syntax add<<< 1, 1 >>>(); represents a kernel launch syntax. A call to run the function add on the GPU.

add<<< N, 1 >>>(); means we’re launching N blocks. The second parameter 1 is the number of threads.

The combination of threads and blocks collectively is called a grid.

Uses “built-in” variables such as blockIdx .

CUDA can guarantee that the size of primitives matches between host and device.

CUDA aims to “harmonize” with the host code.

<<< BLOCKS, THREADS >>>

Video 2 - Shared Memory