GPUs
- Published on
- ∘ 120 min read ∘ ––– views
Previous Article
Next Article
Introduction
Continuing with the other posts1 about ~how computers work at an ELI24 level, this post aims to understand GPU devices, how their architecture differs from CPUs (and why), as well as some rudimentary lessons in writing software for these devices.
I'll begin with an architectural comparison against the maybe-more-familiar CPU:
First of all, we can see that a GPU consists of a larger number of cores and a shallower cache hierarchy. It's worth noting that CPU cores, in general, are much more capable than those present in GPU devices, and the presence of additional caches vastly reduces memory access latency. The GPU paradigm of parallelization is evidenced at the hardware design level by the presence of singular control units (purple) responsible for entire thread groups of cores, implying that each core in a thread group must execute the same instruction.
The differences in architectures serve different purposes. CPUs are designed for low-latency, low-throughput serial operations, whereas GPUs are optimized for high-latency, high-throughput parallel operations.
Sample Benchmarks
We can gain some intuition about the ramifications of these hardware differences via benchmarks on modern devices2,3:
CPU | GPU | |
---|---|---|
cores | 8 | 10,752 |
L3 cache | 32 MB | N/A |
L2 cache | 512 KB/core | 6 MB (shared) |
L1 cache | 64 KB/core | 128 KB/SM† or 1Kb/core |
† per Streaming Multiprocessor (a row of connected cores)
Parallel Operations
An example class of a parallel operation for which GPUs outperform CPUs is vector arithmetic:
An implementation of vector addition designed for execution on a CPU might look something like:
void add(int n, float* x, float* y, float* z) {
for (int i = 0; i < n; i++)
z[i] = x[i] + y[i];
}
this code iterates over every single element in each input vector x
, y
which will be . However, since each element in z
is independent of any other element in z
, we can see that this approach can be improved by parallelizing the summation and computing each entry simultaneously. The GPU code (which we call a kernel) which achieves this looks like:
__global__ void add(int n, float* x, float* y, float* z) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
z[i] = x[i] + y[i];
}
Here, we use the __global__
keyword, which indicates to the compiler that this function should be executed on the GPU. The next bit of GPU-spice is the index variable i
which uses "thread local" variables which would be contextually available in each instance of this function being run in parallel. The way CUDA achieves this is with the concept of thread blocks.4 When we launch our GPU kernel, we must specify the number blocks to launch and how many threads each block contains – this pattern informs how we index into all of our data structures to be accessed by the GPU. For example, if we launch blocks each containing two threads, the resulting thread organization will resemble:
Each thread executes the same code and is now able to compute the appropriate vector index via its own block- and thread-index.
The last notable difference between the two implementation is the existence of a bounds check on the computed index i
. This is necessary in the event that the dimension of our input vectors does not align with our block dimension e.g. for an odd value of , our block alignment would be:
With a GPU kernel in hand, we do still need actually delegate execution from the CPU. This is a multistep process following the procedure:
- Allocate memory on the GPU
- Copy the data to the GPU (since it doesn't have direct access itself)
- Run the kernel
- Copy the results back to the CPU
- Free the memory
1. Allocate memory on the GPU
// we use the _d suffix to denote that this will be sent to the device (GPU)
float* a_d;
float* b_d;
float* c_d;
int N = 4096;
cudaMalloc((void**) &a_d, N * sizeof(float));
cudaMalloc((void**) &b_d, N * sizeof(float));
cudaMalloc((void**) &c_d, N * sizeof(float));
2. Copy the data to the GPU
cudaMemcpy(a_d, a, N * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(b_d, a, N * sizeof(float), cudaMemcpyHostToDevice);
this function is nearly identical to the C standard counterpart, with the added argument denoting the type or direction of the copy. Read more about it here.5
3. Run the kernel
to run the kernel we specify how many blocks, and how many threads in each block. CUDA exposes these configurables with some funky syntax,6 but it's basically just division.
As an example, a function declared as
__global__ void Func(float* parameter);
must be called like this:Func<<< Dg, Db, Ns >>>(parameter);
where Dg
, Db
, Ns
are the dims of the grid, each block in the grid, and the number of bites in shared memory per block (defaulted to 0).
int BLOCK_SZ = 2048;
add<<<ceil(N / (float) BLOCK_SIZE), BLOCK_SIZE>>>(N, a_d, b_d, c_d);
4. Copy the results back to the CPU
Same as step 2, we just reverse the direction:
cudaMemcpy(c, c_d, N * sizeof(float), cudaMemcpyDeviceToHost);
5. Free the memory
Similar to standard memory management APIs, CUDA provides a method to allow us to mark those device-variables for re-allocation via cudaFree
.
cudaFree(a_d);
cudaFree(b_d);
cudaFree(c_d);
Comparison
We can observe how the two implementations compare on each respective processing unit, noting that the GPU pays a penalty up front for fewer, slower caches, but remains leveled off for far-longer than the CPU, and also that for sufficiently large , we start to pay exponential time penalties for CPU-based computation.
Comparing the ratio of execution time in , this tradeoff becomes more evident:
Here we can see that that even our naive GPU kernel is nearly 200x faster for larger inputs. We can drive this number even higher by reducing memory accesses per operation.
Kernel Grids
Returning to a fixed instance of our vector addition kernel operating for a vector of size 6:
The code to delegate execution of our kernel to the GPU would resemble:
int N = 6;
int BLOCK_SIZE = 2;
add<<<ceil(N / (float) BLOCK_SIZE), BLOCK_SIZE>>>(N, a_d, b_d, c_d);
and our resulting kernel grid would be:
Not that the blockDim
along the x
dimension is constant for all blocks, so the only values which would change within our kernel code from earlier would be the block index (valued 0, 1, or 2) and the block's thread index (valued 0 or 1). We can observe then that our index value is a bijection onto the indices of our input vectors:
__global__ void add(int n, float* x, float* y, float* z) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
z[i] = x[i] + y[i];
}
e.g. the computed index of the first thread in the second block would correspond to
int i = 1 * 2 + 0;
computing:
Our example kernel currently only references the .x
dimension of our thread- and block-index variables, but CUDA allows us to configure multidimensional kernel grids. In some scenarios, we might prefer a two or even three dimensional kernel grid:
For -dimensional kernel grids, we specify the kernel parameters via dim3
types like so:
dim3 dimGrid(3,3,3);
dim3 dimbBlock(3,3,3);
add<<<dimGrid, dimBlock>>>(N, a_d, b_d, b_c);
This abstraction is mainly syntactic sugar for index-readability. Since arrays of arbitrary dimension are still stored as flattened contiguous chunks of memory, we still need to make use of the dimension variables in order compute our row/column/... indices. Indexing into a two-dimensional array via dimension variables is given by the formula:
A naive (oh, so terribly, terribly naive7) matrix multiplication kernel on a 2-dimensional kernel grid might then resemble:
__global__ void matmul(int n, float* a, float* b, float* c) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < n && col < n) {
float dot_prod = 0.f;
for (int i =0; i < n; i++) {
dot_prod += a[row*n + i] * b[i*n + col];
}
c[row * n + col] = dot_prod;
}
}
We could similarly achieve the same result with a 1-dimensional kernel grid like so:
__global__ void matmul(int n, float* a, float* b, float* c) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int row = idx / n;
int col = idx % n;
if (row < n && col < n) {
float dot_prod = 0.f;
for (int i =0; i < n; i++) {
dot_prod += a[row*n + i] * b[i*n + col];
}
c[row * n + col] = dot_prod;
}
}
This adds a negligible amount of space overhead, as well as a bit more cognitive overhead, hence the preference for leveraging kernel grid syntactic sugar wherever possible.
This formula generalizes to higher dimensions as well like so:
Practical GPU usage: writing kernels for neural nets
Feed Forward
The feed forward step of an individual neural network neuron boils down to computing the dot product of an input vector and network weights , plus some scalar bias associated with the neuron:
We can generalize the computations for all neurons by treating both the inputs and the weights as matrices, and the bias values as a vector:
The kernel for this computation is similar to the matrix multiplication kernel, except our matrices are no longer uniform/square, so we need to take additional params to describe their dimensions:
__global__ void forward(
int batch_sz, int n, int out_w,
float* inputs, float* weights, float* biases, float* outputs
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < batch_sz && col < out_w) {
// initialize the outputs to be the corresponding weights
outputs[row * out_w + col] = biases[col];
for (int i =0; i < n; i++) {
outputs[row * out_w + col] +=
weights[i * out_w + col] * inputs[row * n + i];
}
}
}
Activation
In order to prevent the network from collapsing to a single linear layer, we introduce a non-linearity such as ReLU:8
The kernel is pretty straightforward:
__global__ void relu(int w, int h, float* inputs, float* outputs) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w) {
float activation = inputs[row * w + col];
outputs[row * w + col] = activation > 0.f ? activation : 0.f;
}
}
we apply this activation to every layer in the network barring the final output representing e.g. the probability that a given input corresponds to a given output.
Clamping
We need to translate a 'd activation to a probability, such as:
The caveat of using an exponential translation like is that, for many inputs, we run the risk of numerical overflow since we're summing over many potentially-huge numbers. The workaround is to subtract the largest element of the input vector that we're applying the to from the exponent of each element within:
The corresponding kernel (hopefully starting to look like a familiar pattern) is:
__global__ void softmax(int w, int h, float* inputs, float* outputs) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w) {
// find the maximum
float max_val = intput[row * w];
for (int i = 1; i < w; i++) {
max_val = max(max_val, inputs[row * w + i])l
}
float divisor = 1e-9f; // prevent div by zero
for (int i = 0; i. < w; i ++) {
divisor += exp(inputs[row * w + i] - max_val);
}
outsputs[row * w + col] =
exp(inputs[row * w + col] - max_val) / (divisor);
}
}
Loss
The final function we might be interested in for a feed forward pass would be a loss function to measure the network's performance. A cross-entropy loss9 function takes two vectors and iterates over all possible output values, comparing the predicted results of each to the actual output , assuming the outputs of the model are probabilities:
The lower our predicted probability is for a target score, the higher the loss:
Our loss kernel can be unidimensional since our output is a single dimension:
__global__ void cross_entropy_loss(
int w, int h, float* preds, float* reals, float* outputs
) {
int ix = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < h) {
float loss = 0.f;
for (int i = 0; i < w; i++) {
loss -= real[idx * w + i] * log(max(1e-6, preds[idx * w + i]));
}
outputs[idx] = loss;
}
}
Backpropagation
Backpropagation is the process of making incremental updates to the initially-random weights and biases of a neural network based on partial derivative of the loss with respect to the weight or bias of each neuron:
We can compute the partials at each layer by leveraging the chain rule. Starting with the last layer and the loss with respect the the activations at connecting it to the output, and working our way backwards multiplying each interior partial using the partials computed for the layer in front of it:
For our activation and cross entropy loss function , we can compute the partials with respect to the inputs:
Since each element of the softmax depends on every other element in the input, the derivative will be a matrix. To simplify all the constituent partials, we make use of properties of logarithms to rearrange the inner expressions like so:
This lets us compute the derivative of a element into that element times the partial of the logarithm of that element. This is useful since it also lets us remove the division operator as well as one of the exponentiations:
Now, taking the partial derivative:
The first term simplifies to 1 if the element index matches the derivative index, otherwise it disappears:
as the more complicated second term, we can again apply the chain rule to simplify the computation for ourselves:
The term inside the summation depends on for only a single element: , the rest are constants which will become zero, so we can further simplify:
So the complete equation for the partial derivative of becomes:
and we can plug this back into our matrix of softmax back propagation updates:
Now for the derivatives of our loss function w.r.t. the activations:
which is a bit simpler:
Multiplying the two collections, we attain the derivative of our loss w.r.t. the inputs:
and since is a probability distribution, the sum over all elements will be 1, so we can simplify the resultant expression:
so our loss w.r.t. inputs is just the vector difference between our softmax'd outputs and the true distribution. Accordingly, the kernel to compute this is much friendlier than all the algebra above:
__global__ void cross_entropy_backwards(
int w, int h, float* preds, float* reals, float* outputs
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w)
outputs[row * w + col] = preds[row * w + col] - reals[row * w + col];
}
With the derivative of our loss w.r.t. the inputs from the output layers, we can compute each preceding derivative as well via the chain rule by multiplying the expression by the partial derivative of the inputs with respect to the activations of the preceding layer:
Recalling that our equation for the inputs to the previous layers is the matrix multiplication of the activations and the weights plus the biases:
Now here I've made an egregious mess of notation – the superscripts are indices not exponents. Taking the derivative of this expression yields just the weights of that previous layer
The corresponding kernel then just needs to perform matrix multiplication between the weights and the error from the next layer:
__global__ void backward(
int batch_size, int n, int out_w,
float* weights, float* biases, float* d_loss, float* out_d_loss
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < batch_size && col < out_w) {
float dloss = 0.f;
for (int i = 0; i < n; i++) {
float w = weights[i * out_w + col];
dloss += w * d_loss[row * n + i];
}
out_d_loss[row * out_w + col] = dloss;
}
}
The derivative of our chosen non-linearity is straightforward:
To apply backpropagation, we again apply the chain rule: multiplying by the partial derivative of our loss function w.r.t. the previous activation:
in kernel form:
__global__ void relu_backwards(
int w, int h, int ns, float* a, float* d_loss, float* b
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < w && col < h) {
float activation = a[row * w + col]
b[row * w + col] = activation > 0.f ? d_loss[row * w + col] : 0.f;
}
}
Loss of weights and biases
To compute the error of our weights and biases, we take the partial derivative with respect to the desired quantity:
To update these values, we augment the existing weight or bias according to the derivative times some learning rate divided by our batch size:
and the last kernel to perform this gradient descent:
__global__ void update_layer(
int w, int h, int batch_size, float η,
float* weights, float* biases, float* activations, float* d_loss
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w) {
float d_w = 0.f;
float d_b = 0.f;
for (int i = 0; i < batch_size; i++) {
float activation = activations[i * h + row];
float d_l = d_loss[i * w + col];
dw += activation + d_l;
db += d_l;
}
weights[row * w + col] -= η * d_w / batch_size;
biases[col] -= η * d_b / batch_size;
}
}
Performance for Measuring GPU Kernels
When we run code designed for the GPU, there's three primary devices that we're concerned with:
- the GPU itself
- the GPU's high bandwidth memory (DRAM)
- the CPU responsible for scheduling kernels and copying data to/from the HBM
Each of these devices has corresponding penalties referring to as compute time, memory access latency, and overhead respectively.
Continuing with the arbitrary example of executing an MNIST handwritten numerical digit, suppose we profiled our application and found that it takes 2s to load the data and 9.5s to train our neural net for 10 epochs:
We might be able to bring down the CPU overhead by using more sophisticated methods of optimized string parsing to ingest the MNIST csv data,10 bringing its overall time down to 1s, but we're still left with ~1s epochs:
The next step might be to observe that we don't need all of the data during the first epoch, just the data required for the first batch, so we can run these two tasks nearly simultaneously:
However, notice that this does not eliminate the overhead time, so even if we were (hypothetically *cough* *cough*) able optimize our GPU-based training computations, our first epoch would still be bottlenecked by the time required to load the first batch of training data:
CUDA API
A brief note about the CUDA API which helps understand precisely how we're able to parallelize these first two tasks. By design, CUDA is not a synchronous API, so when we run a kernel with the method<<< ... >>>(params)
code, the CPU does not await the results of those tasks. Calling additional kernels adds them to a queue of kernels to be scheduled by the CPU which is managed by CUDA drivers for us. Making a synchronous/CPU-bound call after a kernel has been scheduled does wait for the scheduled kernels to complete before executing.
E.g. using the neural net methods implemented above, our training loop might look something like this:
load_mnist();
for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
softmax<<dimGrid, dimBlock>>>(...);
cross_entropy_loss<<dimGrid, dimBlock>>>(...);
cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
compute_accuracy(...);
cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
}
}
So, in order to parallelize the training and loading tasks, we can first schedule all of our kernels up front:
load_mnist();
for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
softmax<<dimGrid, dimBlock>>>(...);
cross_entropy_loss<<dimGrid, dimBlock>>>(...);
cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
compute_accuracy(...);
}
}
Then, instead of loading the full data set, we can instead define a batch-loader to be called once before scheduling our kernels, and then call it again before our synchronization point:
load_next_batch();
for (int epoch = 0; epoch < N_EPOCHS; epoch++) {
for (int batch = 0; batch < train_length / BATCH_SIZE; batch++) {
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
relu<<dimGrid, dimBlock>>>(...);
forward<<dimGrid, dimBlock>>>(...);
softmax<<dimGrid, dimBlock>>>(...);
cross_entropy_loss<<dimGrid, dimBlock>>>(...);
cross_entropy_backwards<<<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
backward<<dimGrid, dimBlock>>>(...);
relu_backwards<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
update_layer<<dimGrid, dimBlock>>>(...);
if (epoch == 0) load_next_batch();
cudaMemcpy(out_h, out_d, out_size, cudaMemcpyDeviceToHost);
compute_accuracy(...);
}
}
GPU perf
When evaluating and improving code targeting the GPU, we're interested in the two other metrics besides CPU overhead. Our GPU compute time is bound by the throughput of the datatype being used on the device, typically measured in floating point operations per second (or FLOPS). The 3090 Ti3 we've been using as a benchmark has a theoretical limit of 40 TFLOPS on 32-bit floating point numbers, and a memory bandwidth of 1.01 TB/s. Therefore, we can perform computations much faster than we can actually load data from memory, so we might introduce a third meta-metric of computation intensity to measure how many FLOPS we can execute per byte loaded from memory.
To find the optimal performance ratio, we can plot the theoretical FLOPs against computational intensity to find the desired intersection:
If we're on the right side, we're compute bound, the GPU is working at full power. This is good. Here, we gain performance improvements by using datatypes with better OPS, optimizing our algorithms, or using more expensive/sophisticated hardware such as tensor cores for a higher throughput limit.
The left is bad, the GPU is being underworked, waiting for data from memory. To move from left to right, we can use different kinds of memory, or do some kernel fusion.
Kernel fusion
Consider one of the numerous bottlenecks in our relatively naive CUDA implementation of a neural network:
__global__ void relu(int w, int h, float* inputs, float* outputs) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w) {
float activation = inputs[row * w + col];
outputs[row * w + col] = activation > 0.f ? activation : 0.f;
}
}
This function has two memory access (reading a float from inputs
and writing a float to outputs
) both of size 4 bytes. We're also only doing two operations: the comparison and assignment of activation
. This gives us a computation intensity of (not very intense imo).
Since our relu
function is exclusively used after invoking the forward
function, we can fuse these two functions together, completely eliminating the need for the memory accesses and operations added by relu
:
__global__ void forward_relu(
int batch_sz, int n, int out_w,
float* inputs, float* weights, float* biases, float* outputs
) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < batch_sz && col < out_w) {
float out = biases[col];
outputs[row * out_w + col] = biases[col];
for (int i =0; i < n; i++) {
out += weights[i * out_w + col] * inputs[row * n + i];
}
outputs[row * out_w + col] = out > 0.f ? out : 0.f;
}
}
We can similarly fuse the backwards pass. This achieves the "hypothetical" improvements alluded to earlier, reducing out total execution time by a factor of 2. At the moment, the rinky-dink example is hardly utilizing the GPU. Since our data in network is very small, the kernels aren't even filling all possible threads. If we increase the batch_size
from 16 to 64, we can achieve an even higher rate of parallelization, bringing our per-epoch training time down from ~half a second to a tenth of a second, yielding a 5x improvement over the initial implementation.
GPU Memory Hierarchy
With respect to programming for a GPU with CUDA, there are two kinds of memory that we're concerned with: on- and off-chip. There are some segments of memory connected to the actual GPU chip, as well as smaller amounts of memory within the chip itself. Each thread within a block is able to access the global VRAM which resides off-chip:
Memory allocated via e.g. cudaMalloc(...)
or statically initialized globally (with the __device__
keyword) lives in global memory which is the largest and slowest memory pool the GPU can access. Each thread also has access to a local set of on-chip registers. Variables created within our kernels are stored in these registers. Of course, if we initialize too many variables within our kernels (more than ~255 for modern commercial hardware), we induce kernel register occupancy, which will cause values in our registers to be evicted to "local memory" which is actually off-chip ("local" indicates that the variables are local to that thread).
We also have a small (on the order of 64 KB) section of VRAM called "constant memory" which is off-chip, cached, and read-only. Access to different addresses within this special segment are serialized, meaning that if we accesses the same address from multiple threads, we can achieve faster reads than if we made identical accesses to global memory.
To use constant memory, we use the __constant__
keyword when declaring a variable, and then use the cudaMemcpyToSymbol
function to move data from the CPU to constant memory.
Finally, we have shared memory which is block-local memory:
Shared memory is on-chip, and therefore preferable to local, constant, and global memory, especially when the data be stored would be used by multiple threads. We specify __shared__
when declaring kernel variables in order to store them in this region.
kind | on-/off-chip | access | scope | lifetime |
---|---|---|---|---|
registers | on | r/w | thread | thread |
shared | on | r/w | block | block |
local | off | r/w | thread | thread |
global | off | r/w | global | dynamic |
constant | off | r | global | dynamic |
Optimizing w.r.t. memory layout
With deeper knowledge of the memory hierarchy on our GPU, we can return to one of our previous kernels to see how we might take advantage of the layout in order to milk out some performance gains.
Recall the naive matrix multiplication kernel from earlier:
__global__ void matmul(int n, float* a, float* b, float* c) {
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
if (row < n && col < n) {
float dot_prod = 0.f;
for (int i =0; i < n; i++) {
dot_prod += a[row*n + i] * b[i*n + col];
}
c[row * n + col] = dot_prod;
}
}
For two square matrices, this code will access each row of a
times and each column of b
n
times, storing the dot product between corresponding elements in the thread's registers.
This is bad! To reduce the number of reads from global memory, we can utilize shared memory by splitting our input matrices into block-sized tiles:
Each thread in the block now reads the corresponding values from each tile into shared memory, and proceeds to compute the partial dot product of that tile. We then shift our tiles on a
and b
to finish to partial dot products for those rows and columns (denoted ):
We repeat this process of tiling till we've covered the whole matrix. This vastly reduces the number of reads from global memory and improves our kernel's performance. The code is a bit psychopathic, but it looks like this:
__shared__ float a_tile[TILE_WIDTH][TILE_WIDTH];
__shared__ float b_tile[TILE_WIDTH][TILE_WIDTH];
int row = blockIdx.y * TILE_WIDTH + threadIdx.y;
int col = blockIdx.x * TILE_WIDTH + threadIdx.x;
int tx = threadIdx.x;
int ty = threadIdx.y;
float dot_prod = 0.f;
for (int tile_offset = 0; tile_offset < n; tile__offset += TILE_WIDTH) {
int a_chk = tile_offset * tx < n && row < n
a_tile[ty][tx] = a_chk ? a[row * n + tile_offset + tx] : 0.f;
int b_chk = tile_offset * ty < n && col < n
b_tile[ty][tx] = b_chk ? b[(tile_offset + ty) * n + col] : 0.f;
// wait for all threads to complete the above boundary checks for tiles
__syncthreads();
for (int i = 0; i < TILE_WIDTH; i++) {
dot_prod += a_tile[ty][i] * b_tile[i][tx];
}
// wait for all threads to compute their partial `dot_prod`
__syncthreads();
}
if (row < n && col < n) {
c[row * n + col] = dot_prod;
}
This performs exponentially better than the naive implementation for large :
Standard matrix multiplication is about 20% slower than the tiled approach, and the CPU implementation is about 15,000x slower for even the smaller order matrices.
Modern GPU Architecture
In this section, we're going to elaborate on the simplified architecture diagram presented earlier:
Omitted from the diagrams are the nonsense ECE voltage controllers, power circuits, etc. about which I know next to nothing. The rest of the guesswork is supplemented by NVIDIA's white paper on their AD100 chip,11 (used in the 4090). Buckle up.
PCB
Beginning at the outermost layer, we have our DRAM and the GPU's chip itself:
The chip contains the L2 cache shared between all cores. There are also 12 memory controllers which handle data transfers between layers of memory. Surrounding the cache are the Graphic Processing Clusters.
Graphics and Texture Processing Clusters
Within each GPC are six Texture Processing Clusters as well as some additional components for rasterization, namely the raster engine which generates pixel information from triangles, as well as sixteen render output units divided into two Raster Opertion Partitions.
Streaming Multiprocessors
Each SM contains a dedicated ray tracing core for those operations, four texture units which perform operations on textures, 128KB of SRAM which is partitioned into an L1 Cache and shared memory. (Since these are colocated, the more data we cram into shared memory, the smaller our available L1 Cache). There's an additional 8KB constant cache, and maybe even a L1.5 constant cache (disputed).12 More on the Constant Cache in the next section.
as well as four processing blocks.
Processing Blocks
Inside the processing blocks, we can identify the smallest components which actually execute our kernels. Each block contains 16 cores designed for 32-bit floating point operations, another 16 cores which can additionally compute 32-bit integer operations (verse cores), a specialized tensor core for matrix multiplication and other accumulation operations.
Asterisk on the FP64 cores since it's not 100% clear how many there are and how they get distributed. I saw conflicting numbers, but the ibid NVIDIA A100 Tensor Core GPU Architecture white paper says 32 per Streaming Multiprocessor – I assume they also live on the Processing Block dies, though evenly distributing them (8 per block) seems prone to maligned kernel blocks, so idk ¯\(ツ)/¯ – maybe only some PBs have FP64 cores.
Warps
Ignoring the nebulous FP64 cores, assume we have 32 cores per block. We can now introduce the notion of a warp. When we actually launch our kernel grid, all the blocks inside are further divided into 32 threads that our processing blocks can execute. These thread collections are called warps. Accordingly, it's best practice to ensure that our block sizes are always multiple of 32 so as to accommodate our hardware. For example, Warp 2 would be underutilizing all the cores available on the processing block:
The Processing Block contains two control components. The Warp Scheduler is responsible for scheduling and context switching between warps while waiting on i.e. data from global memory, and the dispatch unit passes the relevant instructions from our kernel to be executed by those cores. The L0 cache is reserved for caching those instructions, and we also have 64KB register file for the relevant data needed by a warp.
We also have four load/store units responsible for retrieving data from higher up in the memory hierarchy, as well as four Special Function Units which are like high-end ALUs designated for graphical interpolation operations as well as trigonometric and transcendental operations.
All together, our piecemeal interpretation of the various white papers, forum discussions, and developer guides about a somewhat-standard GPU architecture looks like this:
Headache13 Memory
Constant Referring back to our diagram of a Streaming Multiprocessor we can make some observations about how to optimize our code for the hardware:
Usage of Constant Memory helps free up space in the shared L1 Cache. We also get the added benefit of broadcasting. To understand broadcasting and its tradeoffs, we'll look at a simplification of the relationship between a Streaming Multiprocessor and global memory:
When a value is read from DRAM to the constant cache, that value is broadcast to all of the cores:
This is useful when all the threads in the SM will use the value, but if even one thread needs a different value in memory, we have to pay the same cost of broadcasting it to all threads again. If the value is adjacent in memory, then it's not that bad since values loaded into the constant cache from DRAM are loaded by block and there's a good chance it's already in the cache, but point is that all threads are constrained by the slowest poll required.
DRAM
I fended of the electrical engineering material for as long as possible, but now it's time to pull back the curtain and see how the sausage gets made in order to more fully understand the ramifications of which region of memory our kernels employ.
Transistors
Starting with the most fundamental electronic component in a memory cell, we have transistors which have three voltage entry points: the base, the source, and the drain:
The "main idea" with transistors is that voltage is unable to traverse the source/drain line unless additional voltage is also applied to the base:
Capacitors
Capacitors "store" charge.
If we pass voltage through the input, it will start to accumulate charge which will remain there even if we remove voltage from the input:
Memory Cells
We can combine these two components to create a very simple memory cell. The transistor acts like a switch, gating voltage in and out of the capacitor:
We can arrange these cells in a grid to form an array:
The bases of each transistor in a row are connected to a word line, and the outputs of each capacitor in a column are connected to a bit line. The word lines are governed by a row decoder, and the bit lines feed into sense amplifiers which are also controlled by a column decoder. The sense amplifiers output to ~wherever we want our memory array to go.
The array is controlled by a number of address lines corresponding to rows and columns for high and low chunks of an address.
To read a value from the array, we must pre-charge the bit lines with a non-zero charge, but lower than the voltage stored in the capacitors. The high part of the address is fed into the row decoder which charges the entirety of the corresponding row. If a cell in that row was holding a charge, the now-open transistor allows that charge to exit to the bit line, increasing its voltage. Similarly, if the the transistor is closed, it starts draining from the bit line, decreasing the line's voltage.
The sense amplifier then reads the voltage from the bit line. If the voltage on a given bit line is sufficiently high, then the corresponding sense amplifier loads a value of one, otherwise it loads a zero. The amount of data loaded into the sense amplifiers per read from the array is called a page size and it's typically in the range of 1-4KB. We then read the remaining, lower half of the address via the column decoder which goes directly onto the data output line.
If that wasn't complicated enough, during the process of transmitting charge via the bit lines, each of the cells either gained or lost some charge. To correct this, we have to re-write the values from our sense amplifiers back into the memory cells. When we want to access memory a different memory cell, we can incur a row miss which requires us to go through the whole process again. Luckily, we can also have the opposite scenario where we try to read a charge from a row that has already been "loaded" into the sense amplifiers. This allows us to skip the row-decoding step and just decode the column.
Because of the desire to avoid row-misses, as well as to take advantage of caching hierarchies, memory is almost always accessed in bigger chunks than is actually requested by an application. Memory is divided into 32 byte segments. When we make attempt to access memory, warp owning the access request will generate a transaction on the order of 32, 64, or 128 bytes. The size of the transaction depends on the region being accessed. E.g. requests to the L1 cache use 128 byte segments, and request from L1 to L2 are made in 32 byte segments.
Additionally, our accesses must be aligned in terms of segments. So even if we wanted to load just the two contiguous bytes which fall on the boundary between two segments, we'd have to load in all 64 bytes of each.
We can observe why the hell we care by benchmarking some contrived nuisance kernels with misaligned memory access patterns:
__global__ void copy(int n, float* in, float* out, int offset) {
unsigned long i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n) {
out[i + offset] = in[i + offset];
}
}
This kernel accesses 128 bytes of memory with some offset. If the offset is a multiple of 128, we can observe a whole millisecond drops off the execution time compared to the other 127 execution possibilities.
Another kernel which copies memories in strides exhibits similar behavior:
__global__ void copy(int n, float* in, float* out, int stride) {
unsigned long i = (blockIdx.x * blockDim.x + threadIdx.x) * stride;
if (i < n) {
out[i] = in[i];
}
}
Strides of one copy contiguous blocks of memory, and higher strides space out the bytes we're copying accordingly.
As expected, the bigger the stride the more expensive. There are two plateaus of note though which directly correspond to some of the magic breakpoints hidden within the memory hierarchy.
Optimizing Kernels
With all this info sitting fresh at the top of mind, let's revisit one of the non-trivial kernels we implemented earlier: softmax
A Reference Point
To get a baseline measurement, we'll first take note of the FLOPS executed and bytes loaded by our current implementation.
- bytes:
- we pay the cost of reading and writing 4-byte floating point numbers
To compute FLOPs we have to break down our function implementation
- computing the maximum element of requires operations
- subtracting the maximum from each element is another operations
- exponentiating each element by the above difference is another operations
- summing over all elements is another operations
- and finally, dividing the exponent by the su summing over all elements is another operations
leaving us with FLOPs per 8 bytes loaded. This gives us a theoretical maximum performance of for a softmax
kernel for an upper bound of 625 GFLOPs, with the main bottleneck being memory bandwidth ( 1 TB/s on the GPU).
Before getting our hands dirty, let's take a peek at how various implementations from common libraries14,15 perform:
aaaand that doesn't make any sense. How're we already outperforming our theoretical max??? Turns out NVIDIA GPUs use a writeback cache pattern, meaning we're only writing to the L2 cache during kernel execution. Global memory receives the data later when we discard the cache block. And since our L2 write speed is much faster than our global memory read speed, the only bottleneck is reading from global memory, which increases our theoretical maximum up by a factor of two:
Adjusting our y-axis for this juicy piece of intel, our benchmarks looks a lot more reasonable:
Recalling our own softmax
implementation:
__global__ void softmax(int w, int h, float* inputs, float* outputs) {
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
if (row < h && col < w) {
// find the maximum
float max_val = intput[row * w];
for (int i = 1; i < w; i++) {
max_val = max(max_val, inputs[row * w + i])l
}
float divisor = 1e-9f; // prevent div by zero
for (int i = 0; i. < w; i ++) {
divisor += exp(inputs[row * w + i] - max_val);
}
outsputs[row * w + col] =
exp(inputs[row * w + col] - max_val) / (divisor);
}
}
Reduce
the maybe-obvious bottleneck is the computation of the max_val
and the divisor
. This was negligible for the sample MNIST solver, but for larger inputs (up to 1,024) we get a wopping 8.9 GFLOPs.
To fix this, we'll to rewrite our softmax
in the form of a reduce
where some operation is applied to every element in the input:
To parallelize nicely, needs to be associative so the order does not matter. Within the implementation of any softmax
we have two associative operations: max
and summation.
Let's think about how our implementation behaves on the thread level. Currently, each thread operates on the entirety of the input:
which is a lot of repeated work. Instead, we can have each thread perform a portion of the reduction on its own unique slice of the input vector:
We then need to transmit the data to a "leader" thread to complete the reduction:
and, in the case of softmax
broadcast the result back to the rest of the threads. Here's what that looks like for computing the maximum:
__shared__ float reduction[BLOCK_DIM_y];
float max_val = FLOAT_MIN;
// perform reduction
for (int i = ty * BLOCK_DIM_Y; i < min(w, (ty+1)*BLOCK_DIM_Y); i++) {
max_val = fmaxf(max_val, a[row * w + i]);
}
// exchange data b/w threads
reduction[ty] = max_val;
for (int stride = BLOCK_DIM_Y/2; stride >= 1; stride /= 2) {
__syncthreads();
if (ty < stride) {
// block level reduction
reduction[ty] = fmaxf(reduction[ty], reduction[ty + stride]);
}
}
// broadcast
__syncthreads();
max_val = reduction[0];
We can see how this stacks up against the SoTA:
Still not quite there.
Coalescing
The next improvement we could make would be to STOP! accessing memory in strides. Instead, we want each warp to access contiguous chunks of memory. To achieve this, we simply modify the loop which reads values from a stride of 1 to a stride equivalently sized as our block dimension:
// perform reduction
for (int i = ty; i < w; i += BLOCK_DIM_Y) {
max_val = fmaxf(max_val, a[row * w + i]);
}
This brings us up into the league of the big dawgs:
Registers
Next, let's take a look at how our reduction is occurring within memory. At the moment, several of the reductions are taking place in shared memory between the threads.
Shared memory is fast, but not as fast as the registers. Recall that each processing block has a register file which is shared between all the threads. The way to accomplish this is with a cursed CUDA-primitive16 called __shfl_xor_sync
like so:
# define MASK 0xffffffff
__shfl_xor_sync(MASK, variable, offset, warp_size)
which effectively does the following:
int laneId = threadIdx.x % 32;
if (laneId & MASK && laneId < warp_size) {
target_laneId = laneId ^ offset;
return get_variable_from_other(variable, target_laneId);
}
yes yes, mixing camel case and snake case in shitty pseudo code, pseu me. We're passing in a mask which specifies which threads should participate in the exchange, what variable we want to exchange, and an offset used to calculate the lane index of the thread we want to get the variable from, and the warp size.
Sparing myself the trouble of creating a 4x larger diagram, suppose our warp size is 8 instead of 32 threads as stated earlier. What we want to do is perform the reduction within each warp and store the result in the processing block's registers, sync up in shared memory, and complete the reduction in registers again:
The kernel code gets a bit hairy, but it's worth it (maybe?):
# define MASK 0xffffffff
float max_val = FLOAT_MIN;
// perform reduction
for (int i = ty; i < w; i += BLOCK_DIM_Y) {
max_val = fmaxf(max_val, a[row * w + i]);
}
// warp-level reduction
for (int i = 16; i > 0; i /= 2) {
max_val = fmaxf(max_val, __shfl_xor_sync(MASK, max_val, i, 32));
}
// shared memory sync
if (ty % 32 == 0) {
reduction[warp_id] = max_val;
}
__sync_threads();
if (warp_id == 0) {
max_val = ty < BLOCK_DIM_Y/32 ? reduction[ty] : 0.f;
// finish reduction in registers of first warp
for (int i = 16; i > 0; i /= 2) {
max_val = fmaxf(max_val, __shfl_xor_sync(MASK, max_val, i, 32));
}
}
This brings us on par with torch:
float4
Where do we go from here? We can dig deep into the bag of tricks to purchase some more memory management micro optimizations, such as the use of float4
to reduce our memory accesses by a factor of 4. Whereas the initial level of reduction took place in this loop:
// perform reduction
for (int i = ty; i < w; i += BLOCK_DIM_Y) {
max_val = fmaxf(max_val, a[row * w + i]);
}
We can replace it with the following, which requires a single memory access to load four contiguous floating point numbers from memory:
// perform reduction
for (int i = ty; i < w/4; i += BLOCK_DIM_Y) {
float4 val = reinterpret_case<float4>(&a[row * w + i * 4])[0]
max_val = fmaxf(max_val, val.x);
max_val = fmaxf(max_val, val.y);
max_val = fmaxf(max_val, val.z);
max_val = fmaxf(max_val, val.w);
}
The improvement gained by this optimization is hard to distinguish on my plot except in a few places where it BEATS torch.
Fine-tuned Unrolling
If we add #pragma unroll UNROLL_FACTOR
directives to each of our loops, we can instruct the compiler about exactly how to unroll our loops (which it normally does automatically depending on the levels of compiler optimization), but since we know the dimensions of our blocks which informs the duration of all our loops –and with a little bit of fine tuning– we can outperform both Trident and torch:
Still doesn't scratch the theoretical maximum, but we're ahead of the curve now.
Online Normalization
It seems a bit clunk for us to load the value of a[row * w + i]
in two locations in our kernel code (once when computing max_val
and again when computing divisor
), but it also seems unavoidable since we're exponentiating by the max value when we compute the divisor... In other words, the second location is dependent on the first, right?
wrong [sent with slam effect]. Them boys over at NVIDIA have already encountered this very degree of micro-optimization purgatory and came up with a solution17 (and you know it's cracked because their notation is so god awful). In their paper, they demonstrate that we can TODO:
Suppose we were compute the divisor and the max value of the input vector in the same loop. We might run into a scenario where we encounter a larger max value after computing some divisor value(s) like so:
We need to determine how much the new contribution of to the divisor w.r.t. the new max value (discovered in the second iteration) has changed. We can express the change as:
which, notably, is independent of our input vector and only depends on the previous maximum value which will be known. To incorporate this accumulation technique into our kernel, we need to make two modifications.
The first being the computation of max_val
:
// perform reduction
for (int i = ty; i < w/4; i += BLOCK_DIM_Y) {
float4 val = reinterpret_case<float4>(&a[row * w + i * 4])[0]
max_val = fmaxf(max_val, val.x);
max_val = fmaxf(max_val, val.y);
max_val = fmaxf(max_val, val.z);
max_val = fmaxf(max_val, val.w);
// fix and update our max_val
if (max_val > old_max_val) {
divisor *= __expf(old_max_val - max_val);
old_max_val = max_val;
}
divisor += __expf(val.x - max_val);
divisor += __expf(val.y - max_val);
divisor += __expf(val.z - max_val);
divisor += __expf(val.w - max_val);
}
The second modification adjusts how we perform the reduction across the warp. The important difference is that we need to determine which value to fix: the leader's thread-local max_val
or the incoming divisor
:
float incoming_divisor;
float incoming_max_val;
for (int i = 16; i > 0; i /= 2) {
incoming_max_val = __shhfl_xor_sync(MASK, max_val, i, 32);
incoming_divisor = __shhfl_xor_sync(MASK, divisor, i, 32);
if (incoming_max_val > max_val) {
divisor *= __expf(max_val - incoming_max_val);
max_val = incoming_max_val;
} else {
incoming_divisor *= __expf(incoming_max_val - max_val);
}
divisor += incoming_divisor;
}
if (ty % 32 == 0) {
reduction_max[warp_id] = max_val;
reduction_divisor[warp_id] = divisor;
}
aaaand this gives us much better performance for bigger inputs at penalty of slightly worse performance on smaller inputs since they're small enough to entirely fit within the cache, at which point the additional operations above become noticeable:
Footnotes
Footnotes
https://en.wikipedia.org/wiki/Thread_block_(CUDA_programming)#Indexing ↩
Boehm, Simon. "How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance: a Worklog." Siboehm, 2022.
This is something of a grail blog post. ↩
Turns out that you don't technically need a non-linearity because floating point imprecision is sufficient: GradIEEEnt half decent ↩
https://ml-cheatsheet.readthedocs.io/en/latest/loss_functions.html ↩
Zhe Jia et al. "Dissecting the NVIDIA Volta GPU Architecture via Microbenchmarking" arXiv, 2018.
each SMX possesses two private levels of constant caches, which we denote as L1 and L1.5 constant cache (accesses to either of each level within an SMX do not affect the same cache levels on other SMXs);
though this is not corroborated anywhere else that I can find... ↩
PyTorch softmax source: https://pytorch.org/docs/stable/_modules/torch/nn/modules/activation.html#Softmax ↩
Triton softmax source: https://triton-lang.org/main/getting-started/tutorials/02-fused-softmax.html#compute-kernel ↩
https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/ ↩
Maxim Malkov and Natalia Gimelshein. "Online normalizer calculation for softmax." arXiv, 2018. ↩