The Genesis and Definition of CUDA
1.1 From Pixels to Particles
The GPU began as a fixed-function graphics accelerator. The shift to GPGPU (General-Purpose computing on GPUs) occurred when researchers realized that the architecture used to render millions of pixels simultaneously could be repurposed for scientific simulations. Initially, scientists had to "trick" the GPU by disguising data as textures, a hurdle overcome by the release of Brook at Stanford and subsequently CUDA by NVIDIA in 2007.
1.2 The CUDA Platform
CUDA is not just a language; it is a parallel computing platform.
Language Extensions: Keywords like
__global__and<<< >>>.NVCC Compiler: Separates host (CPU) and device (GPU) code.
Accelerated Libraries: cuBLAS (math), cuDNN (deep learning), and cuFFT.
1.3 The Heterogeneous Model
CUDA utilizes a Host-Device relationship:
Host (CPU): Optimised for complex control logic and sequential tasks.
Device (GPU): Optimized for massive data-parallel throughput.
Part II: The Programming Model
2.1 Kernels and the Thread Hierarchy
A Kernel is a function executed
Thread: The smallest unit of execution.
Thread Block: A group of threads (up to 1024) that can cooperate via Shared Memory.
Grid: The collection of all blocks for a single kernel launch.
2.2 SIMT and Warps
On the hardware level, GPUs use SIMT (Single Instruction, Multiple Thread) execution.
Thread Divergence: If threads in a warp take different paths (e.g., an
if-elsestatement), the hardware serializes execution, significantly dropping performance.
Part III: The Memory Hierarchy
Memory management is the most critical aspect of CUDA performance. Because the GPU has thousands of cores, feeding them data is the primary bottleneck.
| Memory Type | Location | Access Speed | Visibility |
| Registers | On-chip | Fastest | Per thread |
| Shared Memory | On-chip | Very Fast | Per block |
| Global Memory | Off-chip (VRAM) | Slowest | All threads + Host |
| Constant Memory | Off-chip (Cached) | Fast | All threads |
3.1 Shared Memory Pattern
A common optimization is Tiling:
Threads cooperatively load a "tile" of data from Global Memory into Shared Memory.
Synchronize threads using
__syncthreads().Perform multiple calculations using the high-speed Shared Memory.
Write results back to Global Memory.
Part IV: Practical Implementation
The standard workflow for a CUDA program follows five steps:
Allocate memory on the Device (
cudaMalloc).Copy data from Host to Device (
cudaMemcpy).Launch the Kernel (
kernel<<<grid, block>>>).Copy results back to Host.
Free Device memory (
cudaFree).
Example: Vector Addition Kernel
In a vector addition ($C = A + B$), each thread calculates its unique global index to determine which element to process:
__global__ void vectorAdd(float *C, const float *A, const float *B, int n) {
// Calculate unique global thread ID
int i = blockIdx.x * blockDim.x + threadIdx.x;
// Boundary check
if (i < n) {
C[i] = A[i] + B[i];
}
}
Key Tools
Nsight Systems: For system-wide profiling (CPU/GPU interactions).
Nsight Compute: For deep-dive kernel-level debugging and instruction analysis.
Comments
Post a Comment