The Genesis and Definition of CUDA

February 05, 2026

1.1 From Pixels to Particles

The GPU began as a fixed-function graphics accelerator. The shift to GPGPU (General-Purpose computing on GPUs) occurred when researchers realized that the architecture used to render millions of pixels simultaneously could be repurposed for scientific simulations. Initially, scientists had to "trick" the GPU by disguising data as textures, a hurdle overcome by the release of Brook at Stanford and subsequently CUDA by NVIDIA in 2007.

1.2 The CUDA Platform

CUDA is not just a language; it is a parallel computing platform. It consists of:

Language Extensions: Keywords like __global__ and <<< >>>.
NVCC Compiler: Separates host (CPU) and device (GPU) code.
Accelerated Libraries: cuBLAS (math), cuDNN (deep learning), and cuFFT.

1.3 The Heterogeneous Model

CUDA utilizes a Host-Device relationship:

Host (CPU): Optimised for complex control logic and sequential tasks.
Device (GPU): Optimized for massive data-parallel throughput.

Part II: The Programming Model

2.1 Kernels and the Thread Hierarchy

A Kernel is a function executed $N$ times in parallel by $N$ different threads. These threads are organized into a three-level hierarchy to ensure scalability across different GPU hardware:

Thread: The smallest unit of execution.
Thread Block: A group of threads (up to 1024) that can cooperate via Shared Memory.
Grid: The collection of all blocks for a single kernel launch.

2.2 SIMT and Warps

On the hardware level, GPUs use SIMT (Single Instruction, Multiple Thread) execution. Threads are managed in groups of 32 called Warps.

Thread Divergence: If threads in a warp take different paths (e.g., an if-else statement), the hardware serializes execution, significantly dropping performance.

Part III: The Memory Hierarchy

Memory management is the most critical aspect of CUDA performance. Because the GPU has thousands of cores, feeding them data is the primary bottleneck.

Memory Type	Location	Access Speed	Visibility
Registers	On-chip	Fastest	Per thread
Shared Memory	On-chip	Very Fast	Per block
Global Memory	Off-chip (VRAM)	Slowest	All threads + Host
Constant Memory	Off-chip (Cached)	Fast	All threads

3.1 Shared Memory Pattern

A common optimization is Tiling:

Threads cooperatively load a "tile" of data from Global Memory into Shared Memory.
Synchronize threads using __syncthreads().
Perform multiple calculations using the high-speed Shared Memory.
Write results back to Global Memory.

Part IV: Practical Implementation

The standard workflow for a CUDA program follows five steps:

Allocate memory on the Device (cudaMalloc).
Copy data from Host to Device (cudaMemcpy).
Launch the Kernel (kernel<<<grid, block>>>).
Copy results back to Host.
Free Device memory (cudaFree).

Example: Vector Addition Kernel

In a vector addition ( $C = A + B$ ), each thread calculates its unique global index to determine which element to process:

C++
__global__ void vectorAdd(float *C, const float *A, const float *B, int n) {
    // Calculate unique global thread ID
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    
    // Boundary check
    if (i < n) {
        C[i] = A[i] + B[i];
    }
}

Key Tools

Nsight Systems: For system-wide profiling (CPU/GPU interactions).
Nsight Compute: For deep-dive kernel-level debugging and instruction analysis.

Search This Blog

AI-driven insight & ML solutions from Data Points to Business Decisions.