GPUs

Jul 6, 2024
#computer systems


- Graphics processing involves a series of steps: - project vertices onto 2D - remove triangles that are not visible - convert triangles into pixels - compute color of each pixel - Traditional GPUs followed a sequential hardware graphics pipeline model: - each step of the graphics processing pipeline was implemented in hardware and could not be programmed/customized - Over time, all the stages would be implemented on a uniform set of general purpose processors instead of each having their own specialized hardware. This paved the way for GPUs to be programmable. - OpenGL is an API specification for GPUs. GLSL is a shader language that is part of the OpenGL project. It declares functions like glCreateShader. The OpenGL API is for high level graphics applications, not for general purpose programming. - GPGPU (General Purpose GPU) is a paradigm of using graphics APIs like OpenGL to implement general purpose algorithms like matrix multiplication. This can be considered a hacky solution. E.g. input matrices would be passed as "textures", the multiplication code would be implemented as a "shader", and the output would be written to the "framebuffer". - Eventually OpenCL and CUDA were developed, which are "platforms" (a vague word that encompasses API specifications, tools, libraries, etc.) for writing general purpose code on GPUs. - Modern GPUs have both general-purpose cores and fixed-function units. The Nvidia driver, which implements the OpenGL interface, uses both types of cores. E.g. programmable shaders run on general-purpose cores, but rasterization runs on a fixed-function unit. - When you run CUDA code, it only runs on the general-purpose cores. - In SIMD (single-instruction multiple-data) programming frameworks like CUDA, the programmer writes code for a single thread and the GPU runs many instances of the thread in parallel. This is a natural programming model that evolved from shader languages in which programs would be descriptions of how to color a single pixel. - GPU Architecture: - Each general purpose core (not fixed-function) is called a Streaming Processor (SP) - The SPs are grouped into Streaming Multiprocessors (SM) - Each SM has some caches and shared memory - All the processors are connected to the GPU's main memory - Textures are a big part of the graphics rendering pipeline. Essentially, a texture is a 2D or 3D array of colors. Shaders look up points in a texture to determine the color of the pixel. To make this efficient, Streaming Multiprocessors have access to texture cache. But this is irrelevant for General Purpose GPU Computing. - Because of this two-layer GPU architecture: Streaming-Multiprocessors and Streaming-Processors, when writing code in CUDA it's often good to break the problem down into two levels of granularity. E.g. break a problem into blocks, and break each block into elements. Then blocks are assigned to SMs and each element is assigned to a thread. - The CUDA programming model has three key abstractions —a hierarchy of thread groups, shared memories, and barrier synchronization - When a CUDA program is compiled, it is not tied to a specific processor count. The runtime system is responsible for making sure the program executes efficiently based on the specific hardware it is using. - CUDA is an extension to C/C++, i.e. the syntax is based on C/C++, with additional keywords. So the CUDA compiler (nvcc) can actually compile any C++ code (as long as the version and libraries are supported), because nvcc is a wrapper around gcc/clang. NVCC relies on C++ compilers to compiler host code, and it is responsible for compiling device code. NVCC "links" the device code into the final executable so that the device kernels are available. The CUDA runtime will send the compiled kernel to the GPU when the binary is executed. - A CUDA kernel is a function that is run by each thread. When you execute a kernel, you specify the number of blocks and the number of threads in each block. The blocks are arranged in a grid, so the number of blocks is the same thing as the dimension of the grid. Within a block, threads can use synchronization primitives. Threads in different blocks are expected to be independent. Blocks map to Streaming Multiprocessors. Picking the right grid and block dimensions is important for code to run efficiently. There are techniques for picking them dynamically based on the GPU architecture. Even though the syntax for invoking a kernel looks like the dimensions are passed as template parameters, they don't actually have to be known at compile-time. - Each thread has local memory. Threads in a block have access to the shared memory. All threads have access to global memory. - A CUDA kernel has access to blockIdx, blockDim, and threadIdx constants so it knows which thread it is. - A CPU core can run at most 2 threads concurrently if hyperthreading is enabled. A GPU Streaming Multiprocessor can run multiple threads in parallel, one per SP core. The restriction is that all the threads have to execute the same instruction on the same clock cycle (though the instruction can reference registers which are local to each thread). This restriction simplifies the hardware because it means the SM can use one instruction loader to fetch the instruction for all the threads. - A Warp is a group of threads that execute the same instruction on each cycle. On Nvidia devices, a warp is generally 32 threads. On older GPUs that could only run 4 threads in parallel, it would take multiple clock cycles to execute an instruction for a warp. Older GPUs could still run multiple warps per SP though, the warps would just be interleaved (i.e. concurrent instead of truly parallel). There would still need to be one instruction loader per tracked warp. - Each SP has a register file. If the register file can store N registers, and each thread requires T registers (typically a CUDA program requires 32 registers per thread), and there are P SP per SM, then theoretically you should have at least N / T * P threads per thread block to saturate the SM. - If a warp executes a memory read, the SP can switch to a different warp and switch back once the read is complete, hiding memory latency - Actually the SM tracks warps and schedules them onto SPs. - I think some stuff I wrote previously is wrong. Each SP can only run a single thread per clock cycle. In the GeForce 8800, there are 16 SMs each with 8 SPs. Each SM can manage 768 concurrent threads. Each issue cycle, it picks one of the 24 warps (32 threads per warp) to execute. It requires 4 processor cycles to execute all the threads in the warp (because there are only 8 SPs). So I guess that means issue cycles happen every 4th processor cycle? - Special Function Unit implements operations like inverse square root, sin/cos, log, etc. A warp is a group of 32 CUDA threads. A CUDA thread is analogous to a vector lane on a CPU (e.g. one 4-byte float in a 512-bit AVX vector). A GPU consists of streaming multiprocessors. Each streaming multiprocessor has some number of warp schedulers. A warp scheduler can run some number of warps concurrently, like hyperthreading on CPUs. The reason for this two-tier hierarchy is that warp schedulers in the same SM share an L1 cache.

What is a GPU?

Disclaimer: most of what I know comes from reading the section on GPUs in Computer Organization and Design 5th Edition and talking to ChatGPT. A GPU (Graphics Processing Unit) is a hardware device that is optimized for parallel computation. As their name suggests, GPUs were originally designed to render computer graphics, because many computations in computer graphics are highly parallelizable:

Then, with the advent of cryptocurrencies, GPUs found a new use case in cryptocurrency mining. Many cryptocurrencies, including Bitcoin, use a Proof-of-Work consensus protocol, in which miners compete to be the first to find a number that satisfies certain properties when hashed. GPUs are vital to mining because they can compute hashes on multiple inputs in parallel.

Now, the Generative AI boom has created a new wave of demand for GPUs. This is because training large neural networks involves performing a lot of matrix multiplication operations. Since each output term in the product of two matrices can be computed in parallel, GPUs excel at these calculations.

GPU Architecture

At a high level, a GPU has many slow and simple cores, whereas a CPU has a few fast and complex cores. Therefore, GPUs are better at simple tasks that can be divided into many independent pieces. As an analogy, think of a GPU as a group of 1000 middle school students and a CPU as a single math PhD student. The middle school students are going to be faster at some tasks, like solving 1000 addition problems. For other tasks, like solving an integral, the middle school students would be lot slower, and may even find the task impossible.

A GPU has many Streaming Multiprocessors (SM). Each SM has many Arithmetic Units (AUs), which perform operations. Even though the clock speed of a CPU core is often twice as fast as that of an Arithmetic Unit (~4GHz vs ~2GHz), GPUs often have an order of magnitude more Arithmetic Units than CPUs have cores (1,000 vs 100), which allow GPUs to perform operations at an order of magnitude higher throughput than CPUs. GPUs have a SIMD (same instruction, multiple data) architecture, which means that all of the Arithemtic Units on a Streaming Multiprocessor must perform the same operation on the same clock cycle.

A thread is the smallest unit of work that can be performed on a GPU. It is a sequence of instructions. A Thread Block is a group of threads that get assigned to an SM. A Thread Block is divided into Warps, which is a fixed sized group of threads, typically 32. All threads in a warp execute the same instruction at the same time, but on different data.

GPUs have their own memory. Each SM has its own L1 cache, and all the SMs share an L2 cache. When you want to perform computation on the GPU, you have to transfer the data from main memory to the GPUs memory. This is typically done via the PCIe bus by issuing a command to the DMA controller. After the GPU is finished when the computation, you have to transfer the data back to main memory.

What is a device driver?

A device driver is a piece of code that tells the operating system how to communicate with hardware devices. On Linux, drivers are implemented as kernel modules that get loaded into the operating system.

- Architecture - Interfacing with CPU and Memory - Driver - Code



Comment