GPUs

Jul 6, 2024
#computer systems


What is a GPU?

A GPU (Graphics Processing Unit) is a hardware device that is optimized for parallel computation. As their name suggests, GPUs were originally designed to render computer graphics, because many computations in computer graphics are highly parallelizable:

Then, with the advent of cryptocurrencies, GPUs found a new use case in cryptocurrency mining. Many cryptocurrencies, including Bitcoin, use a Proof-of-Work consensus protocol, in which miners compete to be the first to find a number that satisfies certain properties when hashed. GPUs are vital to mining because they can compute hashes on multiple inputs in parallel.

Now, the Generative AI boom has created a new wave of demand for GPUs. This is because training large neural networks involves performing a lot of matrix multiplication operations. Since each output term in the product of two matrices can be computed in parallel, GPUs excel at these calculations.

GPU Architecture

At a high level, a GPU has many slow and simple cores, whereas a CPU has a few fast and complex cores. Therefore, GPUs are better at simple tasks that can be divided into many independent pieces. As an analogy, think of a GPU as a group of 1000 middle school students and a CPU as a single math PhD student. The middle school students are going to be faster at some tasks, like solving 1000 addition problems. For other tasks, like solving an integral, the middle school students would be lot slower, and may even find the task impossible.

A GPU has many Streaming Multiprocessors (SM). Each SM has many Arithmetic Units (AUs), which perform operations. Even though the clock speed of a CPU core is often twice as fast as that of an Arithmetic Unit (~4GHz vs ~2GHz), GPUs often have an order of magnitude more Arithmetic Units than CPUs have cores (1,000 vs 100), which allow GPUs to perform operations at an order of magnitude higher throughput than CPUs. GPUs have a SIMD (same instruction, multiple data) architecture, which means that all of the Arithemtic Units on a Streaming Multiprocessor must perform the same operation on the same clock cycle.

A thread is the smallest unit of work that can be performed on a GPU. It is a sequence of instructions. A Thread Block is a group of threads that get assigned to an SM. A Thread Block is divided into Warps, which is a fixed sized group of threads, typically 32. All threads in a warp execute the same instruction at the same time, but on different data.

GPUs have their own memory. Each SM has its own L1 cache, and all the SMs share an L2 cache. When you want to perform computation on the GPU, you have to transfer the data from main memory to the GPUs memory. This is typically done via the PCIe bus by issuing a command to the DMA controller. After the GPU is finished when the computation, you have to transfer the data back to main memory.

What is a device driver?

A device driver is a piece of code that tells the operating system how to communicate with hardware devices. On Linux, drivers are implemented as kernel modules that get loaded into the operating system.

- Architecture - Interfacing with CPU and Memory - Driver - Code



Comment