CPUs and GPUs and TPUs, Oh My!

Training large AI models efficiently

Feb 16, 2024

With the advent of large language models such as ChatGPT and “artistic” generative AI models such as DALL-E, artificial intelligence is on everyone’s mind. Companies are desperate to develop the next big product, and a lot of them are attempting to do this by throwing money and hardware at the problem. For example, Meta recently announced that they would be dedicating $10 billion to computing infrastructure that would allow them to train an AGI (artificial general intelligence) model. However, I often find that these announcements and resulting articles don’t do a great job of explaining why companies think this is a good investment, or if there are other ways they can boost their AI capabilities. In addition, these articles never mention the environmental impact of this sort of hardware in terms of carbon emissions or impact on the world sand shortage. Today, I want to focus on the first issue, and later I will cover the environmental impact in more detail.

Neural Network Refresher

Those who read my inaugural article may remember my diagram of a neural network:

Each of the labelled circles is called a node. To train the network, you give it data in the form of a matrix. A matrix is simply numbers sorted in a rectangular form, like so:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$

This matrix can be any sort of data we want to use to train the network. Common examples would be the pixels of an image, or a spread-sheet like group of numbers. For example, the rows could represent the genes in your DNA, and the columns could represent different diseases. Then, the “1” in the matrix above would represent the effect of a medicine on Gene A if you had Disease A, while moving “2” would represent the effect of medicine on Gene A if you had disease B. At each node, the model will perform a mathematical operation on the data, then forward it to the next node. By the end, the numbers have been transformed into whatever you want it to be. I recommend re-reading my first article linked above for a refresher on this. Once you have a basic understanding of what the neural network is actually doing, it is much easier to understand how hardware can make an impact.

CPU vs GPU - What’s the Difference?

There are two main architectures considered when talking about neural network training - the central processing unit (CPU) and the graphics processing unit (GPU).

The CPU is commonly referred to as the “brain” of the computer, and it performs most commands that are necessary to make your computer useful, like logical or arithmetic operations, memory retrieval, and information input/output. CPUs are latency-optimized, which means that they are designed in a way that makes it easy to perform tasks quickly, but only one at a time. For normal computer operations, this is desirable - anyone who waits 30+ seconds for this article to load knows how annoying it is when the computer is slow. However, this is less desirable for training a neural network - if we want to train the one in the picture above, first, we would get the data we need from memory, then give it to node A. Node A would perform its math, then the CPU would store that data, and give the original data to node B. After nodes A, B, and C are done, the CPU then has to find all that data and give it to node D. The larger the network is, the longer this takes. In order to train a model as big as ChatGPT, it would take days for a single CPU to go through the model just once.

GPUs, as the name suggests, were originally developed for rendering images or 3D scenes. Unlike CPUs, they are bandwidth-optimized; although they are slower to perform tasks or get data from memory, they can perform multiple tasks at once. This parallelization makes sense for rendering images - if you break the picture into blocks, those chunks won’t overlap and can be visualized on the screen at the same time. This is also useful for training neural networks - although it might take longer to get the data we need from memory, we can then give that data to nodes A, B, and C all at the same time. This speeds up training time considerably when using big models. When companies announce they are investing a lot of money in hardware, it usually means that they are buying a lot of GPUs. This allows them to train bigger models with more data, which allows them to get better results without having to design better models. Intel has a good overview of CPUs vs GPUs, and I usually steer away from Quora but I found this answer particularly insightful.

GPUs are great for advancing machine learning, but they are not perfect NVIDIA controls almost all of the GPU market, and companies don’t like being reliant on them. In addition, GPUs are great for parallelizing tasks, but not all math can be parallelized. In particular, multiplying matrices is very slow, and it is done at least once in the majority of all neural network nodes. This lead companies to develop their own hardware that is optimized for the type of data used to train neural networks, usually called tensors. Google’s tensor processing unit (TPU) is one of the more well-known, and works best on neural networks that use images.

Alternatively, some companies have looked to adopt preexisting hardware for neural network training, just like they did with GPUs. Field Programmable Gate Arrays (FPGAs) are a type of hardware that can be used to perform mathematical operations quickly and with low power consumption. There has been some noise about replacing GPUs with them, but in my personal experience, that can be difficult to do. The skill set to program FPGAs is totally different than the traditional machine learning skill set, thus requiring multiple people working together. Changing the structure of the network, which is commonly done during experimentation, can be difficult. However, FPGAs are lightweight and sturdy, and thus it is possible to use them for inference in robotics applications. In that case, the machine learning engineer would train the network using traditional GPUs, then work with the FPGA engineer to port that network to the FPGA, which would then go on the robot/spacecraft/etc.

What about software improvements?

However, hardware is not the only way to solve this problem. While pushes for large infrastructure systems are currently the best way to train massive models such as ChatGPT, the majority of applications do not require such computational power, and the majority of companies and universities do not have access to that level of resources. In addition, when a model is done training, significantly less power is needed, since it is only being called on to do inference (make guesses). It doesn’t make sense to require a huge, memory-intensive GPU every time we want to ask ChatGPT to write a poem, it would be much easier if we could run these models on less powerful computers. This is especially important in the field of robotics - obviously, we can’t put a massive desktop computer on a firefighting robot or an emergency rescue drone.

There are a few ways to reduce the amount of computational power, memory, or both during both training and inference. The first is to pick an architecture and optimize to it. There are models designed to take advantage of CPU’s benefits, and thus can train faster on cheap CPUs than expensive GPUs. Alternatively, you can train your model using GPUs, then use programs such as ONNX to allow it to run with different operating systems, programming languages, and libraries. This is a complicated topic that warrants its own article, but for now you can think of ONNX as a translator - it will change parts of the network into different formats to run more efficiently or on less powerful hardware.

Another way to increase the efficiency of models is to dig into the math behind them and make that more efficient. Quantizing models means working to get the same level of accuracy while reducing the level of precision of the model, such as using 3.14 for pi instead of 3.14159. Sparsification or pruning is a process in which a network is examined for redundant information and those nodes or parts of data are removed. Companies such as NeuralMagic provide these services, taking powerful open source models and sparsing and quantizing them while maintaining as much accuracy as possible.

All of these software modifications can offer improvements, but they are not as popular as increasing the amount of hardware because i) they often work best during the inference stage, not the time-and-memory-intensive training stage and ii) it often requires knowledge and time, while buying hardware is easy.

Science for the Unscientific

Discussion about this post