为什么GPU更适合深度学习?
Since the past decade, we have seen GPU coming into the picture more frequently in fields like HPC(High-Performance Computing) and the most popular field i.e gaming. GPUs have improved year after year and now they are capable of doing some incredibly great stuff, but in the past few years, they are catching even more attention due to deep learning.
自从过去十年以来,我们已经看到GPU在HPC(高性能计算)和一些热门领域(例如游戏)中越来越频繁地出现。 GPU逐年改进,现在已经可以十分出色地完成工作。并且在最近的几年中,由于深度学习,它们引起了更多的关注。
As deep learning models spend a large amount of time in training, even powerful CPUs weren’t efficient enough to handle soo many computations at a given time and this is the area where GPUs simply outperformed CPUs due to its parallelism. But before diving into the depth lets first understand some things about GPU.
深度学习模型通常需要花费大量的时间进行训练,即使强大的CPU也无法在给定的时间内处理太多的计算。而GPU由于其并行性,可以节省大量的训练时间, 这是它而仅胜过CPU的领域。在深入研究之前,我们首先来了解一些有关GPU的知识。
What is the GPU?
A GPU or ‘Graphics Processing Unit’ is a mini version of an entire computer but only dedicated to a specific task. It is unlike a CPU that carries out multiple tasks at the same time. GPU comes with its own processor which is embedded onto its own motherboard coupled with v-ram or video ram, and also a proper thermal design for ventilation and cooling.
什么是GPU?
GPU,即“图形处理单元”,是仅用于特定任务计算机的微型版本。与CPU不同的是, 它可以同时执行多个任务。 GPU带有自己的处理器,该处理器嵌入与v-ram或video ram耦合的主板上,并且具有通风和冷却的散热设计。
In the term ‘Graphics Processing Unit’, ‘Graphics’ refers to rendering an image at specified coordinates on a 2d or 3d space. A viewport or viewpoint is a viewer’s perspective of looking to an object depending upon the type of projection used. Rasterisation and Ray-tracing are some of the ways of rendering 3d scenes, both of these concepts are based on a type of a projection called as perspective projection. What is perspective projection?
In short, it is the way in which how an image is formed on a view plane or canvas where the parallel lines converge to a converging point called as ‘center of projection’ also as the object moves away from the viewpoint it appears to be smaller, exactly how our eyes portray in real-world and this helps in understanding depth in an image as well, that is the reason why it produces realistic images.
Moreover GPUs also process complex geometry, vectors, light sources or illuminations, textures, shapes, etc. As now we have a basic idea about GPU, let us understand why it is heavily used for deep learning.
术语“图形处理单元”中的“图形”是指在二维或三维空间上的指定坐标处渲染图像。 视口,或视点,是观察者根据所使用的投影类型看物体的角度。 栅格化和光线跟踪是渲染3d场景的一些方法,这两个概念都是基于一种称为透视投影的投影类型。 那么什么是透视投影呢?
简而言之,透视投影是一种在视线或画布上形成图像的方法,其中平行线会聚到称为“投影中心”的会聚点,而且当对象从视点移开时,它看起来会变小。这正是我们的眼睛感受现实世界中的方式,帮助我们理解图像的深度。这也是它产生逼真的图像的原因。
此外,GPU还处理复杂的几何体、向量、光源或照明,纹理,形状等。我们现在对GPU已经有了基本的了解,接下来让我们来解释一下为什么它被大量用于深度学习。
Why GPUs are better for deep learning?
One of the most admired characteristics of a GPU is the ability to compute processes in parallel. This is the point where the concept of parallel computing kicks in. A CPU in general completes its task in a sequential manner. A CPU can be divided into cores and each core takes up one task at a time. Suppose if a CPU has 2 cores. Then two different task’s processes can run on these two cores thereby achieving multitasking.
But still, these processes execute in a serial fashion.
为什么GPU更适合深度学习?
GPU最大的特点是能够并行计算进程。 也是因为GPU的出现,有了并行计算这个概念。CPU通常按顺序方式完成其任务。 一个CPU可以有多个内核,每个内核一次承担一个任务。 假设一个CPU有2个内核,那么两个不同的任务流程可以在这两个内核上同时运行,从而实现多任务处理。
但是在每个内核上,这些过程仍然以串行方式执行。
This doesn’t mean that CPUs aren’t good enough. In fact, CPUs are really good at handling different tasks related to different operations like handling operating systems, handing spreadsheets, playing HD videos, extracting large zip files, all at the same time. These are some things that a GPU simply cannot do.
这并不意味着CPU就不好。 实际上,CPU非常擅长同时处理与不同操作相关的不同任务,例如处理操作系统,处理电子表格,播放高清视频,提取大型zip文件。 这些都是GPU无法完成的事情。
Where the difference lies?
As discussed previously a CPU is divided into multiple cores so that they can take on multiple tasks at the same time, whereas GPU will be having hundreds and thousands of cores, all of which are dedicated towards a single task. These are simple computations that are performed more frequently and are independent of each other. And both store frequently required data into there respective cache memory, thereby following the principle of ‘locality reference’.
区别在哪里?
如前所述,CPU被划分为多个内核,因此它们可以同时执行多个任务,而GPU拥有成千上万个内核,所有这些内核都专用于单个任务。 这些任务通常是执行一些简单且频繁的计算,并且彼此独立。 两者都将经常需要的数据存储到相应的缓存中,因此都遵循“位置参考”的原理。
There are many software and games that can take advantage of GPUs for execution. The idea behind this is to make some parts of the task or application code parallel but not the entire processes. This is because most of the task’s processes have to be executed in a sequential manner only. For example, logging into a system or application does not need to make parallel.
When there is part of execution that can be done in parallel it is simply shifted to GPU for processing where at the same time-sequential task gets executed in CPU, then both of the parts of the task are again combined together.
有许多可以利用GPU执行的软件和游戏。 其背后的想法是使任务或应用程序代码的某些部分并行化,而不是使整个过程并行化。 这是因为大多数任务的流程只需要按顺序执行。 例如,登录到系统或应用程序不需要并行进行。
当执行的一部分可以并行完成时,只需将其转移到GPU进行处理,并且在CPU中同时执行顺序任务,然后将任务的这两个部分再次组合在一起。
In the GPU market, there are two main players i.e AMD and Nvidia. Nvidia GPUs are widely used for deep learning because they have extensive support in the forum software, drivers, CUDA, and cuDNN. So in terms of AI and deep learning, Nvidia is the pioneer for a long time.
Neural networks are said to be embarrassingly parallel, which means computations in neural networks can be executed in parallel easily and they are independent of each other.
在GPU市场中,有两个主要参与者,即AMD和Nvidia。 Nvidia GPU被广泛用于深度学习,因为它们在论坛软件,驱动程序,CUDA和cuDNN中具有广泛的支持。 因此就AI和深度学习而言,Nvidia长期以来一直占领先锋位置。
神经网络可并行化,这意味着神经网络中的计算可以轻松地并行执行,并且彼此独立。
Some computations like calculation of weights and activation functions of each layer, backpropagation can be carried out in parallel. There are many research papers available on it as well.
Nvidia GPUs come with specialized cores known as CUDA cores which helps for accelerating deep learning.
一些计算(例如权重计算和每一层的激活函数,反向传播)可以并行执行。 关于这些,也有许多论文可供参考。
Nvidia GPU带有称为CUDA内核的专用内核,有助于加速深度学习。
What is CUDA?
CUDA stands for ‘Compute Unified Device Architecture’ which was launched in the year 2007, it’s a way in which you can achieve parallel computing and yield most out of your GPU power in an optimized way, which results in much better performance while executing tasks.
什么是CUDA?
CUDA代表“计算统一设备体系结构”。该体系结构于2007年推出,您可以通过这种方法实现并行计算,并以优化的方式充分利用GPU的性能,从而在执行任务时提高性能。
The CUDA toolkit is a complete package that consists of a development environment that is used to build applications that make use of GPUs. This toolkit mainly contains c/c++ compiler, debugger, and libraries. Also, the CUDA runtime has its drivers so that it can communicate with the GPU. CUDA is also a programming language that is specifically made for instructing the GPU for performing a task. It is also known as GPU programming.
Below is a simple hello world program just to get an idea of how CUDA code looks like.
CUDA工具包是一个完整的软件包,其中包含一个开发环境,该开发环境用于构建使用GPU的应用程序。 该工具包主要包含c / c ++编译器,调试器和库。 同样,CUDA运行时具有其驱动程序,因此它可以与GPU进行通信。 CUDA还是一种专门用于指示GPU执行任务的编程语言,称为GPU编程。
下面是一个简单的hello world程序,旨在了解CUDA代码的外观。
/* hello world program in cuda *\
#include<stdio.h>
#include<stdlib.h>
#include<cuda.h>__global__ void demo() {
printf("hello world!,my first cuda program");
}int main() {
printf("From main!\n");
demo<<<1,1>>>();
return 0;
}
输出:
What is cuDNN?
cuDNN is a neural network library that is GPU optimized and can take full advantage of Nvidia GPU. This library consists of the implementation of convolution, forward and backward propagation, activation functions, and pooling. It is a must library without which you cannot use GPU for training neural networks.
A big leap with Tensor cores!
Back in the year 2018, Nvidia launched a new lineup of their GPUs i.e 2000 series. Also called RTX, these cards come with tensor cores that are dedicated to deep learning and based on Volta architecture.
cuDNN是经过神经网络优化的神经网络库,可以充分利用Nvidia GPU的优势。 该库由卷积,前向和后向传播,激活函数和池的实现组成。 这是一个必不可少的库,没有它,您将无法使用GPU训练神经网络。
Tensor内核大飞跃!
早在2018年,Nvidia推出了他们的GPU的新阵容,即2000系列。 这些卡也称为RTX,具有基于Volta架构的专用于深度学习的张量内核。
Tensor cores are particular cores that perform matrix multiplication of 4 x 4 FP16 matrix and addition with 4 x 4 matrix FP16 or FP32 in half-precision, the output will be resulting in 4 x 4 FP16 or FP32 matrix with full precision.
Note: ‘FP’ stands for floating-point to understand more about floating-point and precision check this blog.
As stated by Nvidia, the new generation tensor cores based on volta architecture is much faster than CUDA cores based on Pascal architecture. This gave a huge boost to deep learning.
Tensor内核是执行4 x 4 FP16矩阵的矩阵乘法并与4 x 4矩阵FP16或FP32以半精度相加的特殊内核,输出将以全精度生成4 x 4 FP16或FP32矩阵。
备注:“ FP”代表浮点,您可通过这篇文章了解有关浮点和精度的更多信息。如Nvidia所述,基于Volta架构的新一代张量内核比基于Pascal架构的CUDA内核快得多。 这极大地促进了深度学习。
At the time of writing this blog, Nvidia announced the latest 3000 series of their GPU lineup which come with Ampere architecture. In this, they improved the performance of tensor cores by 2x. Also bringing new precision values like TF32(tensor float 32), FP64(floating point 64). The TF32 works the same as FP32 but with speedup up to 20x, as a result of all this Nvidia, claims the inference or training time of models will be reduced from weeks to hours.
在撰写此博客时,Nvidia宣布了其最新的3000系列GPU系列产品,这些产品系列均带有Ampere架构。 在这种情况下,他们将张量内核的性能提高了2倍。 还带来了新的精度值,例如TF32(张量浮点32),FP64(浮点64)。 TF32的工作原理与FP32相同,但据Nvidia的测试结果显示,其速度提高了20倍,并且模型的推理或训练时间将从数周减少至数小时。
AMD vs Nvidia
AMD GPUs are decent for gaming but as soon as deep learning comes into the picture, then simply Nvidia is way ahead. It does not mean that AMD GPUs are bad. It is due to the software optimization and drivers which is not being updated actively, on the Nvidia side they have better drivers with frequent updates and at the top of that CUDA, cuDNN helps to accelerate the computation.
Some well-known libraries like Tensorflow, PyTorch support for CUDA. It means entry-level GPUs of the GTX 1000 series can be used. On the AMD side, it has very little software support for their GPUs. On the hardware side, Nvidia has introduced dedicated tensor cores. AMD has ROCm for acceleration but it is not good as tensor cores, and many deep learning libraries do not support ROCm. For the past few years, no big leap was noticed in terms of performance.
Due to all these points, Nvidia simply excels in deep learning.
AMD的GPU非常适合游戏,但是在深度学习方面,Nvidia却遥遥领先。 这并不意味着AMD的GPU不好。只是因为软件优化和驱动程序未得到积极更新。而Nvidia的GPU具有更好的驱动程序,具有频繁的更新,除此之外,cuDNN还能帮助加速计算。
Tensorflow,PyTorch等一些知名的库都支持CUDA。 这意味着可以使用GTX 1000系列的入门级GPU。 AMD对很少有支持GPU的软件。 在硬件方面,Nvidia引入了专用的张量内核。 AMD具有用于加速的ROCm,但它不适合作为张量内核,并且许多深度学习库不支持ROCm。 在过去的几年中,在性能方面没有发现大的飞跃。
综合上述原因,Nvidia在深度学习方面表现更为出色。
本作品采用知识共享署名-非商业性使用-相同方式共享 4.0 国际许可协议进行许可,转载请附上原文出处链接和本声明。
本文链接地址:https://www.flyai.com/article/715