Understanding the Overheads of Launching CUDA Kernels (理解启动 CUDA Kernels 的开销)

Understanding the Overheads of Launching CUDA Kernels {理解启动 CUDA Kernels 的开销}

Understanding the Overheads of Launching CUDA Kernels
1. INTRODUCTION
2. MICRO-BENCHMARKS USED IN OUR STUDY
3. OVERHEAD OF LAUNCHING KERNELS
- 3.1. Experimental Environment
- 3.2. Launch Overhead in Small Kernels
- 3.3. Launch Overhead in Large Kernels
- 3.4. Other Launch Overheads
- 3.5. Conclusion
4. REFERENCES
Understanding the Overheads of Launching CUDA Kernels
1. Motivation
2. Background
3. Micro-benchmark
4. Launch Overhead in Small Kernels
5. Launch Overhead in Large Kernels
6. Other Overheads
7. Conclusion
8. References
References

Understanding the Overheads of Launching CUDA Kernels

https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf

Lingqi Zhang $^1$ , Mohamed Wahib $^2$ , Satoshi Matsuoka $^{1 3}$
zhang.l.ai@m.titech.ac.jp
$^1$ Tokyo Institute of Technology, Dept. of Mathematical and Computing Science, Tokyo, Japan
$^2$ AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory
$^3$ RIKEN Center for Computational Science, Hyogo, Japan

Tokyo Institute of Technology：东京工业大学
AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory，RWBC-OIL
Hyogo：兵库县
RIKEN Center for Computational Science：理研计算科学中心

1. INTRODUCTION

GPU computing is becoming more and more important in the field of general computing. Many scientific areas utilize the performance of GPUs. Several classes of algorithms require device-wide synchronization, through the use of barriers. However, thousands of threads running on independent SMs (Streaming Multi-Processors) impede this task. Previous research [3] proposed two kinds of device-wide barriers: software barriers or implicit barriers. Recently, Nvidia proposed new methods to do device-wide barriers, i.e. grid synchronization and multi-grid synchronization [1]. Based on the possibility of achieving high performance from lower occupancy [2], we envision using a single kernel with several barriers instead of using multiple kernels as an implicit barrier. But we need to understand the penalty of using different kinds of barriers, i.e. new explicit barrier functions and implicit barrier.
通过使用 barriers，几类算法需要设备范围的同步。然而，在独立 SMs (Streaming Multi-Processors) 上运行的数千个线程阻碍了这项任务。
基于以较低的占用率实现高性能的可能性 [2]，我们设想使用具有多个 barriers 的单个内核，而不是使用多个内核作为隐式 barrier。但我们需要了解使用不同类型 barriers 的代价，即新的显式 barrier 函数和隐式 barrier。

impede [ɪmˈpiːd]：v. 阻碍，阻止
envision [ɪn'vɪʒ(ə)n]：v. 想象，展望

Additionally, Nvidia has proposed new launch functions (e.g. cooperative launch and multi-cooperative launch). These functions are used to support grid synchronization and multi-grid synchronizations [1], i.e. the new explicit barrier functions. In order to utilize the new features, programmers need to turn to the new launch functions. But there is no research try to study the penalty of turning into these new launch functions.
此外，Nvidia 还提出了新的启动函数 (例如协作启动和多协作启动)。这些函数用于支持网格同步和多网格同步 [1]，即新的显式 barrier 函数。为了利用新特性，程序员需要转向新的启动函数。但目前还没有研究尝试研究转向这些新启动函数的惩罚。

cooperative [kəʊˈɒp(ə)rətɪv]：adj. 合作的，协作的，同心协力的，协助的 n. 合作企业，合作社组织
explicit [ɪk'splɪsɪt]：adj. 清楚明白的，易于理解的，明确的，直言的

In this research we will use micro-benchmark to understand the overheads hidden in launch functions. And try to identify the cases when it is not profitable to launch additional kernels. We will also try to make a better understanding of differences in the different launch functions in CUDA.

2. MICRO-BENCHMARKS USED IN OUR STUDY

Throughout this abstract, we use the following terminologies:

Kernel Latency: Total latency to run kernels, start from CPU thread launching a thread, end at CPU thread noticing that the kernel is finished.
运行内核的总延迟，从 CPU 启动线程开始，到 CPU 检测到内核完成时结束。
Kernel Overhead: Latency that is not related to kernel execution.
非 kernel 执行部分的延迟。
Additional Latency: Considering that CPU thread has just called a kernel launch function, additional latency is the additional latency to launch an additional kernel.
额外的延迟是 CPU 刚刚调用了一个 kernel launch function 后，启动另一个 kernel launch function 的额外延迟。
CPU Launch Overhead: Latency of CPU calling a launch function.
CPU 调用启动函数的延迟。
Small Kernel: Kernel execution time is not the main reason for additional latency.
内核执行时间不是造成额外延迟的主要原因。
Larger Kernel: Kernel execution time is the main reason for additional latency.
内核执行时间是造成额外延迟的主要原因。

Currently, researchers tend to either use the execution time of empty kernels or the execution time of a CPU kernel launch function as an overhead of launching a kernel. Although those methods might work correctly when considering a single GPU kernel, this is not enough in the case of multi-kernels. Under this circumstance, we mainly focus on the overhead for launching an additional Kernel.
目前，研究人员倾向于使用空内核的执行时间或 CPU 内核启动函数的执行时间作为启动内核的开销。虽然这些方法在考虑单个 GPU 内核时可能正确工作，但在多内核的情况下这还不够。在这种情况下，我们主要关注启动额外内核的开销。

We use the sleep instruction to control the kernel latency. Sleep instruction is only available in Volta architecture. This instruction is very light, and according to our experiments, no matter how many times we repeat this instruction, the overhead of the kernel remains unaffected.
我们使用 sleep 指令来控制内核延迟。 sleep 指令仅在 Volta 架构中可用。该指令非常轻量，根据我们的实验，无论我们重复该指令多少次，内核的开销都不会受到影响。

We use several sleep instructions to compose a wait unit. Different wait unit inside a single kernel represent a valid kernel execution latency.
我们使用多条 sleep 指令组成一个等待单元。单个内核内的不同等待单元代表一个有效的内核执行延迟。

This micro-benchmark consist of two different kinds of variable:

The times to launch a kernel
the numbers of wait units inside a single kernel. In a single experiment, wait unit should be settled.

settle ['set(ə)l]：v. 定居，结算，停留，确定 n. 高背长椅
distinctive [dɪ'stɪŋktɪv]：adj. 独特的，特别的，有特色的

To test the overhead of small kernels, we propose to use a null kernel (no code inside) as an example of a small kernel. In this situation, the overhead can be computed with the formula 1.
小 kernel 主要是指执行时间不长的 kernel，本测试中采用 wait unit 循环次数为 0 来表示小 kernel，相当于 kernel 内部不做任何操作。

$\frac{Latency_{i0} - Latency_{j0}}{i - j} \tag{1}$

$O$ represents Overhead; $i$ , $j$ represents call launch function times; $0$ represents $0$ wait unit inside a kernel.

To test the overhead of a large kernel, we propose to use kernel fusion to unveil the overhead hidden in kernel latency. The details of this method is shown in Figure 1. In this situation, the overhead can be computed with the formula 2.
大 kernel 是指执行时间占主要开销的 kernel。

$\frac{Latency_{ij} - Latency_{ji}}{i - j} \tag{2}$

$O$ represents Overhead; In $Latency_{ij}$ (the left one), $i$ represents call launch function $i$ times, $j$ represents launch kernels with $j$ wait unit.
测量公式中 Latency 前后两个下标代表不同的维度信息。以 $Latency_{ij}$ 为例， $i$ 代表这个 kernel 被重复的 launch 多少次， $j$ 代表每一个 kernel 内部重复多少次 wait unit 操作。

在这里插入图片描述
Figure 1: Using kernel fusion to test the execution overhead

Wait Unit 是 kernel 中的 sleep 函数，每调用一次 sleep 1 us，代码可见 micro-benchmark 图。sleep 的底层实现基于 Volta 架构提供的汇编指令 nonosleep.u32，据测试该指令的开销极低，经测试验证 sleep 时间长短（即 kernel 执行时间长短）并不影响其它开销。

Table 1: Environment Information
在这里插入图片描述

在这里插入图片描述
Figure 2: Comparison of null kernel overhead for different launch functions

3. OVERHEAD OF LAUNCHING KERNELS

3.1. Experimental Environment

Since we utilize the sleep instruction as a tool to analyze launch overhead, which is only available in Volta Platform in CUDA, we only conduct experiments in the V100 GPU. Table 1 shows the environment information. Each result presented is the average result of 100 experiments.

3.2. Launch Overhead in Small Kernels

We found that latency of CPU Launch Overhead to be nearly equal to the latency of the additional kernel. We hereby additionally plot the latency of the launch function in Figure 2.
我们发现 CPU Launch Overhead 的延迟几乎等于附加内核的延迟。因此，我们在图 2 中额外绘制了启动函数的延迟。

Considering the system error, it is relatively safe to assume that the time consumed when the CPU launches a kernel is the main source of latency among all other steps in kernel launch.
考虑到系统错误，可以相对安全地假设 CPU 启动内核时所消耗的时间是内核启动中所有其他步骤中延迟的主要来源。

3.3. Launch Overhead in Large Kernels

In a single node, we use 5 workload units (sleep 5000 ns). Figure 3 shows that the additional latency is larger than the CPU launch overhead, which means CPU launch overhead do not influence additional latency. And using the kernel fusion method, we found that the execution overhead does exist.
在单节点中，我们使用 5 个工作负载单元 (休眠 5000 ns)，图 3 显示额外延迟大于 CPU 启动开销，说明 CPU 启动开销不影响额外延迟，而使用核融合方法发现执行开销确实存在。

We only prove that this kind of overhead exists in this work. The relation between the execution overhead and how complex the kernel is as well as the launch parameters might is future work. In real-world workloads, the actual execution overhead might be larger than what we are reporting now.
我们只是在这项工作中证明了这种开销的存在。执行开销与内核复杂程度以及启动参数之间的关系可能是未来的工作。在实际工作负载中，实际执行开销可能比我们现在报告的要大。

3.4. Other Launch Overheads

We observe that apart from the overhead of CPU launching kernel and GPU execution overhead, there are remaining overheads.
我们观察到，除了 CPU 启动内核的开销和 GPU 执行的开销之外，还有其他开销。

We use formula 3 to compute that kind of overheads.

$O_{Other} = O_{Total} - (O_{CPU Launch Kernel} + O_{Execution}) \tag{3}$

$O$ represents Overhead;

The result is shown in figure 4. Although the overheads seem large, it does not play an important role when launching a large number of kernels.
结果如图 4 所示，虽然开销看起来很大，但是在启动大量内核时，它并不起到重要作用。

在这里插入图片描述
Figure 3: Large kernel launch overhead of different launch methods

在这里插入图片描述
Figure 4: Comparison of different overheads in different launch functions

如果不是连续发射一系列的 kernels 而是单次触发 kernel 的话，除了执行开销和发射开销，还有一项是其它开销，这项开销甚至超过了发射开销。在频繁发射的场景下，不需要考虑这个问题。

3.5. Conclusion

In this work, we use micro-benchmarks to analyze the launch overhead behaviours of different launch functions, in the case of both small kernels and large kernels. The result reveals two different kinds of kernel overheads and some unknown overhead only distinctive in the situation of a single kernel. The overhead of CPU launching kernel mainly has impacts in the situation of small kernels, while the execution overhead mainly has impacts in the situation of large kernels. We conclude that launching a new kernel is only profitable in the situation when the performance improvement surpasses the overhead of a new kernel. Additionally, we observed that Cooperative Multi-Device Launch is slightly slower than Cooperative Launch, and Cooperative Launch is slightly slower than Traditional Launch. This additional latency is trivial considering the benefit of using grid synchronization. This research is mainly focused on the V100 GPUs in DGX1. But we also observe similar behaviors in P100 platform.
结果揭示了两种不同的内核开销和一些仅在单个内核的情况下才出现的未知开销。CPU 启动内核的开销主要对 small kernels 的情况产生影响，而执行开销主要对 large kernels 的情况产生影响。我们得出结论，只有在性能改进超过新内核的开销的情况下，启动新内核才是有利可图的。此外，我们观察到 Cooperative Multi-Device Launch 比 Cooperative Launch 稍慢，Cooperative Launch 比 Traditional Launch 稍慢。考虑到使用 grid synchronization 的好处，这种额外的延迟是微不足道的。

4. REFERENCES

[1] CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[2] Better Performance at Lower Occupancy, https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf
[1] Inter-block GPU communication via fast barrier synchronization, https://ieeexplore.ieee.org/document/5470477

Understanding the Overheads of Launching CUDA Kernels

https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-moc.pdf

Lingqi Zhang $^1$ , Mohamed Wahib $^2$ , Satoshi Matsuoka $^{1 3}$
zhang.l.ai@m.titech.ac.jp, mohamed.attia@aist.go.jp, matsu@is.titech.ac.jp
$^1$ Tokyo Institute of Technology, Dept. of Mathematical and Computing Science, Tokyo, Japan
$^2$ AIST-Tokyo Tech Real World Big-Data Computation Open Innovation Laboratory
$^3$ RIKEN Center for Computational Science, Hyogo, Japan

1. Motivation

Nvidia GPUs can run 10,000s of threads on independent SMs (Streaming Multi-processors)

Not ideal for device-wide barriers (不适合设备范围的 barriers)

Method for device-wide barriers in GPUs

Software barriers (example in [1])
Implicit barriers: launching separate kernels (impacts performance)

Alternative ways to achieve the same goal

Grid synchronization or multi-grid synchronization [2]
Higher performance might come from lower occupancy [3]

Implicit barrier (additional kernels) vs. single kernel
Question:

When not to launch an additional kernel?
What is the penalty of using different kinds of barriers in CUDA?

impact [ɪm'pækt]：v. 冲击，撞击，有作用 n. 撞击，冲击力，冲撞，巨大影响

2. Background

Different kinds of kernel launch methods.

Traditional Launch
Cooperative Launch (CUDA 9): Introduced to support grid synchronization
Cooperative Multi-Device Launch (CUDA 9): Introduced to support multi-grid synchronization

Sleep instruction: wait specific nanosecond in GPU kernel.

3. Micro-benchmark

Definition

Kernel Latency: Total latency to run kernels, start from CPU thread launching a thread, end at CPU thread noticing that the kernel is finished.
Kernel Overhead: Latency that is not related to kernel execution.
Additional Latency: Considering that CPU thread have just called a kernel launch function, additional latency is the additional latency to launch an additional kernel.
CPU Launch Overhead: Latency of CPU calling a launch function.
Small Kernel: Kernel execution time is not the main reason for additional latency.
Larger Kernel: Kernel execution time is the main reason for additional latency.

在这里插入图片描述
Figure 1: Sample code of micro-benchmark that call launch function 5 times, and repeats a wait unit (sleep 1000 ns) 10 times.

为了方便测试，本次实验中采用的 kernel 并没有指定参数。考虑到实际场景中 kernel 可能有多个参数，实际场景中启动开销应该大于本实验测试的结果。

Additional wait unit (sleep 1000 ns) do not increase any kernel overhead (Considering System Error)

在这里插入图片描述
Figure 2: Gradient of latency per wait unit (sleep 1000 ns) in a single kernel

Test overhead in small kernels

Method: Using null kernel (no code inside) to represent a Small Kernel
小 kernel 测试中 wait unit 循环次数为 0，即 kernel 中不进行任何操作。

Test overhead in large kernels

Method: Using kernel fusion to unveil the overhead.

unveil [ʌn'veɪl]：v. 推出，为 ... 揭幕，揭开 ... 上的覆盖物，拉开 ... 的帷幔

在这里插入图片描述
Figure 3: Using kernel fusion to test overhead hidden in kernel execution

4. Launch Overhead in Small Kernels

在这里插入图片描述

Figure 4: Comparison of null kernel overhead using three different launch functions that employ different types of barriers (left) , Cooperative Multi-Device Launch among different devices (right).

CPU Launch Overhead is the main overhead in Small Kernel.

5. Launch Overhead in Large Kernels

在这里插入图片描述

Figure 5: Comparison of Large Kernel Overhead among different launch functions (left), Cooperative Multi-Device Launch among different devices (right).

CPU launch overhead is recorded to prove that it is not distinctive here. (the result is not as precise as the one in “Small Kernel” section)
GPU execution overhead does exist.

6. Other Overheads

Empty kernel lasts about 8 us, still longer than the overheads we reported.

在这里插入图片描述
Figure 6: Comparison of different overheads in different launch functions

Other Overhead is distinctive in single kernel. (Larger than the two kinds of overhead we reported)

distinctive [dɪ'stɪŋktɪv]：adj. 独特的，特别的，有特色的

7. Conclusion

Main overheads:

Small Kernels: CPU Launch Overhead
Large Kernels: GPU Execution Overhead
Single Kernel: Other Overhead

Overhead of different launch functions

Cooperative Multi-Device Launch > Cooperative Launch > Traditional Launch

Launch a new kernel when the performance improvement surpasses the overhead of a new kernel. (当性能改进超过新内核的开销时，启动新内核。)

8. References

[1] Inter-block GPU communication via fast barrier synchronization, https://ieeexplore.ieee.org/document/5470477
[2] CUDA C++ Programming Guide, https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[3] Better Performance at Lower Occupancy, https://www.nvidia.com/content/gtc-2010/pdfs/2238_gtc2010.pdf

References

[1] Yongqiang Cheng, https://yongqiang.blog.csdn.net/
[2] Understanding the Overheads of Launching CUDA Kernels, https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-abst.pdf
[3] Understanding the Overheads of Launching CUDA Kernels, https://www.hpcs.cs.tsukuba.ac.jp/icpp2019/data/posters/Poster17-moc.pdf
[4] CUDA Runtime API, https://docs.nvidia.com/cuda/cuda-runtime-api/index.html