文章目录

Memory
- shared memory
- global memory
- Transfer Data
- 异步预取
Threads
- thread block
- warp
GPU 性能
- 查看性能
- 测试性能
CUDA 流

Memory

在这里插入图片描述

GPU 在 CUDA Mode 下，有多种存储类型：

register：
- 位于 SM 上，共 $8192$ 个。作用范围是 thread，读取速度 $1$ - cycle latency
- R/W dependencies：写后读，至少 $22$ - cycles 才能稳定（如果同时有 $192$ 个线程，也就是 $6$ 个 warps，那么运算速度为 $24$ 转，此时可以忽略读写依赖）
local memory：
- 位于 global memory 上，作用范围是 thread，读取速度 $200$ - cycle latency
- 模拟 register，如果 register 不够用，编译器自动设置。要注意控制各个线程中的 register 的使用数量。
shared memory：
- 位于 shared memory 上，作用范围是 block，读取速度 $1$ - cycle latency
- bank conflicts：库冲突，读取延迟变大
global memory：
- 位于 global memory 上，作用范围是 grid，读取速度 $200$ - cycle latency
- coalescing accessing：合并访问，可隐藏读取延迟
constant and texture memories：
- 位于 global memory 上，作用范围是 grid，读取速度 $200$ - cycle latency
- 由硬件自动调度到对应的 caches 里，hit & miss（如果 hit，那么就如同 register 一样）

shared memory

在古早的 GPU 上，shared memory 只有 $16$ 个 banks，因此一个 warp 里的 $32$ 个线程需要按照 “half-warp” 方式来访存。

如果没有 bank conflict，那么每个 cycle 可以 $16$ 个 threads 同时读取数据， $2$ 个 cycle 就可以读取完毕。
如果存在 bank conflict，那么这些冲突的 threads 只能串行读取数据了，读取速度慢得多。

在这里插入图片描述

现在的 GPU 已经有了足够多的 banks，如果没有冲突，则可以在 $1$ - cycle 内完成一个 warp 的数据读取。另外，如果 warp 里的线程同时读取同一个数据，使用 broadcast 机制，也可以在 $1$ - cycle 内完成，这也是 no bank conflict 的情况。

global memory

由于 global memory 的读取延迟很大， $200$ - cycle latency，因此将数据一个一个读取是低效的。可以使用内存合并技术（Coalescing accessing）：调度数据时，一次性调度地址整除 $64$ 的长度 $16$ 个 words 的内存块，那么 $16$ 个 half-warp 就可以各自从这个内存块中获取到自己的 word 了。

在这里插入图片描述

现在的 GPU 的一次性调度的内存块大小也变大了，足够支持一个 warp 内的 $32$ 个线程。我们可以使用 _align_(8) 或者 _align_(16) 来指定任意数据结构大小对齐。

Transfer Data

编程时，一般仅使用 shared memory 以及 global memory。使用 cudaMemcpy() 来调度数据。数据调度流程为：

将数据从 host 调度到 device 上
将数据从 device memory 调度到 shared memory 上
使用 __syncthreads() 同步一个 block 上的所有线程，保证数据全部 ready
在 shared memory 上执行程序
如有必要，再次使用 __syncthreads() 同步一个 block 上的所有线程，保证数据全部 updated
将计算结果从 shared memory 调度到 device memory 上
将计算结果从 device 调度到 host 上

在这里插入图片描述

为了提高访存效率，应当将数据仔细分类：

R/Only：常用的只读变量，放在 constant memory 里，会被自动调度到 cache 里。
R/W shared within block：被一个 block 共享的数据，调度到 shared memory 里，每个 thread 各自调度其中的一部分，然后大家共同访问这些数据。
R/W within each thread：只被线程自己使用，放在 register 里。但需要注意使用的数量，不要用超了。
R/W inputs/results：程序的输入输出，放在 global memory 里。

异步预取

如果使用 cudaMallocManaged()，那么将会创建一致内存 UM，自动在 host 和 device 之间调度数据。每当 device 需要某个数据，而在 device memory 上页面缺失，那么启动一次 HtoD 的 cudaMemcpy；每当 host 需要某个数据，而在 host memory 上页面缺失，那么启动一次 DtoH 的 cudaMemcpy。如果需要大量数据，这种一次次的数据调度会特别慢。将数据异步预取到 device / host 里，大幅降低页面缺失，提高程序的运行速度。

比如，数组相加 $a + b = c$

首先把数组 $a, b, c$ 异步预取到 device 上，进行初始化赋值
然后在 device 上计算 $c [i] = a [i] + b [i]$
最后把数组 $c$ 异步预取到 host 上，进行正确性检查

#include <stdio.h>

__global__
void initWith(float num, float *a, int N)
{

  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    a[i] = num;
  }
}

__global__
void addVectorsInto(float *result, float *a, float *b, int N)
{
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < N; i += stride)
  {
    result[i] = a[i] + b[i];
  }
}

void checkElementsAre(float target, float *vector, int N)
{
  for(int i = 0; i < N; i++)
  {
    if(vector[i] != target)
    {
      printf("FAIL: vector[%d] - %0.0f does not equal %0.0f\n", i, vector[i], target);
      exit(1);
    }
  }
  printf("Success! All values calculated correctly.\n");
}

int main()
{
  int deviceId;
  int numberOfSMs;

  cudaGetDevice(&deviceId);
  cudaDeviceGetAttribute(&numberOfSMs, cudaDevAttrMultiProcessorCount, deviceId);

  const int N = 2<<24;
  size_t size = N * sizeof(float);

  float *a;
  float *b;
  float *c;

  cudaMallocManaged(&a, size);
  cudaMallocManaged(&b, size);
  cudaMallocManaged(&c, size);

  //异步预取到 GPU(device) 上
  cudaMemPrefetchAsync(a, size, deviceId);
  cudaMemPrefetchAsync(b, size, deviceId);
  cudaMemPrefetchAsync(c, size, deviceId);

  size_t threadsPerBlock;
  size_t numberOfBlocks;

  threadsPerBlock = 256;
  numberOfBlocks = 32 * numberOfSMs;

  cudaError_t addVectorsErr;
  cudaError_t asyncErr;

  initWith<<<numberOfBlocks, threadsPerBlock>>>(3, a, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(4, b, N);
  initWith<<<numberOfBlocks, threadsPerBlock>>>(0, c, N);

  addVectorsInto<<<numberOfBlocks, threadsPerBlock>>>(c, a, b, N);

  addVectorsErr = cudaGetLastError();
  if(addVectorsErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(addVectorsErr));

  asyncErr = cudaDeviceSynchronize();
  if(asyncErr != cudaSuccess) printf("Error: %s\n", cudaGetErrorString(asyncErr));

  //异步预取到 CPU(host) 上
  cudaMemPrefetchAsync(c, size, cudaCpuDeviceId); 

  checkElementsAre(7, c, N);

  cudaFree(a);
  cudaFree(b);
  cudaFree(c);
}

在使用异步预取进行了一系列重构之后，您应该看到内存传输次数减少了，但是每次传输的量增加了，并且内核执行时间大大减少了。

Threads

thread block

CUDA 里的 block，有如下性质：

一个 grid 里的所有 threads 执行同一个 kernel 程序（SPMD），使用 block ID 以及 thread ID 来挑选任务。
同一个 block 里的 threads，可以 share data，可以 synchronize。
不同 block 里的 threads 无法协作，因为 block 由硬件随机调度，执行顺序是任意的。

同一时间，多个 block 被调度到同一个 SM 上，按照 warp 打散来执行。

warp

一个 warp 内的 threads 实际上是以 SIMD 的方式执行的：

一个 warp 内的 $32$ 个 threads 共享指令。
关于某条 instruction，一旦 warp 内的 $32$ 个 threads 的数据全部 ready 了（打分板 scoreboard），那么 $4$ cycles 将 threads 的这条指令执行完毕。
这个 $4$ cycles 是由工程师人为设定的，因为一条指令的执行需要取址、译码、执行、写回等操作，执行速度过快没什么意义。

在这里插入图片描述

现在的 GPU 可以在 $2$ cycles 内执行完一个 warp 的一条指令。

GPU 性能

查看性能

使用 !nvidia-smi 命令可以查看本机 GPU 的性能：

Tue Dec 13 10:24:33 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.118.02   Driver Version: 440.118.02   CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:00:1E.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

由于 GPU 上的 SM 数量会因所用的特定 GPU 而异，因此为支持可移植性，您不得将 SM 数量硬编码到代码库中。相反，应该以编程方式获取此信息。为获取操作支持并查看相关介绍，请参阅 CUDA 运行时文档以帮助识别设备属性结构中的相关属性。

#include <stdio.h>

int main()
{
	int deviceId;
	cudaGetDevice(&deviceId);	// `deviceId` now points to the id of the currently active GPU.

    cudaDeviceProp props;
    cudaGetDeviceProperties(&props, deviceId); // `props` now has many useful properties about the active GPU device.
    
    int computeCapabilityMajor = props.major;
    int computeCapabilityMinor = props.minor;
    int multiProcessorCount = props.multiProcessorCount;
    int warpSize = props.warpSize;
    int maxThreadsPerBlock = props.maxThreadsPerBlock;
    int maxThreadsPerMultiProcessor = props.maxThreadsPerMultiProcessor;

	printf("Device ID: %d\nNumber of SMs: %d\nCompute Capability Major: %d\nCompute Capability Minor: %d\nWarp Size: %d\nMax Threads Per Block: %d\nMax Threads Per MultiProcessor: %d", 
           deviceId, multiProcessorCount, computeCapabilityMajor, computeCapabilityMinor, warpSize, maxThreadsPerBlock, maxThreadsPerMultiProcessor);
    
    return 0;
}

结果为

Device ID: 0
Number of SMs: 40
Compute Capability Major: 7
Compute Capability Minor: 5
Warp Size: 32
Max Threads Per Block: 1024
Max Threads Per MultiProcessor: 1024

测试性能

如要确保优化加速代码库的尝试真正取得成功，唯一方法便是分析应用程序以获取有关其性能的定量信息。nsys 是指 NVIDIA 的Nsight System命令行分析器。该分析器附带于CUDA工具包中，提供分析被加速的应用程序性能的强大功能。

nsys 使用起来十分简单，最基本用法是向其传递使用 nvcc 编译的可执行文件的路径。随后 nsys 会继续执行应用程序，并在此之后打印应用程序 GPU 活动的摘要输出、CUDA API 调用以及统一内存（UM）活动的相关信息。

例如，一个向量加法的 CUDA 程序，先编译它：

!nvcc -std=c++11 -o vector-add-no-prefetch 01-vector-add/01-vector-add.cu -run

接着使用 nsys profile --stats = true 创建一个报告文件，可以在Nsight Systems可视化分析器中打开该文件：

!nsys profile --stats=true -o vector-add-no-prefetch-report ./vector-add-no-prefetch

**** collection configuration ****
	output_filename = /dli/task/vector-add-no-prefetch-report
	force-overwrite = false
	stop-on-exit = true
	export_sqlite = true
	stats = true
	capture-range = none
	stop-on-range-end = false
	Beta: ftrace events:
	ftrace-keep-user-config = false
	trace-GPU-context-switch = false
	delay = 0 seconds
	duration = 0 seconds
	kill = signal number 15
	inherit-environment = true
	show-output = true
	trace-fork-before-exec = false
	sample_cpu = true
	backtrace_method = LBR
	wait = all
	trace_cublas = false
	trace_cuda = true
	trace_cudnn = false
	trace_nvtx = true
	trace_mpi = false
	trace_openacc = false
	trace_vulkan = false
	trace_opengl = true
	trace_osrt = true
	osrt-threshold = 0 nanoseconds
	cudabacktrace = false
	cudabacktrace-threshold = 0 nanoseconds
	profile_processes = tree
	application command = ./vector-add-no-prefetch
	application arguments = 
	application working directory = /dli/task
	NVTX profiler range trigger = 
	NVTX profiler domain trigger = 
	environment variables:
	Collecting data...
Success! All values calculated correctly.
	Generating the /dli/task/vector-add-no-prefetch-report.qdstrm file.
	Capturing raw events...
	9885 total events collected.
	Saving diagnostics...
	Saving qdstrm file to disk...
	Finished saving file.


Importing the qdstrm file using /opt/nvidia/nsight-systems/2019.5.2/host-linux-x64/QdstrmImporter.

Importing...

Importing [==================================================100%]
Saving report to file "/dli/task/vector-add-no-prefetch-report.qdrep"
Report file saved.
Please discard the qdstrm file and use the qdrep file instead.

Removed /dli/task/vector-add-no-prefetch-report.qdstrm as it was successfully imported.
Please use the qdrep file instead.

Exporting the qdrep file to SQLite database using /opt/nvidia/nsight-systems/2019.5.2/host-linux-x64/nsys-exporter.

Exporting 9845 events:

0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************

Exported successfully to
/dli/task/vector-add-no-prefetch-report.sqlite

Generating CUDA API Statistics...
CUDA API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   57.6       219929639           3      73309879.7           29951       219841843  cudaMallocManaged                                                               
   37.0       141133435           1     141133435.0       141133435       141133435  cudaDeviceSynchronize                                                           
    5.4        20680929           3       6893643.0         6152223         8231765  cudaFree                                                                        
    0.0           53469           1         53469.0           53469           53469  cudaLaunchKernel                                                                


Generating CUDA Kernel Statistics...

Generating CUDA Memory Operation Statistics...
CUDA Kernel Statistics (nanoseconds)

Time(%)      Total Time   Instances         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
  100.0       141118979           1     141118979.0       141118979       141118979  addVectorsInto                                                                  


CUDA Memory Operation Statistics (nanoseconds)

Time(%)      Total Time  Operations         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   78.8        78380960        7923          9892.8            1824          128320  [CUDA Unified Memory memcpy HtoD]                                               
   21.2        21123232         768         27504.2            1600          159968  [CUDA Unified Memory memcpy DtoH]                                               


CUDA Memory Operation Statistics (KiB)

            Total      Operations            Average            Minimum            Maximum  Name                                                                            
-----------------  --------------  -----------------  -----------------  -----------------  --------------------------------------------------------------------------------
         393216.0            7923               49.6              4.000              764.0  [CUDA Unified Memory memcpy HtoD]                                               
         131072.0             768              170.7              4.000             1020.0  [CUDA Unified Memory memcpy DtoH]                                               


Generating Operating System Runtime API Statistics...
Operating System Runtime API Statistics (nanoseconds)

Time(%)      Total Time       Calls         Average         Minimum         Maximum  Name                                                                            
-------  --------------  ----------  --------------  --------------  --------------  --------------------------------------------------------------------------------
   55.3      1669733453          87      19192338.5           18686       100126111  poll                                                                            
   40.9      1235491628          83      14885441.3           11203       100244995  sem_timedwait                                                                   
    2.9        88329907         660        133833.2            1006        17394521  ioctl                                                                           
    0.8        22674769          90        251941.9            1396         8168908  mmap                                                                            
    0.0          638962          77          8298.2            2436           21519  open64                                                                          
    0.0          137399          11         12490.8            4708           24628  write                                                                           
    0.0          117980           4         29495.0           23124           37223  pthread_create                                                                  
    0.0          104548          23          4545.6            1412           15906  fopen                                                                           
    0.0           98832           3         32944.0           24763           46457  fgets                                                                           
    0.0           45682          14          3263.0            1697            4494  munmap                                                                          
    0.0           33118           5          6623.6            3206            9203  open                                                                            
    0.0           28823          16          1801.4            1092            3102  fclose                                                                          
    0.0           25011          13          1923.9            1152            3125  read                                                                            
    0.0           13682           2          6841.0            5868            7814  socket                                                                          
    0.0           10983           3          3661.0            3612            3732  pipe2                                                                           
    0.0            8008           5          1601.6            1015            3533  fcntl                                                                           
    0.0            7634           4          1908.5            1767            2072  mprotect                                                                        
    0.0            6467           2          3233.5            2726            3741  fread                                                                           
    0.0            5210           1          5210.0            5210            5210  connect                                                                         
    0.0            3447           1          3447.0            3447            3447  bind                                                                            
    0.0            1841           1          1841.0            1841            1841  listen                                                                          


Generating NVTX Push-Pop Range Statistics...
NVTX Push-Pop Range Statistics (nanoseconds)

然后，在 cmd 里执行 nsight-sys 命令，启动 Nsight Systems：

直观地描述由GPU加速的CUDA应用程序的时间表，
识别和利用CUDA应用程序中的优化机会。

在这里插入图片描述

CUDA 流

控制 CUDA 流行为的规则：

给定流中的所有 kernel 会按序执行。
就不同非默认流中的 kernel 而言，无法保证其会按彼此之间的任何特定顺序执行。
默认流具有阻断能力，即，它会等待其它已在运行的所有流完成当前 kernel 之后才运行，但在其自身运行完毕之前亦会阻碍其它流上的 kernel 运行。

在这里插入图片描述

创建、使用、销毁 stream：

cudaStream_t stream;   // CUDA流的类型为 `cudaStream_t`
cudaStreamCreate(&stream); // 注意，必须将一个指针传递给 `cudaCreateStream`

someKernel<<<number_of_blocks, threads_per_block, 0, stream>>>();   // `stream` 作为第4个EC参数传递

cudaStreamDestroy(stream); // 注意，将值（而不是指针）传递给 `cudaDestroyStream`

不过，在 CUDA 的云平台上，不同的非默认 stream 上，它们还是串行执行的。一个流执行完，另一个流才开始执行。

在这里插入图片描述