文章目录
- 1. 设置cuda synchronize的等待模式
- 2 设置函数
- 3. streamQuery方式实现stream sync等待逻辑
- Reference
1. 设置cuda synchronize的等待模式
- 参考资料:https://docs.nvidia.com/cuda/pdf/CUDA_Runtime_API.pdf
cuda的 synchronize等待模式分为: Yield方式, busy waiting方式(spin), blocking方式
- busy waiting (spin): 是一直占用cpu,属于轮训式等待
- yield:是让出时间片,将时间片轮空,可能会导致很多切入切出
- blocking:方式会导致线程阻塞,从而让出cpu,等待stream上的gpu操作结束后,会触发block 的cpu线程/进程,然后恢复执行;但是这个是被动唤醒模式;可能会导致block的cpu恢复产生延迟,从而产生空白时间,导致线程整体执行耗时增加。
前两个模式,在gpu操作完成后,cpu主线程会及时响应,从而继续往后执行;但是第三个会产生block空隙,如果主线程是FIFO这种实时线程,优先级高且抢占cpu资源,并且CPU资源充足的情况,则block的cpu线程会恢复较快但不排除存在延迟情况。
-
采用blocking模式后,nsight观察的现象有几个
- gpu context切换更加频繁了,应该是block阻塞导致的
- block恢复存在延迟,导致一些空白gpu时间,如下图红色框
-
可以设置cuda Stream synchorinze时是释放cpu资源还是把持cpu资源; 根据官方说明默认当gpu 个数大于cpu的时候,因为cpu紧张所以会yield时间片; 但是一般cpu core大于gpu个数;所以会spin on the processor; spin属于轮询等待的一种。
2 设置函数
official doc
- 在执行函数设置cudaDeviceScheduleBlockingSync的时候,cudaDeviceMapHost可能被同步设置
__host__cudaError_t cudaSetDeviceFlags (unsigned int flags);
// flags:
- cudaDeviceScheduleAuto: 根据GPU和CPU 的个数来选择cudaDeviceScheduleSpin|cudaDeviceScheduleYield
- cudaDeviceScheduleSpin: 轮询方式
- cudaDeviceScheduleYield: 出让时间片方式
- cudaDeviceScheduleBlockingSync:阻塞方式
- cudaDeviceBlockingSync:deprecated
- cudaDeviceMapHost:
- cudaDeviceLmemResizeToMax: deprecated
- cudaDeviceSyncMemops:
- cudaDeviceScheduleAuto: "If C > P, then CUDA will yield to other OS threads when waiting for the device, otherwise CUDA will not yield while waiting for results and actively spin on the processor. 有可能这个context在程序运行过程中会实时变化,导致C>P 那么就会不确定的执行yield
‣ cudaDeviceScheduleBlockingSync: Instruct CUDA to block the CPU thread on a synchronization primitive when waiting for the device to finish work 另外blocksync和spin还不一样,就是block会进入阻塞态 释放cpu,会被主动唤醒,而spin是把持;虽然都会导致程序wait,但是对硬件的使用情况不一样。
3. streamQuery方式实现stream sync等待逻辑
- 也可以自己写等待逻辑, 如用thread::yeild或者busy waiting,通过用streamQuery来实现
- In my experience, you can’t make the CPU activity level lower, if the CPU has nothing else to do, and it is spinning at a CUDA sync point. If you really want to do something like that, my suggestion would be that instead of doing a CUDA device or stream sync, put your GPU work into a stream, and then in a loop you do cudaStreamQuery alternating with an OS command to put the thread to sleep. You decide what level of responsiveness you want/need based on how long you put the CPU thread to sleep."
Reference
- wiki百科: In computer science and software engineering, busy-waiting, busy-looping or spinning is a technique in which a process repeatedly checks to see if a condition is true, such as whether keyboard input or a lock is available.
https://en.wikipedia.org/wiki/Busy_waiting#:~:text=In%20computer%20science%20and%20software,or%20a%20lock%20is%20available."