GPU-Puzzles项目可以让你学习到GPU编程和cuda核心并行编程的概念,通过一个个小问题让你理解cuda的编程和调用,创建共享显存空间,实现卷积和矩阵乘法等,通过每个小问题之后还会奖励一个狗狗小视频😁
下面是项目的仓库:https://github.com/srush/GPU-Puzzleshttps://github.com/srush/GPU-Puzzles
我本人也是做完了所有的puzzles,特地做一份讲解供大家参考
Puzzle 1: Map
Implement a "kernel" (GPU function) that adds 10 to each position of vector a
and stores it in vector out
. You have 1 thread per position.
题目的目的是让out输出为a中所有元素+10
def map_spec(a):
return a + 10
def map_test(cuda):
def call(out, a) -> None:
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 1 lines)
out[local_i] = a[local_i] + 10
return call
SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
"Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
)
problem.show()
这里不能完全用Python代码的思想去阅读,函数每次调用cuda.threadId.x的时候都会取一个新的核心,实现并行的效果,下面是可视化运行效果
# Map
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 0 | 0 |
Puzzle 2 - Zip
Implement a kernel that adds together each position of a
and b
and stores it in out
. You have 1 thread per position.
在out中每个元素是a,b向量中对应位置元素和,这里不需要什么额外操作,直接local_i作为对应位置的索引即可
def zip_spec(a, b):
return a + b
def zip_test(cuda):
def call(out, a, b) -> None:
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 1 lines)
out[local_i] = a[local_i] + b[local_i]
return call
SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
"Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec
)
problem.show()
可视化效果
# Zip
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 2 | 1 | 0 | 0 |
Puzzle 3 - Guards
Implement a kernel that adds 10 to each position of a
and stores it in out
. You have more threads than positions.
这里是Map的升级版,我们拥有更多的cuda线程,但是这不影响,利用if判断使local_i索引有效即可
def map_guard_test(cuda):
def call(out, a, size) -> None:
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 2 lines)
if (local_i < size):
out[local_i] = a[local_i] + 10
return call
SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
"Guard",
map_guard_test,
[a],
out,
[SIZE],
threadsperblock=Coord(8, 1),
spec=map_spec,
)
problem.show()
可视化效果
# Guard
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 0 | 0 |
Puzzle 4 - Map 2D
Implement a kernel that adds 10 to each position of a
and stores it in out
. Input a
is 2D and square. You have more threads than positions.
这里更进了一步,教我们cuda可以创建二维的线程,我们创建了多余的线程,需要更多边界判断
def map_2D_test(cuda):
def call(out, a, size) -> None:
local_i = cuda.threadIdx.x
local_j = cuda.threadIdx.y
# FILL ME IN (roughly 2 lines)
if local_i < size and local_j < size:
out[local_i, local_j] = a[local_i, local_j]+1
return call
SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
problem = CudaProblem(
"Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec
)
problem.show()
可视化效果
# Map 2D
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 0 | 0 |
Puzzle 5 - Broadcast
Implement a kernel that adds a
and b
and stores it in out
. Inputs a
and b
are vectors. You have more threads than positions.
其实就是前面zip的二维版本,对行向量和列向量相同索引位置相加,线程比实际矩阵大,要注意边界
def broadcast_test(cuda):
def call(out, a, b, size) -> None:
local_i = cuda.threadIdx.x
local_j = cuda.threadIdx.y
# FILL ME IN (roughly 2 lines)
if local_i < size and local_j < size:
out[local_i, local_j] = a[local_i, 0] + b[0, local_j]
return call
SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE).reshape(SIZE, 1)
b = np.arange(SIZE).reshape(1, SIZE)
problem = CudaProblem(
"Broadcast",
broadcast_test,
[a, b],
out,
[SIZE],
threadsperblock=Coord(3, 3),
spec=zip_spec,
)
problem.show()
可视化效果
# Broadcast
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 2 | 1 | 0 | 0 |
Puzzle 6 - Blocks
Implement a kernel that adds 10 to each position of a
and stores it in out
. You have fewer threads per block than the size of a
.
这里和前面不同,每一块中线程数比矩阵要小,但是这也不影响,因为把块是把线程分组了,每个块有一定数量的线程,程序会循环遍历取出一个个线程块,判断好边界条件即可
def map_block_test(cuda):
def call(out, a, size) -> None:
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
# FILL ME IN (roughly 2 lines)
if i < size:
out[i] = a[i] + 10
return call
SIZE = 9
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
"Blocks",
map_block_test,
[a],
out,
[SIZE],
threadsperblock=Coord(4, 1),
blockspergrid=Coord(3, 1),
spec=map_spec,
)
problem.show()
可视化效果
# Blocks
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 0 | 0 |
Puzzle 7 - Blocks 2D
Implement the same kernel in 2D. You have fewer threads per block than the size of a
in both directions.
这里线程矩阵比实际的矩阵大小要小,但是没关系,设置好边界条件直接遍历即可
def map_block2D_test(cuda):
def call(out, a, size) -> None:
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
# FILL ME IN (roughly 4 lines)
j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
if i < size and j < size:
out[i, j] = a[i, j] + 10
return call
SIZE = 5
out = np.zeros((SIZE, SIZE))
a = np.ones((SIZE, SIZE))
problem = CudaProblem(
"Blocks 2D",
map_block2D_test,
[a],
out,
[SIZE],
threadsperblock=Coord(3, 3),
blockspergrid=Coord(2, 2),
spec=map_spec,
)
problem.show()
可视化效果
# Blocks 2D
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 0 | 0 |
Puzzle 8 - Shared
Implement a kernel that adds 10 to each position of a
and stores it in out
. You have fewer threads per block than the size of a
.
这里是利用共享内存的教学,cuda申请共享内存只能用静态变量
TPB = 4
def shared_test(cuda):
def call(out, a, size) -> None:
shared = cuda.shared.array(TPB, numba.float32)
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
local_i = cuda.threadIdx.x
if i < size:
shared[local_i] = a[i]
cuda.syncthreads()
# FILL ME IN (roughly 2 lines)
if I < size:
out[i] = shared[local_i] + 10
return call
SIZE = 8
out = np.zeros(SIZE)
a = np.ones(SIZE)
problem = CudaProblem(
"Shared",
shared_test,
[a],
out,
[SIZE],
threadsperblock=Coord(TPB, 1),
blockspergrid=Coord(2, 1),
spec=map_spec,
)
problem.show()
可视化效果
# Shared
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 1 | 1 |
Puzzle 9 - Pooling
Implement a kernel that sums together the last 3 position of a
and stores it in out
. You have 1 thread per position. You only need 1 global read and 1 global write per thread.
这里教会我们要控制全局读写次数,如果把需要重复读的内容移动到共享内存,可以提升效率
def pool_spec(a):
out = np.zeros(*a.shape)
for i in range(a.shape[0]):
out[i] = a[max(i - 2, 0) : i + 1].sum()
return out
TPB = 8
def pool_test(cuda):
def call(out, a, size) -> None:
shared = cuda.shared.array(TPB, numba.float32)
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 8 lines)
if i < size:
shared[local_i] = a[i]
cuda.syncthreads()
if i == 0:
out[i] = shared[local_i]
elif i == 1:
out[i] = shared[local_i] + shared[local_i - 1]
else:
out[i] = shared[local_i] + shared[local_i - 1] + shared[local_i - 2]
return call
SIZE = 8
out = np.zeros(SIZE)
a = np.arange(SIZE)
problem = CudaProblem(
"Pooling",
pool_test,
[a],
out,
[SIZE],
threadsperblock=Coord(TPB, 1),
blockspergrid=Coord(1, 1),
spec=pool_spec,
)
problem.show()
可视化效果
# Pooling
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 1 | 1 | 3 | 1 |
Puzzle 10 - Dot Product
Implement a kernel that computes the dot-product of a
and b
and stores it in out
. You have 1 thread per position. You only need 2 global reads and 1 global write per thread.
这里是手动实现向量点乘,我们可以先把结果暂存到一个共享内存中,然后最后用一个线程统一计算共享内存的数据并输出到结果,这样可以控制全局读写次数
def dot_spec(a, b):
return a @ b
TPB = 8
def dot_test(cuda):
def call(out, a, b, size) -> None:
shared = cuda.shared.array(TPB, numba.float32)
i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
local_i = cuda.threadIdx.x
# FILL ME IN (roughly 9 lines)
if i < size:
shared[local_i] = a[i] * b[i]
cuda.syncthreads()
if local_i == 0:
total = 0.0
for j in range(TPB):
total += shared[j]
out[0] = total
return call
SIZE = 8
out = np.zeros(1)
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
"Dot",
dot_test,
[a, b],
out,
[SIZE],
threadsperblock=Coord(SIZE, 1),
blockspergrid=Coord(1, 1),
spec=dot_spec,
)
problem.show()
可视化效果
# Dot
Score (Max Per Thread):
| Global Reads | Global Writes | Shared Reads | Shared Writes |
| 2 | 1 | 8 | 1 |
后面还有puzzle11-14,因为难度比较大,思路较为复杂,留在下一篇文章做讲解
👉🏻GPU Puzzles讲解(二)-CSDN博客