GPU Puzzles讲解(一)

news2024/11/24 3:05:25

GPU-Puzzles项目可以让你学习到GPU编程和cuda核心并行编程的概念,通过一个个小问题让你理解cuda的编程和调用,创建共享显存空间,实现卷积和矩阵乘法等,通过每个小问题之后还会奖励一个狗狗小视频😁

下面是项目的仓库:https://github.com/srush/GPU-Puzzlesicon-default.png?t=O83Ahttps://github.com/srush/GPU-Puzzles

我本人也是做完了所有的puzzles,特地做一份讲解供大家参考

Puzzle 1: Map 

Implement a "kernel" (GPU function) that adds 10 to each position of vector a and stores it in vector out. You have 1 thread per position.

题目的目的是让out输出为a中所有元素+10

def map_spec(a):
    return a + 10


def map_test(cuda):
    def call(out, a) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)
        out[local_i] = a[local_i] + 10

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Map", map_test, [a], out, threadsperblock=Coord(SIZE, 1), spec=map_spec
)
problem.show()

这里不能完全用Python代码的思想去阅读,函数每次调用cuda.threadId.x的时候都会取一个新的核心,实现并行的效果,下面是可视化运行效果

# Map
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             0 |             0 | 

Puzzle 2 - Zip

Implement a kernel that adds together each position of a and b and stores it in out. You have 1 thread per position.

在out中每个元素是a,b向量中对应位置元素和,这里不需要什么额外操作,直接local_i作为对应位置的索引即可

def zip_spec(a, b):
    return a + b


def zip_test(cuda):
    def call(out, a, b) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 1 lines)
        out[local_i] = a[local_i] + b[local_i]

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Zip", zip_test, [a, b], out, threadsperblock=Coord(SIZE, 1), spec=zip_spec
)
problem.show()

可视化效果


# Zip
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             2 |             1 |             0 |             0 | 

Puzzle 3 - Guards 

Implement a kernel that adds 10 to each position of a and stores it in out. You have more threads than positions.

这里是Map的升级版,我们拥有更多的cuda线程,但是这不影响,利用if判断使local_i索引有效即可

def map_guard_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)
        if (local_i < size):
            out[local_i] = a[local_i] + 10

    return call


SIZE = 4
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Guard",
    map_guard_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(8, 1),
    spec=map_spec,
)
problem.show()

可视化效果

# Guard
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             0 |             0 | 

Puzzle 4 - Map 2D

Implement a kernel that adds 10 to each position of a and stores it in out. Input a is 2D and square. You have more threads than positions.

这里更进了一步,教我们cuda可以创建二维的线程,我们创建了多余的线程,需要更多边界判断

def map_2D_test(cuda):
    def call(out, a, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)
        if local_i < size and local_j < size:
            out[local_i, local_j] = a[local_i, local_j]+1

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE * SIZE).reshape((SIZE, SIZE))
problem = CudaProblem(
    "Map 2D", map_2D_test, [a], out, [SIZE], threadsperblock=Coord(3, 3), spec=map_spec
)
problem.show()

可视化效果

# Map 2D
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             0 |             0 | 

Puzzle 5 - Broadcast

Implement a kernel that adds a and b and stores it in out. Inputs a and b are vectors. You have more threads than positions.

其实就是前面zip的二维版本,对行向量和列向量相同索引位置相加,线程比实际矩阵大,要注意边界

def broadcast_test(cuda):
    def call(out, a, b, size) -> None:
        local_i = cuda.threadIdx.x
        local_j = cuda.threadIdx.y
        # FILL ME IN (roughly 2 lines)
        if local_i < size and local_j < size:
            out[local_i, local_j] = a[local_i, 0] + b[0, local_j]

    return call


SIZE = 2
out = np.zeros((SIZE, SIZE))
a = np.arange(SIZE).reshape(SIZE, 1)
b = np.arange(SIZE).reshape(1, SIZE)
problem = CudaProblem(
    "Broadcast",
    broadcast_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    spec=zip_spec,
)
problem.show()

可视化效果

# Broadcast
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             2 |             1 |             0 |             0 | 

Puzzle 6 - Blocks

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

这里和前面不同,每一块中线程数比矩阵要小,但是这也不影响,因为把块是把线程分组了,每个块有一定数量的线程,程序会循环遍历取出一个个线程块,判断好边界条件即可

def map_block_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 2 lines)
        if i < size:
            out[i] = a[i] + 10
    return call


SIZE = 9
out = np.zeros((SIZE,))
a = np.arange(SIZE)
problem = CudaProblem(
    "Blocks",
    map_block_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(4, 1),
    blockspergrid=Coord(3, 1),
    spec=map_spec,
)
problem.show()

可视化效果

# Blocks
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             0 |             0 | 

Puzzle 7 - Blocks 2D 

Implement the same kernel in 2D. You have fewer threads per block than the size of a in both directions.

这里线程矩阵比实际的矩阵大小要小,但是没关系,设置好边界条件直接遍历即可

def map_block2D_test(cuda):
    def call(out, a, size) -> None:
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        # FILL ME IN (roughly 4 lines)
        j = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
        if i < size and j < size:
            out[i, j] = a[i, j] + 10

    return call


SIZE = 5
out = np.zeros((SIZE, SIZE))
a = np.ones((SIZE, SIZE))

problem = CudaProblem(
    "Blocks 2D",
    map_block2D_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(3, 3),
    blockspergrid=Coord(2, 2),
    spec=map_spec,
)
problem.show()

可视化效果

# Blocks 2D
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             0 |             0 | 

Puzzle 8 - Shared 

Implement a kernel that adds 10 to each position of a and stores it in out. You have fewer threads per block than the size of a.

这里是利用共享内存的教学,cuda申请共享内存只能用静态变量

TPB = 4
def shared_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x

        if i < size:
            shared[local_i] = a[i]
            cuda.syncthreads()

        # FILL ME IN (roughly 2 lines)
        if I < size:
            out[i] = shared[local_i] + 10

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.ones(SIZE)
problem = CudaProblem(
    "Shared",
    shared_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(2, 1),
    spec=map_spec,
)
problem.show()

可视化效果

# Shared
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             1 |             1 | 

Puzzle 9 - Pooling

Implement a kernel that sums together the last 3 position of a and stores it in out. You have 1 thread per position. You only need 1 global read and 1 global write per thread.

这里教会我们要控制全局读写次数,如果把需要重复读的内容移动到共享内存,可以提升效率

def pool_spec(a):
    out = np.zeros(*a.shape)
    for i in range(a.shape[0]):
        out[i] = a[max(i - 2, 0) : i + 1].sum()
    return out


TPB = 8
def pool_test(cuda):
    def call(out, a, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)
        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 8 lines)
        if i < size:
            shared[local_i] = a[i]
            cuda.syncthreads()

        if i == 0:
            out[i] = shared[local_i]
        elif i == 1:
            out[i] = shared[local_i] + shared[local_i - 1]
        else:
            out[i] = shared[local_i] + shared[local_i - 1] + shared[local_i - 2]

    return call


SIZE = 8
out = np.zeros(SIZE)
a = np.arange(SIZE)
problem = CudaProblem(
    "Pooling",
    pool_test,
    [a],
    out,
    [SIZE],
    threadsperblock=Coord(TPB, 1),
    blockspergrid=Coord(1, 1),
    spec=pool_spec,
)
problem.show()

可视化效果

# Pooling
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             1 |             1 |             3 |             1 | 

Puzzle 10 - Dot Product

Implement a kernel that computes the dot-product of a and b and stores it in out. You have 1 thread per position. You only need 2 global reads and 1 global write per thread.

这里是手动实现向量点乘,我们可以先把结果暂存到一个共享内存中,然后最后用一个线程统一计算共享内存的数据并输出到结果,这样可以控制全局读写次数

def dot_spec(a, b):
    return a @ b

TPB = 8
def dot_test(cuda):
    def call(out, a, b, size) -> None:
        shared = cuda.shared.array(TPB, numba.float32)

        i = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
        local_i = cuda.threadIdx.x
        # FILL ME IN (roughly 9 lines)
        if i < size:
            shared[local_i] = a[i] * b[i]
            cuda.syncthreads()

        if local_i == 0:
            total = 0.0
            for j in range(TPB):
                total += shared[j]
            out[0] = total

    return call


SIZE = 8
out = np.zeros(1)
a = np.arange(SIZE)
b = np.arange(SIZE)
problem = CudaProblem(
    "Dot",
    dot_test,
    [a, b],
    out,
    [SIZE],
    threadsperblock=Coord(SIZE, 1),
    blockspergrid=Coord(1, 1),
    spec=dot_spec,
)
problem.show()

可视化效果

# Dot
 
   Score (Max Per Thread):
   |  Global Reads | Global Writes |  Shared Reads | Shared Writes |
   |             2 |             1 |             8 |             1 | 

后面还有puzzle11-14,因为难度比较大,思路较为复杂,留在下一篇文章做讲解

​​​​​​​👉🏻GPU Puzzles讲解(二)-CSDN博客

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2194397.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

大模型面试宝典:问题全集及权威解答

导读 自ChatGPT开启大模型时代以来&#xff0c;大模型正迎来飞速发展&#xff0c;现在从事大模型开发相关工作可谓是处在时代的风口。那么大模型面试需要哪些技能和技巧呢&#xff0c;本文详细整理了全套的面试问题及答案&#xff0c;希望对大家有所帮助&#xff01; 目录 [x…

脉冲神经网络(SNN)论文阅读(六)-----ECCV-2024 脉冲驱动的SNN目标检测框架:SpikeYOLO

原文链接&#xff1a;CSDN-脉冲神经网络&#xff08;SNN&#xff09;论文阅读&#xff08;六&#xff09;-----ECCV-2024 脉冲驱动的SNN目标检测框架&#xff1a;SpikeYOLO Integer-Valued Training and Spike-Driven Inference Spiking Neural Network for High-performance …

MySQL数据库专栏(二)SQL语句基础操作

目录 数据库操作 创建数据库 查看数据库 选择数据库 删除数据库 数据表操作 数据表数据类型 数据表列约束 数据表索引 创建表 查看表 查看表结构 删除表 数据表的增删改操作 …

自由学习记录

约束的泛型通配符? Java中的泛型 xiaomi和byd都继承了car&#xff0c;但是只是这两个类是car的子类而已&#xff0c;而arraylist<xiaomi> ,arraylist<byd> 两个没有半毛钱继承关系 所以传入的参数整体&#xff0c;是car的list变形&#xff0c;里面的确都能存car…

YOLO11改进 | 卷积模块 | 添加选择性内核SKConv【附完整代码一键运行】

秋招面试专栏推荐 &#xff1a;深度学习算法工程师面试问题总结【百面算法工程师】——点击即可跳转 &#x1f4a1;&#x1f4a1;&#x1f4a1;本专栏所有程序均经过测试&#xff0c;可成功执行&#x1f4a1;&#x1f4a1;&#x1f4a1; 在标准卷积神经网络 (CNN) 中&#xff0…

单调栈day54|42. 接雨水(高频面试题)、84. 柱状图中最大的矩形、两道题思维导图的汇总与对比

单调栈day54|42. 接雨水&#xff08;高频面试题&#xff09;、84. 柱状图中最大的矩形、两道题思维导图的汇总与对比 42. 接雨水84. 柱状图中最大的矩形两道题思维导图的汇总与对比 42. 接雨水 给定 n 个非负整数表示每个宽度为 1 的柱子的高度图&#xff0c;计算按此排列的柱…

2025,企业管理平台的理想模型V0.1

1.前言 近期出于综合考虑&#xff0c;准备休息一段时间......... 在这段时间里&#xff0c;准备重新梳理下企业管理平台应该具备的能力.并准备使用开源项目来一次组合式组装&#xff0c;最终形成一个初步可行的运行平台。 2.企业管理平台的主要组成 企业管理平台不是独立存…

《python语言程序设计》2018版第8章20题使用Rational类编写一个程序(上)-修改一下8-4Rational类我认为的错误

首先抄一下Rational类,可以安静的抄一遍 一、抄写中的问号 各种报错的截图1各种报错的截图2各种报错的截图3各种报错的截图4添加一个str我将n和d修改为self 书中214-215页间程序清单8-4的代码如下: class Rational:def __init__(self, numerator1, denominator0):divisor gcd(…

产品经理内容分享(二):AI产品经理的入门路线图

引言 想象这样一个场景&#xff1a;早晨的阳光穿透窗帘&#xff0c;投射在新一代智能机器人上&#xff0c;它正静静等待着你的第一个命令开始全新的一天。这样的场景听起来像是科幻小说里的情节&#xff0c;但实际上&#xff0c;这正是AI产品经理们工作的成果。如果你对这样的…

使用axios封装AJAX

一 、Http 请求报文 包括了三部分: 求情行、请求头,请求体。 1、请求行: 是HTTP请求的第一行,包含了请求方法、请求目标和HTTP协议版本。常用的请求方法有GET、POST、PUT、DELETE等,用于指定客户端希望服务器执行的操作。请求目标是指请求的资源路径,可以是一个具体的…

国外电商系统开发-运维系统资产属性-命令执行功能

当前开发中&#xff0c;还不支持点击拓扑图标打开资产的功能&#xff0c;后期有时间补全对应的开发。 该功能如同Xshell、SecureCRT、Putty一样&#xff0c;可以批量的发送系统命令&#xff0c;让Linux服务器执行。 默认情况下&#xff0c;系统已经选择全部主机&#xff0c;如果…

番外篇 | CRAS-YOLO:基于卫星图像的多类别船舶检测和分类

前言:Hello大家好,我是小哥谈。目前,基于卫星图像的多类别船舶检测和分类由于在军事和民用领域的重要应用而备受关注。针对传统检测效果不佳的情形,我们进一步提出了一种新的多类船检测,称为CRAS-YOLO,它由卷积块注意力模块(CBAM)、感受野块(RFB)和基于YOLOv5s的自适…

2024计算机毕业设计最简单的完成流程

一、计算机专业毕业设计选题-选题推荐 1.基于深度学习的生活垃圾智能分类系统&#xff08;微信小程序YOLOv5训练数据集开题报告中期检查论文&#xff09; &#xff08;1&#xff09;程序界面 &#xff08;2&#xff09;论文大纲 &#xff08;3&#xff09;论文详情链接 基于…

【高等代数笔记】线性空间(十九-二十四上半部分)

课程视频剪辑得太抽象了&#xff0c;一节课不能完整学完&#xff0c;拆的零零散散得。 3. 线性空间 3.19 满秩矩阵 【推论4】设 rank ( A ) r \text{rank}(\boldsymbol{A})r rank(A)r&#xff0c;则 A \boldsymbol{A} A的不为0的 r r r阶子式所在的列&#xff08;行&#x…

75 华三vlan端口隔离

华三vlan端口隔离 为了实现端口间的二层隔离&#xff0c;可以将不同的端口加入不同的VLAN&#xff0c;但VLAN资源有限。采用端口隔离特性&#xff0c;用户只需要将端口加入到隔离组中&#xff0c;就可以实现隔离组内端口之间二层隔离&#xff0c;而不关心这些端口所属VLAN&…

【每日一题 | 24.10.7】Fizz Buzz 经典问题

1. 题目2. 解题思路3. 代码实现&#xff08;AC_Code&#xff09; 个人主页&#xff1a;C_GUIQU 归属专栏&#xff1a;每日一题 1. 题目 Fizz Buzz 经典问题 2. 解题思路 【法1】逻辑硬解&#xff1a;按照题目逻辑分四种情况&#xff0c;用if else 判断即可。 【法2】switc…

大数据新视界 --大数据大厂之 Ibis:独特架构赋能大数据分析高级抽象层

&#x1f496;&#x1f496;&#x1f496;亲爱的朋友们&#xff0c;热烈欢迎你们来到 青云交的博客&#xff01;能与你们在此邂逅&#xff0c;我满心欢喜&#xff0c;深感无比荣幸。在这个瞬息万变的时代&#xff0c;我们每个人都在苦苦追寻一处能让心灵安然栖息的港湾。而 我的…

【电力系统】基于MATLAB的储能辅助电力系统调峰的容量需求研究

摘要 本研究基于MATLAB仿真平台&#xff0c;探讨了储能系统在电力系统中辅助调峰的容量需求问题。通过对风电、微型燃气机等分布式能源的实际出力曲线与理论输出进行比较分析&#xff0c;我们探讨了在不同负荷条件下储能系统的调峰能力。实验结果表明&#xff0c;储能系统的合…

js逆向--某招标公告公示搜索引擎DES解密

js逆向--某招标公告公示搜索引擎DES解密 一、寻找数据接口二、寻找解密入口三、编写代码一、寻找数据接口 打开网页,在搜索框中输入关键词python。 试图通过按F12或者右键打开开发者工具,发现均没有反应。这时需要点击浏览器右上角的三个点,然后点击更多工具–开发者工具,…

(笔记)第三期书生·浦语大模型实战营(十一卷王场)–书生基础岛第5关---XTuner 微调个人小助手认知

学员闯关手册&#xff1a;https://aicarrier.feishu.cn/wiki/ZcgkwqteZi9s4ZkYr0Gcayg1n1g?open_in_browsertrue 课程视频&#xff1a;https://www.bilibili.com/video/BV1tz421B72y/ 课程文档&#xff1a; https://github.com/InternLM/Tutorial/tree/camp3/docs/L1/XTuner 关…