揭秘 AMD GPU 上 PyTorch Profiler 的性能洞察

Unveiling performance insights with PyTorch Profiler on an AMD GPU — ROCm Blogs

2024年5月29日，作者：Phillip Dang。
在机器学习领域，优化性能通常和改进模型架构一样重要。在本文中，我们将深入探讨 PyTorch Profiler，这是一款设计用于帮助深入了解我们 PyTorch 模型内部状态的便捷工具，能够揭示瓶颈和低效之处。本文将介绍 PyTorch Profiler 的基本工作原理以及如何在 AMD GPU + ROCm 系统中利用它来提高模型效率。

什么是 PyTorch Profiler？

PyTorch Profiler 是一个性能分析工具，使开发人员能够检查 PyTorch 模型训练和推理的各个方面。它允许用户收集和分析详细的分析信息，包括 GPU/CPU 利用率、内存使用情况以及模型内不同操作的执行时间。通过利用 PyTorch Profiler，开发人员可以获得关于其模型运行时行为的宝贵见解，并发现潜在的优化机会。
使用 PyTorch Profiler 非常简单，只需几个步骤：
1. 标注代码：要开始对 PyTorch 代码进行分析，您需要使用分析注释对其进行标注。这些注释指定了要分析的代码区域或操作。PyTorch Profiler 提供了上下文管理器和装饰器以便于标注。
2. 配置分析器设置：根据您的分析需求配置分析器设置。您可以指定参数，如详细程度、分析模式（例如 CPU, GPU）和输出格式。
3. 运行分析：在代码标注完成且分析器设置配置好后，像往常一样运行您的 PyTorch 代码。分析器将在执行期间收集性能数据。
4. 分析分析结果：执行后，使用 PyTorch Profiler 提供的可视化工具分析分析结果。探索时间线、火焰图和内存使用图，以识别性能瓶颈和优化机会。
5. 迭代和优化：利用从分析中获得的洞见来反复优化代码。根据分析数据进行有针对性的优化，并重新运行分析器以评估您更改的影响。

先决条件

要跟随本博客的内容，您需要以下软件：
- ROCm
- PyTorch
- Linux 操作系统

有关支持的 GPU 和操作系统的列表，请参阅此页面。为了方便和稳定，我们建议您直接在 Linux 系统中使用以下代码拉取并运行 rocm/pytorch Docker：

docker run -it --ipc=host --network=host --device=/dev/kfd --device=/dev/dri \
           --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
           --name=olmo rocm/pytorch:rocm6.0_ubuntu20.04_py3.9_pytorch_2.1.1 /bin/bash

检查您的硬件并确保系统识别到您的 GPU，请运行：

! rocm-smi --showproductname

您的输出应如下所示：

================= ROCm System Management Interface ================
========================= Product Info ============================
GPU[0] : Card series: Instinct MI210
GPU[0] : Card model: 0x0c34
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: D67301
===================================================================
===================== End of ROCm SMI Log =========================

接下来，确保 PyTorch 检测到您的 GPU：

import torch
print(f"number of GPUs: {torch.cuda.device_count()}")
print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])

您的输出应如下所示：

number of GPUs: 1
['AMD Radeon Graphics']

检测你的代码

库

导入我们将使用的必要库和模块。

import torch
import torch.nn as nn
import torchvision
from torchvision import transforms
from torch.profiler import profile, record_function, ProfilerActivity

模型

首先，我们创建一个非常简单的卷积神经网络模型进行分析。

class SimpleCNN(nn.Module):
    def __init__(self):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(3, 16, kernel_size=3, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, padding=1)
        self.fc1 = nn.Linear(32 * 8 * 8, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = torch.relu(self.conv1(x))
        x = torch.max_pool2d(x, kernel_size=2, stride=2)
        x = torch.relu(self.conv2(x))
        x = torch.max_pool2d(x, kernel_size=2, stride=2)
        x = x.view(-1, 32 * 8 * 8)
        x = torch.relu(self.fc1(x))
        x = self.fc2(x)
        return x

数据

接下来，我们下载一个简单的数据集。

# Load CIFAR-10 dataset 
transform = transforms.Compose([
    transforms.Resize((32, 32)),
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])
trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

训练循环

我们将创建一个简单的训练循环，包括前向传递和反向传递，并对其进行分析。为了本博客的目的，我们将分析模型在200个批次上的前向和反向传递，而不是遍历整个数据集。

# Function to train the model
def train(model, trainloader, criterion, optimizer, device, epochs=1):
    for epoch in range(epochs):
        for i, data in enumerate(trainloader, 0):
            inputs, labels = data
            inputs = inputs.to(device)
            labels = labels.to(device)
            optimizer.zero_grad()
            outputs = model(inputs)
            loss = criterion(outputs, labels)
            loss.backward()
            optimizer.step()
            # exit after 200 batches 
            if i == 200:
                break

此外，让我们编写一个实用函数，该函数负责设置优化器和损失函数，实例化模型，并运行实际的分析。

# utility function for running the profiler 
def run_profiler(trainloader, model, profile_memory=False):
    device = 'cuda'
    model = model.to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = torch.optim.SGD(model.parameters(), lr=0.001, momentum=0.9)
    activities = [ProfilerActivity.CPU, ProfilerActivity.CUDA]
    
    with profile(activities=activities, record_shapes=True, profile_memory=profile_memory) as prof:
        with record_function("training"):
            train(model, trainloader, criterion, optimizer, device, epochs=1)

    if profile_memory == False:
        print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=10))
    else:
         print(prof.key_averages().table(sort_by="self_cpu_memory_usage", row_limit=10))

分析非常简单，只需用分析器上下文管理器包裹训练循环即可。

运行性能分析

有了模型训练循环和性能分析工具函数的实现后，我们可以使用 PyTorch Profiler 来分析执行时间和内存消耗。

执行时间性能分析

我们先来看看训练循环的执行时间。

model = SimpleCNN()
trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
run_profiler(trainloader, model)

输出结果如下所示：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               training        23.76%     360.249ms        71.31%        1.081s        1.081s       0.000us         0.00%      68.837ms      68.837ms             1  
autograd::engine::evaluate_function: ConvolutionBack...         0.15%       2.271ms         3.63%      55.037ms     136.908us       0.000us         0.00%      34.770ms      86.493us           402  
                             aten::convolution_backward         2.34%      35.480ms         3.34%      50.615ms     125.908us      18.366ms        16.60%      34.770ms      86.493us           402  
                                   ConvolutionBackward0         0.14%       2.151ms         3.46%      52.431ms     130.425us       0.000us         0.00%      34.486ms      85.786us           402  
    autograd::engine::evaluate_function: AddmmBackward0         0.33%       4.960ms         7.98%     120.946ms     300.861us       0.000us         0.00%      16.764ms      41.701us           402  
                                            aten::copy_         0.44%       6.674ms         2.08%      31.585ms      77.037us      15.762ms        14.25%      16.408ms      40.020us           410  
                                         aten::_to_copy         0.14%       2.079ms         2.31%      34.972ms      86.995us       0.000us         0.00%      16.306ms      40.562us           402  
                                              aten::sum         0.78%      11.818ms         0.93%      14.160ms      17.612us      14.723ms        13.31%      16.162ms      20.102us           804  
                                               aten::to         0.13%       2.031ms         2.36%      35.852ms      35.674us       0.000us         0.00%      15.783ms      15.704us          1005  
                                       CopyHostToDevice         0.00%       0.000us         0.00%       0.000us       0.000us      15.739ms        14.23%      15.739ms      39.152us           402  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.516s
Self CUDA time total: 110.639ms

注意 self cpu time 和 cpu time 之间的区别。根据[教程](PyTorch Profiler — PyTorch Tutorials 2.4.0+cu121 documentation)，“操作符可以调用其它操作符，自身 cpu time 排除了在子操作符调用中花费的时间，而总的 cpu time 包括了这些时间。你可以选择通过其他指标排序，比如传递 sort_by="self_cpu_time_total" 到表格调用中来按自身 cpu time 排序。”

接下来，我们将卷积神经网络（CNN）简化为一个非常简单的线性层，再次运行性能分析。我们预计会看到 CUDA 总时间的显著减少。

class SimpleNet(nn.Module):
    def __init__(self):
        super(SimpleNet, self).__init__()
        self.fc1 = nn.Linear(3 * 32 * 32, 10)

    def forward(self, x):
        x = x.view(-1, 3 * 32 * 32)
        x = self.fc1(x)
        return x

model = SimpleNet()
run_profiler(trainloader, model)

以下是输出结果：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                               training        23.91%     192.128ms        84.59%     679.785ms     679.785ms       0.000us         0.00%      39.361ms      39.361ms             1  
                                           aten::linear         0.10%     768.000us         1.57%      12.605ms      62.711us       0.000us         0.00%      16.955ms      84.353us           201  
                                            aten::addmm         0.99%       7.943ms         1.28%      10.247ms      50.980us      16.955ms        37.52%      16.955ms      84.353us           201  
Cijk_Alik_Bljk_SB_MT64x64x32_MI32x32x2x1_SE_1LDSB0_A...         0.00%       0.000us         0.00%       0.000us       0.000us      15.556ms        34.42%      15.556ms      77.393us           201  
                                            aten::copy_         0.25%       2.028ms         3.07%      24.636ms      60.980us      14.614ms        32.34%      14.614ms      36.173us           404  
                                       CopyHostToDevice         0.00%       0.000us         0.00%       0.000us       0.000us      14.608ms        32.32%      14.608ms      36.338us           402  
                                         aten::_to_copy         0.27%       2.130ms         3.50%      28.122ms      69.955us       0.000us         0.00%      14.554ms      36.204us           402  
                                               aten::to         0.31%       2.460ms         3.61%      28.972ms      28.771us       0.000us         0.00%      13.586ms      13.492us          1007  
                                Optimizer.step#SGD.step         2.09%      16.809ms         2.94%      23.664ms     117.731us       0.000us         0.00%       5.557ms      27.647us           201  
    autograd::engine::evaluate_function: AddmmBackward0         0.28%       2.236ms         1.64%      13.185ms      65.597us       0.000us         0.00%       3.691ms      18.363us           201  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 803.604ms
Self CUDA time total: 45.193ms

正如预期的那样，CUDA 总时间显著减少（从 110.639ms 到 45.193ms）。

内存消耗性能分析

我们还可以分析在模型运算过程中分配或释放的张量所使用的内存量。

trainloader = torch.utils.data.DataLoader(trainset, batch_size=32, shuffle=True, num_workers=4)
model = SimpleCNN()
run_profiler(trainloader, model, profile_memory=True)

输出表格如下所示：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        22.44%     224.849ms        22.74%     227.911ms       1.134ms       0.000us         0.00%       0.000us       0.000us      75.42 Mb      75.42 Mb           0 b           0 b           201  
                                            aten::empty         0.22%       2.204ms         0.22%       2.204ms       2.731us       0.000us         0.00%       0.000us       0.000us     390.64 Kb     390.64 Kb       3.79 Mb       3.79 Mb           807  
                                    aten::scalar_tensor         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us           8 b           8 b           0 b           0 b             1  
                                          aten::random_         0.00%      25.000us         0.00%      25.000us      12.500us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                             aten::item         0.00%       9.000us         0.00%      13.000us       6.500us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                              aten::_local_scalar_dense         0.00%       4.000us         0.00%       4.000us       2.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                          aten::resize_         0.00%       6.000us         0.00%       6.000us       0.002us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          2615  
                                     aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                      aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                               aten::to         0.22%       2.206ms         3.73%      37.335ms      37.149us       0.000us         0.00%      14.821ms      14.747us           0 b           0 b      75.47 Mb       2.63 Mb          1005  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 1.002s
Self CUDA time total: 109.871ms

如果我们对数据加载器的内存消耗不满意，可以通过尝试各种策略来解决内存瓶颈。这些策略可能包括减少批次大小、简化模型架构或使用混合精度训练。让我们将批次大小从 32 减少到 4，然后再次运行性能分析：

trainloader = torch.utils.data.DataLoader(trainset, batch_size=4, shuffle=True, num_workers=4)
model = SimpleCNN()
run_profiler(trainloader, model, profile_memory=True)

新的输出结果如下：

-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
                                                   Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg     Self CUDA   Self CUDA %    CUDA total  CUDA time avg       CPU Mem  Self CPU Mem      CUDA Mem  Self CUDA Mem    # of Calls  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
enumerate(DataLoader)#_MultiProcessingDataLoaderIter...        13.45%     127.135ms        13.74%     129.910ms     646.318us       0.000us         0.00%       0.000us       0.000us       9.43 Mb       9.43 Mb           0 b           0 b           201  
                                            aten::empty         0.23%       2.193ms         0.23%       2.193ms       2.717us       0.000us         0.00%       0.000us       0.000us     390.64 Kb     390.64 Kb       3.87 Mb       3.87 Mb           807  
                                    aten::scalar_tensor         0.00%       9.000us         0.00%       9.000us       9.000us       0.000us         0.00%       0.000us       0.000us           8 b           8 b           0 b           0 b             1  
                                          aten::random_         0.00%      22.000us         0.00%      22.000us      11.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                             aten::item         0.00%       6.000us         0.00%      10.000us       5.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                              aten::_local_scalar_dense         0.00%       4.000us         0.00%       4.000us       2.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             2  
                                          aten::resize_         0.00%       7.000us         0.00%       7.000us       0.003us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b          2615  
                                     aten::resolve_conj         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                      aten::resolve_neg         0.00%       0.000us         0.00%       0.000us       0.000us       0.000us         0.00%       0.000us       0.000us           0 b           0 b           0 b           0 b             1  
                                               aten::to         0.21%       2.013ms         2.86%      27.042ms      26.907us       0.000us         0.00%       5.850ms       5.821us           0 b           0 b       9.52 Mb     481.50 Kb          1005  
-------------------------------------------------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  ------------  
Self CPU time total: 945.407ms
Self CUDA time total: 83.583ms

在这里，我们显著减少了加载数据所需的 CPU 内存，从 75.42 MB 减少到 9.43 MB。

在这篇博客中，我们展示了通过分析内存性能和执行时间，我们可以有效地提高模型训练过程的效率。我们鼓励读者尝试不同的优化策略以获得最佳结果。