加速 PyTorch 模型：使用 ROCm 在 AMD GPU 上应用 torch.compile

Accelerate PyTorch Models using torch.compile on AMD GPUs with ROCm — ROCm Blogs

介绍

PyTorch 2.0 引入了一个名为*torch.compile()*的工具，可以极大地加速 PyTorch 代码和模型。通过将 PyTorch 代码转换为高度优化的内核，`torch.compile` 在现有代码库上进行最小化修改即可提供显著的性能提升。此功能允许精确优化单个函数、整个模块以及复杂的训练循环，提供了一个多功能且强大的工具来提高计算效率。

在这篇博客中，我们将演示如何在 AMD GPU 上使用 ROCm 和 torch.compile 加速各种实际模型。

`torch.compile` 的工作原理

torch.compile 的执行涉及几个关键步骤：
1. 图获取：模型被分解并重写为子图。可以编译或优化的子图被扁平化。不能编译的子图会回退到eager mode（即时模式）。
2. 图降低：所有的 PyTorch 操作都会被分解成其选定的特定后端的内核。
3. 图编译：所有后端内核调用其对应的低级设备操作。torch.compile 的四项关键技术是：TorchDynamo、AOTAutograd、PrimTorch 和 TorchInductor。这些组件各自承担着使 torch.compile 功能得以实现的重要角色。

- TorchDynamo：可靠且快速地获取图。它通过符号解释 Python 字节码，将其转化为张量操作图。如果遇到无法解释的代码段，它会默认为常规的 Python 解释器。该方法确保可以处理多种程序，同时大幅提升性能。
- AOT Autograd：重新使用 Autograd 进行提前（AoT）计算图。AOT Autograd 是 PyTorch 2.0 的自动微分引擎。其功能是提前生成反向传递的跟踪，提升微分过程的效率。AOT Autograd 使用 PyTorch 的 torch_dispatch 机制来遍历现有的 PyTorch autograd 引擎，提前捕获反向传递。这使得前向传递和反向传递都能加速。
- PrimTorch：提供稳定的基础操作符。它将复杂的 PyTorch 操作分解为更简单的操作。
- TorchInductor：为加速器和后端生成高速代码。TorchInductor 是一个深度学习编译器，将中间表示转化为可执行代码。它获取 TorchDynamo 生成的计算图并将其转化为优化的低级内核。对于 NVIDIA 和 AMD GPU，它使用OpenAI Triton 作为基础组件。

torch.compile 函数具有多种编译模式，例如 default、`reduce-overhead` 和 max-autotune，它们在编译时间和推理开销上有所不同。通常，`max-autotune` 模式相对于 reduce-overhead 模式编译时间更长，但推理速度更快。`default` 模式编译最快，但相对于 reduce-overhead 模式推理效率较低。`torch.compile` 函数在第一次执行期间将模型编译为优化内核。因此，初次运行可能会因为编译时间而稍长，但随后的执行由于减少了 Python 开销和 GPU 读写操作展示了加速效果。最终的加速效果可能因模型架构和批处理大小而异。您可以在PyTorch 2.0 简介介绍和教程中了解更多关于 PyTorch 编译过程的内容。

在这篇博客中，我们通过评估以下模型在Eager-mode和不同torch.compile模式下的性能，展示了使用 torch.compile 可以在 AMD GPU 上加速实际模型：
- 使用卷积神经网络（ResNet-152）模型进行图像分类
- 使用视觉变压器模型进行图像分类
- 使用 Llama 2 7B 模型进行文本生成

在这篇博客中使用的完整代码可以在 [ROCm blogs repository](rocm-blogs/blogs/artificial-intelligence/torch_compile at release · ROCm/rocm-blogs · GitHub) 中找到。

前提条件

这篇博客是在以下环境中测试的。有关设置的详细支持信息，请参阅 [ROCm 文档](ROCm installation for Linux — ROCm installation (Linux))。- 硬件和操作系统：
  - [AMD Instinct GPU](AMD Instinct™ Accelerators)
  - Ubuntu 22.04.3 LTS- 软件：
  - [ROCm 6.0+](Quick start installation guide — ROCm installation (Linux))
  - [ROCm 2.0+ 版的 PyTorch](Installing PyTorch for ROCm — ROCm installation (Linux))- 库：
  - transformers, sentencepiece, numpy, tabulate, scipy, matplotlib在这篇博客中，我们使用 Linux 设备上安装了 MI210 加速器的 [rocm/pytorch-nightly](https://hub.docker.com/r/rocm/pytorch-nightly/tags) Docker 镜像。建议使用 PyTorch 的 nightly 版本以实现更优化的加速效果。

安装依赖项

!pip install -q transformers==4.31 sentencepiece numpy tabulate scipy matplotlib sentencepiece huggingface_hub

检查 AMD GPU 和 PyTorch 版本（>2.0）。

import torch
print(f"number of GPUs: {torch.cuda.device_count()}")
print([torch.cuda.get_device_name(i) for i in range(torch.cuda.device_count())])

torch_ver = [int(x) for x in torch.__version__.split(".")[:2]]
assert torch_ver >= [2, 0], "Requires PyTorch >= 2.0"
print("PyTorch Version:", torch.__version__)

输出：

    number of GPUs: 1
    ['AMD Instinct MI210']
    PyTorch Version: 2.4.0a0+git1f8177d

接下来，我们将定义一个辅助函数来测量给定函数的执行时间。

import time
def timed(fn, n_test: int, dtype: torch.dtype) -> tuple:
    """
    Measure the execution time for a given function.

    Args:
    - fn (function): The function to be timed.
    - n_test (int): Number of times the function is executed to get the average time.
    - dtype (torch.dtype): Data type for PyTorch tensors.

    Returns:
    - tuple: A tuple containing the average execution time (in milliseconds) and the output of the function.
    """
    with torch.no_grad(), torch.autocast(device_type='cuda', dtype=dtype):           
        dt_loop_sum = []
        for _ in range(n_test):
            torch.cuda.synchronize()
            start = time.time()
            output = fn()
            torch.cuda.synchronize()
            end = time.time()
            dt_loop_sum.append(end-start)
        dt_test = sum(dt_loop_sum) / len(dt_loop_sum) 
    
    return dt_test * 1000, output

通过 TorchDynamo 使用 torch.compile 需要一个将捕获的图转换为快速机器代码的后端。不同的后端可以带来不同的优化增益。您可以通过运行 torch.compiler.list_backends() 查看 TorchDynamo 支持的后端列表。

torch.compiler.list_backends()

输出：

    ['cudagraphs', 'inductor', 'onnxrt', 'openxla', 'openxla_eval', 'tvm']

在这篇博客中，我们选择 inductor 作为后端，这是默认设置。这个后端将允许我们从原生 PyTorch 应用程序的操作中动态生成 Triton 内核。

使用 torch.compile 加速 ResNet-152

ResNet 是一种卷积神经网络，最初在论文 Deep Residual Learning for Image Recognition（He 等）中提出。在本次评估中，我们使用 ResNet-152 作为图像分类模型的骨干网络。我们在不同模式下测试并比较推理时间，包括 Eager 模式、`default`、`reduce-overhead` 和 max-autotune 模式。

验证模型和环境设置

首先，我们下载并显示用作分类模型输入的测试图像。

# Download an example image from the pytorch website
import urllib
import matplotlib.pyplot as plt
url, filename = ("https://github.com/pytorch/hub/raw/master/images/dog.jpg", "dog.jpg")
try: urllib.URLopener().retrieve(url, filename)
except: urllib.request.urlretrieve(url, filename)

from PIL import Image
input_image = Image.open(filename)
plt.imshow(input_image)
plt.axis('off')
plt.show()

导入图像预处理器和模型来处理上述图像。

import torch
import torchvision.transforms as transforms

# create the image preprocessor
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])
input_tensor = preprocess(input_image)
input_batch = input_tensor.unsqueeze(0) # create a mini-batch as expected by the model

# load the resnet152 model
model = torch.hub.load('pytorch/vision:v0.17.2', 'resnet152', pretrained=True)
model.eval()

# move the input and model to GPU for speed if available
if torch.cuda.is_available():
    input_batch = input_batch.to('cuda')
    model.to('cuda')
with torch.no_grad():
    output = model(input_batch)

# Tensor of shape 1000, with confidence scores over ImageNet's 1000 classes
print(output.shape)

输出：

    torch.Size([1, 1000])

打印基于概率的 topk 标签的辅助函数。

def print_topk_labels(output, k):
    # The output has unnormalized scores. To get probabilities, you can run a softmax on it.
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
    # Read the categories
    with open("imagenet_classes.txt", "r") as f:
        categories = [s.strip() for s in f.readlines()]
    # Show top categories per image
    topk_prob, topk_catid = torch.topk(probabilities, k)
    for i in range(topk_prob.size(0)):
        print(categories[topk_catid[i]], topk_prob[i].item())

# Download ImageNet labels
!wget https://raw.githubusercontent.com/pytorch/hub/master/imagenet_classes.txt

显示前 5 个标签及其概率。

print_topk_labels(output, 5)

输出：

    Samoyed 0.7907489538192749
    Pomeranian 0.08977615833282471
    white wolf 0.03610273823142052
    keeshond 0.02681431733071804
    Arctic fox 0.022788070142269135

我们可以发现，模型效果很好。这表明环境是正确的，我们已经准备好使用 torch.compile 测试基于 ResNet-152 的模型。

ResNet-152 模型在 Eager Mode 下的性能评估

为了预热 GPU，我们在进行 20 次额外迭代以获取模型的平均推理时间之前，先运行 ResNet-152 模型 10 次。

n_warmup = 10
n_test = 20
dtype = torch.bfloat16
inference_time=[]
mode=[]

t_warmup, _ = timed(lambda:model(input_batch), n_warmup, dtype)
t_test, output = timed(lambda:model(input_batch), n_test, dtype)
print(f"Average inference time for resnet152(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for resnet152(test): dt_test={t_test} ms")
print_topk_labels(output, 5)
inference_time.append(t_test)
mode.append("eager")

输出：

    Average inference time for resnet152(warmup): dt_test=164.6312952041626 ms
    Average inference time for resnet152(test): dt_test=18.761909008026123 ms
    Samoyed 0.80078125
    Pomeranian 0.0791015625
    white wolf 0.037353515625
    keeshond 0.0257568359375
    Arctic fox 0.022705078125

ResNet-152 模型在 torch.compile(default) Mode 下的性能评估

要将 torch.compile 应用于 ResNet-152，我们可以按照下面的代码进行包装。
• mode：我们使用 default 编译模式，这是性能和开销之间的良好平衡。
• fullgraph：如果为 True，`torch.compile()` 要求整个函数都能捕获到一个单一的图中。如果这不可能，则会引发错误。

#clean up the workspace with torch._dynamo.reset().
torch._dynamo.reset()
model_opt1 = torch.compile(model, fullgraph=True)
t_compilation, _ = timed(lambda:model_opt1(input_batch), 1, dtype)
t_warmup, _ = timed(lambda:model_opt1(input_batch), n_warmup, dtype)
t_test, output = timed(lambda:model_opt1(input_batch), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for compiled resnet152(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for compiled resnet152(test): dt_test={t_test} ms")
print_topk_labels(output, 5)
inference_time.append(t_test)
mode.append("default")

输出：

    Compilation time: dt_compilation=24626.18637084961 ms
    Average inference time for compiled resnet152(warmup): dt_test=15.319490432739258 ms
    Average inference time for compiled resnet152(test): dt_test=15.275216102600098 ms
    Samoyed 0.80078125
    Pomeranian 0.0791015625
    white wolf 0.037353515625
    keeshond 0.0257568359375
    Arctic fox 0.022705078125

ResNet-152 模型在 torch.compile(reduce-overhead) Mode 下的性能评估

reduce-overhead 模式利用 CUDA graphs 来减少内核启动的开销，改善整体延迟。如果你想了解更多，可以关于CUDA graphs 的内容。

torch._dynamo.reset()
model_opt2 = torch.compile(model, mode="reduce-overhead", fullgraph=True)
t_compilation, _ = timed(lambda:model_opt2(input_batch), 1, dtype)
t_warmup, _ = timed(lambda:model_opt2(input_batch), n_warmup, dtype)
t_test, output = timed(lambda:model_opt2(input_batch), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for compiled resnet152(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for compiled resnet152(test): dt_test={t_test} ms")
print_topk_labels(output, 5)
inference_time.append(t_test)
mode.append("reduce-overhead")

输出：

    Compilation time: dt_compilation=18916.11909866333 ms
    Average inference time for compiled resnet152(warmup): dt_test=39.9461030960083 ms
    Average inference time for compiled resnet152(test): dt_test=5.042397975921631 ms
    Samoyed 0.80078125
    Pomeranian 0.0791015625
    white wolf 0.037353515625
    keeshond 0.0257568359375
    Arctic fox 0.022705078125

ResNet-152模型在torch.compile(max-autotune)模式下的性能评估

max-autotune模式利用基于Triton的矩阵乘法和卷积运算。它默认启用CUDA图。

torch._dynamo.reset()
model_opt3 = torch.compile(model, mode="max-autotune", fullgraph=True)
t_compilation, _ = timed(lambda:model_opt3(input_batch), 1, dtype)
t_warmup, _ = timed(lambda:model_opt3(input_batch), n_warmup, dtype)
t_test, output = timed(lambda:model_opt3(input_batch), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for compiled resnet152(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for compiled resnet152(test): dt_test={t_test} ms")
print_topk_labels(output, 5)
inference_time.append(t_test)
mode.append("max-autotune")

输出：

    AUTOTUNE convolution(1x64x56x56, 256x64x1x1)
      triton_convolution_49 0.0238 ms 100.0%
      triton_convolution_48 0.0240 ms 99.3%
      convolution 0.0242 ms 98.7%
      triton_convolution_46 0.0325 ms 73.4%
      triton_convolution_52 0.0326 ms 73.0%
      triton_convolution_53 0.0331 ms 72.0%
      triton_convolution_47 0.0333 ms 71.6%
      triton_convolution_50 0.0334 ms 71.3%
      triton_convolution_51 0.0341 ms 70.0%
      triton_convolution_42 0.0360 ms 66.2%
    SingleProcess AUTOTUNE takes 64.3134 seconds
                       ...
    AUTOTUNE convolution(1x256x14x14, 1024x256x1x1)
      triton_convolution_538 0.0285 ms 100.0%
      triton_convolution_539 0.0290 ms 98.3%
      convolution 0.0299 ms 95.2%
      triton_convolution_536 0.0398 ms 71.5%
      triton_convolution_542 0.0400 ms 71.2%
      triton_convolution_543 0.0406 ms 70.1%
      triton_convolution_537 0.0411 ms 69.3%
      triton_convolution_540 0.0443 ms 64.3%
      triton_convolution_541 0.0464 ms 61.4%
      triton_convolution_532 0.0494 ms 57.6%
    SingleProcess AUTOTUNE takes 15.0623 seconds
                      ...
    AUTOTUNE addmm(1x1000, 1x2048, 2048x1000)
      bias_addmm 0.0240 ms 100.0%
      addmm 0.0240 ms 100.0%
      triton_mm_2176 0.0669 ms 35.9%
      triton_mm_2177 0.0669 ms 35.9%
      triton_mm_2174 0.0789 ms 30.4%
      triton_mm_2175 0.0789 ms 30.4%
      triton_mm_2180 0.0878 ms 27.3%
    SingleProcess AUTOTUNE takes 8.4102 seconds

    Compilation time: dt_compilation=820945.9936618805 ms
    Average inference time for compiled resnet152(warmup): dt_test=41.12842082977295 ms
    Average inference time for compiled resnet152(test): dt_test=5.32916784286499 ms
    Samoyed 0.796875
    Pomeranian 0.083984375
    white wolf 0.037353515625
    keeshond 0.025634765625
    Arctic fox 0.0225830078125

基于输出，我们可以看到Triton正在自主优化矩阵乘法和卷积操作。相比于其他模式，这个过程需要极长的时间。你可以在这里将 编译时间 与之前测试的模式进行比较。

虽然使用了Triton调优，但在这种情况下，`max-autotune`模式并没有显著增强性能，和 reduce-overhead模式相比没有明显优势。这表明在我们的测试平台上，ResNet-152的瓶颈并不主要在于矩阵乘法或卷积操作。要进一步提高性能并应用高级设置，请参考torch._inductor.config。

比较从上述四种模式获得的推理时间

import matplotlib.pyplot as plt

# Plotting the bar graph
plt.bar(mode, inference_time)
print(inference_time)
print(mode)

# Adding labels and title
plt.xlabel('mode')
plt.ylabel('Inference time (ms)')
plt.title('ResNet-152')

# Displaying the plot
plt.show()

输出：

    [18.761909008026123, 15.275216102600098, 5.042397975921631, 5.32916784286499]
    ['eager', 'default', 'reduce-overhead', 'max-autotune']

从图表中可以看到，`torch.compile`显著提升了ResNet-152在AMD MI210与ROCm上的性能，达到了*3.5*倍以上的提升。

使用 torch.compile 加速 Vision Transformer

Vision Transformer（ViT）是一个类似 BERT 的 transformer 编码器模型，在大规模的图像集合上以有监督方式进行了预训练，具体来说是在分辨率为 224×224 像素的 ImageNet-21k 数据集上预训练的。以下是如何使用这个模型将 COCO 2017 数据集中的一张图像分类为 1,000 个 ImageNet 类别之一的示例，使用的是vit-base-patch16-224检查点。

from transformers import ViTImageProcessor, ViTForImageClassification
from PIL import Image
import requests
import matplotlib.pyplot as plt

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
plt.imshow(image)
plt.axis('off')  # Turn off axis
plt.show()

# load the image processor and model
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')

inputs = processor(images=image, return_tensors="pt")

if torch.cuda.is_available():
    inputs = inputs.to('cuda')
    model.to('cuda')
    
outputs = model(**inputs)
logits = outputs.logits
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

输出：

    Predicted class: Egyptian cat

模型和环境看起来都很好。接下来，我们将按照与 ResNet-152 相同的测试流程进行测试，包括在不同模式下测试模型，并在最后评估性能。在每种模式下，我们将进行 10 次迭代以进行预热，然后进行额外的 20 次迭代，以获得模型的平均推理时间。

n_warmup = 10
n_test = 20
dtype = torch.bfloat16
inference_time=[]
mode=[]

评估 Vision Transformer 模型在 Eager 模式下的性能

torch._dynamo.reset()
t_warmup, _ = timed(lambda:model(**inputs), n_warmup, dtype)
t_test, output = timed(lambda:model(**inputs), n_test, dtype)
print(f"Average inference time for ViT(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for ViT(test): dt_test={t_test} ms")
inference_time.append(t_test)
mode.append("eager")
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = output.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

输出：

    Average inference time for ViT(warmup): dt_test=8.17105770111084 ms
    Average inference time for ViT(test): dt_test=7.561385631561279 ms
    Predicted class: Egyptian cat

评估 Vision Transformer 模型在 torch.compile(default) 模式下的性能

torch._dynamo.reset()
model_opt1 = torch.compile(model, fullgraph=True)
t_compilation, _ = timed(lambda:model_opt1(**inputs), 1, dtype)
t_warmup, _ = timed(lambda:model_opt1(**inputs), n_warmup, dtype)
t_test, output = timed(lambda:model_opt1(**inputs), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for ViT(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for ViT(test): dt_test={t_test} ms")
inference_time.append(t_test)
mode.append("default")
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = output.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

输出：

    Compilation time: dt_compilation=13211.912631988525 ms
    Average inference time for ViT(warmup): dt_test=7.065939903259277 ms
    Average inference time for ViT(test): dt_test=7.033288478851318 ms
    Predicted class: Egyptian cat

评估 Vision Transformer 模型在 torch.compile(reduce-overhead) 模式下的性能

torch._dynamo.reset()
model_opt2 = torch.compile(model, mode="reduce-overhead", fullgraph=True)
t_compilation, _ = timed(lambda:model_opt2(**inputs), 1, dtype)
t_warmup, _ = timed(lambda:model_opt2(**inputs), n_warmup, dtype)
t_test, output = timed(lambda:model_opt2(**inputs), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for ViT(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for ViT(test): dt_test={t_test} ms")
inference_time.append(t_test)
mode.append("reduce-overhead")
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = output.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

输出：

    Compilation time: dt_compilation=10051.868438720703 ms
    Average inference time for ViT(warmup): dt_test=30.241727828979492 ms
    Average inference time for ViT(test): dt_test=3.2375097274780273 ms
    Predicted class: Egyptian cat

评估 Vision Transformer 模型在 torch.compile(max-autotune) 模式下的性能

torch._dynamo.reset()
model_opt3 = torch.compile(model, mode="max-autotune", fullgraph=True)
t_compilation, _ = timed(lambda:model_opt3(**inputs), 1, dtype)
t_warmup, _ = timed(lambda:model_opt3(**inputs), n_warmup, dtype)
t_test, output = timed(lambda:model_opt3(**inputs), n_test, dtype)
print(f"Compilation time: dt_compilation={t_compilation} ms")
print(f"Average inference time for ViT(warmup): dt_test={t_warmup} ms")
print(f"Average inference time for ViT(test): dt_test={t_test} ms")
inference_time.append(t_test)
mode.append("max-autotune")
# model predicts one of the 1000 ImageNet classes
predicted_class_idx = output.logits.argmax(-1).item()
print("Predicted class:", model.config.id2label[predicted_class_idx])

输出：

    AUTOTUNE convolution(1x3x224x224, 768x3x16x16)
      convolution 0.0995 ms 100.0%
      triton_convolution_2191 0.2939 ms 33.9%
      triton_convolution_2190 0.3046 ms 32.7%
      triton_convolution_2194 0.3840 ms 25.9%
      triton_convolution_2195 0.4038 ms 24.6%
      triton_convolution_2188 0.4170 ms 23.9%
                      ...
    AUTOTUNE addmm(197x768, 197x768, 768x768)
      bias_addmm 0.0278 ms 100.0%
      addmm 0.0278 ms 100.0%
      triton_mm_2213 0.0363 ms 76.7%
      triton_mm_2212 0.0392 ms 71.0%
      triton_mm_2207 0.0438 ms 63.5%
      triton_mm_2209 0.0450 ms 61.9%
      triton_mm_2206 0.0478 ms 58.2%
      triton_mm_2197 0.0514 ms 54.2%
      triton_mm_2208 0.0533 ms 52.3%
      triton_mm_2196 0.0538 ms 51.8%
                      ...
    AUTOTUNE addmm(1x1000, 1x768, 768x1000)
      bias_addmm 0.0229 ms 100.0%
      addmm 0.0229 ms 100.0%
      triton_mm_4268 0.0338 ms 67.8%
      triton_mm_4269 0.0338 ms 67.8%
      triton_mm_4266 0.0382 ms 59.8%
      triton_mm_4267 0.0382 ms 59.8%
      triton_mm_4272 0.0413 ms 55.4%
      triton_mm_4273 0.0413 ms 55.4%
      triton_mm_4260 0.0466 ms 49.1%
      triton_mm_4261 0.0466 ms 49.1%
    SingleProcess AUTOTUNE takes 8.9279 seconds


    Compilation time: dt_compilation=103891.38770103455 ms
    Average inference time for ViT(warmup): dt_test=31.742525100708004 ms
    Average inference time for ViT(test): dt_test=3.2366156578063965 ms
    Predicted class: Egyptian cat

比较在上述四种模式下获得的 ViT 推理时间

# Plotting the bar graph
plt.bar(mode, inference_time)
print(inference_time)
print(mode)

# Adding labels and title
plt.xlabel('mode')
plt.ylabel('Inference time (ms)')
plt.title('ViT')

# Displaying the plot
plt.show()

输出：

    [7.561385631561279, 7.033288478851318, 3.2375097274780273, 3.2366156578063965]
    ['eager', 'default', 'reduce-overhead', 'max-autotune']

从图表中可以看出，`torch.compile` 显著提升了 ViT 的性能，在 AMD MI210 上通过 ROCm 提升了超过 2.3 倍。

加速Llama 2 7B模型与torch.compile

Llama 2是一个大型语言模型，由一系列能够响应提示生成文本和代码的模型组成。PyTorch团队在PyTorch Labs GitHub仓库中提供了一个简单、高效的PyTorch原生实现的Transformer文本生成模型。在我们的评估中，我们简化了代码，仅用于应用`torch.compile`进行优化。具体代码可以在src文件夹中找到。

与之前的评估相比，我们这次重点评估的是Llama 2 7B模型的吞吐量（批次大小=1）。我们将进行20次迭代以得出模型的平均吞吐量。

下载openlm-research/open_llama_7b模型并转换为PyTorch格式。

gpt-fast文件夹可以在src文件夹中找到。

%%bash
pip install sentencepiece huggingface_hub

cd gpt-fast
./scripts/prepare.sh openlm-research/open_llama_7b

输出：

    Model config {'block_size': 2048, 'vocab_size': 32000, 'n_layer': 32, 'n_head': 32, 'dim': 4096, 'intermediate_size': 11008, 'n_local_heads': 32, 'head_dim': 128, 'rope_base': 10000, 'norm_eps': 1e-05}
    Saving checkpoint to checkpoints/openlm-research/open_llama_7b/model.pth

在Eager模式下对Llama 2 7B模型进行性能评估

指定`--compile none`以使用Eager模式。
• --compile：设置为`none`以使用Eager模式
• --profile：启用torch.profiler的跟踪功能
• --checkpoint_path：检查点路径
• --prompt：输入提示
• --max_new_tokens：最大新的token数
• --num_samples：样本数量

%%bash
cd gpt-fast
python generate_simp.py --compile none --profile ./trace_compile_none --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --prompt "def quicksort(arr):"  --max_new_tokens 200  --num_samples 20

输出：

    __CUDNN VERSION: 3000000
    __Number CUDA Devices: 1
    Using device=cuda
    Loading model ...
    Time to load model: 3.94 seconds
    Compilation time: 11.18 seconds
    def quicksort(arr):
     """
     Quickly sorts a list.
     """
     arr = arr.sort()
     return arr
    
    
    def fizzbuzz():
     """
     Does the fizzbuzz algorithm.
     """
     return 'fizzbuzz'
    
    
    def reverse_string():
     """
     Reverses a string.
     """
     return 'foobar'[::-1]
    
    if __name__ == "__main__":
     print(quicksort([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]))
     print(fizzbuzz())
     print(reverse_string())
     Vuetify, MUI, BEM, CSS, JavaScript
    CSS / JavaScript / Vue
    
    ###
    ## vue2vuetify
    
    The vue2vuetify package contains a declarative
    Average tokens/sec: 28.61
    Memory used: 13.62 GB

输出结果将包括三个部分：
• 系统信息和编译时间，
• 基于给定提示的模型输出，以及
• 在模型执行期间收集的指标。

根据输出结果，我们观察到推理速度大约为每秒28个token，这并不算太差。需要注意的是，对于`def quicksort(arr):的响应质量可能并不令人满意。但在本博客中这是可以接受的，因为我们的重点是使用torch.compile`来提高推理吞吐量。

测试完成后，你会在`gpt-fast`文件夹中找到一个`trace_compile_none.json`文件。这个文件是使用torch.profiler的跟踪功能生成的。你可以使用Perfetto查看跟踪文件，分析在执行Llama 2 7B模型期间使用的操作符和内核的序列。

通过分析跟踪文件，我们观察到CPU的任务调度（顶部）相对于GPU（底部）的效率不高，这从连续任务之间的间隙可以看出。这些间隙表示GPU的空闲期，由于缺乏活动，资源未被充分利用。接下来，我们将看看`torch.compile`如何帮助缓解这个问题。

使用 `torch.compile(default)` 模式对 Llama 2 7B 模型进行性能评估

指定 --compile default 以使用 torch.compile 的默认模式。

%%bash
cd gpt-fast
python generate_simp.py --compile default --profile ./trace_compile_default --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --prompt "def quicksort(arr):"  --max_new_tokens 200  --num_samples 20

输出：

    __CUDNN VERSION: 3000000
    __Number CUDA Devices: 1
    Using device=cuda
    Loading model ...
    Time to load model: 3.56 seconds
    Reset and set torch.compile mode as  default

    def quicksort(arr):
     # Quick sort.
     #
     # Returns -1, 0, or 1.
     # If arr is empty, -1 is returned.
     # If arr is sorted, arr[0] is returned.
     #
     # If arr is already sorted, 0 is returned.
     # If arr is not sorted, arr[1] is returned.
     #
     # See: https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb#L150-L153
    
     arr.sort!
     n = 0
     while n < arr.size
    
     # if arr[n] < arr[n+1]
     # quicksort(arr)
     # arr[n+1] = arr[n]
     # arr[n] = 1
     # n += 
    Average tokens/sec: 73.90
    Memory used: 13.87 GB

使用 `torch.compile(reduce-overhead)` 模式对 Llama 2 7B 模型进行性能评估

指定 --compile reduce-overhead 以使用 torch.compile 的 reduce-overhead 模式。

%%bash
cd gpt-fast
python generate_simp.py --compile reduce-overhead --profile ./trace_compile_reduceoverhead --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --prompt "def quicksort(arr):"  --max_new_tokens 200  --num_samples 20

输出：

    __CUDNN VERSION: 3000000
    __Number CUDA Devices: 1
    Using device=cuda
    Loading model ...
    Time to load model: 3.17 seconds
    Reset and set torch.compile mode as  reduce-overhead

    def quicksort(arr):
     # Quick sort.
     #
     # Returns -1, 0, or 1.
     # If arr is empty, -1 is returned.
     # If arr is sorted, arr[0] is returned.
     #
     # If arr is already sorted, 0 is returned.
     # If arr is not sorted, arr[1] is returned.
     #
     # See: https://github.com/rails/rails/blob/master/activerecord/lib/active_record/connection_adapters/sqlite3_adapter.rb#L150-L153
    
     arr.sort!
     n = 0
     while n < arr.size
    
     # if arr[n] < arr[n+1]
     # quicksort(arr)
     # arr[n+1] = arr[n]
     # arr[n] = 1
     # n += 
    Average tokens/sec: 74.45
    Memory used: 13.62 GB

测试完成后，你将在 gpt-fast 文件夹中找到一个名为 trace_compile_reduceoverhead.json 的文件。这是 Llama 2 7B 模型执行过程中生成的追踪文件。

追踪文件显示了一系列 hipGraphLaunch 事件，而在 Eager Mode Section 中获取的追踪文件中没有出现这些事件。`Hipgraph` 使一系列的 hip 内核可以被定义并封装为一个单元，即一系列操作的图形，而不是在 Eager Mode Section 中单独启动的操作序列。`Hipgraph` 提供了一种通过单个 CPU 操作来启动多个 GPU 操作的机制，从而降低启动开销。

使用 `torch.compile(max-autotune)` 模式对 Llama 2 7B 模型进行性能评估

使用 --compile max-autotune 来启用 torch.compile 的 max-autotune 模式。

%%bash
cd gpt-fast
python generate_simp.py --compile max-autotune --profile ./trace_compile_maxautotune --checkpoint_path checkpoints/openlm-research/open_llama_7b/model.pth --prompt "def quicksort(arr):"  --max_new_tokens 200  --num_samples 20

输出：

    __CUDNN VERSION: 3000000
    __Number CUDA Devices: 1
    Using device=cuda
    Loading model ...
    Time to load model: 3.05 seconds
    Reset and set torch.compile mode as  max-autotune

    def quicksort(arr):
     # Quick sort.
     #
     # Returns -1, 0, or 1.
     # If arr is empty, -1 is returned for each partition.
    
     # Create two split keys.
     split_key_a = int(len(arr) / 2)
     split_key_b = len(arr) - 1
    
     # Quick sort for split key a.
     # Each partition is sorted.
     #
     # Note that the inner loop is nested.
     # The outer loop sorts split key a and the inner loop sorts each
     # partition.
     for i in range(split_key_a):
     for j in range(split_key_b):
     # If the element is smaller than split_key_a, insert it in the
     # left partition. Otherwise, insert it in the right partition.
     idx = numpy.searchsorted(arr, split_key_a)
     if
    Average tokens/sec: 74.58
    Memory used: 13.88 GB

对比上述四种模式的吞吐量

# Plotting the bar graph
mode =["eager", "default", "reduce-overhead", "max-autotune"]
inference_time=[28.61, 73.90, 74.45, 74.58]
plt.bar(mode, inference_time)
print(inference_time)
print(mode)

# Adding labels and title
plt.xlabel('mode')
plt.ylabel('Inference throughput (tokens/sec)')
plt.title('Llama 2 7B')

# Displaying the plot
plt.show()

输出：

    [28.61, 73.9, 74.45, 74.58]
    ['eager', 'default', 'reduce-overhead', 'max-autotune']

图表显示，与 Eager 模式相比，在 AMD MI210 和 ROCm 上，`torch.compile` 可以将 Llama 模型的吞吐量提高多达 2.6 倍（图表中越高越好）。

结论

在这篇博客中，我们展示了如何利用 torch.compile 简单地加速在 AMD GPU 上运行的 ResNet、ViT 和 Llama 2 模型。这种方法带来了显著的性能提升，分别实现了 3.5 倍、2.3 倍和 2.6 倍的加速效果。

参考资料

torch.compile 介绍
利用 CUDA Graphs 加速 PyTorch
加速生成性 AI 的第二部分：GPT，快速
TorchDynamo 和 FX Graphs