【目标检查】YOLO系列之：Triton 推理服务器Ultralytics YOLO11

news2025/4/7 14:10:05

Triton 推理服务器

1、引言
2、Triton服务器
- 2.1 什么是Triton Inference Server
- 2.2 将YOLO11 导出为ONNX 格式
- 2.3 设置Triton 模型库
- - 2.3.1 创建目录结构
  - 2.3.2 将导出的ONNX 模型移至Triton 资源库
- 2.4 运行Triton 推断服务器
- - 2.4.1 使用 Docker 运行Triton Inference Server
  - 2.4.2 使用Triton 服务器模型运行推理
  - 2.4.3 清理容器
- 2.4.5 如何通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11
3、总结

1、引言

小屌丝：鱼哥，这天可是真冷啊
小鱼：那可不，这天气，就适合吃点铁锅炖
小屌丝：铁锅炖… 啥呢
小鱼：我们去套圈啊
小屌丝：圈有啥好套的
小鱼：听说能套大鹅
小屌丝：… 鱼哥，咱们就省了这中间过程，直接去吃得了
小鱼：… 不套大鹅了？那吃啥？
小屌丝：这不就是一个电话的事
小鱼：哎呦喂，我到时要看看，你这电话能打到哪里
小屌丝：放心吧，咱俩去的时候，必须给炖上
小鱼：这就去？
小屌丝：要不，你等会去？
小鱼：雪天路滑，你自己去我不放心。
小屌丝：… 鱼哥，我发现只有两件事你最积极
小鱼：啥事？
小屌丝：泡澡，吃饭。
小鱼：不予置评。
在这里插入图片描述

2、Triton服务器

2.1 什么是Triton Inference Server

Triton Inference Server（原名TensorRT Inference Server）是NVIDIA 开发的一个开源软件解决方案。

Triton 推理服务器旨在在生产中部署各种人工智能模型。它支持多种深度学习和机器学习框架，包括TensorFlow 、 PyTorchONNX Runtime 等。它的主要用例包括

从一个服务器实例为多个模型提供服务。
动态加载和卸载模型，无需重启服务器。
集合推理，允许同时使用多个模型来获得结果。
模型版本化，用于 A/B 测试和滚动更新。

2.2 将YOLO11 导出为ONNX 格式

在Triton 上部署模型之前，必须将其导出为ONNX 格式。ONNX (Open Neural Network Exchange）是一种允许在不同深度学习框架之间传输模型的格式。使用 export 功能中的 YOLO 类：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Retreive metadata during export
metadata = []


def export_cb(exporter):
    metadata.append(exporter.metadata)


model.add_callback("on_export_end", export_cb)

# Export the model
onnx_file = model.export(format="onnx", dynamic=True)

2.3 设置Triton 模型库

Triton 模型库是Triton 可以访问和加载模型的存储位置。

2.3.1 创建目录结构

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)

2.3.2 将导出的ONNX 模型移至Triton 资源库

from pathlib import Path

# Move ONNX model to Triton Model path
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")

# Create config file
(triton_model_path / "config.pbtxt").touch()

# (Optional) Enable TensorRT for GPU inference
# First run will be slow due to TensorRT engine conversion
data = """
optimization {
  execution_accelerators {
    gpu_execution_accelerator {
      name: "tensorrt"
      parameters {
        key: "precision_mode"
        value: "FP16"
      }
      parameters {
        key: "max_workspace_size_bytes"
        value: "3221225472"
      }
      parameters {
        key: "trt_engine_cache_enable"
        value: "1"
      }
      parameters {
        key: "trt_engine_cache_path"
        value: "/models/yolo/1"
      }
    }
  }
}
parameters {
  key: "metadata"
  value: {
    string_value: "%s"
  }
}
""" % metadata[0]

with open(triton_model_path / "config.pbtxt", "w") as f:
    f.write(data)

2.4 运行Triton 推断服务器

2.4.1 使用 Docker 运行Triton Inference Server

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"  # 8.57 GB

# Pull the image
subprocess.call(f"docker pull {tag}", shell=True)

# Run the Triton server and capture the container ID
container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}:/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

# Wait for the Triton server to start
triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

# Wait until model is ready
for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)

2.4.2 使用Triton 服务器模型运行推理

from ultralytics import YOLO

# Load the Triton Server model
model = YOLO("http://localhost:8000/yolo", task="detect")

# Run inference on the server
results = model("path/to/image.jpg")

2.4.3 清理容器

# Kill and remove the container at the end of the test
subprocess.call(f"docker kill {container_id}", shell=True)

2.4.5 如何通过NVIDIA Triton Inference Server 设置Ultralytics YOLO11

设置 Ultralytics YOLO11NVIDIA Triton Inference Server涉及几个关键步骤

1. 将YOLO11 导出为ONNX 格式：

from ultralytics import YOLO

# Load a model
model = YOLO("yolo11n.pt")  # load an official model

# Export the model to ONNX format
onnx_file = model.export(format="onnx", dynamic=True)

2. 建立Triton 模型库

from pathlib import Path

# Define paths
model_name = "yolo"
triton_repo_path = Path("tmp") / "triton_repo"
triton_model_path = triton_repo_path / model_name

# Create directories
(triton_model_path / "1").mkdir(parents=True, exist_ok=True)
Path(onnx_file).rename(triton_model_path / "1" / "model.onnx")
(triton_model_path / "config.pbtxt").touch()

3. 运行Triton 服务器

import contextlib
import subprocess
import time

from tritonclient.http import InferenceServerClient

# Define image https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver
tag = "nvcr.io/nvidia/tritonserver:24.09-py3"

subprocess.call(f"docker pull {tag}", shell=True)

container_id = (
    subprocess.check_output(
        f"docker run -d --rm --gpus 0 -v {triton_repo_path}/models -p 8000:8000 {tag} tritonserver --model-repository=/models",
        shell=True,
    )
    .decode("utf-8")
    .strip()
)

triton_client = InferenceServerClient(url="localhost:8000", verbose=False, ssl=False)

for _ in range(10):
    with contextlib.suppress(Exception):
        assert triton_client.is_model_ready(model_name)
        break
    time.sleep(1)