例子：Triton + TensorRT-LLM

news2025/4/6 2:40:14

Deploy an AI Coding Assistant with NVIDIA TensorRT-LLM and NVIDIA Triton | NVIDIA Technical Blog

https://github.com/triton-inference-server/tutorials/blob/main/Conceptual_Guide/Part_1-model_deployment/README.md

1. 想用onnx-runtime来做推理backend；因此先要将模型转换为onnx格式；

2. model repo: 新建一个目录（本地目录、远程目录、Azure Blob都可）；存放所有模型的名称(text_detection、text_recognition)、版本（1、2）、配置文件（config.pbtxt)、模型文件(model.onnx)。例如：

model_repository/
├── text_detection
│   ├── 1
│   │   └── model.onnx
│   ├── 2
│   │   └── model.onnx
│   └── config.pbtxt
└── text_recognition
    ├── 1
    │   └── model.onnx
    └── config.pbtxt

3. config.pbtxt格式

name: "text_detection"
backend: "onnxruntime"
max_batch_size : 256
input [
  {
    name: "input_images:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 3 ]
  }
]
output [
  {
    name: "feature_fusion/Conv_7/Sigmoid:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 1 ]
  }
]
output [
  {
    name: "feature_fusion/concat_3:0"
    data_type: TYPE_FP32
    dims: [ -1, -1, -1, 5 ]
  }
]

backend、max_batch_size要写; input、output应该可以由triton从模型文件里自动获取，也可不写；

4. 拉取和启动nvcr.io的triton server镜像：

docker run --gpus=all -it --shm-size=256m --rm -p8000:8000 -p8001:8001 -p8002:8002 -v $(pwd)/model_repository:/models nvcr.io/nvidia/tritonserver:<yy.mm>-py3

5. 启动triton server

tritonserver --model-repository=/models

启动成功后，显示如下信息：（哪几个模型READY了；版本、内存、显存等信息；2个推理用的端口和1个状态查询端口）

I0712 16:37:18.246487 128 server.cc:626]
+------------------+---------+--------+
| Model            | Version | Status |
+------------------+---------+--------+
| text_detection   | 1       | READY  |
| text_recognition | 1       | READY  |
+------------------+---------+--------+

I0712 16:37:18.267625 128 metrics.cc:650] Collecting metrics for GPU 0: NVIDIA GeForce RTX 3090
I0712 16:37:18.268041 128 tritonserver.cc:2159]
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                        |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                       |
| server_version                   | 2.23.0                                                                                                                                                                                       |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace |
| model_repository_path[0]         | /models                                                                                                                                                                                      |
| model_control_mode               | MODE_NONE                                                                                                                                                                                    |
| strict_model_config              | 1                                                                                                                                                                                            |
| rate_limit                       | OFF                                                                                                                                                                                          |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                    |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                     |
| response_cache_byte_size         | 0                                                                                                                                                                                            |
| min_supported_compute_capability | 6.0                                                                                                                                                                                          |
| strict_readiness                 | 1                                                                                                                                                                                            |
| exit_timeout                     | 30                                                                                                                                                                                           |
+----------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0712 16:37:18.269464 128 grpc_server.cc:4587] Started GRPCInferenceService at 0.0.0.0:8001
I0712 16:37:18.269956 128 http_server.cc:3303] Started HTTPService at 0.0.0.0:8000
I0712 16:37:18.311686 128 http_server.cc:178] Started Metrics Service at 0.0.0.0:8002

6. 可以使用裸curl来发推理请求，也可使用封装的对象来发；

例如使用triton自带的python包tritionclient里的httpclient类（先要pip install tritionclient):

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="localhost:8000")

raw_image = cv2.imread("./img2.jpg")
preprocessed_image = detection_preprocessing(raw_image)

detection_input = httpclient.InferInput("input_images:0", preprocessed_image.shape, datatype="FP32")
detection_input.set_data_from_numpy(preprocessed_image, binary_data=True)

detection_response = client.infer(model_name="text_detection", inputs=[detection_input])

scores = detection_response.as_numpy('feature_fusion/Conv_7/Sigmoid:0')
geometry = detection_response.as_numpy('feature_fusion/concat_3:0')
cropped_images = detection_postprocessing(scores, geometry, preprocessed_image)

7. 再将第一个模型输出的cropped_images作为第二个模型的输入；

# Create input object for recognition model
recognition_input = httpclient.InferInput("input.1", cropped_images.shape, datatype="FP32")
recognition_input.set_data_from_numpy(cropped_images, binary_data=True)

# Query the server
recognition_response = client.infer(model_name="text_recognition", inputs=[recognition_input])

# Process response from recognition model
text = recognition_postprocessing(recognition_response.as_numpy('308'))

print(text)

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_2-improving_resource_utilization

仅修改config.pbtxt，即可enable拼batch和多instance功能；

1. 拼batch

小batch拼成大batch，在throughput和latency上，都可能提升；

可以指定最多等多久就执行已有的；

dynamic_batching {
    max_queue_delay_microseconds: 100
}

2. 多model instance

在0和1号GPU上启动，每个GPU启动2个实例；

instance_group [
  {
    count: 2
    kind: KIND_GPU
    gpus: [ 0, 1 ]
  }
]

https://github.com/triton-inference-server/tutorials/tree/main/Conceptual_Guide/Part_3-optimizing_triton_configuration

Model Analyzer

其实就是profiler；

用户给出变量，该工具网格搜索遍历每个配置，在Triton上“试跑”；

跑出的结果，画成图或表格；供用户去分析并选取符合他产品需求throughput、latency、硬件资源的最优配置。

主要试跑参数：batching等待延迟上限；model instance数目；

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1718508.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！

例子：Triton + TensorRT-LLM

相关文章

React + Taro 项目实际书写感受

Facebook的算法揭秘：如何塑造我们的信息

神器！！Python热重载调试【送源码】

Android高通 12/13 录屏流程代码位置

停车场车位引导系统方案升级实施步骤流程是什么，有什么注意事项

【数据分享】中国科技统计年鉴Excel版（1991-2023年）

AutoSQT 2024汽车软件质量与测试峰会开启注册 | 智能汽车软件如何卷出差异化？

Spring MVC 应⽤分层

LM2733升压芯片

数据结构 | 二叉树（基本概念、性质、遍历、C代码实现）

ChatGPT的逆袭历程：核心技术深度解析

GIS毕业薪资从2K到20K究竟要多久？

npm run dev 同时运行vue前端项目和node后端项目

window11 设置 ubuntu2204 至最佳体验(安装/右键菜单/root用户/docker)

探索python数据可视化的奥秘：打造专业绘图环境

taskENTER_CRITICAL()分析

【linux】开机调用python脚本

计算机图形学入门05：投影变换

Three.js 研究：4、创建设备底部旋转的科技感圆环

万字长文深度解析Agent反思工作流框架Reflexion上篇：安装与运行