TensorRT-LLM七日谈 Day1

news2025/2/22 8:46:59

Flag

利用7天时间熟悉tensort-llm的代码架构，cublas的使用方式以及flash attention的调优。

项目链接

https://github.com/NVIDIA/TensorRT-LLM

安装

https://nvidia.github.io/TensorRT-LLM/installation/linux.html

它的安装主要是需要下载相应的docker镜像，然后基于它，启动容器，下列是官网提供的指令，但是因为rm, 它会在退出后自动删除。

docker run --rm --ipc=host --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.5.1-devel-ubuntu22.04

在没有代理的情况下，这个安装的速度非常慢，从早到晚，还有可能中断的样子。

在完成了tensorrt-llm的安装后，可以看到

接口实例

LLM Examples Introduction — tensorrt_llm documentation

from tensorrt_llm import LLM, SamplingParams

prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)

llm = LLM(model="TinyLlama/TinyLlama-1.1B-Chat-v1.0")

outputs = llm.generate(prompts, sampling_params)

# Print the outputs.
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

上述案例让人好奇的地方是generate的实现是怎样的呢？它的底层与tensorrt又是怎么建立联系的呢？

在查阅源码后，可以看到，它调用的是gerneate_async这个函数。但是，接下来的调用逻辑是什么样呢？我们明天说。