大模型推理--Qwen2.5-Omni在A100上的初体验

过去的一周Qwen2.5-Omni产生了很高的热度，吸引了很多人的目光。它的多模态确实很吸引人，放出来的demo体验还算尚可（语音对话的延迟还是太大），所以就在A100 PCIe上实地部署了一下，初步对其速度进行了测试，以下是我的操作流程，供大家参考。

1.资源下载

1.1 下载git项目

首先下载官方git项目：

git clone https://github.com/QwenLM/Qwen2.5-Omni

该git没有太多有价值的资源，主要就是一个docker部署文件和一个web_demo，不过后续的流程基本都参考该git的readme实现，所以大家可以clone下来之后仔细研读readme。

1.2 下载模型文件

上面的git项目建议大家从魔塔社区下载，所以我就从modelscope进行了下载，命令如下：

pip install modelscope
modelscope download --model Qwen/Qwen2.5-Omni-7B

因为模型较大，下载时间估计在1小时以上，请慢慢等待。下载的模型可以放在上一步的git项目中。

2.环境配置

运行大模型除了显卡跑不起来之外，最麻烦的事情可能就是环境配置，往往需要折腾半天以上。幸好上面的git项目中提供了一个官方docker避免了我们自己搭建环境，个人建议就是直接使用该docker镜像，否则你会遇到很多麻烦，就像我一样。此外，官方还提供了一个Dockerfile，大家也可以采用该文件去手动构建可用的docker镜像。在此，我还是把自己的构建过程分享给大家，让大家在遇到相同问题时能不慌。

2.1 创建Docker容器

执行以下命令：

docker container  run --net=host -v ~/Qwen2.5-Omni/:/workspace/Qwen2.5-Omni/ -it -d --name qwen2.5-omni --gpus device=7 cu12_torch250:1.0
docker container start 1602a
docker container exec -it 1602a bash

上述三条命令在后台创建一个容器，占用id为7的A100显卡，用的镜像叫cu12_torch250，这个镜像是我自己之前构造的，不是官方镜像。为了能长期使用该容器，我一般通过-d在后台启动，然后通过exec进入容器。退出容器通过ctrl+p加ctrl+q实现。

2.2 配置容器内软件环境

参照git中的readme卸载已安装的transformers库，并从源码安装特定版本的transformers库，执行如下命令：

pip uninstall transformers
pip install git+https://github.com/huggingface/transformers@f742a644ca32e65758c3adb36225aef1731bd2a8
pip install accelerate
pip install qwen-omni-utils[decord]

执行完第二步之后，pip list查看transformers版本为4.50.0.dev0。特别注意一点，第二步给出的版本号要仔细查看，git上的readme貌似给了两个不同的版本，有一个是不可用的。

在我的docker容器内，执行完第4步会提示：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
cugraph 23.12.0 requires cudf==23.12.*, which is not installed.
cugraph 23.12.0 requires dask-cudf==23.12.*, which is not installed.
cugraph-service-server 23.12.0 requires cudf==23.12.*, which is not installed.
cugraph-service-server 23.12.0 requires dask-cudf==23.12.*, which is not installed.
cuml 23.12.0 requires cudf==23.12.*, which is not installed.
cuml 23.12.0 requires dask-cudf==23.12.*, which is not installed.
dask-cuda 23.12.0 requires pynvml<11.5,>=11.0.0, which is not installed.

不过在后续的操作中没有用到报错的软件，所以我也没有去修复该错误。

官方git还建议按照flash-attn用来加速推理，命令如下：

pip install -U flash-attn --no-build-isolation

不过安装完之后flash-attn的版本应该是2.7.4.post1。但是我之前已经安装过2.7.2.post1，所以就没有再重新安装。后续速度测试的时候还会提到flash-attn。

3. Web demo跑通

3.1 安装缺失的软件

可以直接用git中的requirements.txt来一键安装所有软件，不过大部分软件我都已经安装，为了不破坏原有安装软件的版本，我只安装了缺失的两个软件：

pip install gradio
pip install modelscope_studio

3.2 运行demo

启动命令为：

python web_demo.py --flash-attn2 --server-name '10.192.2.1'

在我的docker容器内，如果不加–flash-attn2则无法正常启动。启动之后提示可以通过http://10.192.2.1:7860/访问网页。但是打开网页之后，摄像头、麦克风均因为网页不安全没有使用权限，只能通过offline页面上传音频文件进行体验。我上传了一个不到4s的音频文件，在经过十几秒的等待之后给我返回了结果。这个速度着实是慢的离谱啊！与魔塔和HF上相同的demo速度相比差得太远了。我又连续上传了几个音频文件，后台提示最后一轮推理的输入为：

[{‘role’: ‘system’, ‘content’: ‘You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.’}, {‘role’: ‘user’, ‘content’: [{‘type’: ‘audio’, ‘audio’: ‘/tmp/gradio/af6e0/test.wav’}]}, {‘role’: ‘assistant’, ‘content’: ‘嗯，那你好好洗个澡哈。洗完澡咱们再聊，有啥事儿都可以跟我说哦。’}, {‘role’: ‘user’, ‘content’: [{‘type’: ‘audio’, ‘audio’: ‘/tmp/gradio/4d4f8ecdb21/output5.wav’}]}, {‘role’: ‘assistant’, ‘content’: ‘中国最长的河流是长江。它全长约6300千米呢。你要是还想知道关于长江或者其他河流的事儿，可以再问我呀。’}, {‘role’: ‘user’, ‘content’: [{‘type’: ‘audio’, ‘audio’:‘/tmp/gradio/f57e72267b/output6.wav’}]}, {‘role’: ‘assistant’,‘content’: ‘我是阿里云研发的大规模语言模型，我叫通义千问，有什么我可以帮助你的吗？’}, {‘role’: ‘user’,‘content’: [{‘type’: ‘audio’, ‘audio’:‘/tmp/gradio/b83b0b07986/question.wav’}]}]

可以看出，它的多轮机制就是通过把之前的历史保留下来重复送给大模型推理实现的。这里有一个很大的问题，如果不做prefix Cache，历史会越来越长，推理速度也会越来越慢，不确定https://chat.qwen.ai/这个网址是如何实现的多轮。

这里还有另外一个疑问，因为还没有看过相关源码，所以不确定上述prompt中前几轮的wav文件会如何处理：是保持文本状态就一个单纯的文件链接，还是也会再提取音频的embedding信息。如果还会提取embedding信息，感觉重复计算又来了，如果不提取则感觉一个单纯的文件链接也没有什么意义。

4.速度测试

4.1 原始demo

鉴于3中的web_demo速度实在太慢了，所以我又仿照git中的示例代码在本地进行了速度验证。Demo代码如下：

import time
import torch
import soundfile as sf

from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto", enable_audio_output=True)

# 我们建议启用 flash_attention_2 以获取更快的推理速度以及更低的显存占用.
#model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", enable_audio_output=False)

processor = Qwen2_5OmniProcessor.from_pretrained("Qwen2.5-Omni-7B")

conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test1.wav"},
        ],
    },
]

# set use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
s = time.time()
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
e = time.time()
print(f'prepare time: {(e-s)*1000:.2f}ms')

# Inference: Generation of the output text and audio
text_ids, audio = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=True)
s = time.time()
print(f'generate time: {(s-e)*1000:.2f}ms')
print(text_ids)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
e = time.time()
print(f'batch_decode time: {(e-s)*1000:.2f}ms')
print(text)

sf.write(
    "output.wav",
    audio.reshape(-1).detach().cpu().numpy(),
    samplerate=24000,
)

在上面的代码中我们暂时没有使用flash-attn进行加速，然后通过enable_audio_output=True来选择合成音频。还是web_demo测试时不到4s的音频文件，运行上述demo会打印： prepare time: 943ms, generate time: 11723ms，这个推理速度基本和web_demo一致，太慢了。

然后设置enable_audio_output=False，generate的参数设置return_audio=False来避免生成音频文件，则generate速度就可以由11.7s降低到1.7s。这说明推理耗时的大头都被TTS模块给占据了，不做TTS可以大幅提升推理速度。

4.2 stream demo

原始的demo中，一个4s不到的音频generate时间为1.7s，这个速度够快吗？我感觉还是挺慢的。所以我把原始demo改成流式输出，看看Qwen2.5生成每个token的速度有多快。Stream demo代码如下：

import time
import torch
import datetime
import builtins
import soundfile as sf
from threading import Thread

from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor, AutoTokenizer, TextIteratorStreamer
from qwen_omni_utils import process_mm_info

def custom_print(*args, **kwargs):
    current_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
    original_print(f'[{current_time}]', *args, **kwargs)


# change print function to add time stamp
original_print = builtins.print
builtins.print = custom_print

# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto", enable_audio_output=False)

# 我们建议启用 flash_attention_2 以获取更快的推理速度以及更低的显存占用.
#model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", enable_audio_output=False)

processor = Qwen2_5OmniProcessor.from_pretrained("Qwen2.5-Omni-7B")
tokenizer = AutoTokenizer.from_pretrained("Qwen2.5-Omni-7B")
streamer = TextIteratorStreamer(tokenizer)

conversation = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test1.wav"},
        ],
    },
]

# set use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
s = time.time()
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
e = time.time()
print(f'prepare time: {(e-s)*1000:.2f}ms')

# Inference: Generation of the output text and audio
generation_kwargs = {
    'streamer':streamer,
    'use_audio_in_video':USE_AUDIO_IN_VIDEO,
    'return_audio':False,
    **inputs
}
thread = Thread(target=model.generate, kwargs=generation_kwargs)
thread.start()
for token in streamer:
    print(token)

上述demo和原始demo类似，但是通过TextIteratorStreamer来实现逐token生成。另外，我修改了一下print，让它打印一下时间，重新跑一下会生成如下输出：

[2025-04-03 09:06:07.465] 嗯
[2025-04-03 09:06:07.542]
[2025-04-03 09:06:07.577] ，那你
[2025-04-03 09:06:07.610] 好好
[2025-04-03 09:06:07.644] 洗
[2025-04-03 09:06:07.679] 个
[2025-04-03 09:06:07.727] 澡
[2025-04-03 09:06:07.769] 哈
[2025-04-03 09:06:07.808]
[2025-04-03 09:06:07.847] 。洗
[2025-04-03 09:06:07.886] 完
[2025-04-03 09:06:07.924] 澡
[2025-04-03 09:06:07.963] 咱们
[2025-04-03 09:06:08.001] 再
[2025-04-03 09:06:08.034] 聊
[2025-04-03 09:06:08.067]
[2025-04-03 09:06:08.108] ，有
[2025-04-03 09:06:08.151] 啥
[2025-04-03 09:06:08.194] 事儿
[2025-04-03 09:06:08.228] 都可以
[2025-04-03 09:06:08.262] 跟我说
[2025-04-03 09:06:08.295] 哦
[2025-04-03 09:06:08.329]
[2025-04-03 09:06:08.362]
[2025-04-03 09:06:08.363] 。<|im_end|>

通过分析每两行之间的时间差，大概可以知道每个token的生成时间大概为30~40ms左右，也即每秒大概生成2、30个token。我感觉这个速度还是挺慢的。

4.3 batch demo

官方git中提到，在不生成音频的情况下，还可以通过batch的方式进行推理。我又仿照官方给出的示例代码写了一个batch demo，代码如下：

import time
import torch
import soundfile as sf

from transformers import Qwen2_5OmniModel, Qwen2_5OmniProcessor
from qwen_omni_utils import process_mm_info

# default: Load the model on the available device(s)
model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype="auto", device_map="auto", enable_audio_output=False)

# 我们建议启用 flash_attention_2 以获取更快的推理速度以及更低的显存占用.
#model = Qwen2_5OmniModel.from_pretrained("Qwen2.5-Omni-7B", torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2", enable_audio_output=False)

processor = Qwen2_5OmniProcessor.from_pretrained("Qwen2.5-Omni-7B")

conversation1 = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test1.wav"},
        ],
    },
]

conversation2 = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test2.wav"},
        ],
    },
]

conversation3 = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test3.wav"},
        ],
    },
]

conversation4 = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test4.wav"},
        ],
    },
]

conversation5 = [
    {
        "role": "system",
        "content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
    },
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": "wav/test5.wav"},
        ],
    },
]

conversations=[conversation1, conversation2, conversation3, conversation4, conversation5]

# set use audio in video
USE_AUDIO_IN_VIDEO = True

# Preparation for inference
s = time.time()
text = processor.apply_chat_template(conversations, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversations, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = processor(text=text, audios=audios, images=images, videos=videos, return_tensors="pt", padding=True, use_audio_in_video=USE_AUDIO_IN_VIDEO)
inputs = inputs.to(model.device).to(model.dtype)
e = time.time()
print(f'prepare time: {(e-s)*1000:.2f}ms')

# Inference: Generation of the output text and audio
text_ids = model.generate(**inputs, use_audio_in_video=USE_AUDIO_IN_VIDEO, return_audio=False)
s = time.time()
print(f'generate time: {(s-e)*1000:.2f}ms')
print(text_ids)

text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
e = time.time()
print(f'batch_decode time: {(e-s)*1000:.2f}ms')
print(text)

在上述代码中，我将5个长度在1~4s之间的音频文件一起灌给Qwen2.5-Omni进行推理，prepare的耗时大概在2.2s，generate的时间大概在4.4s。这个速度肯定比顺序算5个文件要快，但是整体的速度还是很慢。大概就是：5个4s的音频文件batch推理需要7s，这根本就无法用在实时的环境下，只适合离线场景。

5.推理加速

目前我还没有跑通官方给出的两种加速思路：flash-attn和vLLM，后续跑通会及时更新，先把我的教训分享给大家。

5.1 flash-attn

前面提到我没有按照官方的要求安装最新版的flash-attn，因为之前已经安装过一次。上一节给出的demo中在加载模型时有对应的flash-attn支持，但是我试用了一下发现速度根本没有变化，不知道是咋回事。然后我又尝试利用命令pip install -U flash-attn --no-build-isolation安装最新版的flash-attn，但是安装过程极为漫长，我都下班了也没有安装完，遂杀了下周重试。

5.2 vLLM

官方还给出了vLLM示例。但是用的vLLM版本不是主版本，所以需要我们自己进行源码编译。编译命令如下：

git clone -b qwen2_omni_public_v1 https://github.com/fyabc/vllm.git
cd vllm
pip install .

看着很简单，但是你一执行就会发现此路根本行不通。问题出在哪呢？就出在安装过程特别漫长，占用资源特别巨大，你一个不注意就把你机器整崩了。首先，当安装过程提示Installing build dependencies时，你要耐心等待，该步骤会耗时3个小时以上。其次，当上述步骤完成之后，还会触发torch2.6.0和transformers4.50.3等一系列软件的更新，如果安装完毕就会导致你的docker容器环境大变样。更为恶心的是，更新了transformers库又把上面刚新装的transformers库给覆盖了，上述demo变得无法跑通，还得按照前面提到的流程更新transformers库，幸好我没有安装成功。最后，当安装进行到最后一步Building wheels for collected packages，你要及时打开你的top命令，你就会发现你的所有核都被占满了，内存也在慢慢增长。我用的服务器内存是500G，结果一会不到就全部占满了。之前安装flash-attn的时候也经历过一次，当时的后果就是服务器挂了被迫重启。所以这次在安装vLLM的时候当内存快爆掉的时候我就把它杀了，最终也没有成功安装vLLM。

所以，最好的方式还是要使用官方的docker镜像，等有了新的实验结果及时同步给大家。

6.疑问

我看魔塔和git上的issue累积越来越多，感觉是官方一下子被整懵了。我个人觉得现在放出来的这个Qwen2.5-Omni有点仓促上马的意味，相当不完善。有很多人也包括我都有一个疑问：为什么https://chat.qwen.ai/给出的demo可以比较好的支持流式对话（当然延迟比较大，一般在2~4s），但是HF和魔塔上给出的demo却是需要上传文件的形式？如何实现纯粹的流式推理？毕竟上传文件这种形式在实际中没有太大意义。另一个疑问，为什么我自己用A100搭建的web_demo速度这么慢，与HF和魔塔上的速度对比差异太大了，官方最好能给一个比较明确的运行方式和硬件配置。还有，我自己测试flash-attn速度没有变化，这个是和我安装的flash-attn不是最新版有关吗？最后，Qwen2.5-Omni的多轮是如何实现的，如何避免prompt重复推理的？

不知道大家有没有发现，Freeze-Omni的作者wang xiong也是Qwen2.5-Omni的作者。理论上，Qwen2.5-Omni应该可以比较快得具有Freeze-Omni那样的流式推理能力（我怀疑https://chat.qwen.ai/这个网址就是按照Freeze-Omni那样实现的）。Freeze-Omni这个项目挺好的，但是是基于Qwen2做的，所以效果要比Qwen2.5-Omni略差，期待wang xiong大佬能快速开源一个新版的Freeze-Omni。Freeze-Omni的另一个问题是成本太高了，单卡单用户，根本无法上线。Qwen2.5-Omni相比Freeze-Omni效果有提升，但是也不完美，它的知识可能和Qwen2是一样的，比较老旧。我问了他一个问题：特朗普还是美国总统吗？它回答不是了。所以基本上来说，Qwen2.5-Omni也还属于一个玩具，无法与线上几十B的大模型效果相匹配，希望这个月发布的Qwen3能大幅提升文本效果。

紧跟着Qwen2.5-Omni的发布，百度也发布了自己的端到端语音大模型，号称双L20卡的并发可以做到几百以上，如果真是如此那端到端语音大模型终于迎来了商业落地的重大时刻。不过根据百度以往的习惯，他们的技术都不开源，基本属于自嗨的一种状态，业界也没有人真正去follow。希望李彦宏这次被Deepseek刺激之后，往后多多开源，让我们这些小企业能真正受益。