A. 运行效果
B. 配置部署
- 如果可以执行下面就执行下面:
pip install git+https://github.com/huggingface/transformers accelerate
- 否则分开执行
git clone https://github.com/huggingface/transformers
cd transformers
pip install . accelerate
- 随后,执行
pip install qwen-vl-utils
pip install torchvision
C. 模型测试
C.1 测试代码与注意事项
import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0' # 使用GPU 0
# ⚠️ 注意事项1: 如果是混合显卡,且中有一块不支持Flash2-Attention,则需要在代码最开始的地方指定可用显卡
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
# default: Load the model on the available device(s)
model = Qwen2VLForConditionalGeneration.from_pretrained(
# "/home/lgk/Downloads/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map = "auto"
"/home/lgk/Downloads/Qwen2-VL-2B-Instruct", torch_dtype="auto", device_map = "balanced_low_0"
)
# ⚠️ 注意事项2: 模型与输入需要选择与开头对应的设备,tokenizer没有要求,这里需要更改device_map = "balanced_low_0"
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2VLForConditionalGeneration.from_pretrained(
# "/home/lgk/Downloads/Qwen2-VL-2B-Instruct",
# torch_dtype=torch.bfloat16,
# attn_implementation="flash_attention_2",
# device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("/home/lgk/Downloads/Qwen2-VL-2B-Instruct")
# The default range for the number of visual tokens per image in the model is 4-16384.
# You can set min_pixels and max_pixels according to your needs, such as a token range of 256-1280, to balance performance and cost.
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("/home/lgk/Downloads/Qwen2-VL-2B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image."},
],
}
]
# Preparation for inference
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
C.2 测试
- mode=1
(qwen2-vl) (base) lgk@WIN-20240401VAM:~/Projects/transformers$ python -u "/home/lgk/Projects/transformers/test.py"
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████| 2/2 [00:03<00:00, 1.58s/it]
['The image depicts a serene beach scene with a woman and her dog. The woman is sitting on the sand, wearing a plaid shirt and black pants, and appears to be smiling. She is holding up her hand in a high-five gesture towards the dog, which is also sitting on the sand. The dog has a harness on, and its front paws are raised in a playful manner. The background shows the ocean with gentle waves, and the sky is clear with a soft glow from the setting or rising sun, casting a warm light over the entire scene. The overall atmosphere is peaceful and joyful.']
- mode=2
(qwen2-vl) (base) lgk@WIN-20240401VAM:~/Projects/transformers$ python test.py
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, 4.03s/it]
['The image depicts a serene beach scene with a woman and her dog. The woman is sitting on the sand, wearing a plaid shirt and black pants, and appears to be smiling. She is holding up her hand in a high-five gesture towards the dog, which is also sitting on the sand. The dog has a harness on, and its front paws are raised in a playful manner. The background shows the ocean with gentle waves, and the sky is clear with a soft glow from the setting or rising sun, casting a warm light over the entire scene. The overall atmosphere is peaceful and joyful, capturing a moment of connection between the']
D. 问题分析
D.1 Flash-Attention版本选择
flash_attn-2.3.5+cu117torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl与flash_attn-2.3.5+cu117torch2.0cxx11abiTrue-cp310-cp310-linux_x86_64.whl两个版本有什么区别?
这两个版本的 flash_attn
包的区别在于它们的编译选项,特别是关于 cxx11 ABI
(Application Binary Interface)的设置,这影响到包的二进制兼容性。具体来说:
-
cxx11abiFALSE:
- 编译时未启用 C++11 ABI(应用程序二进制接口)的选项。
- 使用了旧的 ABI(即
cxx03
ABI),通常用于与旧版本的系统或编译器(例如 GCC 4.x 系列)兼容。 - 对于那些需要与旧版本的 C++ 库兼容的系统或环境,这是合适的选择。
-
cxx11abiTrue:
- 编译时启用了 C++11 ABI 选项。
- 使用了新的 C++11 ABI,这通常是在较新的系统上运行的默认选择,兼容性较好,性能也有一定的提升。
- 适用于较新的编译器(如 GCC 5 及以上)以及那些完全支持 C++11 标准的系统环境。
主要区别:
- 二进制兼容性:启用 C++11 ABI 的版本(
cxx11abiTrue
)通常更现代化并且与新编译器及标准库更兼容。而未启用 C++11 ABI 的版本(cxx11abiFALSE
)则是为了兼容旧的二进制接口。 - 性能和特性:使用
cxx11abiTrue
可能会带来性能上的微小提升,因为它使用了较新的 ABI 优化。
选择哪个版本:
- 如果你的环境中所有相关的 C++ 库和编译器都是较新的版本(支持 C++11 及以上),那么使用
cxx11abiTrue
版本是更合适的选择。 - 如果你需要兼容老版本的系统或编译器,或者有其他旧的库依赖未启用 C++11 ABI,那么选择
cxx11abiFALSE
。
总之,选择时要考虑你系统的编译器版本、与其他库的兼容性以及性能需求。
D.2 如何判断
要判断你的环境中的所有相关 C++ 库和编译器是否支持 C++11 及以上版本,可以按照以下步骤进行检查:
1. 检查编译器版本
对于大多数系统,常用的 C++ 编译器是 GCC 或 Clang。以下是检查编译器版本的方法:
-
GCC (GNU Compiler Collection):
gcc --version
如果版本是 5.1 或更高,则默认支持 C++11 ABI。
-
Clang:
clang --version
Clang 3.3 及以上版本支持 C++11,但 Clang 3.5 及以上版本默认启用 C++11 ABI。
2. 检查编译器的默认 ABI 设置
检查你的编译器是否默认启用了 C++11 ABI,可以通过编译一个简单的测试程序并查看 ABI 信息:
-
编译测试程序:
创建一个简单的 C++ 文件(如
abi_check.cpp
):#include <iostream> int main() { std::cout << "__GLIBCXX_USE_CXX11_ABI = " << __GLIBCXX_USE_CXX11_ABI << std::endl; return 0; }
-
编译并运行:
g++ abi_check.cpp -o abi_check ./abi_check
输出结果
__GLIBCXX_USE_CXX11_ABI = 1
表示启用了 C++11 ABI,而0
表示未启用。
3. 检查系统中已安装的 C++ 库
有些 C++ 库可能也需要支持 C++11 ABI。检查已安装的库是否与 C++11 ABI 兼容:
-
查看已安装库的版本:可以使用包管理器(如
apt
,yum
,dnf
等)查看安装的 C++ 库的版本。例如,查看 libstdc++ 版本:
apt list --installed | grep libstdc++
-
查看符号信息:对于已安装的库,使用
nm
或objdump
查看符号信息,确保符号表中的符号与 C++11 ABI 兼容。
4. 检查构建工具链配置
如果你的项目使用 CMake、Makefile 或其他构建系统:
-
CMake: 确保
CMAKE_CXX_STANDARD
设置为 11 或更高:set(CMAKE_CXX_STANDARD 11)
-
Makefile: 在编译选项中加入
-std=c++11
或更高的标准:CXXFLAGS = -std=c++11
总结
通过以上步骤,你可以确认你的编译器、库和构建工具链是否默认支持并启用了 C++11 及以上的 ABI。如果所有检查结果都表明支持 C++11,那么你可以安全地使用 cxx11abiTrue
版本的包。
E. 参考文献
- 但是CUDA_VISIBLE_DEVICES只能在代码最开始的时候设置,中间改是没用的。
- 【⚠️ 大模型运行漫长的开始】 关于多GPU使用 device_map_device map-CSDN博客
- from_pretrained加载本地模型文件 - 知乎
- ⚠️ flash-attn安装报错 - 知乎
- ⚠️ 微调Qwen2-VL 最佳实践 — swift 2.4.0.dev0 文档
- vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs
- QwenLM/Qwen2-VL: Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
- Dao-AILab/flash-attention: Fast and memory-efficient exact attention
- ⚠️ 下载编译好的Flash-attention, True/False版本可以测试下
- ValueError: Flash Attention 2.0 only supports torch.float16 and torch.bfloat16 dtypes. You passed torch.float32, this might lead to unexpected behaviour. · Issue #28052 · huggingface/transformers
pip install flash_attn-2.6.3+cu123torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl --no-build-isolation