基于Qwen2-VL模型针对LaTeX OCR任务进行微调训练 - 原模型 单图推理
flyish
输入
输出
[‘这是一幅中国传统山水画,描绘了一座高耸的山峰,周围环绕着树木和植被。画面下方有一片开阔的田野,远处的山峦在薄雾中若隐若现。画面上方有几行书法题字,可能是画家的签名或诗文。画面右上角和左下角有几枚印章,可能是画家或收藏者的印章。整体色调以淡雅的青绿色为主,给人一种宁静、自然的感觉。’]
使用flash-attn 减少资源消耗
pip install flash-attn --no-build-isolation
代码如下
from PIL import Image
import requests
import torch
from torchvision import io
from typing import Dict
from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from modelscope import snapshot_download
# 下载模型快照并指定保存目录
model_dir = snapshot_download("qwen/Qwen2-VL-7B-Instruct")
# 加载模型到可用设备(CPU或GPU),并使用自动精度(根据设备自动选择)
model = Qwen2VLForConditionalGeneration.from_pretrained(
model_dir, torch_dtype="auto", device_map="auto",attn_implementation="flash_attention_2"
)
# 打印模型的所有属性及其值(不包括方法)
print("\nModel Attributes and Their Values:")
attributes = dir(model)
for attr in attributes:
try:
value = getattr(model, attr)
if not callable(value):
print(f"{attr}: {value}")
except AttributeError:
continue
# 打印实际使用的 torch_dtype
print(f"Actual torch_dtype: {model.dtype}")
# 输出模型的结构
print("\nModel Configuration:")
print(model.config)
# 加载图像处理器
processor = AutoProcessor.from_pretrained(model_dir,low_cpu_mem_usage=False)
# 图像的URL
#url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"
# 从给定的URL加载图像
#image = Image.open(requests.get(url, stream=True).raw)
# 从本地获取文件
image_path = './QueHuaQiuSe1.png'
image = Image.open(image_path)
# 定义对话历史,包括用户输入的文本和图像 Describe this image.
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
},
{"type": "text", "text": "描述这张图像。"},
],
}
]
# 使用处理器应用聊天模板,并添加生成提示
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True)
# 预处理输入数据,将文本和图像转换为模型可以接受的格式
inputs = processor(
text=[text_prompt], images=[image], padding=True, return_tensors="pt"
)
# 将输入数据移动到CUDA设备上(如果可用的话)
inputs = inputs.to("cuda")
# 推理:生成输出文本
output_ids = model.generate(**inputs, max_new_tokens=128) # 最大新生成token数量为128
# 提取生成的token ID,去掉输入的原始token ID
generated_ids = [
output_ids[len(input_ids) :]
for input_ids, output_ids in zip(inputs.input_ids, output_ids)
]
# 解码生成的token ID为人类可读的文本
output_text = processor.batch_decode(
generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True
)
# 打印生成的描述文本
print(output_text)
模型的属性
Downloading Model to directory: /home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct
`Qwen2VLRotaryEmbedding` can now be fully parameterized by passing the model config through the `config` argument. All other arguments will be removed in v4.46
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:03<00:00, 1.45it/s]
Model Attributes and Their Values:
T_destination: ~T_destination
__annotations__: {'dump_patches': <class 'bool'>, '_version': <class 'int'>, 'training': <class 'bool'>, '_parameters': typing.Dict[str, typing.Optional[torch.nn.parameter.Parameter]], '_buffers': typing.Dict[str, typing.Optional[torch.Tensor]], '_non_persistent_buffers_set': typing.Set[str], '_backward_pre_hooks': typing.Dict[int, typing.Callable], '_backward_hooks': typing.Dict[int, typing.Callable], '_is_full_backward_hook': typing.Optional[bool], '_forward_hooks': typing.Dict[int, typing.Callable], '_forward_hooks_with_kwargs': typing.Dict[int, bool], '_forward_hooks_always_called': typing.Dict[int, bool], '_forward_pre_hooks': typing.Dict[int, typing.Callable], '_forward_pre_hooks_with_kwargs': typing.Dict[int, bool], '_state_dict_hooks': typing.Dict[int, typing.Callable], '_load_state_dict_pre_hooks': typing.Dict[int, typing.Callable], '_state_dict_pre_hooks': typing.Dict[int, typing.Callable], '_load_state_dict_post_hooks': typing.Dict[int, typing.Callable], '_modules': typing.Dict[str, typing.Optional[ForwardRef('Module')]], 'call_super_init': <class 'bool'>, '_compiled_call_impl': typing.Optional[typing.Callable], 'forward': typing.Callable[..., typing.Any], '__call__': typing.Callable[..., typing.Any]}
__dict__: {'training': False, '_parameters': {}, '_buffers': {}, '_non_persistent_buffers_set': set(), '_backward_pre_hooks': OrderedDict(), '_backward_hooks': OrderedDict(), '_is_full_backward_hook': None, '_forward_hooks': OrderedDict(), '_forward_hooks_with_kwargs': OrderedDict(), '_forward_hooks_always_called': OrderedDict(), '_forward_pre_hooks': OrderedDict(), '_forward_pre_hooks_with_kwargs': OrderedDict(), '_state_dict_hooks': OrderedDict(), '_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_pre_hooks': OrderedDict(), '_load_state_dict_post_hooks': OrderedDict(), '_modules': {'visual': Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionFlashAttention2(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=3584, bias=True)
)
)
), 'model': Qwen2VLModel(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLFlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
), 'lm_head': Linear(in_features=3584, out_features=152064, bias=False)}, 'config': Qwen2VLConfig {
"_attn_implementation_autoset": true,
"_name_or_path": "/home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}
, 'name_or_path': '/home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct', 'warnings_issued': {}, 'generation_config': GenerationConfig {
"attn_implementation": "flash_attention_2",
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}
, '_keep_in_fp32_modules': None, 'vocab_size': 152064, 'rope_deltas': None, '_is_hf_initialized': True, '_old_forward': <bound method Qwen2VLForConditionalGeneration.forward of Qwen2VLForConditionalGeneration(
(visual): Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionFlashAttention2(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=3584, bias=True)
)
)
)
(model): Qwen2VLModel(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLFlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)>, '_hf_hook': AlignDevicesHook(execution_device=0, offload=False, io_same_device=True, offload_buffers=False, place_submodules=False, skip_keys='past_key_values'), 'forward': functools.partial(<function add_hook_to_module.<locals>.new_forward at 0x727f2aec7e20>, Qwen2VLForConditionalGeneration(
(visual): Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionFlashAttention2(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=3584, bias=True)
)
)
)
(model): Qwen2VLModel(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLFlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(lm_head): Linear(in_features=3584, out_features=152064, bias=False)
)), 'to': <function Module.to at 0x727f28b1c7c0>, 'cuda': <function Module.cuda at 0x727f28b1c860>, 'hf_device_map': {'visual': 0, 'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}}
__doc__: None
__module__: transformers.models.qwen2_vl.modeling_qwen2_vl
__weakref__: None
_auto_class: None
_backward_hooks: OrderedDict()
_backward_pre_hooks: OrderedDict()
_buffers: {}
_compiled_call_impl: None
_forward_hooks: OrderedDict()
_forward_hooks_always_called: OrderedDict()
_forward_hooks_with_kwargs: OrderedDict()
_forward_pre_hooks: OrderedDict()
_forward_pre_hooks_with_kwargs: OrderedDict()
_hf_hook: AlignDevicesHook(execution_device=0, offload=False, io_same_device=True, offload_buffers=False, place_submodules=False, skip_keys='past_key_values')
_hf_peft_config_loaded: False
_is_full_backward_hook: None
_is_hf_initialized: True
/home/sss/anaconda3/lib/python3.12/site-packages/transformers/modeling_utils.py:5055: FutureWarning: `_is_quantized_training_enabled` is going to be deprecated in transformers 4.39.0. Please use `model.hf_quantizer.is_trainable` instead
warnings.warn(
_is_quantized_training_enabled: False
_is_stateful: False
_keep_in_fp32_modules: None
_keep_in_fp32_modules: None
_keys_to_ignore_on_load_missing: None
_keys_to_ignore_on_load_unexpected: None
_keys_to_ignore_on_save: None
_load_state_dict_post_hooks: OrderedDict()
_load_state_dict_pre_hooks: OrderedDict()
_modules: {'visual': Qwen2VisionTransformerPretrainedModel(
(patch_embed): PatchEmbed(
(proj): Conv3d(3, 1280, kernel_size=(2, 14, 14), stride=(2, 14, 14), bias=False)
)
(rotary_pos_emb): VisionRotaryEmbedding()
(blocks): ModuleList(
(0-31): 32 x Qwen2VLVisionBlock(
(norm1): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(attn): VisionFlashAttention2(
(qkv): Linear(in_features=1280, out_features=3840, bias=True)
(proj): Linear(in_features=1280, out_features=1280, bias=True)
)
(mlp): VisionMlp(
(fc1): Linear(in_features=1280, out_features=5120, bias=True)
(act): QuickGELUActivation()
(fc2): Linear(in_features=5120, out_features=1280, bias=True)
)
)
)
(merger): PatchMerger(
(ln_q): LayerNorm((1280,), eps=1e-06, elementwise_affine=True)
(mlp): Sequential(
(0): Linear(in_features=5120, out_features=5120, bias=True)
(1): GELU(approximate='none')
(2): Linear(in_features=5120, out_features=3584, bias=True)
)
)
), 'model': Qwen2VLModel(
(embed_tokens): Embedding(152064, 3584)
(layers): ModuleList(
(0-27): 28 x Qwen2VLDecoderLayer(
(self_attn): Qwen2VLFlashAttention2(
(q_proj): Linear(in_features=3584, out_features=3584, bias=True)
(k_proj): Linear(in_features=3584, out_features=512, bias=True)
(v_proj): Linear(in_features=3584, out_features=512, bias=True)
(o_proj): Linear(in_features=3584, out_features=3584, bias=False)
(rotary_emb): Qwen2VLRotaryEmbedding()
)
(mlp): Qwen2MLP(
(gate_proj): Linear(in_features=3584, out_features=18944, bias=False)
(up_proj): Linear(in_features=3584, out_features=18944, bias=False)
(down_proj): Linear(in_features=18944, out_features=3584, bias=False)
(act_fn): SiLU()
)
(input_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
(post_attention_layernorm): Qwen2RMSNorm((3584,), eps=1e-06)
)
)
(norm): Qwen2RMSNorm((3584,), eps=1e-06)
(rotary_emb): Qwen2VLRotaryEmbedding()
), 'lm_head': Linear(in_features=3584, out_features=152064, bias=False)}
_no_split_modules: ['Qwen2VLDecoderLayer', 'Qwen2VLVisionBlock']
_non_persistent_buffers_set: set()
_parameters: {}
_skip_keys_device_placement: past_key_values
_state_dict_hooks: OrderedDict()
_state_dict_pre_hooks: OrderedDict()
_supports_cache_class: True
_supports_flash_attn_2: True
_supports_flex_attn: False
_supports_quantized_cache: False
_supports_sdpa: True
_supports_static_cache: True
_tied_weights_keys: ['lm_head.weight']
_tp_plan: None
_version: 1
base_model_prefix: model
call_super_init: False
config: Qwen2VLConfig {
"_attn_implementation_autoset": true,
"_name_or_path": "/home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}
device: cuda:0
dtype: torch.bfloat16
dummy_inputs: {'input_ids': tensor([[7, 6, 0, 0, 1],
[1, 2, 3, 0, 0],
[0, 0, 0, 4, 5]])}
dump_patches: False
framework: pt
generation_config: GenerationConfig {
"attn_implementation": "flash_attention_2",
"bos_token_id": 151643,
"do_sample": true,
"eos_token_id": [
151645,
151643
],
"pad_token_id": 151643,
"temperature": 0.01,
"top_k": 1,
"top_p": 0.001
}
hf_device_map: {'visual': 0, 'model.embed_tokens': 0, 'model.layers.0': 0, 'model.layers.1': 0, 'model.layers.2': 0, 'model.layers.3': 0, 'model.layers.4': 0, 'model.layers.5': 0, 'model.layers.6': 0, 'model.layers.7': 0, 'model.layers.8': 0, 'model.layers.9': 0, 'model.layers.10': 0, 'model.layers.11': 1, 'model.layers.12': 1, 'model.layers.13': 1, 'model.layers.14': 1, 'model.layers.15': 1, 'model.layers.16': 1, 'model.layers.17': 1, 'model.layers.18': 1, 'model.layers.19': 1, 'model.layers.20': 1, 'model.layers.21': 1, 'model.layers.22': 1, 'model.layers.23': 1, 'model.layers.24': 1, 'model.layers.25': 1, 'model.layers.26': 1, 'model.layers.27': 1, 'model.norm': 1, 'model.rotary_emb': 1, 'lm_head': 1}
is_gradient_checkpointing: False
is_parallelizable: False
`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.
main_input_name: input_ids
model_tags: None
name_or_path: /home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct
rope_deltas: None
supports_gradient_checkpointing: True
supports_tp_plan: False
training: False
vocab_size: 152064
warnings_issued: {}
Actual torch_dtype: torch.bfloat16
Model Configuration:
Qwen2VLConfig {
"_attn_implementation_autoset": true,
"_name_or_path": "/home/sss/.cache/modelscope/hub/qwen/Qwen2-VL-7B-Instruct",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.47.0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}
输入
输出
[‘这张图片展示了一个宁静的湖泊景色。湖面平静,反射出天空和远处的山峰。湖的中央有一个小岛,上面有一些树木和灌木丛。岛的周围是清澈的水面,可以看到一些水生植物。远处可以看到一些建筑物,可能是住宅区或城市的一部分。天空晴朗,呈现出淡蓝色,给人一种宁静和平静的感觉。’]