MiniCPM-V 2.6 面壁“小钢炮”,多图、视频理解多模态模型,部署和推理实战教程

news2025/1/7 7:29:02

MiniCPM-V 2.6是清华和面壁智能最新发布的多模态模型,亦称面壁“小钢炮”,它是 MiniCPM-V 系列中最新、性能最佳的模型。该模型基于 SigLip-400MQwen2-7B 构建,仅 8B 参数,但却取得 20B 以下单图、多图、视频理解 3 SOTA 成绩,一举将端侧 AI 多模态能力拉升至全面对标 GPT-4V 水平。

MiniCPM-V 2.6 的主要特点包括:

  1. 8B 参数,单图、多图、视频理解全面超越 GPT-4V !
  2. 小钢炮一口气将实时视频理解、多图联合理解、多图 ICL 等能力首次搬上端侧多模态模型。
  3. 端侧友好:量化后端侧内存仅占 6 GB,个人笔记本电脑可部署和推理。

更多性能和功能介绍,参见 GitHub 官网:https://github.com/OpenBMB/MiniCPM-V

这么小的参数量,却能带来这么强悍的能力,老牛同学决定部署,和大家一起一探究竟:

  1. 准备环境、下载源代码和模型权重文件
  2. 模型部署,进行图片理解推理、WebUI 可视化部署和推理

环境准备和模型下载

环境准备分为 3 部分:Miniconda配置、下载 GitHub 源代码、下载MiniCPM-V 2.6模型权重文件。

Miniconda 配置

工欲善其事,必先利其器,大模型研发环境先准备好,为后面部署和推理做好准备。详细教程,大家可以参考之前的文章:大模型应用研发基础环境配置(Miniconda、Python、Jupyter Lab、Ollama 等)

conda create --name MiniCPM-V python=3.10 -y

我们创建 Python 虚拟环境:MiniCPM-V,同时 Python 的主版本:3.10

完成之后,我们激活虚拟环境:conda activate MiniCPM-V

GitHub 源代码下载

GitHub 源代码地址:https://github.com/OpenBMB/MiniCPM-V.git

源代码下载的目录:MiniCPM-V

git clone https://github.com/OpenBMB/MiniCPM-V.git MiniCPM-V

源代码下载完成之后,我们就可以安装 Python 依赖包了,包依赖列表文件:MiniCPM-V/requirements.txt

packaging==23.2
addict==2.4.0
editdistance==0.6.2
einops==0.7.0
fairscale==0.4.0
jsonlines==4.0.0
markdown2==2.4.10
matplotlib==3.7.4
more_itertools==10.1.0
nltk==3.8.1
numpy==1.24.4
opencv_python_headless==4.5.5.64
openpyxl==3.1.2
Pillow==10.1.0
sacrebleu==2.3.2
seaborn==0.13.0
shortuuid==1.0.11
timm==0.9.10
torch==2.1.2
torchvision==0.16.2
tqdm==4.66.1
protobuf==4.25.0
transformers==4.40.0
typing_extensions==4.8.0
uvicorn==0.24.0.post1
sentencepiece==0.1.99
accelerate==0.30.1
socksio==1.0.0
gradio==4.41.0
gradio_client
http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/modelscope_studio-0.4.0.9-py3-none-any.whl
eva-decord

特别注意:最后一个依赖包decord通过pip install decord安装可能会报如下错误,因此,老牛同学找到了替代的依赖包eva-decord,可正常安装。

(MiniCPM-V) $ pip install decord
ERROR: Could not find a version that satisfies the requirement decord (from versions: none)
ERROR: No matching distribution found for decord

模型权重文件下载

模型权重文件比较大,我们需要通过 Git 大文件系统下载:

权重文件下载的目录:MiniCPM-V2.6

git lfs install
git clone https://www.modelscope.cn/openbmb/minicpm-v-2_6.git MiniCPM-V2.6

下载过程中,如果因网络等原因中断,我们可以继续断点下载:

cd MiniCPM-V2.6
git lfs pull

小试牛刀:单图理解体验

老牛同学网上找了一张汽车图片,先试一下“小钢炮”的威力:

汽车尾部图

由于模型推理过程,需要用到权重模型中的 Python 模块,因此我们把推理的 Python 代码放到模型权重文件目录中:MiniCPM-V2.6/MiniCPM-V2.6-01.py

# MiniCPM-V2.6-01.py
import torch
from PIL import Image
from transformers import AutoModel, AutoTokenizer

# 模型权重文件目录
model_dir = '.'

# 加载模型:local_files_only 加载本地模型,trust_remote_code 执行远程代码(必须)
model = AutoModel.from_pretrained(
    model_dir,
    local_files_only=True,
    trust_remote_code=True,
)

# 设置推理模式,如果有卡:model = model.eval().cuda()
model = model.eval()

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    local_files_only=True,
    trust_remote_code=True,
)

# 测试的汽车尾部图片,可以指定其它目录
image = Image.open('Car-01.jpeg').convert('RGB')

# 图片理解:自然语言理解 + 图片理解
question = '请问这是一张什么图片?'
msgs = [{'role': 'user', 'content': [image, question]}]

res = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    sampling=True,
    stream=True
)

# 理解结果
generated_text = ""
for new_text in res:
    generated_text += new_text
    print(new_text, flush=True, end='')

图片理解的输出结果如下:

汽车尾部图理解

可以看出,MiniCPM-V 2.6对图片的理解非常详尽:汽车、奥迪、A6L、尾部、黑色、中国、牌照区域、牌照内容等。

如果要给图片理解推理的结果打分的话,老牛同学打99 分,另外1 分的不足是给老牛同学自己的:推理速度实在太慢了,只能怪老牛同学的笔记本电脑配置太低了!

WebUI 可视化,推理自由

我们本地完成图片理解推理之后,在来看看 WebUI 可视化推理界面,体验会更好。同样的,我们把 WebUI 代码放到模型权重文件目录中:MiniCPM-V2.6/MiniCPM-V2.6-WebUI.py

# MiniCPM-V2.6-WebUI.py
#!/usr/bin/env python
# encoding: utf-8
import torch
import argparse
from transformers import AutoModel, AutoTokenizer
import gradio as gr
from PIL import Image
from decord import VideoReader, cpu
import io
import os
import copy
import requests
import base64
import json
import traceback
import re
import modelscope_studio as mgr

# 解析参数
parser = argparse.ArgumentParser(description='demo')
parser.add_argument('--device', type=str, default='cuda', help='cuda or mps')
parser.add_argument('--multi-gpus', action='store_true', default=False, help='use multi-gpus')
args = parser.parse_args()
device = args.device
assert device in ['cuda', 'mps']

# 模型权重文件目录
model_path = '.'

# 加载模型和分词器
if 'int4' in model_path:
    if device == 'mps':
        print('Error: running int4 model with bitsandbytes on Mac is not supported right now.')
        exit()
    model = AutoModel.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
else:
    if args.multi_gpus:
        from accelerate import load_checkpoint_and_dispatch, init_empty_weights, infer_auto_device_map
        with init_empty_weights():
            model = AutoModel.from_pretrained(model_path, local_files_only=True, trust_remote_code=True, attn_implementation='sdpa', torch_dtype=torch.bfloat16)
        device_map = infer_auto_device_map(model, max_memory={0: "10GB", 1: "10GB"},
            no_split_module_classes=['SiglipVisionTransformer', 'Qwen2DecoderLayer'])
        device_id = device_map["llm.model.embed_tokens"]
        device_map["llm.lm_head"] = device_id # firtt and last layer should be in same device
        device_map["vpm"] = device_id
        device_map["resampler"] = device_id
        device_id2 = device_map["llm.model.layers.26"]
        device_map["llm.model.layers.8"] = device_id2
        device_map["llm.model.layers.9"] = device_id2
        device_map["llm.model.layers.10"] = device_id2
        device_map["llm.model.layers.11"] = device_id2
        device_map["llm.model.layers.12"] = device_id2
        device_map["llm.model.layers.13"] = device_id2
        device_map["llm.model.layers.14"] = device_id2
        device_map["llm.model.layers.15"] = device_id2
        device_map["llm.model.layers.16"] = device_id2
        #print(device_map)

        model = load_checkpoint_and_dispatch(model, model_path, local_files_only=True, dtype=torch.bfloat16, device_map=device_map)
    else:
        model = AutoModel.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)
        model = model.to(device=device)
tokenizer = AutoTokenizer.from_pretrained(model_path, local_files_only=True, trust_remote_code=True)

# 设置推理模式
model.eval()

ERROR_MSG = "Error, please retry"
model_name = 'MiniCPM-V 2.6'
MAX_NUM_FRAMES = 64
IMAGE_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.webp'}
VIDEO_EXTENSIONS = {'.mp4', '.mkv', '.mov', '.avi', '.flv', '.wmv', '.webm', '.m4v'}

def get_file_extension(filename):
    return os.path.splitext(filename)[1].lower()

def is_image(filename):
    return get_file_extension(filename) in IMAGE_EXTENSIONS

def is_video(filename):
    return get_file_extension(filename) in VIDEO_EXTENSIONS


form_radio = {
    'choices': ['Beam Search', 'Sampling'],
    #'value': 'Beam Search',
    'value': 'Sampling',
    'interactive': True,
    'label': 'Decode Type'
}


def create_component(params, comp='Slider'):
    if comp == 'Slider':
        return gr.Slider(
            minimum=params['minimum'],
            maximum=params['maximum'],
            value=params['value'],
            step=params['step'],
            interactive=params['interactive'],
            label=params['label']
        )
    elif comp == 'Radio':
        return gr.Radio(
            choices=params['choices'],
            value=params['value'],
            interactive=params['interactive'],
            label=params['label']
        )
    elif comp == 'Button':
        return gr.Button(
            value=params['value'],
            interactive=True
        )


def create_multimodal_input(upload_image_disabled=False, upload_video_disabled=False):
    return mgr.MultimodalInput(upload_image_button_props={'label': 'Upload Image', 'disabled': upload_image_disabled, 'file_count': 'multiple'},
                                        upload_video_button_props={'label': 'Upload Video', 'disabled': upload_video_disabled, 'file_count': 'single'},
                                        submit_button_props={'label': 'Submit'})


def chat(img, msgs, ctx, params=None, vision_hidden_states=None):
    try:
        print('msgs:', msgs)
        answer = model.chat(
            image=None,
            msgs=msgs,
            tokenizer=tokenizer,
            **params
        )
        res = re.sub(r'(<box>.*</box>)', '', answer)
        res = res.replace('<ref>', '')
        res = res.replace('</ref>', '')
        res = res.replace('<box>', '')
        answer = res.replace('</box>', '')
        print('answer:', answer)
        return 0, answer, None, None
    except Exception as e:
        print(e)
        traceback.print_exc()
        return -1, ERROR_MSG, None, None


def encode_image(image):
    if not isinstance(image, Image.Image):
        if hasattr(image, 'path'):
            image = Image.open(image.path).convert("RGB")
        else:
            image = Image.open(image.file.path).convert("RGB")
    # resize to max_size
    max_size = 448*16
    if max(image.size) > max_size:
        w,h = image.size
        if w > h:
            new_w = max_size
            new_h = int(h * max_size / w)
        else:
            new_h = max_size
            new_w = int(w * max_size / h)
        image = image.resize((new_w, new_h), resample=Image.BICUBIC)
    return image
    ## save by BytesIO and convert to base64
    #buffered = io.BytesIO()
    #image.save(buffered, format="png")
    #im_b64 = base64.b64encode(buffered.getvalue()).decode()
    #return {"type": "image", "pairs": im_b64}


def encode_video(video):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    if hasattr(video, 'path'):
        vr = VideoReader(video.path, ctx=cpu(0))
    else:
        vr = VideoReader(video.file.path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx)>MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    video = vr.get_batch(frame_idx).asnumpy()
    video = [Image.fromarray(v.astype('uint8')) for v in video]
    video = [encode_image(v) for v in video]
    print('video frames:', len(video))
    return video


def check_mm_type(mm_file):
    if hasattr(mm_file, 'path'):
        path = mm_file.path
    else:
        path = mm_file.file.path
    if is_image(path):
        return "image"
    if is_video(path):
        return "video"
    return None


def encode_mm_file(mm_file):
    if check_mm_type(mm_file) == 'image':
        return [encode_image(mm_file)]
    if check_mm_type(mm_file) == 'video':
        return encode_video(mm_file)
    return None

def make_text(text):
    #return {"type": "text", "pairs": text} # # For remote call
    return text

def encode_message(_question):
    files = _question.files
    question = _question.text
    pattern = r"\[mm_media\]\d+\[/mm_media\]"
    matches = re.split(pattern, question)
    message = []
    if len(matches) != len(files) + 1:
        gr.Warning("Number of Images not match the placeholder in text, please refresh the page to restart!")
    assert len(matches) == len(files) + 1

    text = matches[0].strip()
    if text:
        message.append(make_text(text))
    for i in range(len(files)):
        message += encode_mm_file(files[i])
        text = matches[i + 1].strip()
        if text:
            message.append(make_text(text))
    return message


def check_has_videos(_question):
    images_cnt = 0
    videos_cnt = 0
    for file in _question.files:
        if check_mm_type(file) == "image":
            images_cnt += 1
        else:
            videos_cnt += 1
    return images_cnt, videos_cnt


def count_video_frames(_context):
    num_frames = 0
    for message in _context:
        for item in message["content"]:
            #if item["type"] == "image": # For remote call
            if isinstance(item, Image.Image):
                num_frames += 1
    return num_frames


def respond(_question, _chat_bot, _app_cfg, params_form):
    _context = _app_cfg['ctx'].copy()
    _context.append({'role': 'user', 'content': encode_message(_question)})

    images_cnt = _app_cfg['images_cnt']
    videos_cnt = _app_cfg['videos_cnt']
    files_cnts = check_has_videos(_question)
    if files_cnts[1] + videos_cnt > 1 or (files_cnts[1] + videos_cnt == 1 and files_cnts[0] + images_cnt > 0):
        gr.Warning("Only supports single video file input right now!")
        return _question, _chat_bot, _app_cfg

    if params_form == 'Beam Search':
        params = {
            'sampling': False,
            'num_beams': 3,
            'repetition_penalty': 1.2,
            "max_new_tokens": 2048
        }
    else:
        params = {
            'sampling': True,
            'top_p': 0.8,
            'top_k': 100,
            'temperature': 0.7,
            'repetition_penalty': 1.05,
            "max_new_tokens": 2048
        }

    if files_cnts[1] + videos_cnt > 0:
        params["max_inp_length"] = 4352 # 4096+256
        params["use_image_id"] = False
        params["max_slice_nums"] = 1 if count_video_frames(_context) > 16 else 2

    code, _answer, _, sts = chat("", _context, None, params)

    images_cnt += files_cnts[0]
    videos_cnt += files_cnts[1]
    _context.append({"role": "assistant", "content": [make_text(_answer)]})
    _chat_bot.append((_question, _answer))
    if code == 0:
        _app_cfg['ctx']=_context
        _app_cfg['sts']=sts
    _app_cfg['images_cnt'] = images_cnt
    _app_cfg['videos_cnt'] = videos_cnt

    upload_image_disabled = videos_cnt > 0
    upload_video_disabled = videos_cnt > 0 or images_cnt > 0
    return create_multimodal_input(upload_image_disabled, upload_video_disabled), _chat_bot, _app_cfg


def fewshot_add_demonstration(_image, _user_message, _assistant_message, _chat_bot, _app_cfg):
    ctx = _app_cfg["ctx"]
    message_item = []
    if _image is not None:
        image = Image.open(_image).convert("RGB")
        ctx.append({"role": "user", "content": [encode_image(image), make_text(_user_message)]})
        message_item.append({"text": "[mm_media]1[/mm_media]" + _user_message, "files": [_image]})
    else:
        if _user_message:
            ctx.append({"role": "user", "content": [make_text(_user_message)]})
            message_item.append({"text": _user_message, "files": []})
        else:
            message_item.append(None)
    if _assistant_message:
        ctx.append({"role": "assistant", "content": [make_text(_assistant_message)]})
        message_item.append({"text": _assistant_message, "files": []})
    else:
        message_item.append(None)

    _chat_bot.append(message_item)
    return None, "", "", _chat_bot, _app_cfg


def fewshot_respond(_image, _user_message, _chat_bot, _app_cfg, params_form):
    user_message_contents = []
    _context = _app_cfg["ctx"].copy()
    if _image:
        image = Image.open(_image).convert("RGB")
        user_message_contents += [encode_image(image)]
    if _user_message:
        user_message_contents += [make_text(_user_message)]
    if user_message_contents:
        _context.append({"role": "user", "content": user_message_contents})

    if params_form == 'Beam Search':
        params = {
            'sampling': False,
            'num_beams': 3,
            'repetition_penalty': 1.2,
            "max_new_tokens": 2048
        }
    else:
        params = {
            'sampling': True,
            'top_p': 0.8,
            'top_k': 100,
            'temperature': 0.7,
            'repetition_penalty': 1.05,
            "max_new_tokens": 2048
        }

    code, _answer, _, sts = chat("", _context, None, params)

    _context.append({"role": "assistant", "content": [make_text(_answer)]})

    if _image:
        _chat_bot.append([
            {"text": "[mm_media]1[/mm_media]" + _user_message, "files": [_image]},
            {"text": _answer, "files": []}
        ])
    else:
        _chat_bot.append([
            {"text": _user_message, "files": [_image]},
            {"text": _answer, "files": []}
        ])
    if code == 0:
        _app_cfg['ctx']=_context
        _app_cfg['sts']=sts
    return None, '', '', _chat_bot, _app_cfg


def regenerate_button_clicked(_question, _image, _user_message, _assistant_message, _chat_bot, _app_cfg, params_form):
    if len(_chat_bot) <= 1 or not _chat_bot[-1][1]:
        gr.Warning('No question for regeneration.')
        return '', _image, _user_message, _assistant_message, _chat_bot, _app_cfg
    if _app_cfg["chat_type"] == "Chat":
        images_cnt = _app_cfg['images_cnt']
        videos_cnt = _app_cfg['videos_cnt']
        _question = _chat_bot[-1][0]
        _chat_bot = _chat_bot[:-1]
        _app_cfg['ctx'] = _app_cfg['ctx'][:-2]
        files_cnts = check_has_videos(_question)
        images_cnt -= files_cnts[0]
        videos_cnt -= files_cnts[1]
        _app_cfg['images_cnt'] = images_cnt
        _app_cfg['videos_cnt'] = videos_cnt
        upload_image_disabled = videos_cnt > 0
        upload_video_disabled = videos_cnt > 0 or images_cnt > 0
        _question, _chat_bot, _app_cfg = respond(_question, _chat_bot, _app_cfg, params_form)
        return _question, _image, _user_message, _assistant_message, _chat_bot, _app_cfg
    else:
        last_message = _chat_bot[-1][0]
        last_image = None
        last_user_message = ''
        if last_message.text:
            last_user_message = last_message.text
        if last_message.files:
            last_image = last_message.files[0].file.path
        _chat_bot = _chat_bot[:-1]
        _app_cfg['ctx'] = _app_cfg['ctx'][:-2]
        _image, _user_message, _assistant_message, _chat_bot, _app_cfg = fewshot_respond(last_image, last_user_message, _chat_bot, _app_cfg, params_form)
        return _question, _image, _user_message, _assistant_message, _chat_bot, _app_cfg


def flushed():
    return gr.update(interactive=True)


def clear(txt_message, chat_bot, app_session):
    txt_message.files.clear()
    txt_message.text = ''
    chat_bot = copy.deepcopy(init_conversation)
    app_session['sts'] = None
    app_session['ctx'] = []
    app_session['images_cnt'] = 0
    app_session['videos_cnt'] = 0
    return create_multimodal_input(), chat_bot, app_session, None, '', ''


def select_chat_type(_tab, _app_cfg):
    _app_cfg["chat_type"] = _tab
    return _app_cfg


init_conversation = [
    [
        None,
        {
            # The first message of bot closes the typewriter.
            "text": "You can talk to me now",
            "flushing": False
        }
    ],
]


css = """
video { height: auto !important; }
.example label { font-size: 16px;}
"""

introduction = """

## Features:
1. Chat with single image
2. Chat with multiple images
3. Chat with video
4. In-context few-shot learning

Click `How to use` tab to see examples.
"""


with gr.Blocks(css=css) as demo:
    with gr.Tab(model_name):
        with gr.Row():
            with gr.Column(scale=1, min_width=300):
                gr.Markdown(value=introduction)
                params_form = create_component(form_radio, comp='Radio')
                regenerate = create_component({'value': 'Regenerate'}, comp='Button')
                clear_button = create_component({'value': 'Clear History'}, comp='Button')

            with gr.Column(scale=3, min_width=500):
                app_session = gr.State({'sts':None,'ctx':[], 'images_cnt': 0, 'videos_cnt': 0, 'chat_type': 'Chat'})
                chat_bot = mgr.Chatbot(label=f"Chat with {model_name}", value=copy.deepcopy(init_conversation), height=600, flushing=False, bubble_full_width=False)

                with gr.Tab("Chat") as chat_tab:
                    txt_message = create_multimodal_input()
                    chat_tab_label = gr.Textbox(value="Chat", interactive=False, visible=False)

                    txt_message.submit(
                        respond,
                        [txt_message, chat_bot, app_session, params_form],
                        [txt_message, chat_bot, app_session]
                    )

                with gr.Tab("Few Shot") as fewshot_tab:
                    fewshot_tab_label = gr.Textbox(value="Few Shot", interactive=False, visible=False)
                    with gr.Row():
                        with gr.Column(scale=1):
                            image_input = gr.Image(type="filepath", sources=["upload"])
                        with gr.Column(scale=3):
                            user_message = gr.Textbox(label="User")
                            assistant_message = gr.Textbox(label="Assistant")
                            with gr.Row():
                                add_demonstration_button = gr.Button("Add Example")
                                generate_button = gr.Button(value="Generate", variant="primary")
                    add_demonstration_button.click(
                        fewshot_add_demonstration,
                        [image_input, user_message, assistant_message, chat_bot, app_session],
                        [image_input, user_message, assistant_message, chat_bot, app_session]
                    )
                    generate_button.click(
                        fewshot_respond,
                        [image_input, user_message, chat_bot, app_session, params_form],
                        [image_input, user_message, assistant_message, chat_bot, app_session]
                    )

                chat_tab.select(
                    select_chat_type,
                    [chat_tab_label, app_session],
                    [app_session]
                )
                chat_tab.select( # do clear
                    clear,
                    [txt_message, chat_bot, app_session],
                    [txt_message, chat_bot, app_session, image_input, user_message, assistant_message]
                )
                fewshot_tab.select(
                    select_chat_type,
                    [fewshot_tab_label, app_session],
                    [app_session]
                )
                fewshot_tab.select( # do clear
                    clear,
                    [txt_message, chat_bot, app_session],
                    [txt_message, chat_bot, app_session, image_input, user_message, assistant_message]
                )
                chat_bot.flushed(
                    flushed,
                    outputs=[txt_message]
                )
                regenerate.click(
                    regenerate_button_clicked,
                    [txt_message, image_input, user_message, assistant_message, chat_bot, app_session, params_form],
                    [txt_message, image_input, user_message, assistant_message, chat_bot, app_session]
                )
                clear_button.click(
                    clear,
                    [txt_message, chat_bot, app_session],
                    [txt_message, chat_bot, app_session, image_input, user_message, assistant_message]
                )

    with gr.Tab("How to use"):
        with gr.Column():
            with gr.Row():
                image_example = gr.Image(value="http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/m_bear2.gif", label='1. Chat with single or multiple images', interactive=False, width=400, elem_classes="example")
                example2 = gr.Image(value="http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/video2.gif", label='2. Chat with video', interactive=False, width=400, elem_classes="example")
                example3 = gr.Image(value="http://thunlp.oss-cn-qingdao.aliyuncs.com/multi_modal/never_delete/fshot.gif", label='3. Few shot', interactive=False, width=400, elem_classes="example")


# 启动WebUI: http://127.0.0.1:8885
demo.launch(share=False, debug=True, show_api=False, server_port=8885, server_name="0.0.0.0")

WebUI 支持 GPU 和苹果 CPU 推理,启动方式分别为:

  • GPU:python web_demo_2.6.py --device cuda
  • 苹果 MPS:PYTORCH_ENABLE_MPS_FALLBACK=1 PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 python web_demo_2.6.py --device mps

WebUI启动界面

浏览器打开地址:http://0.0.0.0:8885/ ,可以看到可视化界面:

WebUI对话界面

WebUI 功能支持:自然语言对话、上传图片、上传视频等理解。

终极考验:视频理解体验

老牛同学受限于电脑配置,视频预计推理速度将会极慢,因此不做演示。根据官方代码,视频推理和图片推理类似,代码样例如下:

import torch
from PIL import Image
from modelscope import AutoModel, AutoTokenizer
from decord import VideoReader, cpu

# 模型权重文件目录
model_dir = '.'

# 加载模型:local_files_only 加载本地模型,trust_remote_code 执行远程代码(必须)
model = AutoModel.from_pretrained(
    model_dir,
    local_files_only=True,
    trust_remote_code=True,
)

# 设置推理模式,如果有卡:model = model.eval().cuda()
model = model.eval()

# 加载分词器
tokenizer = AutoTokenizer.from_pretrained(
    model_dir,
    local_files_only=True,
    trust_remote_code=True,
)


MAX_NUM_FRAMES=64

def encode_video(video_path):
    def uniform_sample(l, n):
        gap = len(l) / n
        idxs = [int(i * gap + gap / 2) for i in range(n)]
        return [l[i] for i in idxs]

    vr = VideoReader(video_path, ctx=cpu(0))
    sample_fps = round(vr.get_avg_fps() / 1)  # FPS
    frame_idx = [i for i in range(0, len(vr), sample_fps)]
    if len(frame_idx) > MAX_NUM_FRAMES:
        frame_idx = uniform_sample(frame_idx, MAX_NUM_FRAMES)
    frames = vr.get_batch(frame_idx).asnumpy()
    frames = [Image.fromarray(v.astype('uint8')) for v in frames]
    print('num frames:', len(frames))
    return frames

# 视频文件路径
video_path="~/Car.mp4"

frames = encode_video(video_path)

question = "请问这是一个什么视频?"
msgs = [
    {'role': 'user', 'content': frames + [question]},
]

# Set decode params for video
params={}
params["use_image_id"] = False
params["max_slice_nums"] = 2 # 如果cuda OOM且视频分辨率大于448*448 可设为1

answer = model.chat(
    image=None,
    msgs=msgs,
    tokenizer=tokenizer,
    **params
)

print(answer)

老牛同学在这里,引用了官方的绘制兔子头像的一个视频理解:

|

基于 Qwen2/Lllama3 等大模型,部署团队私有化 RAG 知识库系统的详细教程(Docker+AnythingLLM)

使用 Llama3/Qwen2 等开源大模型,部署团队私有化 Code Copilot 和使用教程

基于 Qwen2 大模型微调技术详细教程(LoRA 参数高效微调和 SwanLab 可视化监控)

LivePortrait 数字人:开源的图生视频模型,本地部署和专业视频制作详细教程

本地部署 GLM-4-9B 清华智谱开源大模型方法和对话效果体验

玩转 AI,笔记本电脑安装属于自己的 Llama 3 8B 大模型和对话客户端

ChatTTS 开源文本转语音模型本地部署、API 使用和搭建 WebUI 界面

Ollama 完整教程:本地 LLM 管理、WebUI 对话、Python/Java 客户端 API 应用

微信公众号:老牛同学

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/2040781.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

leetcode300. 最长递增子序列,动态规划附状态转移方程

leetcode300. 最长递增子序列 给你一个整数数组 nums &#xff0c;找到其中最长严格递增子序列的长度。 子序列 是由数组派生而来的序列&#xff0c;删除&#xff08;或不删除&#xff09;数组中的元素而不改变其余元素的顺序。例如&#xff0c;[3,6,2,7] 是数组 [0,3,1,6,2,2…

【扩散模型(七)】Stable Diffusion 3 diffusers 源码详解2 - DiT 与 MMDiT 相关代码(上)

系列文章目录 【扩散模型&#xff08;一&#xff09;】中介绍了 Stable Diffusion 可以被理解为重建分支&#xff08;reconstruction branch&#xff09;和条件分支&#xff08;condition branch&#xff09;【扩散模型&#xff08;二&#xff09;】IP-Adapter 从条件分支的视…

“从零开始的HTML 表格”——WEB开发系列09

HTML 表格是一种用于在网页上组织和显示信息的结构性元素&#xff0c;它能够将数据以行和列的形式呈现&#xff0c;帮助用户更清晰地理解数据关系。表格在展示统计数据、产品列表、日程安排等方面非常实用。 一、HTML 表格的基本结构 HTML 表格用 ​​<table>​​ 标签来…

创意无限!2024年热门视频剪辑软件精选

从专业级电影剪辑工具到简单易用的手机APP&#xff0c;再到集创意与高效于一身的桌面应用&#xff0c;各类剪辑软件如雨后春笋般涌现。本文将带你一窥2024年火热的剪辑视频的软件。 1.福昕视频剪辑 连接直达>>https://www.pdf365.cn/foxit-clip/ 这款视频编辑工具凭…

Oracle 12.2集群搭建遇到ORA-ORA-15227,ORA-15031,ORA-15018问题处理

报错&#xff1a; [FATAL] [DBT-30056] Labeling of disks failed. ORA-15227: could not perform label set/clear operation ORA-15031: disk specification /dev/asmdisk/ocr01 matches no disks [FATAL] [DBT-30002] Disk group OCR creation failed. ORA-15018: diskgrou…

用Python实现9大回归算法详解——03. 岭回归算法

1. 岭回归的基本概念与动机 1.1 为什么使用岭回归&#xff1f; 在线性回归中&#xff0c;当特征之间存在强烈的相关性&#xff08;即多重共线性&#xff09;时&#xff0c;回归系数会变得不稳定&#xff0c;导致模型在新数据上的表现很差。多重共线性会导致普通最小二乘法&am…

stm32f407新建项目工程及烧录

1、新建一个文件夹&#xff0c;打开keil5将项目工程放入文件夹中 2、弹出选择对应型号设备 3、弹出选择对应库 可以看见出现下图&#xff1a;感叹号表示有错 最后如图所示&#xff1a;点击ok就行了 4、创建对应的文件夹存放文件 4、建立main.c 5、添加对应的设置 最后写一个空白…

sp-eric靶机

端口扫描 靶机ip地址为192.168.7.46 目录扫描 访问80端口 拼接访问 /admin.php 发现登录框 测试sql注入&#xff0c;弱口令等&#xff0c;无结果 扫描目录发现了.git文件&#xff0c;存在源码泄漏 将其下载到kali上读取 python2 GitHack.py -u http://192.168.7.180/.git/…

Linux11

Linux运行级别 graphical.target图形化模式 runlevel查看运行级别 init 6自动重启 centos7单用户模式修改密码 Windows安全模式可用来删除木马&#xff0c;更为方便 单用户模式修改密码 选择第一个 按e键进入编辑模式&#xff0c;并完成以下修改&#xff08;注意&#xff0…

linux上Java生成图片中文乱码

在生成图形二维码时&#xff0c;设置底部中文导出空白乱码&#xff0c;效果如下&#xff1a; 这里服务器使用的是centos7&#xff0c;解决方案下载simsun.ttc文件&#xff0c;放入至jdk安装目录“/opt/jdk/jre/lib/fonts”中&#xff0c;具体根据自身本机jdk安装路径存放&…

ZOOKEEPER+KAFKA消息队列群集

前言 消息队列(Message Queue)&#xff0c;是分布式系统中重要的组件&#xff0c;其通用的使用场景可以简单地描述为:当不需要立即获得结果&#xff0c;但是并发量又需要进行控制的时候&#xff0c;差不多就是需要使用消息队列的时候。 消息队列 什么是消息队列 消息(Messa…

CAD二次开发IFoxCAD框架系列(19)-图层操作

1. 根据名称查询指定的图层 查看层表中是否含有名为“MyLayer”的图层。 using var tr new DBTrans();if(tr.LayerTable.Has("MyLayer")){//要执行的操作}2. 遍历图层名称 遍历图层表并打印每个图层的名字 。 using var tr new DBTrans();tr.LayerTable.GetRecor…

MySQL源码安装与MySQL基础学习

1、安装MySQL ​ 本次安装使用的是绿色硬盘版本&#xff0c;无需额外安装依赖环境&#xff0c;比较简单 修改相关配置文件&#xff1a; 设置环境变量&#xff0c;声明/宣告MySQL命令便于系统识别&#xff1a; 初始化数据库&#xff1a; 设置系统识别&#xff0c;进行操作&…

【代码随想录】数组总结篇

本博文为《代码随想录》的学习笔记&#xff0c;原文链接&#xff1a;代码随想录 数组理论基础 首先要知道数组在内存中的存储方式&#xff0c;数组时存放在连续内存空间上的相同类型数据的集合。数组可以方便地通过下标索引的方式获取到下标对应的数据。举例如下&#xff1a;…

电信优惠套餐到期会自动延续吗?这个问题你了解过吗?

电信优惠套餐到期会自动延续吗&#xff1f;看选择的套餐&#xff0c;不同的套餐情况不同。 对于电信流量卡的优惠期限&#xff0c;有以下几种情况&#xff1a; 短期套餐&#xff1a; 6个月、12个月、24个月等&#xff0c;套餐到期后会恢复原来的资费&#xff0c;不会自动延续…

分组汇总时保留不变列

Excel表格的ID列是分类&#xff0c;Value1和Value2是数值&#xff0c;ID相同时Descr 1和Descr 2保持不变。 ABCDE1IDValue 1Value 2Descr 1Descr 22112.51.8ax13112.31.1ax14111.91.6ax15123.73.5bx26123.91.5bx27132.50.2cx38132.64.1cx391324.8cx310132.71.8cx3 要求&#…

Linux Shell实例

1.查空行 答案&#xff1a; awk /^$/{print NR} file1.txt#awk:一个强大的文本分析工具&#xff0c;把文件逐行的读入&#xff0c;以空格为默认分隔符将每行切片&#xff0c;切开的部分再进行分析#处理。 #1&#xff09;基本语法 #awk [选项参数]/pattern1/{action1} /pattern…

【数据结构详解】——计数排序(动图详解)

目录 &#x1f552; 1. 计数排序 &#x1f552; 1. 计数排序 &#x1f4a1; 算法思想&#xff1a;计数排序又称为鸽巢原理&#xff0c;是对哈希直接定址法的变形应用&#xff0c;操作步骤&#xff1a; 统计相同元素出现次数根据统计的结果将序列回收到原来的序列中 void Coun…

Windows11 WSL2 Ubuntu编译安装perf工具

在Windows 11上通过WSL2安装并编译perf工具&#xff08;Linux性能分析工具&#xff09;可以按以下步骤进行。perf工具通常与Linux内核一起发布&#xff0c;因此你需要确保你的内核版本和perf版本匹配。以下是安装和编译perf的步骤&#xff1a; 1. 更新并升级系统 首先&#x…

【算法】并查集的介绍与使用

1.并查集的概论 定义&#xff1a; 并查集是一种树型的数据结构&#xff0c;用于处理一些不相交集合的合并及查询问题&#xff08;即所谓的并、查&#xff09;。比如说&#xff0c;我们可以用并查集来判断一个森林中有几棵树、某个节点是否属于某棵树等。 主要构成&#xff1a; …