Grounded-Segment-Anything本地部署

1.本地部署源码
- 1.1 克隆源码
- 1.2 下载初始参数
2. 创建虚拟环境及配置
3. 测试相关文件
- 3.1 运行`grounding_dino_demo.py`文件
- 3.2 运行`grounded_sam_demo.py`文件
- 3.3 运行`grounded_sam_simple_demo.py`文件
- 3.4 `grounded_sam_inpainting_demo.py`文件
- 3.5 ` 运行`automatic_label_ram_demo.py`文件
- 3.6 运行`automatic_label_demo.py`文件
- 3.7. 批量自动标注图片
5. 总结

源码链接：https://github.com/IDEA-Research/Grounded-Segment-Anything
介绍
- Segment Anything Model，简称 SAM。
- SAM 已经学会了关于物体的一般概念，可以为任何图像或视频中的任何物体生成 mask，甚至包括在训练过程中没有遇到过的物体和图像类型。
- SAM 足够通用，可以涵盖广泛的用例，并且可以在新的图像领域即开即用，无需额外的训练

1.本地部署源码

1.1 克隆源码

克隆命令

git clone https://github.com/IDEA-Research/Grounded-Segment-Anything.git

问题一
- 克隆下来后发现下面的三个文件夹是空的
- 在源码链接中有这两行命令，应该是把这三个文件夹文件下载下来，但是我的报错
- 解决：手动下载，然后放在相应的文件夹
问题二【手动放在相应的文件夹后，调用里面的文件路径是错的】
- 解决：修改相关路径,运行代码将不对的路径全部改掉
- 修改路径 【segment_anything】
- 修改路径 【GroundingDINO】

1.2 下载初始参数

参数链接
- groundingdino_swint_ogc.pth
- sam_vit_h_4b8939.pth
- sam_hq_vit_h.pth
- ram_swin_large_14m.pth
- tag2text_swin_14m.pth
初始参数文件存放目录：项目根目录

2. 创建虚拟环境及配置

创建虚拟环境

conda create -n env_grounded_segment_anything python==3.8.10

进入虚拟环境

conda activate env_grounded_segment_anything

安装pytorch

pip install torch==1.10.0+cu113 torchvision==0.11.0+cu113 torchaudio==0.10.0 -f https://download.pytorch.org/whl/torch_stable.html

安装requirements.txt库
```
pip install -r requirements.txt 
```
使用pycharm打开项目
选择虚拟环境

3. 测试相关文件

3.1 运行`grounding_dino_demo.py`文件

无gpu，将DEVICE 值改为 cpu
有gpu,无需修改参数
生成标注图片

3.2 运行`grounded_sam_demo.py`文件

添加参数，电脑没有gpu，device 参数使用默认的cpu

--config GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py  
--grounded_checkpoint groundingdino_swint_ogc.pth  
--sam_checkpoint sam_vit_h_4b8939.pth 
--input_image assets/demo1.jpg  
--output_dir "outputs"  
--box_threshold 0.3  
--text_threshold 0.25  
--text_prompt "bear"

在这里插入图片描述

输出结果
- 文件说明
- 图片显示

3.3 运行`grounded_sam_simple_demo.py`文件

无cuda报错
inference.py修改device值改为cpu
测试demo4.jpg【无需修改代码】
测试demo7.jpg
- 修改关于demo.jpg信息
- 结果

3.4 `grounded_sam_inpainting_demo.py`文件

修复图片文件

添加参数

--config
GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py
--grounded_checkpoint
groundingdino_swint_ogc.pth
--sam_checkpoint
sam_vit_h_4b8939.pth
--input_image
assets/inpaint_demo.jpg
--output_dir
"outputs"
--box_threshold
0.3
--text_threshold
0.25
--det_prompt
"bench"
--inpaint_prompt
"A sofa, high quality, detailed"

在这里插入图片描述

报错 【下载远程文件失败】
解决：手动下载
- 下载链接：https://huggingface.co/runwayml/stable-diffusion-inpainting/tree/main
- 将下载好的文件放在config_data文件夹中
- 修改为本地路径
重新运行:报错 【原因：我没有gpu】
解决，将cuda换为cpu
重新运行:报错
- RuntimeError: "LayerNormKernelImpl" not implemented for 'Half'
- 解决方式：解决链接
- 将float16改为float32
结果

3.5 `运行`automatic_label_ram_demo.py`文件

添加参数

--ram_checkpoint
ram_swin_large_14m.pth
--grounded_checkpoint
groundingdino_swint_ogc.pth
--sam_checkpoint
sam_vit_h_4b8939.pth
--input_image
assets/demo9.jpg
--output_dir
"outputs"
--box_threshold
0.25
--text_threshold
0.2
--iou_threshold
0.5

结果

3.6 运行`automatic_label_demo.py`文件

自动标注文件

添加参数

--config
GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py
--grounded_checkpoint
groundingdino_swint_ogc.pth
--sam_checkpoint
sam_vit_h_4b8939.pth
--input_image
assets/demo9.jpg
--output_dir
"outputs"
--box_threshold
0.25
--text_threshold
0.2
--iou_threshold
0.5

在这里插入图片描述

报错
原因：下载文件失败，手动下载
- 下载链接：https://huggingface.co/Salesforce/blip-image-captioning-large/tree/main
- 将文件放在config_data中
- 修改为本地路径
报错Resource punkt not found. Please use the NLTK Downloader to obtain the resources
- 手动下载链接：http://www.nltk.org/nltk_data/
- 下载wordnet,punkt,averaged_perceptron_tagger
- 放到相应目录，zip及解压文件都要有
结果

3.7. 批量自动标注图片

修改automatic_label_demo.py文件

文件太长，其他的地方还要改

if __name__ == "__main__":
    root_path=''  # 根目录
    images_name='images'   # 图片文件夹名
    images_path=os.path.join(root_path,images_name)
    images_outputs_path=os.path.join(root_path,'grounded_segment_anything_images')
    output_json = os.path.join(images_outputs_path,'json')
    output_orig = os.path.join(images_outputs_path,'orig')
    output_mask = os.path.join(images_outputs_path,'mask')
    output_automatic_label = os.path.join(images_outputs_path,'automatic_label')
    for i in [output_json,output_mask,output_orig,output_automatic_label]:
        os.makedirs(i, exist_ok=True)
    images_list=os.listdir(images_path)
    parser = argparse.ArgumentParser("Grounded-Segment-Anything Demo", add_help=True)
    parser.add_argument("--config", type=str, default='GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py',
                        help="path to config file")
    parser.add_argument("--grounded_checkpoint", type=str, default='groundingdino_swint_ogc.pth', help="path to checkpoint file")
    parser.add_argument("--sam_checkpoint", type=str, default='sam_vit_h_4b8939.pth', help="path to checkpoint file")
    parser.add_argument("--split", default=",", type=str, help="split for text prompt")
    parser.add_argument("--openai_key", type=str, help="key for chatgpt")
    parser.add_argument("--openai_proxy", default=None, type=str, help="proxy for chatgpt")
    parser.add_argument("--box_threshold", type=float, default=0.25, help="box threshold")
    parser.add_argument("--text_threshold", type=float, default=0.2, help="text threshold")
    parser.add_argument("--iou_threshold", type=float, default=0.5, help="iou threshold")
    parser.add_argument("--device", type=str, default="cpu", help="running on cpu only!, default=False")
    args = parser.parse_args()

    # cfg
    config_file = args.config  # change the path of the model config file
    grounded_checkpoint = args.grounded_checkpoint  # change the path of the model
    sam_checkpoint = args.sam_checkpoint
    # image_path = args.input_image
    split = args.split
    openai_key = args.openai_key
    openai_proxy = args.openai_proxy
    box_threshold = args.box_threshold
    text_threshold = args.text_threshold
    iou_threshold = args.iou_threshold
    device = args.device
    openai.api_key = openai_key
    if openai_proxy:
        openai.proxy = {"http": openai_proxy, "https": openai_proxy}

    # load model
    model = load_model(config_file, grounded_checkpoint, device=device)
 
    processor = BlipProcessor.from_pretrained("config_data/blip-image-captioning-large")
    if device == "cuda":
        blip_model = BlipForConditionalGeneration.from_pretrained("config_data/blip-image-captioning-large",
                                                                  torch_dtype=torch.float16).to("cuda")
    else:
        blip_model = BlipForConditionalGeneration.from_pretrained("config_data/blip-image-captioning-large")
    for img_name in images_list:
 
        image_path=os.path.join(images_path,img_name)
        image_pil, image = load_image(image_path)
        image_pil.save(os.path.join(output_orig, img_name))
        args = parser.parse_args()
        caption = generate_caption(image_pil, device=device)
        text_prompt = generate_tags(caption, split=split)
        print(f"Caption: {caption}")
        print(f"Tags: {text_prompt}")
        # visualize raw image
        image_pil.save(os.path.join(output_orig,img_name ))
        # run grounding dino model
        boxes_filt, scores, pred_phrases = get_grounding_output(
            model, image, text_prompt, box_threshold, text_threshold, device=device
        )

        # initialize SAM
        predictor = SamPredictor(build_sam(checkpoint=sam_checkpoint).to(device))
        image = cv2.imread(image_path)
        image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
        predictor.set_image(image)

        size = image_pil.size
        H, W = size[1], size[0]
        for i in range(boxes_filt.size(0)):
            boxes_filt[i] = boxes_filt[i] * torch.Tensor([W, H, W, H])
            boxes_filt[i][:2] -= boxes_filt[i][2:] / 2
            boxes_filt[i][2:] += boxes_filt[i][:2]

        boxes_filt = boxes_filt.cpu()
        # use NMS to handle overlapped boxes
        print(f"Before NMS: {boxes_filt.shape[0]} boxes")
        nms_idx = torchvision.ops.nms(boxes_filt, scores, iou_threshold).numpy().tolist()
        boxes_filt = boxes_filt[nms_idx]
        pred_phrases = [pred_phrases[idx] for idx in nms_idx]
        print(f"After NMS: {boxes_filt.shape[0]} boxes")
        caption = check_caption(caption, pred_phrases)
        print(f"Revise caption with number: {caption}")

        transformed_boxes = predictor.transform.apply_boxes_torch(boxes_filt, image.shape[:2]).to(device)

        masks, _, _ = predictor.predict_torch(
            point_coords = None,
            point_labels = None,
            boxes = transformed_boxes.to(device),
            multimask_output = False,
        )

        # draw output image
        plt.figure(figsize=(10, 10))
        plt.imshow(image)
        for mask in masks:
            show_mask(mask.cpu().numpy(), plt.gca(), random_color=True)
        for box, label in zip(boxes_filt, pred_phrases):
            show_box(box.numpy(), plt.gca(), label)

        plt.title(caption)
        plt.axis('off')
        plt.savefig(
            os.path.join(output_automatic_label,img_name),
            bbox_inches="tight", dpi=300, pad_inches=0.0
        )

        save_mask_data(output_mask,output_json,img_name, caption, masks, boxes_filt, pred_phrases)