ARTrack 阅读记录

news2024/9/27 17:36:10

目录

环境配置与脚本编写

前向传播过程

网络结构


环境配置与脚本编写

按照官网执行并没有顺利完成,将yaml文件中的 pip 项 手动安装的

conda create -n artrack python=3.9
# 启动该环境,并跳转到项目主目录路径下
astor==0.8.1 configparser==5.2.0
data==0.4 docker-pycreds==0.4.0 easydict==1.9 einops==0.4.1 formulaic==0.5.2 funcsigs==1.0.2 future==0.18.2
gitdb==4.0.9 gitpython==3.1.27 interface-meta==1.3.0 iopath==0.1.9 jpeg4py==0.1.4 jsonpatch==1.32 jsonpointer==2.3 latex==0.7.0
libarchive-c==2.9 linearmodels==4.29 lmdb==1.3.0 loguru==0.6.0 mat73==0.59 memory-profiler==0.60.0 msgpack==1.0.2 ninja==1.11.1
opencv-python==4.5.5.64 pathtools==0.1.2 promise==2.3 property-cached==1.6.4 protobuf==3.20.0 pycocotools==2.0.4 pyhdfe==0.1.2
ruamel-yaml-conda==0.15.100 sentry-sdk==1.5.8 setproctitle==1.2.2 setuptools-scm==7.1.0 shapely==1.8.1.post1 shortuuid==1.0.8
shutilwhich==1.1.0 smmap==5.0.0 tables==3.6.1 tempdir==0.7.1 tensorboardx==2.5.1 thop==0.1.0.post2207010342 tikzplotlib==0.10.1
timm==0.5.4 tomli==2.0.1 torch==1.11.0 torchfile==0.1.0 visdom==0.1.8.9 wandb==0.12.11 webcolors==1.12 yaspin==2.1.0

里面的默认路径需要改写

python tracking/create_default_local_file.py --workspace_dir . --data_dir ./data --save_dir ./outpu

官网下载训练好的模型,创建路径,将模型放在该路径下

ARTrack-main/output/checkpoints/train/artrack_seq/artrack_seq_256_full/ARTrackSeq_ep0060.pth.tar

创建encoder的预训练模型路径,并把预训练模型放入这里,在yaml文件中进行更改,并且源脚本文件 artrack_seq.py中也需要更改

mkdir pretrained_model
#
mae_pretrain_vit_base.pth 文件名

# artrack_seq_256_full.yaml 中用绝对路径改写
PRETRAIN_PTH: "/root/data/zjx/Code-subject/ARTrack/ARTrack-main/pretrained_models"

# 同时将artrack_seq.py --100 中的
load_from = cfg.MODEL.PRETRAIN_PTH
# 改为
load_from = cfg.MODEL.PRETRAIN_PTH +'/' + cfg.MODEL.PRETRAIN_FILE
#同时将 artrack_seq.py -- 103 中的
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["net"], strict=False)
# 改为
missing_keys, unexpected_keys = model.load_state_dict(checkpoint["model"], strict=False)

代码中没有实现 run video 的脚本,这里需要自定义一个脚本实现

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function
from __future__ import unicode_literals

import os
import random
import argparse
import multiprocessing

import cv2
import torch
import torch.nn as nn
import numpy as np
from glob import glob

from lib.test.evaluation.tracker import Tracker

import sys

prj_path = os.path.join(os.path.dirname(__file__), '..')
if prj_path not in sys.path:
    sys.path.append(prj_path)

torch.set_num_threads(1)

parser = argparse.ArgumentParser(description='Run tracker on sequence or dataset.')
parser.add_argument('tracker_name', type=str, help='Name of tracking method.')
parser.add_argument('tracker_param', type=str, help='Name of config file.')
parser.add_argument('--runid', type=int, default=None, help='The run id.')
parser.add_argument('--video_path', type=str, default='None', help='Name of dataset (otb, nfs, uav, tpl, vot, tn, gott, gotv, lasot).')
parser.add_argument('--sequence', type=str, default=None, help='Sequence number or name.')
parser.add_argument('--debug', type=int, default=0, help='Debug level.')
parser.add_argument('--threads', type=int, default=0, help='Number of threads.')
parser.add_argument('--num_gpus', type=int, default=8)

args = parser.parse_args()


def main():  # 这里已经是图片了
    colors = [random.randint(0, 255) for _ in range(3)]
    print('[INFO] Loading the model')

    # load config
    trackers = Tracker(args.tracker_name, args.tracker_param, None, args.runid)

    try:
        worker_name = multiprocessing.current_process().name
        worker_id = int(worker_name[worker_name.find('-') + 1:]) - 1
        gpu_id = worker_id % args.num_gpus
        torch.cuda.set_device(gpu_id)
    except:
        pass

    trackers.run_video(args.video_path, None, None, None, False)

if __name__=='__main__':
    main()

执行

python tracking/run_video.py artrack_seq artrack_seq_256_full --video_path /root/data/zjx/Code-subject/OSTrack-main/experiments/video/soccer1.avi 

前向传播过程

裁剪模板区域和OSTrack代码一样,初始化的时候,为需要保留的N帧的bbox的坐标信息创建了一个buffer--self.store_result,初始化时全为 init bbox,N的值此时设置为7

        for i in range(self.save_all - 1):
            self.store_result.append(info['init_bbox'].copy())

搜索区域的裁剪和OSTrack的一样。将之前帧的坐标进行变换,  以前一帧预测的坐标为参考点计算相对坐标,因为当前帧的裁剪的搜索区域的就是以上一帧预测的bbox为中心进行裁剪的,所以搜索区域的中心实则是前一帧预测的bbox的中心。只不过前一帧预测的bbox为原img的尺度,而搜索区域为crop size上的尺度,因此,只需要将计算原img尺度上的也就是之前帧的预测的坐标与前一帧预测的坐标的相对坐标,再乘以resize factor就可以将相对坐标转换到crop size 的尺度下。并且,前一帧的预测的bbox转换实则移到了搜索区域的中心点,也就是 (crop_size/2, crop_ size/2)。 

转换后除以 crop size 进行了归一化,不过这里有可能会 小于0 或者 大于 1,因为坐标变换可能会超出边界。接下来将xywh转换成 xyxy 形式,并筛选只保留(-0.5,1.5)区间的。然后对坐标进行量化。加上0.5 为了防止 出现负数,最终将bbox量化到 2*(bins-1)之间。最终,包含时空上下文信息的坐标输入为

seqs_out = seqs_out.unsqueeze(0)  # (1,28)

将 模板 和 搜索区域送入 ViT backbone中进行特征提取,这个过程中一共 16倍 下采样。然后将 提取的 sequence patch、以及位置编码、外加之前转换后的之前帧的bbox的信息 送入 接下来的Transformer中。

首先进入一个 encoder,在FeatureFusionEncoder类中进行一些预处理,主要的基本模块是 FeatureFusion 模块。这个encoder的主要过程如下所示,最终返回 z 和 x 一样shape的特征 patch。

接下来 ,将 之前帧的 bbox 坐标序列以及开始标志拼接在一起,作为decoder的输入 sequence。因为只需要预测bbox的坐标,所以不需要额外的结束标志,输出的序列长度直接为4即可。

1、 将输入的sequence 进行词汇嵌入,词向量的长度是crop img 下采样得到的特征patch的分辨率
2、 将初始输入tgt、模板特征、搜索特征、patch z的位置编码、 x patch的位置编码、identity高斯截断分布、高斯截断分布、查询嵌入、输入序列的掩码 送入decoder

 decoder主要由TargetQueryDecoderLayer层组成。该模块的前向过程如下所示,一共有6层

最终输出和 tgt shape一样的token sequence。得到的输出的shape为(1,length,768),这个length为tgt的长度,随sequence的预测而逐渐增加。接下来,

1、 拿出得到的 query的 最后一个单词嵌入,并与词向量的权重矩阵进行矩阵乘法,得到与每个位置量化后的相关联的预测值。
2、 取softmax,得到关于量化后的坐标的概率分布。
3、 采用argmax sampleing,也就是看最大概率的位置。
4、 将当前预测的量化后的坐标加入到 tgt当中,执行循环。
5、 最终得到预测的bbox的量化坐标。

得到网络的输出预测后,

1、 bbox坐标反量化
2、 xyxy 转为 xywh  中心点加长宽
3、 尺度返回到原img, 转成 xywh, 左顶点加长宽
4、 平滑处理,去掉bbox超出图片的部分
5、 对于之前保存的坐标信息,将最靠前的弹出去,在最靠后的也就是前一帧的坐标加入当前预测的。好比出栈入栈操作。

 

网络结构

ARTrackSeq(
  (backbone): VisionTransformer(
    (patch_embed): PatchEmbed(
      (proj): Conv2d(3, 768, kernel_size=(16, 16), stride=(16, 16))
      (norm): Identity()
    )
    (pos_drop): Dropout(p=0.0, inplace=False)
    (blocks): Sequential(
      (0): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): Identity()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (1): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (2): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (3): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (4): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (5): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (6): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (7): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (8): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (9): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (10): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
      (11): Block(
        (norm1): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (attn): Attention(
          (qkv): Linear(in_features=768, out_features=2304, bias=True)
          (attn_drop): Dropout(p=0.0, inplace=False)
          (proj): Linear(in_features=768, out_features=768, bias=True)
          (proj_drop): Dropout(p=0.0, inplace=False)
        )
        (drop_path): DropPath()
        (norm2): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
        (mlp): Mlp(
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (act): GELU()
          (drop1): Dropout(p=0.0, inplace=False)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
          (drop2): Dropout(p=0.0, inplace=False)
        )
      )
    )
    (norm): LayerNorm((768,), eps=1e-06, elementwise_affine=True)
  )
  (pix_head): Pix2Track(
    (word_embeddings): Embedding(802, 768, padding_idx=800, max_norm=1)
    (position_embeddings): Embedding(5, 768)
    (prev_position_embeddings): Embedding(28, 768)
    (encoder): FeatureFusionEncoder(
      (layers): ModuleList(
        (0): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (1): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (2): FeatureFusion(
          (z_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_self_attn): SelfAttention(
            (qkv): Linear(in_features=768, out_features=2304, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm2_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_x_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (x_z_cross_attention): CrossAttention(
            (q): Linear(in_features=768, out_features=768, bias=True)
            (kv): Linear(in_features=768, out_features=1536, bias=True)
            (attn_drop): Dropout(p=0.0, inplace=False)
            (proj): Linear(in_features=768, out_features=768, bias=True)
            (proj_drop): Dropout(p=0.1, inplace=False)
          )
          (z_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (x_norm3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (z_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (x_mlp): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
      )
      (z_pos_enc): Untied2DPositionalEncoder(
        (pos): Learned2DPositionalEncoder()
        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
        (pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
      )
      (x_pos_enc): Untied2DPositionalEncoder(
        (pos): Learned2DPositionalEncoder()
        (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (pos_q_linear): Linear(in_features=768, out_features=768, bias=True)
        (pos_k_linear): Linear(in_features=768, out_features=768, bias=True)
      )
      (z_rel_pos_bias_table): RelativePosition2DEncoder()
      (x_rel_pos_bias_table): RelativePosition2DEncoder()
      (z_x_rel_pos_bias_table): RelativePosition2DEncoder()
      (x_z_rel_pos_bias_table): RelativePosition2DEncoder()
    )
    (decoder): TargetQueryDecoderBlock(
      (layers): ModuleList(
        (0): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (1): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (2): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (3): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (4): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
        (5): TargetQueryDecoderLayer(
          (norm_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (self_attn1): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_2_query): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (norm_2_memory): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (multihead_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=768, out_features=768, bias=True)
          )
          (norm_3): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (mlpz): Mlp(
            (fc1): Linear(in_features=768, out_features=3072, bias=True)
            (act): GELU()
            (fc2): Linear(in_features=3072, out_features=768, bias=True)
            (drop): Dropout(p=0.1, inplace=False)
          )
          (drop_path): Identity()
        )
      )
      (norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
    )
  )
)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1364099.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

(2023|NIPS,MUSE,掩蔽适配器,基于反馈的迭代训练)StyleDrop:任意风格的文本到图像生成

StyleDrop: Text-to-Image Generation in Any Style 公和众和号:EDPJ(添加 VX:CV_EDPJ 或直接进 Q 交流群:922230617 获取资料) 目录 0. 摘要 3. StyleDrop:文本到图像合成的风格调整 3.1 基础&#x…

Java-网络爬虫(二)

文章目录 前言一、WebMagic二、使用步骤1. 搭建 Maven 项目2. 引入依赖 三、入门案例四、核心对象&组件1. 核心对象SipderRequestSitePageResultItemsHtml(Selectable) 2. 四大组件DownloaderPageProcessorSchedulerPipeline 上篇:Java-网…

浅析Attention

本质: Attention机制的本质来自于人类视觉注意力机制。人们在看东西的时候一般不会从头看到尾全部都看,往往只会根据需求观察注意特定的一部分。简单来说,就是一种权重参数的分配机制,目标是协助模型捕捉重要信息。 原理&#x…

自监督深度学习技术

一、定义 自监督学习(SSL)是机器学习的一种范式,用于处理未标记数据以获取有用的表示,以帮助下游学习任务。SSL方法最显著的特点是它们不需要人类标注的标签,这意味着它的训练完全基于由未标记的数据样本组成的数据集…

在做题中学习(43):长度最小的子数组

LCR 008. 长度最小的子数组 - 力扣(LeetCode) 解法:同向双指针-------滑动窗口算法 解释:本是暴力枚举做法,因为全部是正整数,就可以利用单调性和双指针解决问题来节省时间 思路: 如上面图&am…

IIS+SDK+VS2010+SP1+SQL server2012全套工具包及安装教程

前言 今天花了两个半小时安装这一整套配置,这个文章的目标是将安装时间缩短到1个小时 正文 安装步骤如下: VS2010 —> service pack 1 —>SQL server2012 —> IIS —> SDK 工具包链接如下: https://pan.baidu.com/s/1WQD-KfiUW…

[Linux] 一文理解HTTPS协议:什么是HTTPS协议、HTTPS协议如何加密数据、什么是CA证书(数字证书)...

之前的文章中, 已经分析介绍过了HTTP协议. HTTP协议在网络中是以明文的形式传输的. 无论是GET还是POST方法都是不安全的. 为什么不安全呢? 因为: HTTP协议以明文的形式传输数据, 缺乏对信息的保护. 如果在网络中传输数据以明文的形式传输, 网络中的任何人都可以轻松的获取数据…

Java:File类详解

文章目录 1、概述2、创建File实例3、常用方法3.1 获取功能的方法3.2 绝对路径和相对路径3.3 判断功能的方法3.4 创建删除功能的方法3.5 文件过滤功能的方法 4、文件夹的遍历5、综合练习5.1 创建文件夹5.2 查找文件(不考虑子文件夹)5.3 查找文件&#xff…

RK3568平台开发系列讲解(Linux系统篇)Linux 内核打印

🚀返回总目录 文章目录 一、方法一:dmseg 命令二、方法二:查看 kmsg 文件三、方法三:调整内核打印等级一、方法一:dmseg 命令 在终端使用 dmseg 命令可以获取内核打印信息,该命令的具体使用方法如下所示: 首先在串口终端使用 “dmseg”命令,可以看见相应的内核打印信息…

静态网页设计——科学家网(HTML+CSS+JavaScript)(dw、sublime Text、webstorm、HBuilder X)

前言 声明:该文章只是做技术分享,若侵权请联系我删除。!! 感谢大佬的视频:https://www.bilibili.com/video/BV1wg4y1Q7qm/?vd_source5f425e0074a7f92921f53ab87712357b 源码:https://space.bilibili.com…

[C#]C# OpenVINO部署yolov8-pose姿态估计模型

【源码地址】 github地址:https://github.com/ultralytics/ultralytics 【算法介绍】 Yolov8-Pose算法是一种基于深度神经网络的目标检测算法,用于对人体姿势进行准确检测。该算法在Yolov8的基础上引入了姿势估计模块,通过联合检测和姿势…

HarmonOS 通用组件(Checkbox)

本文中 我们来说 通用组件中的 Checkbox 我们先搭起一个基本的架子组件 Entry Component struct Index {build() {Row() {Column() {Row() {}}.width(100%)}.height(100%)} }我们可以在Row 行组件中加入代码 Checkbox({name: "age"}) Text("年龄")这样 就…

JDK 11:崭新特性解析

JDK 11:崭新特性解析 JDK 11:崭新特性解析1. HTTP Client(标准化)示例代码 2. 局部变量类型推断的扩展示例代码 3. 新的字符串方法示例代码 4. 动态类文件常量示例代码 5. Epsilon 垃圾收集器使用方式 结语 JDK 11:崭新…

第三十八周周报:文献阅读 +BILSTM+GRU+Seq2seq

目录 摘要 Abstract 文献阅读:耦合时间和非时间序列模型模拟城市洪涝区洪水深度 现有问题 提出方法 创新点 XGBoost和LSTM耦合模型 XGBoost算法 ​编辑 LSTM(长短期记忆网络) 耦合模型 研究实验 数据集 评估指标 研究目的 洪…

stable diffusion 基础教程-文生图

置顶大模型插件资源链接 你如果没有魔法上网,请自取 百度云盘链接:链接:https://pan.baidu.com/s/1_xAu47XMdDNlA86ufXqAuQ?pwd=23wi 提取码:23wi 有疑问加微:mincarver 界面介绍 参数解释 参数解释Sampling method扩散去噪算法的采样模式,不同采样模式会带来不一样的效…

opencv期末练习题(6)附带解析

滑动块调整图像灰度 import cv2 import numpy as np """ 滑动块调整图像灰度1. 读取图片,并转为灰度图 2. 定义启动滑块和阈值滑块 3. 只有启动滑块的值为1时,拖动阈值滑块才生效 4. 根据阈值滑块的值实时对图像进行二值化操作 "&qu…

【文件操作】

目录 为什么使用文件什么是文件二进制文件和文本文件文件的打开和关闭文件的顺序读写文件的随机读写文件读取结束的判定文件缓冲区 1. 为什么使用文件 如果没有文件,我们写的程序数据时存储在电脑内存中,如果程序退出,内存回收,…

VS Code 如何调试Python文件

VS Code中有1,2,3处跟Run and Debug相关的按钮, 1 处:调试和运行就不多说了,Open Configurations就是打开workspace/.vscode下的lauch.json文件,而Add Configuration就是在lauch.json文件中添加当前运行Python文件的Configuratio…

C语言编译器(C语言编程软件)完全攻略(第二十七部分:VS安全函数问题(C语言安全函数)是怎么回事?如何解决?)

介绍常用C语言编译器的安装、配置和使用。 二十七、VS安全函数问题(C语言安全函数)是怎么回事?如何解决? 在 VS(Visual Studio)下编译C语言程序,如果使用了 scanf()、gets()、strcpy()、strca…

在IDEA中使用git分支进行开发然后合并到Master分支,2022.1.x版本

在实际开发过程中,为了避免因为在开发中出现的问题以及方便发布版本,如果是多版本发布的情况相下,我们通常需要采用分支进行开发,这个时候,我们就需要了解git分支的相关知识点了,本篇博客也是博主在实际公司…