yolov5网络初始化问题

news2025/7/4 3:23:39

当你打印detect层的三个特征层时，发现有三种不同的长和宽，如下图所示：
我提出三个问题：
为什么不一样呢，输入有什么含义吗？
为什么网络初始化四次（forward)？
下面来逐个击破

在这里插入图片描述

1. torch.Size([1, 3, 32, 32, 8]) （这个数据为detect层输出的最大宽度特征层）

第一层调用：train.py

model = Model(cfg or ckpt["model"].yaml, ch=3, nc=nc, anchors=hyp.get("anchors")).to(device)  # create

第二层调用：

在yolo/DetectionModel里面定义的，是一个固定的输入，为[1,3,256,256]卷积完之后就如上。

使用256这个参数主要是因为①最大stride的倍数（8,16 ，32,64…），②这个数降采样之后的值真好，不会造成资源的浪费。

主要是用来网络初始化的，创建网络的

        if isinstance(m, (Detect, Segment)):
            s = 256  # 2x min stride   256
            m.inplace = self.inplace
            car_detect=[0,0,0,0]
            forward = lambda x: self.forward(x)[0] if isinstance(m, Segment) else self.forward(x)
            _,rs=forward(torch.zeros(1, ch, s, s))  #forward
            m.stride = torch.tensor([s / x.shape[-2] for x in rs[0]])  # forward torch.Size([1, 3, 32, 32, 8])
            # if m.stride==torch.tensor([]):
            #     m.stride = torch.tensor([8, 16, 32])
            check_anchor_order(m)
            m.anchors /= m.stride.view(-1, 1, 1)
            self.stride = m.stride
            self._initialize_biases()  # only run once

2. torch.Size([1, 3, 4, 4, 8])

第一层调用：train.py

model = Model(cfg or ckpt["model"].yaml, ch=3, nc=nc, anchors=hyp.get("anchors")).to(device)  # create

第二层：

还是在yolo/DetectionModel里面实现的。

        # Init weights, biases
        initialize_weights(self)
        self.info()  # 第二遍 计算层数，参数，梯度等 YOLOv5s summary: 245 layers, 8091510 parameters, 8091510 gradients, 16.8 GFLOPs 
        LOGGER.info("")

主要是self.info()这个函数。

其中im是输入，是[1,3,32,32]卷积出来第一个卷积层也是上面的。

为什么是32呢，这个是因为①是最大stride，降采样使能成功 ②为什么不使用其他32的倍数，因为这个是最小计算量，确保网络能够正确处理图像的前提。

主要是来计算网络的参数的，如层数，参数，计算量等。

这个flops计算量是这个模型的最快执行时间。

        p = next(model.parameters())  #  获取第一个模型的参数：32,3,6,6
        stride = max(int(model.stride.max()), 32) if hasattr(model, "stride") else 32  # max stride  压缩程度
        # torch.empty创建任意数据类型的张量  torch.tensor() 只创建torch.FloatTensor类型的张量
        # 使用32是因为①是最大stride，降采样使能成功 ②为什么不使用其他32的倍数，因为这个是最小计算量，确保网络能够正确处理图像的前提
        im = torch.empty((1, p.shape[1], stride, stride), device=p.device)  # input image in BCHW format
        # 浮点运算次数，可以用来衡量算法/模型复杂度 1GFLOPs = 10^9 FLOPs
        # 计算量（时间复杂度,flops) 与输入参数有关系 网络执行时间的长短
        # 参数量（空间复杂度,params）占用显存的大小 只与网络有关系
        # 这个地方除以2 是因为加法（偏置）可能没有算进去，所以初一二让他接近真实值，flops值越大越好
        flops = thop.profile(deepcopy(model), inputs=(im,), verbose=False)[0] / 1e9 * 2  # stride GFLOPs thop.profile计算flops，verbose是日志显示
        imgsz = imgsz if isinstance(imgsz, list) else [imgsz, imgsz]  # expand if int/float
        fs = f", {flops * imgsz[0] / stride * imgsz[1] / stride:.1f} GFLOPs"  # 640x640 GFLOPs  计算真实图片的flops,使用最大stride就是为了简化计算，作为一个标准，

3. torch.Size([1, 3, 80, 60, 8])

第一步调用：

是在train中调用的，想要统计是否使用AMP（自动混合精度）

amp = check_amp(model)  # check AMP  第三次  计算是否使用amp自动混合精度（torch16和torch32）

第二步调用：

下面会调用Autoshape，im就是引用的data/imges/bus.jpg的一张yolo自带的图，进行初始化的。im进行resize后的shape是[1,3,640,480]。

主要是想用一张图片,然后用两种方式FP32 inference和AMP inference进行推理，然后计算相似度，大于阈值，就是用AMP。

为什么使用AutoShape类，首先这个对输入包容性很大，无论是file还是uri或者numpy,torch等其他类型都可以进行统一预测，输出结果。

            n, ims = (len(ims), list(ims)) if isinstance(ims, (list, tuple)) else (1, [ims])  # number, list of images
            shape0, shape1, files = [], [], []  # image and inference shapes, filenames
            for i, im in enumerate(ims):
                f = f"image{i}"  # filename
                if isinstance(im, (str, Path)):  # filename or uri
                    im, f = Image.open(requests.get(im, stream=True).raw if str(im).startswith("http") else im), im
                    im = np.asarray(exif_transpose(im))
                elif isinstance(im, Image.Image):  # PIL Image
                    im, f = np.asarray(exif_transpose(im)), getattr(im, "filename", f) or f
                files.append(Path(f).with_suffix(".jpg").name)
                if im.shape[0] < 5:  # image in CHW
                    im = im.transpose((1, 2, 0))  # reverse dataloader .transpose(2, 0, 1)
                im = im[..., :3] if im.ndim == 3 else cv2.cvtColor(im, cv2.COLOR_GRAY2BGR)  # enforce 3ch input
                s = im.shape[:2]  # HWC
                shape0.append(s)  # image shape
                g = max(size) / max(s)  # gain
                shape1.append([int(y * g) for y in s])
                ims[i] = im if im.data.contiguous else np.ascontiguousarray(im)  # update
            shape1 = [make_divisible(x, self.stride) for x in np.array(shape1).max(0)]  # inf shape  640，480
            x = [letterbox(im, shape1, auto=False)[0] for im in ims]  # pad
            x = np.ascontiguousarray(np.array(x).transpose((0, 3, 1, 2)))  # stack and BHWC to BCHW
            x = torch.from_numpy(x).to(p.device).type_as(p) / 255  # uint8 to fp16/32

        with amp.autocast(autocast):
            # Inference
            with dt[1]:
                y = self.model(x, augment=augment)  # forward

总结

第几次调用forward	输入尺寸	作用
第一次调用	torch.Size([1, 3, 256, 256])	主要用于创建网络，计算stride的值
第二次调用	torch.Size([1, 3,32, 32 ])	主要用于计算网络参数的，如层数，参数，计算量等
第三次调用	torch.Size([1, 3, 640, 480])	主要是确认是否使用amp

注：此处的数据建立在stride的最大值为32的