LaneATT推理详解及部署实现（下）

- 前言
- 一、LaneATT推理(Python)
- - 1. LaneATT预测
  - 2. LaneATT预处理
  - 3. LaneATT后处理
  - 4. LaneATT推理
- 二、LaneATT推理(C++)
- - 1. ONNX导出
  - 2. LaneATT预处理
  - 3. LaneATT后处理
  - 4. LaneATT推理
- 三、LaneATT部署
- - 1. 源码下载
  - 2. 环境配置
  - - 2.1 配置CMakeLists.txt
    - 2.2 配置Makefile
  - 3. ONNX导出
  - 4. engine生成
  - 5. 源码修改
  - 6. 运行
- 结语
- 下载链接
- 参考

前言

在 LaneATT推理详解及部署实现（上）文章中我们有提到如何导出 LaneATT 的 ONNX 模型，这篇文章就来看看如何在 tensorRT 上推理得到结果

Note：开始之前大家务必参考 LaneATT推理详解及部署实现（上）将对应的环境配置好，并将 LaneATT 的 ONNX 导出来，这里博主就不再介绍了

repo：https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8

在这里插入图片描述

一、LaneATT推理(Python)

1. LaneATT预测

我们先尝试利用官方预训练权重来推理一张图片并保存，看能否成功

在 LaneATT 目录下新建 predict.py 文件，其内容如下：

import cv2
import torch
import numpy as np
from lib.models.laneatt import LaneATT

def preprocess(img, dst_width=640, dst_height=360):
    img_pre = cv2.resize(img, (dst_width, dst_height))
    img_pre = (img_pre / 255.0).astype(np.float32)
    img_pre = img_pre.transpose(2, 0, 1)[None]
    img_pre = torch.from_numpy(img_pre)
    return img_pre

if __name__ == "__main__":

    device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 

    img = cv2.imread("datasets/culane/driver_37_30frame/05181432_0203.MP4/00210.jpg")
    img_pre = preprocess(img).to(device)

    model = LaneATT(anchors_freq_path="data/culane_anchors_freq.pt", topk_anchors=1000)
    state_dict = torch.load("experiments/laneatt_r34_culane/models/model_0015.pt")['model']
    model.load_state_dict(state_dict)
    model = model.to(device)

    model.eval()
    with torch.no_grad():
        output = model(img_pre, conf_threshold=0.5, nms_thres=50.0, nms_topk=4)
        pred = model.decode(output, as_lanes=True)[0]
        for line in pred:
            points = line.points
            points[:, 0] *= img.shape[1]
            points[:, 1] *= img.shape[0]
            points = points.round().astype(int)
            for point in points:
                cv2.circle(img, point, 3, color=(0, 255, 0), thickness=-1)
        cv2.imwrite("result.jpg", img)

Note：代码、权重和数据集大家可以点击 here 下载，脚本运行需要编译 NMS 插件，其环境配置请参考上篇文章，这边博主不再赘述

执行该脚本后在当前目录下会生成 result.jpg 推理结果图片，如下所示：

在这里插入图片描述

2. LaneATT预处理

模型推理成功后我们就要来梳理下 LaneATT 的预处理和后处理，方便后续在 C++ 上实现，我们先来看预处理的实现

经过我们的调试分析可知 LaneATT 的预处理过程在 lib/datasets/lane_dataset.py 文件中，可以参考：lane_dataset.py#L270

def __getitem__(self, idx):
    item = self.dataset[idx]
    img_org = cv2.imread(item['path'])
    line_strings_org = self.lane_to_linestrings(item['old_anno']['lanes'])
    line_strings_org = LineStringsOnImage(line_strings_org, shape=img_org.shape)
    for i in range(30):
        img, line_strings = self.transform(image=img_org.copy(), line_strings=line_strings_org)
        line_strings.clip_out_of_image_()
        new_anno = {'path': item['path'], 'lanes': self.linestrings_to_lanes(line_strings)}
        try:
            label = self.transform_annotation(new_anno, img_wh=(self.img_w, self.img_h))['label']
            break
        except:
            if (i + 1) == 30:
                self.logger.critical('Transform annotation failed 30 times :(')
                exit()

    img = img / 255.
    if self.normalize:
        img = (img - IMAGENET_MEAN) / IMAGENET_STD
    img = self.to_tensor(img.astype(np.float32))
    return (img, label, idx)

它包含以下步骤：

self.transform：即 resize 缩放
img = img / 255.：除以 255，归一化
self.to_tensor：添加 batch 维度，HWC -> BCHW

它其实和分类模型的预处理有些像，只是没有 center crop 这个操作，更加的简单，因此我们不难写出对应的预处理代码，如下所示：

def preprocess(self, img):
    # 1. resize
    img = cv2.resize(img, (self.img_w, self.img_h))
    # 2. normalize
    img = (img / 255.0).astype(np.float32)
    # 3. to bchw
    img = img.transpose(2, 0, 1)[None]
    return img

Note：预处理 resize 的目标尺寸是 width=640，height=360，并且没有 BGR->RGB 这个操作，因此 LaneATT 模型的输入是 1x3x360x640

3. LaneATT后处理

我们再来看看后处理的实现

经过我们的调试分析可知 LaneATT 的后处理过程在 lib/models/laneatt.py 文件中，可以参考：laneatt.py#L112

def nms(self, batch_proposals, batch_attention_matrix, nms_thres, nms_topk, conf_threshold):
    softmax = nn.Softmax(dim=1)
    proposals_list = []
    for proposals, attention_matrix in zip(batch_proposals, batch_attention_matrix):
        anchor_inds = torch.arange(batch_proposals.shape[1], device=proposals.device)
        # The gradients do not have to (and can't) be calculated for the NMS procedure
        with torch.no_grad():
            scores = softmax(proposals[:, :2])[:, 1]
            if conf_threshold is not None:
                # apply confidence threshold
                above_threshold = scores > conf_threshold
                proposals = proposals[above_threshold]
                scores = scores[above_threshold]
                anchor_inds = anchor_inds[above_threshold]
            if proposals.shape[0] == 0:
                proposals_list.append((proposals[[]], self.anchors[[]], attention_matrix[[]], None))
                continue
            keep, num_to_keep, _ = nms(proposals, scores, overlap=nms_thres, top_k=nms_topk)
            keep = keep[:num_to_keep]
        proposals = proposals[keep]
        anchor_inds = anchor_inds[keep]
        attention_matrix = attention_matrix[anchor_inds]
        proposals_list.append((proposals, self.anchors[keep], attention_matrix, anchor_inds))

    return proposals_list

def proposals_to_pred(self, proposals):
    self.anchor_ys = self.anchor_ys.to(proposals.device)
    self.anchor_ys = self.anchor_ys.double()
    lanes = []
    for lane in proposals:
        lane_xs = lane[5:] / self.img_w
        start = int(round(lane[2].item() * self.n_strips))
        length = int(round(lane[4].item()))
        end = start + length - 1
        end = min(end, len(self.anchor_ys) - 1)
        # end = label_end
        # if the proposal does not start at the bottom of the image,
        # extend its proposal until the x is outside the image
        mask = ~((((lane_xs[:start] >= 0.) &
                    (lane_xs[:start] <= 1.)).cpu().numpy()[::-1].cumprod()[::-1]).astype(np.bool))
        lane_xs[end + 1:] = -2
        lane_xs[:start][mask] = -2
        lane_ys = self.anchor_ys[lane_xs >= 0]
        lane_xs = lane_xs[lane_xs >= 0]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)
        if len(lane_xs) <= 1:
            continue
        points = torch.stack((lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)), dim=1).squeeze(2)
        lane = Lane(points=points.cpu().numpy(),
                    metadata={
                        'start_x': lane[3],
                        'start_y': lane[2],
                        'conf': lane[1]
                    })
        lanes.append(lane)
    return lanes

def decode(self, proposals_list, as_lanes=False):
    softmax = nn.Softmax(dim=1)
    decoded = []
    for proposals, _, _, _ in proposals_list:
        proposals[:, :2] = softmax(proposals[:, :2])
        proposals[:, 4] = torch.round(proposals[:, 4])
        if proposals.shape[0] == 0:
            decoded.append([])
            continue
        if as_lanes:
            pred = self.proposals_to_pred(proposals)
        else:
            pred = proposals
        decoded.append(pred)

    return decoded

它包含以下步骤：

nms：非极大值抑制
decode：车道线 proposal 解码

在分析后处理代码之前我们先来看下模型的输出含义，LaneATT 模型的输出是 1x1000x77，其中代表的含义是：

1：batch 维度
1000：每张图像中预测的车道线 proposal 数量
77：每个车道线 proposal 的特征向量长度，它包括以下三部分
- cls（2 维）：分类概率，分别代表背景的概率以及车道线的概率
- start_y，start_x（2 维）：车道线的起始点坐标
- length（1 维）：车道线的长度
- x_offset（72 维）：车道线在每个 anchor 的水平偏移，单位是像素

OK，把模型输出的每个维度含义梳理清楚之后，我们再来看后处理代码就比较清晰了

NMS 我们已经非常熟悉了，值得注意的是车道线的 NMS 中 IoU 的计算和我们常见的检测框 IoU 计算有所不同，它主要是针对车道线的特殊性，官方提供的 IoU 计算代码如下：

template <typename scalar_t>
// __device__ inline scalar_t devIoU(scalar_t const * const a, scalar_t const * const b) {
__device__ inline bool devIoU(scalar_t const * const a, scalar_t const * const b, const float threshold) {
  const int start_a = (int) (a[2] * N_STRIPS - DATASET_OFFSET + 0.5); // 0.5 rounding trick
  const int start_b = (int) (b[2] * N_STRIPS - DATASET_OFFSET + 0.5);
  const int start = max(start_a, start_b);
  const int end_a = start_a + a[4] - 1 + 0.5 - ((a[4] - 1) < 0); //  - (x<0) trick to adjust for negative numbers (in case length is 0)
  const int end_b = start_b + b[4] - 1 + 0.5 - ((b[4] - 1) < 0);
  const int end = min(min(end_a, end_b), N_OFFSETS - 1);
  // if (end < start) return 1e9;
  if (end < start) return false;
  scalar_t dist = 0;
  for(unsigned char i = 5 + start; i <= 5 + end; ++i) {
    if (a[i] < b[i]) {
      dist += b[i] - a[i];
    } else {
      dist += a[i] - b[i];
    }
  }
  // return (dist / (end - start + 1)) < threshold;
  return dist < (threshold * (end - start + 1));
  // return dist / (end - start + 1);
}

上述函数的输入参数是两条车道线的 proposal 以及 IoU 判断的阈值，主要计算两个 proposal 在重叠区域内的水平偏移差异，适用于车道线检测。具体步骤如下：

1. 计算两条车道线 proposal 的起始和结束索引
2. 检查两个 proposal 是否有重叠区域
3. 计算两个 proposal 在重叠区域内的水平偏移差异总和
4. 将水平偏移差异总和与阈值进行比较，判断两个 proposal 是否重叠

经过 NMS 之后我们还需要对预测的车道线 proposal 进行 decode 解码将其转换为实际的车道线坐标，代码如下：

def proposals_to_pred(self, proposals):
    self.anchor_ys = self.anchor_ys.to(proposals.device)
    self.anchor_ys = self.anchor_ys.double()
    lanes = []
    for lane in proposals:
        lane_xs = lane[5:] / self.img_w
        start = int(round(lane[2].item() * self.n_strips))
        length = int(round(lane[4].item()))
        end = start + length - 1
        end = min(end, len(self.anchor_ys) - 1)
        # if the proposal does not start at the bottom of the image,
        # extend its proposal until the x is outside the image
        mask = ~((((lane_xs[:start] >= 0.) &
                   (lane_xs[:start] <= 1.)).cpu().numpy()[::-1].cumprod()[::-1]).astype(np.bool))
        lane_xs[end + 1:] = -2
        lane_xs[:start][mask] = -2
        lane_ys = self.anchor_ys[lane_xs >= 0]
        lane_xs = lane_xs[lane_xs >= 0]
        lane_xs = lane_xs.flip(0).double()
        lane_ys = lane_ys.flip(0)
        if len(lane_xs) <= 1:
            continue
        points = torch.stack((lane_xs.reshape(-1, 1), lane_ys.reshape(-1, 1)), dim=1).squeeze(2)
        lane = Lane(points=points.cpu().numpy(),
                    metadata={
                        'start_x': lane[3],
                        'start_y': lane[2],
                        'conf': lane[1]
                    })
        lanes.append(lane)
    return lanes

它主要包括如下步骤：(from ChatGPT)

1. 准备 anchor 的 y 坐标
- anchor_ys 是均匀分布固定的值，LaneATT 模型利用了车道线的先验知识，大多数车道线都是从图像底部开始向上延伸，因此 y 坐标的分布是相对固定的
2. 处理每个 proposal
- 归一化水平偏移量：lane_xs = lane[5:] / self.img_w
- 计算起始点和长度：
  - start：车道线的起始点在anchor中的索引
    length：车道线的长度
    end：车道线终点在anchor中的索引，确保不超过anchor范围
- 处理起始部分无效点：如果车道线 proposal 的起始点不在图像底部，将其扩展到图像外部，并将无效点设置为 -2
- 提取有效的车道线点：将无效点过滤掉，只保留有效的车道线点
- 翻转车道线点：将车道线点翻转，使其从起始点到终点
- 构造车道线对象：将车道线点构造成 Lane 对象，并包含一些元数据（如起始点和置信度）
3. 返回车道线列表：将所有车道线对象返回。

通过上述的分析后我们不难写出对应的后处理代码，如下所示：

def postprocess(self, pred):
    # pred->1x1000x77
    lanes = []
    for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
        lane = pred[img_id, lane_id]
        lanes.append(lane.tolist())
    lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
    lanes = self._nms(lanes)
    lanes_points = self._decode(lanes)
    return lanes_points[:self.nms_topk]

def _nms(self, lanes):
    
    remove_flags = [False] * len(lanes)
    
    keep_lanes = []
    for i, ilane in enumerate(lanes):
        if remove_flags[i]:
            continue
            
        keep_lanes.append(ilane)
        for j in range(i + 1, len(lanes)):
            if remove_flags[j]:
                continue
            
            jlane = lanes[j]
            if self._lane_iou(ilane, jlane) < self.nms_thres:
                remove_flags[j] = True
    return keep_lanes

def _lane_iou(self, lane_a, lane_b):
    # lane = (_, conf, start_y, start_x, length, ...) = 2+2+1+72 = 77
    start_a = int(lane_a[2] * self.n_strips + 0.5)
    start_b = int(lane_b[2] * self.n_strips + 0.5)
    start   = max(start_a, start_b)
    end_a   = start_a + int(lane_a[4] + 0.5) - 1
    end_b   = start_b + int(lane_b[4] + 0.5) - 1
    end     = min(min(end_a, end_b), self.n_strips)
    dist = 0
    for i in range(start, end + 1):
        dist += abs(lane_a[i + 5] - lane_b[i + 5])
    dist = dist / float(end - start + 1)
    print(f"dist = {dist}")
    return dist

def _decode(self, lanes):
    lanes_points = []
    for lane in lanes:
        start  = int(lane[2] * self.n_strips + 0.5)
        end    = start + int(lane[4] + 0.5) - 1
        end    = min(end, self.n_strips)
        points = []
        for i in range(start, end + 1):
            points.append([lane[i + 5] / self.img_w, self.anchor_ys[i]])
        points = torch.from_numpy(np.array(points))
        lanes_points.append(points)
    return lanes_points

4. LaneATT推理

通过上面对 LaneATT 的预处理和后处理分析之后，整个推理过程就显而易见了。LaneATT 的推理包括图像预处理、模型推理、预测结果后处理三部分，其中预处理主要包括 resize，后处理主要包括 NMS 和 decode 解码两部分

完整的推理代码如下：

import cv2
import torch
import numpy as np
import onnxruntime as ort

class LaneATT(object):
    def __init__(self, model_path, S=72, img_w=640, img_h=360, conf_thresh=0.5, nms_thres=50., nms_topk=4) -> None:
        self.predictor   = ort.InferenceSession(model_path, provider_options=["CPUExecutionProvider"])
        self.n_strips    = S - 1
        self.n_offsets   = S
        self.img_w       = img_w
        self.img_h       = img_h
        self.conf_thresh = conf_thresh
        self.nms_thres   = nms_thres
        self.nms_topk    = nms_topk
        self.anchor_ys   = [1 - i / self.n_strips for i in range(self.n_offsets)]

    def preprocess(self, img):
        # 1. resize
        img = cv2.resize(img, (self.img_w, self.img_h))
        # 2. normalize
        img = (img / 255.0).astype(np.float32)
        # 3. to bchw
        img = img.transpose(2, 0, 1)[None]
        return img
    
    def forward(self, input):
        # input->1x3x360x640
        output = self.predictor.run(None, {"images": input})[0]
        return output

    def postprocess(self, pred):
        # pred->1x1000x77
        lanes = []
        for img_id, lane_id in zip(*np.where(pred[..., 1] > self.conf_thresh)):
            lane = pred[img_id, lane_id]
            lanes.append(lane.tolist())
        lanes = sorted(lanes, key=lambda x:x[1], reverse=True)
        lanes = self._nms(lanes)
        lanes_points = self._decode(lanes)
        return lanes_points[:self.nms_topk]

    def _nms(self, lanes):
        
        remove_flags = [False] * len(lanes)
        
        keep_lanes = []
        for i, ilane in enumerate(lanes):
            if remove_flags[i]:
                continue
                
            keep_lanes.append(ilane)
            for j in range(i + 1, len(lanes)):
                if remove_flags[j]:
                    continue
                
                jlane = lanes[j]
                if self._lane_iou(ilane, jlane) < self.nms_thres:
                    remove_flags[j] = True
        return keep_lanes
    
    def _lane_iou(self, lane_a, lane_b):
        # lane = (_, conf, start_y, start_x, length, ...) = 2+2+1+72 = 77
        start_a = int(lane_a[2] * self.n_strips + 0.5)
        start_b = int(lane_b[2] * self.n_strips + 0.5)
        start   = max(start_a, start_b)
        end_a   = start_a + int(lane_a[4] + 0.5) - 1
        end_b   = start_b + int(lane_b[4] + 0.5) - 1
        end     = min(min(end_a, end_b), self.n_strips)
        dist = 0
        for i in range(start, end + 1):
            dist += abs(lane_a[i + 5] - lane_b[i + 5])
        dist = dist / float(end - start + 1)
        return dist

    def _decode(self, lanes):
        lanes_points = []
        for lane in lanes:
            start  = int(lane[2] * self.n_strips + 0.5)
            end    = start + int(lane[4] + 0.5) - 1
            end    = min(end, self.n_strips)
            points = []
            for i in range(start, end + 1):
                points.append([lane[i + 5] / self.img_w, self.anchor_ys[i]])
            points = torch.from_numpy(np.array(points))
            lanes_points.append(points)
        return lanes_points

if __name__ == "__main__":
    
    image = cv2.imread("02610.jpg")

    model_file_path = "laneatt.sim.onnx"
    model   = LaneATT(model_file_path)
    img_pre = model.preprocess(image)
    pred    = model.forward(img_pre)
    lanes_points = model.postprocess(pred)

    for points in lanes_points:
        points[:, 0] *= image.shape[1]
        points[:, 1] *= image.shape[0]
        points = points.numpy().round().astype(int)
        # for curr_p, next_p in zip(points[:-1], points[1:]):
        #     cv2.line(image, tuple(curr_p), tuple(next_p), color=(0, 255, 0), thickness=3)
        for point in points:
            cv2.circle(image, point, 3, color=(0, 255, 0), thickness=-1)
    
    cv2.imwrite("result.jpg", image)

Note：这里直接使用的 ONNX 模型进行推理，它不像 torch 模型还需要去编译 NMS 插件比较麻烦，ONNX 模型的导出可以参考上篇文章

推理效果如下图：

在这里插入图片描述

至此，我们在 Python 上面完成了 LaneATT 的整个推理过程，下面我们去 C++ 上实现

二、LaneATT推理(C++)

C++ 上的实现我们使用的 repo 依旧是 tensorRT_Pro，现在我们就基于 tensorRT_Pro 完成 LaneATT 在 C++ 上的推理。

1. ONNX导出

ONNX 导出的细节请参考 LaneATT推理详解及部署实现（上），这边不再赘述。

2. LaneATT预处理

之前有提到过 LaneATT 的预处理就是一个 resize 操作，因此我们在 tensorRT_Pro 中 LaneATT 模型的预处理可以直接使用 resize 的 CUDA 核函数的实现，只是需要注意在 CUDAKernel::Norm 的指定时不需要做 channel invert 操作

tensorRT_Pro 的预处理代码如下：

// same to opencv
// reference: https://github.com/opencv/opencv/blob/24fcb7f8131f707717a9f1871b17d95e7cf519ee/modules/imgproc/src/resize.cpp
// reference: https://github.com/openppl-public/ppl.cv/blob/04ef4ca48262601b99f1bb918dcd005311f331da/src/ppl/cv/cuda/resize.cu
/*
    可以考虑用同样实现的resize函数进行训练，python代码在：tools/test_resize.py
*/
__global__ void resize_bilinear_and_normalize_kernel(
    uint8_t* src, int src_line_size, int src_width, int src_height, float* dst, int dst_width, int dst_height, 
    float sx, float sy, Norm norm, int edge
){
    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= edge) return;

    int dx      = position % dst_width;
    int dy      = position / dst_width;
    float src_x = (dx + 0.5f) * sx - 0.5f;
    float src_y = (dy + 0.5f) * sy - 0.5f;
    float c0, c1, c2;

    int y_low = floorf(src_y);
    int x_low = floorf(src_x);
    int y_high = limit(y_low + 1, 0, src_height - 1);
    int x_high = limit(x_low + 1, 0, src_width - 1);
    y_low = limit(y_low, 0, src_height - 1);
    x_low = limit(x_low, 0, src_width - 1);

    int ly    = rint((src_y - y_low) * INTER_RESIZE_COEF_SCALE);
    int lx    = rint((src_x - x_low) * INTER_RESIZE_COEF_SCALE);
    int hy    = INTER_RESIZE_COEF_SCALE - ly;
    int hx    = INTER_RESIZE_COEF_SCALE - lx;
    int w1    = hy * hx, w2 = hy * lx, w3 = ly * hx, w4 = ly * lx;
    float* pdst = dst + dy * dst_width + dx * 3;
    uint8_t* v1 = src + y_low * src_line_size + x_low * 3;
    uint8_t* v2 = src + y_low * src_line_size + x_high * 3;
    uint8_t* v3 = src + y_high * src_line_size + x_low * 3;
    uint8_t* v4 = src + y_high * src_line_size + x_high * 3;

    c0 = resize_cast(w1 * v1[0] + w2 * v2[0] + w3 * v3[0] + w4 * v4[0]);
    c1 = resize_cast(w1 * v1[1] + w2 * v2[1] + w3 * v3[1] + w4 * v4[1]);
    c2 = resize_cast(w1 * v1[2] + w2 * v2[2] + w3 * v3[2] + w4 * v4[2]);

    if(norm.channel_type == ChannelType::Invert){
        float t = c2;
        c2 = c0;  c0 = t;
    }

    if(norm.type == NormType::MeanStd){
        c0 = (c0 * norm.alpha - norm.mean[0]) / norm.std[0];
        c1 = (c1 * norm.alpha - norm.mean[1]) / norm.std[1];
        c2 = (c2 * norm.alpha - norm.mean[2]) / norm.std[2];
    }else if(norm.type == NormType::AlphaBeta){
        c0 = c0 * norm.alpha + norm.beta;
        c1 = c1 * norm.alpha + norm.beta;
        c2 = c2 * norm.alpha + norm.beta;
    }

    int area = dst_width * dst_height;
    float* pdst_c0 = dst + dy * dst_width + dx;
    float* pdst_c1 = pdst_c0 + area;
    float* pdst_c2 = pdst_c1 + area;
    *pdst_c0 = c0;
    *pdst_c1 = c1;
    *pdst_c2 = c2;
}

关于预处理部分其实就是调用了上述 CUDA 核函数来实现 resize，由于在 CUDA 中我们是对每个像素进行操作，因此非常容易实现 BGR->RGB，/255.0 等操作。

3. LaneATT后处理

之前我们有提过 LaneATT 的后处理部分主要是包括 NMS 和 decode 两部分，这里博主主要参考了 TensorRT-LaneATT 的实现

关于 NMS 的实现这里博主是将其放在 GPU 上完成的，也是沿用了 tensorRT_Pro 的代码，如下所示：

static __device__ float LaneIoU(float* a, float* b){
    int start_a = (int)(a[2] * N_STRIPS + 0.5f);
    int start_b = (int)(b[2] * N_STRIPS + 0.5f);
    int start   = max(start_a, start_b);
    int end_a   = start_a + (int)(a[4] + 0.5f) - 1;
    int end_b   = start_b + (int)(b[4] + 0.5f) - 1;
    int end     = min(min(end_a, end_b), N_STRIPS);
    float dist  = 0.0f;
    for(int i = 6 + start; i <= 6 + end; ++i){
        dist += fabsf(a[i] - b[i]);
    }
    return dist / (float)(end - start + 1);
}

static __global__ void nms_kernel(float* lanes, int max_lanes, float threshold){
    
    int position = (blockDim.x * blockIdx.x + threadIdx.x);
    int count = min((int)*lanes, max_lanes);
    if(position >= count)
        return;

    float* pcurrent = lanes + 1 + position * NUM_LANE_ELEMENT;
    if(pcurrent[5] == 0) return;

    for(int i = 0; i < count; ++i){
        float* pitem = lanes + 1 + i * NUM_LANE_ELEMENT;
        if(i == position)   continue;
        
        if(pitem[1] >= pcurrent[1]){
            if(pitem[1] == pcurrent[1] && i < position)
                continue;
            
            float iou = LaneIoU(pcurrent, pitem);
            if(iou < threshold){
                pcurrent[5] = 0;  // 1=keep, 0=ignore
                return;
            }
        }
    }
}

关于 NMS 的具体实现也是启动多个线程，每个线程处理一条车道线 proposal，如果剩余 proposal 中的置信度大于当前线程中处理的车道线，则计算两个车道线的 IoU，通过 IoU 值判断是否保留该框。相比于 CPU 版的 NMS 应该是少套了一层循环，另外一层循环是通过 CUDA 上线程的并行操作处理的

decode 解码部分中通过置信度过滤的实现是放在 GPU 上做的，代码如下：

static __global__ void decode_kernel(float* predict, int num_lanes, float confidence_threshold, float* parray){

    int position = blockDim.x * blockIdx.x + threadIdx.x;
    if (position >= num_lanes)  return;

    float* pitem = predict + (NUM_LANE_ELEMENT - 1) * position;
    float conf   = pitem[1];
    if(conf < confidence_threshold)
        return;
    
    int index = atomicAdd(parray, 1);
    float conf1   = *pitem++;
    float conf2   = *pitem++;
    float start_y = *pitem++;
    float start_x = *pitem++;
    float length  = *pitem++;

    float* pout_item = parray + 1 + index * NUM_LANE_ELEMENT;
    *pout_item++ = conf1;
    *pout_item++ = conf2;
    *pout_item++ = start_y;
    *pout_item++ = start_x;
    *pout_item++ = length;
    *pout_item++ = 1;   // 1 = keep, 0 = ignore

    for(int i = 0; i < N_OFFSETS; ++i){
        float point  = *pitem++;
        *pout_item++ = point;
    }
}

另外 proposal 中点的解码是放在 CPU 上做的，代码如下：

for(auto& lane : image_based_lanes){
    lane.points.reserve(N_OFFSETS / 2);
    int start = (int)(lane.start_y * N_STRIPS + 0.5f);
    int end   = start + (int)(lane.length + 0.5f) - 1;
    end       = min(end, N_STRIPS);
    for(int i = start; i <= end; ++i){
        lane.points.push_back(cv::Point2f(lane.lane_xs[i] / input_width_, anchor_ys_[i]));
    }
}

4. LaneATT推理

通过上面对 LaneATT 的预处理和后处理分析之后，整个推理过程就显而易见了。C++ 上 LaneATT 的预处理部分可直接沿用 tensorRT_Pro 中的 CUDA resize，后处理中的 decode 和 NMS 部分需要简单修改

我们在终端执行如下指令即可完成推理（注意！完整流程博主会在后续内容介绍，这边只是简单演示）：

make laneatt -j64

编译图解如下所示：

在这里插入图片描述

推理结果如下图所示：

在这里插入图片描述

至此，我们在 C++ 上面完成了 LaneATT 的整个推理过程，下面我们将完整的走一遍流程

三、LaneATT部署

博主新建了一个仓库 tensorRT_Pro-YOLOv8，该仓库基于 shouxieai/tensorRT_Pro，并进行了调整以支持 YOLOv8 的各项任务，目前已支持分类、检测、分割、姿态点估计任务。

下面我们就来具体看看如何利用 tensorRT_Pro-YOLOv8 这个 repo 完成 LaneATT 模型的推理。

1. 源码下载

tensorRT_Pro-YOLOv8 的代码可以直接从 GitHub 官网上下载，源码下载地址是 https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8，Linux 下代码克隆指令如下：

git clone https://github.com/Melody-Zhou/tensorRT_Pro-YOLOv8.git

也可手动点击下载，点击右上角的 Code 按键，将代码下载下来。至此整个项目就已经准备好了。也可以点击 here 下载博主准备好的源代码（注意代码下载于 2024/8/4 日，若有改动请参考最新）

2. 环境配置

需要使用的软件环境有 TensorRT、CUDA、cuDNN、OpenCV、Protobuf，所有软件环境的安装可以参考 Ubuntu20.04软件安装大全，这里不再赘述，需要各位看官自行配置好相关环境😄，外网访问较慢，这里提供下博主安装过程中的软件安装包下载链接 Baidu Drive【pwd:yolo】🚀🚀🚀

tensorRT_Pro-YOLOv8 提供 CMakeLists.txt 和 Makefile 两种方式编译，二者选一即可

2.1 配置CMakeLists.txt

主要修改五处

1. 修改第 13 行，修改 OpenCV 路径

set(OpenCV_DIR   "/usr/local/include/opencv4")

2. 修改第 15 行，修改 CUDA 路径

set(CUDA_TOOLKIT_ROOT_DIR     "/usr/local/cuda-11.6")

3. 修改第 16 行，修改 cuDNN 路径

set(CUDNN_DIR    "/usr/local/cudnn8.4.0.27-cuda11.6")

4. 修改第 17 行，修改 tensorRT 路径

set(TENSORRT_DIR "/opt/TensorRT-8.4.1.5")

5. 修改第 20 行，修改 protobuf 路径

set(PROTOBUF_DIR "/home/jarvis/protobuf")

2.2 配置Makefile

主要修改五处

1. 修改第 4 行，修改 protobuf 路径

lean_protobuf  := /home/jarvis/protobuf

2. 修改第 5 行，修改 tensorRT 路径

lean_tensor_rt := /opt/TensorRT-8.4.1.5

3. 修改第 6 行，修改 cuDNN 路径

lean_cudnn     := /usr/local/cudnn8.4.0.27-cuda11.6

4. 修改第 7 行，修改 OpenCV 路径

lean_opencv    := /usr/local

5. 修改第 8 行，修改 CUDA 路径

lean_cuda      := /usr/local/cuda-11.6

3. ONNX导出

导出细节可以查看 LaneATT推理详解及部署实现（上），这边不再赘述。记得将导出的 ONNX 模型放在 tensorRT_Pro-YOLOv8/workspace 文件夹下。

4. engine生成

在 workspace 下新建 laneatt_build.sh，其内容如下：

#! /usr/bin/bash

TRTEXEC=/home/jarvis/lean/TensorRT-8.6.1.6/bin/trtexec

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/home/jarvis/lean/TensorRT-8.6.1.6/lib

${TRTEXEC} \
  --onnx=laneatt.sim.onnx \
  --minShapes=images:1x3x360x640 \
  --optShapes=images:1x3x360x640 \
  --maxShapes=images:8x3x360x640 \
  --memPoolSize=workspace:2048 \
  --saveEngine=laneatt.sim.FP16.trtmodel \
  --fp16 \
  > laneatt.log 2>&1

其中需要修改 TRTEXEC 的路径为你自己的路径，终端执行如下指令：

cd tensorRT_Pro-YOLOv8/workspace
bash laneatt_build.sh

执行后等待一段时间会在当前文件夹生成 laneatt.sim.FP16.trtmodel 即模型引擎文件，注意终端看不到任何日志打印输出，这是因为博主将 tensorRT 输出的日志信息保存到了 laneatt.log 文件中，大家也可以删除保存直接在终端显示相关日志信息

Note：博主也提供了 TRT::compile 接口生成 engine 文件，不过在反序列化的时候可能会出现如下的问题：

在这里插入图片描述

这个主要是因为 tensorRT_Pro-YOLOv8 自己构建的 onnxparser 版本太老，不支持 ScatterND 节点的解析，我们可以手动替换 onnxparser 解析器具体可以参考：RT-DETR推理详解及部署实现

另外我们也可以写插件支持，杜老师在 tensorRT_Pro 中有提供 ScatterND 的插件，我们可以直接使用，只不过需要对 ONNX 模型进行一些简单的修改，大家感兴趣的话可以看看：LayerNorm Plugin的使用与说明

5. 源码修改

如果你想推理自己训练的模型还需要修改下源代码，LaneATT 模型的推理代码主要在 app_laneatt.cpp 文件中，我们就只需要修改这一个文件的内容即可，源码修改较简单主要有以下几点：

app_laneatt.cpp 226 行，“laneatt.sim” 修改为你导出的 ONNX 模型名

具体修改示例如下：

test(TRT::Mode::FP16, "laneatt.sim");	// 修改1 226 行 "laneatt.sim" 改成 "best"

6. 运行

OK！源码修改好了，Makefile 编译文件也搞定了，engine 模型也准备好了，现在可以编译运行了，直接在终端执行如下指令即可：

make laneatt -j64

推理结果如下图所示：

在这里插入图片描述

推理成功后会生成 laneatt.sim_LaneATT_FP16_result 文件夹，该文件夹下保存了推理的图片。

模型推理效果如下图所示：

在这里插入图片描述

OK，以上就是使用 tensorRT_Pro-YOLOv8 推理 LaneATT 的大致流程，若有问题，欢迎各位看官批评指正。

结语

博主在这里针对 LaneATT 的预处理和后处理做了简单分析，同时与大家分享了 C++ 上的实现流程，目的是帮大家理清思路，更好的完成后续的部署工作😄。感谢各位看到最后，创作不易，读后有收获的看官请帮忙点个👍⭐️

从精度来看 LaneATT 可能无法匹级目前的 SOTA，并且其工程实用性可能也比不过基于 down-top 的分割方案，不过作为车道线检测任务入门的学习还是不错的🤗

最后大家如果觉得 tensorRT_Pro-YOLOv8 这个 repo 对你有帮助的话，不妨点个 ⭐️ 支持一波，这对博主来说非常重要，感谢各位🙏。