面部情绪识别Facial Emotion Recognition：从表情到情绪的全面解析与代码实现

面部情绪识别（FER）是指根据面部表情对人类情绪进行识别和分类的过程。通过分析面部特征和模式，机器可以有依据地推测一个人的情绪状态。这一面部识别子领域是一个高度跨学科的领域，它借鉴了计算机视觉、机器学习和心理学的见解。

在这里插入图片描述

本文将首先对面部情绪识别的原理进行分析，再基于VGG13模型架构和Face Expression Recognition Plus (FER+) 数据集，建立实时面部情绪识别系统，并在文末附上完整代码。

人类情绪的重要性

几个世纪以来，哲学家们一直在争论情绪的本质。有些人认为情绪是纯粹的生理反应，有些人则认为情绪是涉及判断、欲望和信念的复杂心理状态。还有一些人把情绪理解为社会因素作用的结果。从历史上看，情绪往往与理性形成羡慕对比，理性被视为优质的，而情绪则是易变的、具有破坏性的。然而，以心理学和神经科学为基础的现代观点，则认为情绪在决策、道德判断和其他认知过程中经常发挥关键作用。

情绪不仅仅是个人的经历，它们还受到文化和历史的影响。不同的文化可能对情绪的优先次序、理解甚至命名都有所不同。这种文化差异性表明，虽然可能存在一些普遍的情绪体验，但它们的解释和表达方式可能差异很大。

如下图所示即为多重情绪的抽象阐释。

在这里插入图片描述

与此同时，情绪与身体健康息息相关。理解和管理情绪对心理健康至关重要。未消化或压抑的情绪会导致各种各样的心理健康问题。长期的压力或长期的消极情绪状态可能对身体产生不利影响。它也可以作为一个强大的动力。例如，愤怒可以激发社会变革，而恐惧可以驱使个人避免危险。

AI加速系统能检测人类情绪吗？

人类通常是依靠直觉观察和理解我们的世界，而且看起来毫不费力。我们的眼睛捕捉光线，我们的视觉神经将这些数据传送到我们的大脑，在几毫秒内，我们的大脑处理这些信息，形成我们周围环境的连贯图像。这种感知、识别和解释的能力是数百万年的进化完善的顶峰。

然而，计算机没有这种认识世界的先天能力，它们通过数据解读世界。而在视觉方面，数据通常由摄像机提供。摄像机将视觉信息转换成数字信息，再由图像分类模型进行处理。

在这里插入图片描述

面部情绪识别的应用

情绪识别可以应用于多个领域。例如：

社交和内容创建平台
- 基于情绪反馈的个性化用户体验。
- 在线机器学习系统根据学习者的情绪状态来调整内容，从而提供更具针对性的建议。
医学研究
- 监测患者的抑郁、焦虑或其他情绪障碍的迹象。
- 在治疗过程中协助治疗师跟踪病人的进展或反应。
- 实时监控患者心理压力。
驾驶员安全机制
- 监测驾驶员疲劳、分心、或嗜睡的迹象。
市场营销和市场调研
- 实时分析受众对广告的反应。
- 根据受众的情绪状态定制广告内容。
- 产品测试和反馈。
安全与监视
- 在人群密集的地方发现可疑或异常的行为。
- 分析公共活动期间的人群反应。

构建人脸情绪识别系统

在对人脸情绪原理有了一定了解后，本章节开始构建人脸情绪识别系统。

FER+数据集

FER+数据集是原始面部表情识别（FER）数据集的扩展，它对面部表情进行了更精细、更细微的标注。最初的FER数据集将面部表情分为六种基本情绪：快乐、悲伤、愤怒、惊讶、恐惧和厌恶，而FER+则在此基础上更进一步，增加了两个类别：中性和蔑视。

在这里插入图片描述
FER+数据集中性情绪类别的扩展反映了其对人类情绪和表情复杂性的认识。事实证明，一个人大多数时候看起来没有表情，因此“中性”可以作为比较的基准，有助于解释个人表情没有强烈传达任何主要情绪的情况。通过添加“中性”作为一个独特的类别，也得以说明，并非所有的面部表情都可以很容易地归类为某一个主要的情绪标签。

用于人脸检测的RFB-320

Single Shot Multibox Detector(SSD)模型在情绪识别之前，首先需要在输入帧中检测人脸。为此，这里采用超轻量级人脸检测模型RFB-320。
它是针对边缘计算设备进行优化的一种创新的人脸检测模型，其中结合了一个改进的RFB模块，在不增加计算开销的前提下，有效地捕获了多尺度上下文信息。它在WIDER FACE数据集上训练，支持输入分辨率为320*240，计算速度可达0.2106 GFLOPs。参数量仅为0.3004百万个。该模型基于PyTorch框架开发，平均精度(mAP)可达84.78%，在资源受限环境下成为高效人脸检测的有力解决方案。在此实现中，RFB-320 SSD模型以Caffe格式使用。

在这里插入图片描述

自定义VGG13模型架构

情绪识别分类模型采用定制的VGG13架构，专为64*64灰度图像设计。它使用具有最大池化和dropout的卷积层将图像分为8个情绪类别，以防止过拟合。该架构从两个具有64个卷积核的卷积层开始，然后是最大池化和25%的dropout。另外额外的卷积层捕捉复杂的特征，两个具有1024个节点的密集层聚合信息，然后是50%的dropout。一个softmax输出层预测情绪类别。

在这里插入图片描述

如上图所示的模型架构中，我们可以看到黄色、绿色、橙色、蓝色和灰色分别是卷积层、最大池化层、dropout层、全连接层和softmax输出层。尽管数据集较小，但为了抑制过度拟合，我们还是战略性地设置了dropout层。这增强了模型的泛化能力，并能准确识别各种图像中的情绪。在本方案中，该模型已在上述FER+数据集上进行了预训练。

代码实现

接下来，分析一下代码实现。

参数初始化

首先对情绪识别中的一些常用参数进行配置。

image_mean = np.array([127, 127, 127])
image_std = 128.0
iou_threshold = 0.3
center_variance = 0.1
size_variance = 0.2
min_boxes = [
    [10.0, 16.0, 24.0],
    [32.0, 48.0],
    [64.0, 96.0],
    [128.0, 192.0, 256.0]
]
strides = [8.0, 16.0, 32.0, 64.0]
threshold = 0.5

其中，

image_mean：图像RGB通道归一化的均值。
image_std：图像归一化的标准差。
iou_threshold：用于判断边界框匹配的交并比（Intersection over Union，IoU）的阈值。
center_variance：预测边界框中心坐标的比例因子。
size_variance：预测边界框尺寸的缩放因子。
min_boxes：不同尺寸对象的最小边界框尺寸。
strides：根据图像大小控制特征图的尺度。
threshold: 目标检测的置信度阈值。

生成先验边界框（priors）

def define_img_size(image_size):
    shrinkage_list = []
    feature_map_w_h_list = []
    for size in image_size:
        feature_map = [int(ceil(size / stride)) for stride in strides]
        feature_map_w_h_list.append(feature_map)

    for i in range(0, len(image_size)):
        shrinkage_list.append(strides)
    priors = generate_priors(
        feature_map_w_h_list, shrinkage_list, image_size, min_boxes
    )
    return priors

这里旨在为目标检测任务生成先验边界框(priors)，以image_size参数作为输入，根据提供的图像尺寸和一组预定义的步长值，计算特征图的尺寸。这些特征图的尺寸反映了卷积神经网络（CNN）层对不同输入图像尺度的期望输出尺寸。实质上，SSD中的先验边界框提供了一种高效的方法，同时预测多个边界框和它们关联的类别得分，从而实现实时目标检测。

代码通过将strides列表复制多次，为image_size列表中的每个元素准备了一个shrinkage_list。随后，代码调用名为generate_priors的函数，将计算得到的特征图维度、缩小信息、image_size和预定义的最小边界框尺寸传递给该函数。generate_priors函数的目的是利用提供的信息创建并返回所需的先验边界框。最后，define_img_size函数返回这些计算得到的先验边界框。

def generate_priors(feature_map_list, shrinkage_list, image_size, min_boxes):
    #初始化空列表priors，用于存储生成的先验边界框
    priors = []
    #根据特征图的尺寸、缩放比例和图像尺寸，计算出每个特征点对应于原始图像中的中心点坐标
    for index in range(0, len(feature_map_list[0])):
        # 对于每个最小边界框尺寸，根据图像尺寸将其归一化为相对于原始图像的比例
        scale_w = image_size[0] / shrinkage_list[0][index]
        scale_h = image_size[1] / shrinkage_list[1][index]
        #根据特征点的坐标、归一化的边界框尺寸，生成一个先验边界框，并将其添加到priors列表中
        for j in range(0, feature_map_list[1][index]):
            for i in range(0, feature_map_list[0][index]):
                x_center = (i + 0.5) / scale_w
                y_center = (j + 0.5) / scale_h
               
                for min_box in min_boxes[index]:
                    w = min_box / image_size[0]
                    h = min_box / image_size[1]

                    priors.append([
                        x_center,
                        y_center,
                        w,
                        h
                    ])
    #打印生成的先验边界框数量，并对先验边界框进行截断，确保其坐标值在0.0到1.0之间
    print("priors nums:{}".format(len(priors)))
    # 返回生成的先验边界框priors
    return np.clip(priors, 0.0, 1.0)

函数generate_priors用于生成目标检测中的先验边界框（prior bounding boxes）。它使用输入数据，如特征图、图像尺寸和最小边界框尺寸，计算并归一化边界框的坐标。函数累积这些边界框，并返回它们作为准确目标定位的重要参考。

Hard NMS算法实现

def hard_nms(box_scores, iou_threshold, top_k=-1, candidate_size=200):
    #提取出得分和边界框信息，并将得分按升序排序
    scores = box_scores[:, -1]
    boxes = box_scores[:, :-1]
    picked = []
    indexes = np.argsort(scores)
    indexes = indexes[-candidate_size:]
    #从得分最高的边界框开始，将其索引添加到`picked`列表中
    while len(indexes) > 0:
        current = indexes[-1]
        picked.append(current)
        #如果设置了top_k参数且已选取的边界框数量达到了top_k，或者剩余待处理的边界框数量为1，则结束循环
        if 0 < top_k == len(picked) or len(indexes) == 1:
            break
        #获取当前选取的边界框的坐标信息，并从索引列表indexes中移除该边界框的索引
        current_box = boxes[current, :]
        indexes = indexes[:-1]
        rest_boxes = boxes[indexes, :]
        #提取剩余待处理的边界框信息，并计算它们与当前边界框的IoU
        iou = iou_of(
            rest_boxes,
            np.expand_dims(current_box, axis=0),
        )
        #根据IoU值，从indexes中移除与当前边界框重叠超过阈值iou_threshold的边界框的索引
        indexes = indexes[iou <= iou_threshold]
    #返回被选取的边界框的子集
    return box_scores[picked, :]

函数hard_nms实现了目标检测中的硬非极大值抑制（Hard Non-Maximum Suppression，NMS）算法。它使用iou_threshold、top_k和candidate_size等参数对box_scores进行处理。通过循环和IoU计算，它选择高得分且不重叠的边界框。函数返回一个经过优化的边界框子集，从而提高目标检测的准确性。

计算IoU

def area_of(left_top, right_bottom):
    #使用np.clip将坐标差限制在0.0及以上的范围内
    hw = np.clip(right_bottom - left_top, 0.0, None)
    #计算矩形的宽和高，并返回面积值
    return hw[..., 0] * hw[..., 1]

#两组边界框的坐标作为输入，并设置一个很小的值eps以避免除以0的情况
def iou_of(boxes0, boxes1, eps=1e-5):
    #计算两组边界框的重叠区域的左上角和右下角坐标
    overlap_left_top = np.maximum(boxes0[..., :2], boxes1[..., :2])
    overlap_right_bottom = np.minimum(boxes0[..., 2:], boxes1[..., 2:])
    #利用area_of函数分别计算两组边界框的面积
    overlap_area = area_of(overlap_left_top, overlap_right_bottom)
    area0 = area_of(boxes0[..., :2], boxes0[..., 2:])
    area1 = area_of(boxes1[..., :2], boxes1[..., 2:])
    #计算重叠区域的面积除以总面积减去重叠区域的面积，并返回结果
    return overlap_area / (area0 + area1 - overlap_area + eps)

函数area_of用于计算矩形的面积，输入为左上角和右下角的坐标。通过使用np.clip函数，它避免了出现负值。

另外，函数iou_of用于计算两组边界框之间的IoU（交并比）。它利用了area_of函数，通过量化边界框的重叠区域来评估检测的准确性。

def predict(
    width,
    height,
    confidences,
    boxes,
    prob_threshold,
    iou_threshold=0.3,
    top_k=-1
):
    # 获取boxes和confidences中的第一个元素（通常是批次维度），并将其赋值给boxes和confidences变量
    boxes = boxes[0]
    confidences = confidences[0]
    # 初始化空列表picked_box_probs和picked_labels，用于存储筛选后的边界框和标签
    picked_box_probs = []
    picked_labels = []
    # 针对每个类别（从1开始，跳过背景类别）
    for class_index in range(1, confidences.shape[1]):
        # 提取该类别的置信度
        probs = confidences[:, class_index]
        # 根据置信度阈值，得到满足条件的边界框掩码
        mask = probs > prob_threshold
        probs = probs[mask]
        # 检查满足条件的边界框数量是否为0，若是则继续下一个类别
        if probs.shape[0] == 0:
            continue
        # 根据边界框掩码，提取满足条件的边界框和对应的置信度，并将其合并为一个数组box_probs
        subset_boxes = boxes[mask, :]
        box_probs = np.concatenate(
            [subset_boxes, probs.reshape(-1, 1)], axis=1
        )
        # 应用硬NMS算法（hard_nms函数）对box_probs进行非极大值抑制，根据设定的阈值和选取的前k个边界框数量进行筛选
        box_probs = hard_nms(box_probs,
                             iou_threshold=iou_threshold,
                             top_k=top_k,
                             )
        # 将筛选后的边界框和标签分别添加到picked_box_probs和picked_labels列表中
        picked_box_probs.append(box_probs)
        picked_labels.extend([class_index] * box_probs.shape[0])
    # 检查picked_box_probs是否为空，如果为空则返回空的数组作为结果
    if not picked_box_probs:
        return np.array([]), np.array([]), np.array([])
    # 将picked_box_probs中的边界框坐标值乘以图像的宽度和高度，将其还原为原始图像中的坐标
    picked_box_probs = np.concatenate(picked_box_probs)
    picked_box_probs[:, 0] *= width
    picked_box_probs[:, 1] *= height
    picked_box_probs[:, 2] *= width
    picked_box_probs[:, 3] *= height
    # 返回优化后的边界框坐标（前四列）、标签和置信度（第五列）作为结果
    return (
        picked_box_probs[:, :4].astype(np.int32),
        np.array(picked_labels),
        picked_box_probs[:, 4]
    )

predict函数用于预测目标检测模型的结果，生成准确的边界框预测、标签和置信度。它根据提供的阈值对预测结果进行过滤，并使用非极大值抑制（NMS）来消除冗余。该函数接受输入参数，如width、height、confidences和boxes，并输出优化后的边界框坐标、标签和置信度。

转为实际边界框

def convert_locations_to_boxes(locations, priors, center_variance,
                               size_variance):
    # 检查先验边界框数组priors和位置数组locations的维度
    # 如果priors的维度比locations的维度少1，会将priors扩展为与locations相同的维度
    if len(priors.shape) + 1 == len(locations.shape):
        priors = np.expand_dims(priors, 0)
    # 将位置数组locations进行拆分，分别计算边界框的中心坐标和尺寸
    return np.concatenate([
        #对于中心坐标，通过将位置的前两个维度乘以center_variance乘以priors的尺寸，并加上priors的前两个维度，得到边界框的中心坐标
        locations[..., :2] * center_variance * priors[..., 2:] + priors[..., :2],
        # 对于尺寸，通过将位置的后两个维度乘以size_variance并取指数，然后与priors的尺寸相乘，得到边界框的尺寸
        np.exp(locations[..., 2:] * size_variance) * priors[..., 2:]
    #将计算得到的中心坐标和尺寸进行合并，返回最终的边界框坐标数组
    ], axis=len(locations.shape) - 1)
 
函数convert_locations_to_boxes在将预测的偏移量和尺度转换为实际边界框坐标方面起到了关键作用。通过使用给定的信息以及调整因子（如center_variance和size_variance），它可以纠正预测的边界框，准确地显示物体的位置和大小。

def center_form_to_corner_form(locations):
    # 从位置数组locations中提取中心坐标和尺寸信息
    # 通过对中心坐标减去尺寸的一半，得到边界框的左上角坐标
    # 通过对中心坐标加上尺寸的一半，得到边界框的右下角坐标
    # 最后，将左上角和右下角的坐标进行合并，返回转换后的边界框坐标数组
    return np.concatenate(
        [locations[..., :2] - locations[..., 2:] / 2,
         locations[..., :2] + locations[..., 2:] / 2],
        len(locations.shape) - 1
    )

函数center_form_to_corner_form用于将边界框的中心形式转换为角点形式的表示。通过对位置数组进行简单的计算，将边界框的中心坐标和尺寸转换为左上角和右下角的角点坐标。

实时视频帧上识别面部情绪

def FER_live_cam():
    # 情绪类别对应的字典
    emotion_dict = {
        0: 'neutral',
        1: 'happiness',
        2: 'surprise',
        3: 'sadness',
        4: 'anger',
        5: 'disgust',
        6: 'fear'
    }
    # 打开视频捕获设备
    cap = cv2.VideoCapture(0)
    # 获取视频帧的宽度和高度
    frame_width = int(cap.get(3))
    frame_height = int(cap.get(4))
    size = (frame_width, frame_height)
    # 初始化视频写入器
    result = cv2.VideoWriter('result.avi',
                         cv2.VideoWriter_fourcc(*'MJPG'),
                         10, size)

    # 读取ONNX模型
    model = 'onnx_model.onnx'
    model = cv2.dnn.readNetFromONNX('emotion-ferplus-8.onnx')

    # 读取Caffe人脸检测模型
    model_path = 'RFB-320/RFB-320.caffemodel'
    proto_path = 'RFB-320/RFB-320.prototxt'
    net = dnn.readNetFromCaffe(proto_path, model_path)
    input_size = [320, 240]
    width = input_size[0]
    height = input_size[1]
    priors = define_img_size(input_size)

    while cap.isOpened():
        # 读取视频帧
        ret, frame = cap.read()
        if ret:
            # 将视频帧调整为指定大小
            img_ori = frame
            # 将调整后的帧从BGR格式转换为RGB格式
            #print("frame size: ", frame.shape)
            rect = cv2.resize(img_ori, (width, height))
            rect = cv2.cvtColor(rect, cv2.COLOR_BGR2RGB)
            # 将帧转换为输入格式
            net.setInput(dnn.blobFromImage(
                rect, 1 / image_std, (width, height), 127)
            )
            start_time = time.time()
            # 进行人脸检测，获取边界框和置信度
            boxes, scores = net.forward(["boxes", "scores"])
            # 对边界框和置信度进行形状调整
            boxes = np.expand_dims(np.reshape(boxes, (-1, 4)), axis=0)
            scores = np.expand_dims(np.reshape(scores, (-1, 2)), axis=0)
            # 将边界框转换为实际坐标形式
            boxes = convert_locations_to_boxes(
                boxes, priors, center_variance, size_variance
            )
            
            boxes = center_form_to_corner_form(boxes)
            # 对边界框进行预测，获取标签和置信度
            boxes, labels, probs = predict(
                img_ori.shape[1],
                img_ori.shape[0],
                scores,
                boxes,
                threshold
            )
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            for (x1, y1, x2, y2) in boxes:
                w = x2 - x1
                h = y2 - y1
                # 在帧上绘制边界框
                cv2.rectangle(frame, (x1,y1), (x2, y2), (255,0,0), 2)
                # 调整人脸图像大小并转换为灰度图像
                resize_frame = cv2.resize(
                    gray[y1:y1 + h, x1:x1 + w], (64, 64)
                )
                resize_frame = resize_frame.reshape(1, 1, 64, 64)
                # 将调整后的人脸图像输入情绪识别模型
                model.setInput(resize_frame)
                # 进行情绪预测
                output = model.forward()
                end_time = time.time()
                fps = 1 / (end_time - start_time)
                print(f"FPS: {fps:.1f}")
                # 获取预测结果中的最大值所对应的情绪标签
                pred = emotion_dict[list(output[0]).index(max(output[0]))]
                # 在原始帧上绘制边界框和情绪标签
                cv2.rectangle(
                    img_ori,
                    (x1, y1),
                    (x2, y2),
                    (0, 255, 0),
                    2,
                    lineType=cv2.LINE_AA
                )
                cv2.putText(
                    frame,
                    pred,
                    (x1, y1),
                    cv2.FONT_HERSHEY_SIMPLEX,
                    0.8,
                    (0, 255, 0),
                    2,
                    lineType=cv2.LINE_AA
                )
            # 将帧写入输出视频文件
            result.write(frame)
            # 在窗口中显示帧
            cv2.imshow('frame', frame)
            # 如果按下 'q' 键，则退出循环
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        else:
            break
    # 释放视频捕获设备和视频写入器
    cap.release()
    result.release()
    # 关闭所有窗口
    cv2.destroyAllWindows()

FER_live_cam()函数用于对视频帧进行实时的面部情绪识别。首先，它设置了一个emotion_dict字典，将数字情绪类别索引映射到可读的情绪标签。视频源被初始化，虽然也可以使用摄像头的输入。函数还初始化了一个输出视频写入器，用于将处理后带有情绪标签的帧保存下来。主要的情绪预测模型以ONNX格式保存，使用OpenCV DNN的readNetFromONNX方法进行读取，并与以Caffe格式保存的RFB-30 SSD人脸检测模型一起加载。随着逐帧处理视频，人脸检测模型识别出由边界框表示的人脸。

这些检测到的人脸在输入情绪识别模型之前进行预处理，包括调整大小和转换为灰度图像。通过从模型输出中选择最大得分来确定识别的情绪，并使用emotion_dict将其映射到标签上。然后，在检测到的人脸周围绘制矩形边界框和情绪标签，将帧保存到输出视频文件中，并实时显示。用户可以通过按下“q”键停止视频显示。一旦视频处理完成或被中断，将释放资源，包括视频捕获和写入器，并关闭任何打开的窗口。

实验结果

如下是使用这个模型对展示不同面部情绪的视频进行推理的一些结果。

在这里插入图片描述

附录-完整代码

import cv2
import numpy as np
import time
import os

from cv2 import dnn
from math import ceil

image_mean = np.array([127, 127, 127])
image_std = 128.0
iou_threshold = 0.3
center_variance = 0.1
size_variance = 0.2
min_boxes = [
    [10.0, 16.0, 24.0], 
    [32.0, 48.0], 
    [64.0, 96.0], 
    [128.0, 192.0, 256.0]
]
strides = [8.0, 16.0, 32.0, 64.0]
threshold = 0.5

def define_img_size(image_size):
    shrinkage_list = []
    feature_map_w_h_list = []
    for size in image_size:
        feature_map = [int(ceil(size / stride)) for stride in strides]
        feature_map_w_h_list.append(feature_map)

    for i in range(0, len(image_size)):
        shrinkage_list.append(strides)
    priors = generate_priors(
        feature_map_w_h_list, shrinkage_list, image_size, min_boxes
    )
    return priors


def generate_priors(
    feature_map_list, shrinkage_list, image_size, min_boxes
):
    priors = []
    for index in range(0, len(feature_map_list[0])):
        scale_w = image_size[0] / shrinkage_list[0][index]
        scale_h = image_size[1] / shrinkage_list[1][index]
        for j in range(0, feature_map_list[1][index]):
            for i in range(0, feature_map_list[0][index]):
                x_center = (i + 0.5) / scale_w
                y_center = (j + 0.5) / scale_h

                for min_box in min_boxes[index]:
                    w = min_box / image_size[0]
                    h = min_box / image_size[1]
                    priors.append([
                        x_center,
                        y_center,
                        w,
                        h
                    ])
    print("priors nums:{}".format(len(priors)))
    return np.clip(priors, 0.0, 1.0)


def hard_nms(box_scores, iou_threshold, top_k=-1, candidate_size=200):
    scores = box_scores[:, -1]
    boxes = box_scores[:, :-1]
    picked = []
    indexes = np.argsort(scores)
    indexes = indexes[-candidate_size:]
    while len(indexes) > 0:
        current = indexes[-1]
        picked.append(current)
        if 0 < top_k == len(picked) or len(indexes) == 1:
            break
        current_box = boxes[current, :]
        indexes = indexes[:-1]
        rest_boxes = boxes[indexes, :]
        iou = iou_of(
            rest_boxes,
            np.expand_dims(current_box, axis=0),
        )
        indexes = indexes[iou <= iou_threshold]
    return box_scores[picked, :]


def area_of(left_top, right_bottom):
    hw = np.clip(right_bottom - left_top, 0.0, None)
    return hw[..., 0] * hw[..., 1]


def iou_of(boxes0, boxes1, eps=1e-5):
    overlap_left_top = np.maximum(boxes0[..., :2], boxes1[..., :2])
    overlap_right_bottom = np.minimum(boxes0[..., 2:], boxes1[..., 2:]) 

    overlap_area = area_of(overlap_left_top, overlap_right_bottom)
    area0 = area_of(boxes0[..., :2], boxes0[..., 2:])
    area1 = area_of(boxes1[..., :2], boxes1[..., 2:])
    return overlap_area / (area0 + area1 - overlap_area + eps)


def predict(
    width, 
    height, 
    confidences, 
    boxes, 
    prob_threshold, 
    iou_threshold=0.3, 
    top_k=-1
):
    boxes = boxes[0]
    confidences = confidences[0]
    picked_box_probs = []
    picked_labels = []
    for class_index in range(1, confidences.shape[1]):
        probs = confidences[:, class_index]
        mask = probs > prob_threshold
        probs = probs[mask]
        if probs.shape[0] == 0:
            continue
        subset_boxes = boxes[mask, :]
        box_probs = np.concatenate(
            [subset_boxes, probs.reshape(-1, 1)], axis=1
        )
        box_probs = hard_nms(box_probs,
                             iou_threshold=iou_threshold,
                             top_k=top_k,
                             )
        picked_box_probs.append(box_probs)
        picked_labels.extend([class_index] * box_probs.shape[0])
    if not picked_box_probs:
        return np.array([]), np.array([]), np.array([])
    picked_box_probs = np.concatenate(picked_box_probs)
    picked_box_probs[:, 0] *= width
    picked_box_probs[:, 1] *= height
    picked_box_probs[:, 2] *= width
    picked_box_probs[:, 3] *= height
    return (
        picked_box_probs[:, :4].astype(np.int32), 
        np.array(picked_labels), 
        picked_box_probs[:, 4]
    )


def convert_locations_to_boxes(locations, priors, center_variance,
                               size_variance):
    if len(priors.shape) + 1 == len(locations.shape):
        priors = np.expand_dims(priors, 0)
    return np.concatenate([
        locations[..., :2] * center_variance * priors[..., 2:] + priors[..., :2],
        np.exp(locations[..., 2:] * size_variance) * priors[..., 2:]
    ], axis=len(locations.shape) - 1)


def center_form_to_corner_form(locations):
    return np.concatenate(
        [locations[..., :2] - locations[..., 2:] / 2,
         locations[..., :2] + locations[..., 2:] / 2], 
        len(locations.shape) - 1
    )


def FER_live_cam():
    emotion_dict = {
        0: 'neutral', 
        1: 'happiness', 
        2: 'surprise', 
        3: 'sadness',
        4: 'anger', 
        5: 'disgust', 
        6: 'fear'
    }

    cap = cv2.VideoCapture('video3.mp4')
    # cap = cv2.VideoCapture(0)

    frame_width = int(cap.get(3))
    frame_height = int(cap.get(4))
    size = (frame_width, frame_height)
    result = cv2.VideoWriter('infer2-test.avi', 
                         cv2.VideoWriter_fourcc(*'MJPG'),
                         10, size)

    # Read ONNX model
    model = 'onnx_model.onnx'
    model = cv2.dnn.readNetFromONNX('emotion-ferplus-8.onnx')
    
    # Read the Caffe face detector.
    model_path = 'RFB-320/RFB-320.caffemodel'
    proto_path = 'RFB-320/RFB-320.prototxt'
    net = dnn.readNetFromCaffe(proto_path, model_path)
    input_size = [320, 240]
    width = input_size[0]
    height = input_size[1]
    priors = define_img_size(input_size)

    while cap.isOpened():
        ret, frame = cap.read()
        if ret:
            img_ori = frame
            #print("frame size: ", frame.shape)
            rect = cv2.resize(img_ori, (width, height))
            rect = cv2.cvtColor(rect, cv2.COLOR_BGR2RGB)
            net.setInput(dnn.blobFromImage(
                rect, 1 / image_std, (width, height), 127)
            )
            start_time = time.time()
            boxes, scores = net.forward(["boxes", "scores"])
            boxes = np.expand_dims(np.reshape(boxes, (-1, 4)), axis=0)
            scores = np.expand_dims(np.reshape(scores, (-1, 2)), axis=0)
            boxes = convert_locations_to_boxes(
                boxes, priors, center_variance, size_variance
            )
            boxes = center_form_to_corner_form(boxes)
            boxes, labels, probs = predict(
                img_ori.shape[1], 
                img_ori.shape[0], 
                scores, 
                boxes, 
                threshold
            )
            gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
            for (x1, y1, x2, y2) in boxes:
                w = x2 - x1
                h = y2 - y1
                cv2.rectangle(frame, (x1,y1), (x2, y2), (255,0,0), 2)
                resize_frame = cv2.resize(
                    gray[y1:y1 + h, x1:x1 + w], (64, 64)
                )
                resize_frame = resize_frame.reshape(1, 1, 64, 64)
                model.setInput(resize_frame)
                output = model.forward()
                end_time = time.time()
                fps = 1 / (end_time - start_time)
                print(f"FPS: {fps:.1f}")
                pred = emotion_dict[list(output[0]).index(max(output[0]))]
                cv2.rectangle(
                    img_ori, 
                    (x1, y1), 
                    (x2, y2), 
                    (215, 5, 247), 
                    2,
                    lineType=cv2.LINE_AA
                )
                cv2.putText(
                    frame, 
                    pred, 
                    (x1, y1-10), 
                    cv2.FONT_HERSHEY_SIMPLEX, 
                    0.8, 
                    (215, 5, 247), 
                    2,
                    lineType=cv2.LINE_AA
                )

            result.write(frame)
        
            cv2.imshow('frame', frame)
            if cv2.waitKey(1) & 0xFF == ord('q'):
                break
        else:
            break

    cap.release()
    result.release()
    cv2.destroyAllWindows()

if __name__ == "__main__":
    FER_live_cam()