摘要
-
SORT(Simple Online and Realtime Tracking)
-
we integrate appearance information to improve the performance of SORT
集成外观信息来提高SORT的表现
-
we are able to track objects through longer periods of occlusions, effectively reducing the number of identity switches.
可以在长周期的遮挡中追踪物体,有效减少类别身份切换(身份切换IDSwitch,是目标跟踪中比较重要的指标)
-
we learn a deep association metric on a largescale person re-identification dataset
通过离线的方式在重识别数据集上训练了一个深度关联度量
-
During online application, we establish measurement-to-track associations using nearest neighbor queries in visual appearance space
在线的应用中,我们使用视觉外观空间中的近邻查找方法来建立 检测-跟踪 的联系
-
Experimental evaluation shows that our extensions reduce the number of identity switches by 45%, achieving overall competitive performance at high frame rates.
实验结果表明,我们的方法减少45%的身份切换,在高帧率的情况下仍然具有有竞争力的表现
1. 介绍
-
Due to recent progress in object detection, tracking-by-detection has become the leading paradigm in multiple object tracking. Within this paradigm, object trajectories are usually found in a global optimization problem that processes entire video batches at once.
在最近的物体检测中,通过跟踪来检测的方法变成了多目标跟踪的一个范式,这种范式中,目标跟踪被发现是一种全局优化问题,也就是采用的方法是批量式(batch)的,一次性将一个视频当做一个batch来处理。
-
However, due to batch processing, these methods are not applicable in online scenarios where a target identity must be available at each time step. More traditional methods are Multiple Hypothesis Tracking (MHT) [8] and the Joint Probabilistic
Data Association Filter (JPDAF) [9]. These methods perform data association on a frame-by-frame basis但是,由于批量处理,这种方法针对在线场景(目标识别每个时刻都需要)并不适用,更传统的方法有多假设跟踪MPT和联合概率数据关联滤波器JPDAF,这些方法的数据匹配式一帧一帧的。
-
In the JPDAF, a single state hypothesis is generated by weighting individual measurements by their association likelihoods. In MHT, all possible hypotheses are tracked, but pruning schemes must be applied for computational tractability. Both methods have recently been revisited in a tracking-by-detection scenario [10, 11] and shown promising results. However, the
performance of these methods comes at increased computational and implementation complexity.在JPDAF中,通过权衡由匹配似然生成的单个测量来生成单状态假设,在MHT中,所有可能的假设都会被跟踪,但是考虑到计算的可追踪性必须应用剪枝方案。这两种方法,在最近的 检测-跟踪 场景中又被提出来,并且有好的结果,但是,这些表现都需要更多的计算和实现复杂度。
-
Simple online and realtime tracking (SORT) is a much simpler framework that performs Kalman filtering in image space and frame-by-frame data association using the Hungarian method with an association metric that measures bounding box overlap.
SORT方法在图像空间中使用卡尔曼滤波,并使用匈牙利算法来进行逐帧的数据匹配,匈牙利算法时一种用来评估bbox的重叠程度的匹配指标。
-
SORT returns a relatively high number of identity switches. This is, because the employed association metric is only accurate when state estimation uncertainty is low,Therefore, SORT has a deficiency in tracking through occlusions as they typically appear in frontal-view camera scenes.
SORT有比较多的身份切换,因为所使用的关联度量方法只有在较低状态不确定性的时候才能比较准确。所以,SORT算法在前视相机场景下的遮挡场景中有缺陷。
-
We overcome this issue by replacing the association metric with a more informed metric that combines motion and appearance information.
我们使用了一个更明智的度量,它结合了运动和外观信息
-
In particular, we apply a convolutional neural network (CNN) that has been trained to discriminate pedestrians on a large-scale person re-identification dataset
我们应用了一个CNN网络,这个网络在大规模行人重识别数据集上被训练过以使它能区分行人
2. SORT WITH DEEP ASSOCIATION METRIC
-
We adopt a conventional single hypothesis tracking methodology with recursive Kalman filtering and frame-by-frame data association.
我们使用了传统的单假设跟踪思想,使用递归的卡尔曼滤波和逐帧的数据关联方法
2.1 Track Handling and State Estimation
-
our tracking scenario is defined on the eight dimensional state space ( u , v , γ , h , x ˙ , y ˙ , γ ˙ , h ˙ u, v, \gamma, h, \dot{x},\dot{y}, \dot{\gamma},\dot{h} u,v,γ,h,x˙,y˙,γ˙,h˙) that contains the bounding box center position ( u , v u, v u,v), aspect ratio γ \gamma γ ,height h h h, and their respective velocities in image coordinates. We use a standard Kalman filter with constant velocity motion and linear observation model, where we take the bounding coordinates ( u , v , γ , h u, v, γ, h u,v,γ,h) as direct observations of the object state
跟踪的场景中,定义了一个8维向量,有框的中心位置 u , v u, v u,v、宽高比 γ \gamma γ、高 h h h和对应的变化速度(导数)。我们使用了一个标准的等速卡尔曼滤波器和线性观测模型,我们将绑定的坐标 ( u , v , γ , h u, v, γ, h u,v,γ,h) 视为物体状态的直接观测
-
For each track k k k we count the number of frames since the last successful measurement association a k a_k ak . This counter is incremented during Kalman filter prediction and reset to 0 when the track has been associated with a measurement.
对每一个预测的 k k k ,我们从上一个关联成功的 a k a_k ak 开始计数,这个数字在卡尔曼预测的过程中增长并在成功关联的时候置为0
-
Tracks that exceed a predefined maximum age A m a x A_{max} Amax are considered to have left the scene and are deleted from the track set.
当这个数超过我们预设的最大阈值 A m a x A_{max} Amax的时候,则被认为已经离开了场景并将这个track从集合中删掉
-
New track hypotheses are initiated for each detection that cannot be associated to an existing track. These new tracks are classified as tentative during their first three frames. During this time, we expect a successful measurement association at each time step. Tracks that are not successfully associated to a measurement within their first three frames are deleted.
当一个预测不能和现有的预测关联起来的时候,我们认为它是一个新的预测,这些新的预测在前3帧中的状态是tentative,在此期间我们希望能够匹配成功,如果在前三帧中没有被匹配成功,则将这个预测删除
2.2 Assignment Problem
-
A conventional way to solve the association between the predicted Kalman states and newly arrived measurements is to build an assignment problem that can be solved using the Hungarian algorithm.Into this problem formulation we integrate motion and appearance information through combination of two appropriate metrics.
一个传统的来解决卡尔曼预测和检测匹配的方法就是将其看做是一个分配问题,可以采用匈牙利算法解决。我们通过结合运动和外观信息来解决这个问题,运动和外观信息又由两个适当的指标来确定。
-
To incorporate motion information we use the (squared) Mahalanobis distance between predicted Kalman states and newly arrived measurements, The Mahalanobis distance takes state estimation uncertainty into account by measuring how many standard deviations the detection is away from the mean track location
为了整合运动信息,我们使用预测值和测量值的平方马氏距离,马氏距离通过度量检测和平均预测位置的标准差有多偏离,来考虑了状态估计的不确定性。
d 1 ( i , j ) = ( d i − y i ) T S i − 1 ( d j − y j ) d^{1}(i,j) = (d_i - y_i)^TS_i^{-1}(d_j - y_j) d1(i,j)=(di−yi)TSi−1(dj−yj)
其中 i i i 为预测变量, j j j 为bounding box变量 -
Further, using this metric it is possible to exclude unlikely associations by thresholding the Mahalanobis distance at a 95% confidence interval computed from the inverse χ 2 \chi^2 χ2 distribution,We denote this decision with an indicator
b i , j ( 1 ) = 1 [ d ( 1 ) ( i , j ) ⩽ t ( 1 ) ] b_{i,j}^{(1)} = \mathbb{1}[d^{(1)}(i,j)\leqslant t^{(1)} ] bi,j(1)=1[d(1)(i,j)⩽t(1)]
更进一步,使用这个指标,结合马氏距离在逆卡方分布中的95%的阈值,还可以排除那些关联性不是很高的匹配,并使用一个指示函数来表明这种关系,一般情况下 t ( 1 ) t^{(1)} t(1) 取 9.4877 -
While the Mahalanobis distance is a suitable association metric when motion uncertainty is low, in our image-space problem formulation the predicted state distribution obtained from the Kalman filtering framework provides only a rough estimate of the object location. In particular, unaccounted camera motion can introduce rapid displacements in the image plane, making the Mahalanobis distance a rather uninformed metric for tracking through occlusions. Therefore, we integrate a second metric into the assignment problem
当运动不确定性比较低的时候,马氏距离比较适合用来做关联匹配,在我们的图像平面中,卡尔曼滤波框架的预测状态分布只能给物体位置提供粗略的估计。特别是,未计算的相机运动可能在图像平面引入快速的位移,使得马氏距离在有遮挡的跟踪场景下变得不靠谱?,所以我们在分配问题中集成了第二个指标
-
For each bounding box detection d j d_j dj we compute an appearance descriptor r j r_j rj with ∥ r j ∥ = 1 \left \| r_j \right \| = 1 ∥rj∥=1 .Further, we keep a gallery R k = { r ( i ) } k = 1 L k R_k= \{r ^{(i)} \} ^{L_k} _{k=1} Rk={r(i)}k=1Lk of the last L k = 100 L_k = 100 Lk=100 associated appearance descriptors for each track k k k. Then, our second metric measures the smallest cosine distance between the i i i-th track and j j j-th detection in appearance space:
d ( 2 ) ( i , j ) = m i n { 1 − r j T r k ( i ) ∣ r k ( i ) ∈ R i } d^{(2)}(i,j) = min\{ 1-r_j^Tr_k^{(i)} | r_k ^ {(i)}\in R_i\} d(2)(i,j)=min{1−rjTrk(i)∣rk(i)∈Ri}
对于每一个bounding box的 d j d_j dj,我们都计算一个外观描述子 r j r_j rj,这个描述子的模长是1,我们还保存了100个配对好的外观描述子在 R k R_k Rk 中,最终再计算预测到的描述子和集合中的描述子的余弦距离,获取最小值,公式如上。 -
Again, we introduce a binary variable to indicate if an association is admissible according to this metric,In practice, we apply a pre-trained CNN to compute bounding box appearance descriptors.
b i , j ( 2 ) = 1 [ d ( 2 ) ( i , j ) ≤ t ( 2 ) ] b^{(2)}_{i,j} = \mathbb{1}[d^{(2)}(i, j) ≤ t^{(2)}] bi,j(2)=1[d(2)(i,j)≤t(2)]
同样地,我们仍然使用一个指示函数来表征计算的值是否满足需求,满足返回1,不满足0,在实际的操作中,我们使用一个预训练的CNN网络来计算bounding box的外观描述子 -
In combination, both metrics complement each other by serving different aspects of the assignment problem. On the one hand, the Mahalanobis distance provides information about possible object locations based on motion that are particularly useful for short-term predictions. On the other hand, the cosine distance considers appearance information that are particularly useful to recover identities after longterm occlusions, when motion is less discriminative. To build the association problem we combine both metrics using a weighted sum,where we call an association admissible if it is within the gating region of both metrics:
c i , j = λ d ( 1 ) ( i , j ) + ( 1 − λ ) d ( 2 ) ( i , j ) c_{i,j} = \lambda{d^{(1)}(i,j)} + (1 - \lambda)d^{(2)}(i,j) ci,j=λd(1)(i,j)+(1−λ)d(2)(i,j)b i , j = ∏ m = 1 2 b i , j ( m ) b_{i,j} = \prod ^2_{m=1}b^{(m)} _{i,j} bi,j=m=1∏2bi,j(m)
结合起来,两个指标相辅相成,每个指标都解决了不同的问题,一方面,马氏距离提供了可能的物体位置,这个信息对短期的预测非常重要,另一方面,当运动不是很容易区分的时候,基于外观信息的余弦距离方法对于长期遮挡又出现的个体很有帮助,为了构建匹配问题,我们将两个指标通过权重的方式组合起来使用,最终形式如上式所示,如果这个匹配在两个指标的门控范围内,我们称之为可接受的匹配,如第二个公式所示。
-
The influence of each metric on the combined association cost can be controlled through hyperparameter λ. During our experiments we found that setting λ = 0 is a reasonable choice when there is substantial camera motion. In this setting, only appearance information are used in the association cost term. However, the Mahalanobis gate is still used to disregarded infeasible assignments based on possible object locations inferred by the Kalman filter.
每个指标都由超参数 λ \lambda λ 来控制,在我们的实验中,我们发现,当输入是大幅度相机运动的时候,取 λ \lambda λ = 0的时候是一个合理的选择,这个时候只利用了外观信息,虽然如此,马氏距离的门计算还是能帮助我们忽略不可取的卡尔曼预测。
2.3 Matching Cascade
-
When an object is occluded for a longer period of time, subsequent Kalman filter predictions increase the uncertainty associated with the object location. Consequently, probability mass spreads out in state space and the observation likelihood becomes less peaked. Intuitively, the association metric should account for this spread of probability mass by increasing the measurement-to-track distance
当一个物体被长时间遮挡的时候,后续的卡尔曼预测增加了物体位置匹配的不确定性。因此,概率质量在状态空间扩散开而且观测的似然变得平缓,所以匹配指标应该通过增加 检测-预测 的距离来考虑这种概率质量的分散
-
when two tracks compete for the same detection, the Mahalanobis distance favors larger uncertainty, because it effectively reduces the distance in standard deviations of any detection towards the projected track mean,This is an undesired behavior as it can lead to increased track fragmentations and unstable tracks.
当两个预测竞争同一个检测结果的时候,马氏距离出现了更大的不确定性,因为这种情况减少了检测和投影轨迹均值的标准差,这不是我们想要的行为,因为这种行为可能会导致碎片跟踪或者不稳定的跟踪
-
Therefore, we introduce a matching cascade that gives priority to more frequently seen objects to encode our notion of probability spread in the association likelihood.
因此,我们考虑一种级联匹配,这种方法给予频繁出现的物体更多的权重,通过这种方法来关联匹配似然中的概率分散概念
-
Listing 1 outlines our matching algorithm. As input we provide the set of track T T T and detection D D D indices as well as the maximum age A m a x A_{max} Amax. In lines 1 and 2 we compute the association cost matrix and the matrix of admissible associations. We then iterate over track age n to solve a linear assignment problem for tracks of increasing age. In line 6 we select the subset of tracks T n T_n Tn that have not been associated with a detection in the last n frames. In line 7 we solve the linear assignment between tracks in T n T_n Tn and unmatched detections U U U.In lines 8 and 9 we update the set of matches and unmatched detections, which we return after completion in line 11. Note that this matching cascade gives priority to tracks of smaller age, i.e., tracks that have been seen more recently
上面是我们级联匹配方法的伪代码,我们将一个预测集合 T T T和当前检测结果 D D D和最大匹配次数作为输入,在第一行和第二行我们计算了对应的匹配代价矩阵和可接受匹配矩阵(这里的 C C C指的是带权重的, B B B指的是直接求和),然后迭代预测次数n来解决age增加的预测的线性分配问题,第6行我们选择T中的一个子集 T n T_n Tn,这个子集由最后n帧都没有和检测匹配好的预测组成,第7行,我们求解 T n T_n Tn和没有匹配好的检测 U U U的线性分配,第8行和第9行,我们更新了匹配好和没有匹配好的检测,在第11行返回。注意,级联匹配对于更小的age给予了优先,比如对于刚看到的预测。
-
In a final matching stage, we run intersection over union association as proposed in the original SORT algorithm [12] on the set of unconfirmed and unmatched tracks of age n = 1. This helps to to account for sudden appearance changes, e.g., due to partial occlusion with static scene geometry, and to increase robustness against erroneous initialization.
在最后的匹配阶段,我们使用原始SORT算法提出的IOU方法对age=1的未确认和未匹配的预测进行计算,这能帮助我们处理由于部分突然出现变化的情况,比如静态场景几何中的部分遮挡,同时能够提升针对错误初始化的鲁棒性
2.4 Deep Appearance Descriptor
-
To this end, we employ a CNN that has been trained on a large-scale person re-identification dataset [21] that contains over 1,100,000 images of 1,261 pedestrians, making it well suited for deep metric learning in a people tracking context.
为此,我们使用了一个CNN网络,这个网络在大规模人体重识别数据集上训练,这个数据集包含1,100,000 照片,共1261个行人,这些数据让网络更适合人体跟踪上下文的深度指标学习。
-
The CNN architecture of our network is shown in Table 1. In summary, we employ a wide residual network [22] with two convolutional layers followed by six residual blocks. The global feauture map of dimensionality 128 is computed in dense layer 10. A final batch and `2 normalization projects features onto the unit hypersphere to be compatible with our cosine appearance metric.
CNN结构如图所示,我们使用一个宽的残差网络,它有两个卷积层加上6个残差模块组成,密集层10最终计算一个全局的128维特征。最后,batch和l2正则化将特征投影到单位超球上,从而可以和我们的余弦外观指标相兼容
3. EXPERIMENTS
-
Evaluation is carried out according to the following metrics: MOTA/MOTP/MT/ML/ID/FM
-
Multi-object tracking accuracy (MOTA): Summary of overall tracking accuracy in terms of false positives, false negatives and identity switches [23].
多目标跟踪准确率:假正例、假反例和身份切换多个因素综合起来的指标
-
Multi-object tracking precision (MOTP): Summary of overall tracking precision in terms of bounding box overlap between ground-truth and reported location [23].
多目标跟踪精确率:真值和预测值得边界框重叠率
-
Mostly tracked (MT): Percentage of ground-truth tracks that have the same label for at least 80% of their life span.
大部分跟踪:寿命最少80%的时间里和真值具有相同的标签的比例
-
Mostly lost(ML): Percentage of ground-truth tracks that are tracked for at most 20% of their life span.
大部分丢失:寿命最多20的时间的真值被跟踪到的比例
-
Identity switches (ID): Number of times the reported identity of a ground-truth track changes.
身份切换:对于一个跟踪真值的身份变换的次数
-
Fragmentation (FM): Number of times a track is interrupted by a missing detection.
碎片:一个被丢失检测打断的跟踪次数
-
-
The reported tracking accuracy is mostly impaired by a larger number of false positives. Given their overall impact on the MOTA score, applying a larger confidence threshold to the detections can potentially increase the reported performance of our algorithm by a large margin
报告的跟踪准确率被大量的假正例所影响,考虑到在MOTA分数,对检测使用一个大的置信度阈值可能会大幅增加我们算法的性能。
-
However, visual inspection of the tracking output shows that these false positives are mostly generated from sporadic detector responses at static scene geometry.
但是这些跟踪输出的视觉检测表明,这些假正例大多是有零星检测器在静态场景几何情景下所生成的
4. CONCLUSION
-
We have presented an extension to SORT that incorporates appearance information through a pre-trained association metric. Due to this extension, we are able to track through longer periods of occlusion, making SORT a strong competitor to state-of-the-art online tracking algorithms. Yet, the algorithm remains simple to implement and runs in real time.
我们展现了一个SORT算法的拓展,它包含了外观信息,这个外观信息通过预训练的关联指标获得,我们可以在有长期遮挡的情况下实现跟踪,上面的方法让SORT算法在跟踪算法中变得更有竞争力,并且算法仍然比较简单并且能够实现实时运行。