音视频同步

更多精彩内容

音频系统概述

在这里插入图片描述

音频时延payload_delay_ms jitter_delay(neteq)。
在WebRTC中有neteq，所以，音频的卡顿以及卡顿时长都是放在neteq内部进行计算的。

时间戳

时间戳的概念主要有以下几个：

ntp时间戳: NTP时间戳是绝对时间戳
本地时间戳
RTP时间戳：RTP时间戳是相对时间戳

ntp时间戳

RTP的标准并没有规定音频、视频流的第一个包必须同时采集、发送，也就是说开始的一小段时间内可能只有音频或者视频，再加上可能的网络丢包，音频或者视频流的开始若干包可能丢失，那么不能简单认为接收端收到的第一个音频包和视频包是对齐的，需要一个共同的时间基准来做时间对齐，这就是NTP时间戳的作用。

NTP时间戳是从1900年1月1日00:00:00以来经过的秒数，发送端以一定的频率发送SR(Sender Report)这个RTCP包，分为视频SR和音频SR，SR包内包含一个RTP时间戳和对应的NTP时间戳，接收端收到后就可以确定某个流的RTP时间戳和NTP时间戳的对应关系，这样音频、视频的时间戳就可以统一到同一个时间基准下。

在这里插入图片描述

如上图，发送端的音视频流并没有对齐，但是周期地发送SR包，接收端得到音视频SR包的RTP时间戳、NTP时间戳后通过线性回归得到NTP时间戳Tntp和RTP时间戳Trtp时间戳的对应关系：

Tntp_audio = f(Trtp_audio)

Tntp_video = f(Trtp_video)

其中Tntp = f(Trtp) = kTrtp + b 为线性函数，这样接收端每收到一个RTP包，都可以将RTP时间戳换算成NTP时间戳，从而在同一时间基准下进行音视频同步。

本地时间戳

从系统启动启动开始计时。

RTP时间戳

RTP时间戳定义了负载数据的采样时刻，描述的是负载数据的帧间顺序

“The timestamp reflects the sampling instant of the first octet in the RTP data packet. The sampling instant must be derived from a clock that increments monotonically and linearly in time to allow synchronization and jitter calculations. The resolution of the clock must be sufficient for the desired synchronization accuracy and for measuring packet arrival jitter (one tick per video frame is typically not sufficient). ”

也就是说，RTP时间戳是从单调线性递增的时钟里面获得的，时钟的精度由采样频率决定，视频的采样一般为90kZ，时间戳增加1，实际时间增加1/90000s

等价于，本地时间增加1，采样的时间增加90000s。

在WebRTC内部，ntp时间的计算过程如下：

// Capture time may come from clock with an offset and drift from clock_.
int64_t capture_ntp_time_ms;
if (video_frame.ntp_time_ms() > 0) { //值为0，不会进入
  capture_ntp_time_ms = video_frame.ntp_time_ms();
} else if (video_frame.render_time_ms() != 0) {//render_time_ms由timestamp_us_换算过来，本地时间。在采集的时候已经赋值
  capture_ntp_time_ms = video_frame.render_time_ms() + delta_ntp_internal_ms_;
} else {
  capture_ntp_time_ms = current_time_ms + delta_ntp_internal_ms_;
}
incoming_frame.set_ntp_time_ms(capture_ntp_time_ms);
delta_ntp_internal_ms_(clock_->CurrentNtpInMilliseconds() - clock_->TimeInMilliseconds())
// Convert NTP time, in ms, to RTP timestamp.
const int kMsToRtpTimestamp = 90;
incoming_frame.set_timestamp(
    kMsToRtpTimestamp * static_cast<uint32_t>(incoming_frame.ntp_time_ms()));

重上述可以知道，ntp时间是从本地时间戳线性变化过来的，rtp时间戳是ntp时间戳线性变换过来的，所以ntp时间戳和rtp时间戳也存在线性变化的关系。ntp和本地时间戳本质上是一致的，只不过是在不同刻度下的表现。NTP时间和RTP时间戳是同一时刻的不同表示，区别在于精度不同。NTP时间是绝对时间，以毫秒为精度，而RTP时间戳则和媒体的采样频率有关。

音视频目标

不需要做到音视频绝对的同步，但是需要在一定的阈值内才能不被人所感知。

+ 代表声音比视频超前

(-100ms, +25ms) 无法感知

(-125ms, +45ms) 可以感知

(-, -185)U(+90, +) 影响体验

音频延迟于视频的阈值，相比于音频提前于视频的阈值更高。这个主要原因是人眼总是先看到光然后再听到声音，我们逐渐形成了这样的一个习惯。我们在理解信息的时候，也总是先视觉再听觉。如果声音先于视频较多，那么我们可能就无法将声音和画面对应上来，很容易感觉到音视频不同步。

不同步建模

音视频不同步的根本原因在于，音视频的传输通过的是不同的数据流，不同数据流之间是独立传输与处理的，拥有独立的时间戳。在不做任何处理的情况下，各自处理的延迟差异也会导致不完全同步。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-tSPeISdk-1669986678991)(!%5B%5D%28https://zhongsy.oss-cn-beijing.aliyuncs.com/img/%28null%29-20221202205752737.png%29#pic_center)]

如图所示，如果要处理音视频同步(同步的本质是，接收到数据后，根据数据能够计算出数据的需要的pts)。就需要处理好以下几个：

获取接收端音视频数据绝对的ntp时间差。
获取发送端音视频采集绝对的NTP时间差。
音视频渲染缓冲时间delay。

在webRTC中衡量音视频的不同步的统计量做syncdiff(含义：在不做同步的情况下，音视频不同步的情况)播放时间差 - 采集时间差。

不同步 = 采集到接收组帧后的不同步 + JItterBuffer delay + 音视频渲染时长(渲染时长是计算再jitterbuffer的jitter中的，所以可以和JitterBuffer delay合并)

// Calculate the difference between the lowest possible video delay and the
  // current audio delay.
  // 视频的jitter buffer当前延迟 -音频的jitter buffer当前延迟 + 相对延迟
  int current_diff_ms =current_video_delay_ms - current_audio_delay_ms + relative_delay_ms;

RelativeDelay相对时延，表示如下图所示：

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-pjtDEyRR-1669986028837)(https://zhongsy.oss-cn-beijing.aliyuncs.com/img/(null)]-20221202205800266.(null))

相对时延 = 接收时间差 - 采集时间差。

在webRTC中的rtp包中只能拿到rtp时间戳，但是接收端只能获取到接收端的本地时间戳，单位是ms，rtp的单位是采样步长，所以并不能直接使用，所以需要将rtp时间戳，转换成发送端为ms的单位，这时候就有了SR包，在接收端将数据包rtp时间戳映射为绝对的NTP时间戳。恰好RTCP SR的一个包里面有携带rtp timestamp和对应的NTP时间，SR包按照周期发送，接收端完全可以根据这些SR拟合出RTP timestamp和NTP timestamp的关系。这个转换由RtpToNtpEstimator完成。(这里其实并不关系发送时延)。

$$ $相对延迟 = (视频帧接收时间 - 音频帧接收时间) - （视频帧采集时间 - 音频帧采集时间）$ $$

可以看出，在webRTC中，都是将时间戳换算为发送端的NTP时间戳，再计算relativeDelay。

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-G5CcNEhL-1669986028838)(https://zhongsy.oss-cn-beijing.aliyuncs.com/img/(null)]-20221202205805271.(null))

ntp转ntp

音视频同步原理

如果要做音视频同步，需要知道以下几个参数：

音视频相对延迟
期望视频目标延迟，其实就是jitterbuffer里面计算出来的jitter
期望音频目标延迟

有了上述的参数以后，剩下的就是同步策略：

• 如果视频相对音频存在播放延迟：如果视频已经存在延迟，减小视频播放延迟，通过快放追上音频；如果视频没有延迟，无法再降低，增加音频延迟，让音频慢放等待视频。

• 如果音频相对视频存在播放延迟：如果音频已经存在延迟，减小音频播放延迟，通过快放追上音频；如果音频没有延迟，无法再降低，增加视频延迟，让视频慢放等待音频。

bool StreamSynchronization::ComputeDelays(int relative_delay_ms,
                                          int current_audio_delay_ms,
                                          int* total_audio_delay_target_ms,
                                          int* total_video_delay_target_ms) {
  int current_video_delay_ms = *total_video_delay_target_ms;

  RTC_LOG(LS_VERBOSE) << "Audio delay: " << current_audio_delay_ms
                      << " current diff: " << relative_delay_ms
                      << " for stream " << audio_stream_id_;

  // Calculate the difference between the lowest possible video delay and the
  // current audio delay.
  int current_diff_ms =
      current_video_delay_ms - current_audio_delay_ms + relative_delay_ms;
// 1. 做平滑处理
  avg_diff_ms_ =
      ((kFilterLength - 1) * avg_diff_ms_ + current_diff_ms) / kFilterLength;
  if (abs(avg_diff_ms_) < kMinDeltaMs) {
    // Don't adjust if the diff is within our margin.
    return false;
  }

  // Make sure we don't move too fast.
  // 2.控制步长，单次不超过80ms
  int diff_ms = avg_diff_ms_ / 2;
  diff_ms = std::min(diff_ms, kMaxChangeMs);
  diff_ms = std::max(diff_ms, -kMaxChangeMs);

  // Reset the average after a move to prevent overshooting reaction.
  avg_diff_ms_ = 0;
  // 3. diff_ms大于0，代表的是视频落后音频，视频相对于音频有延迟
  if (diff_ms > 0) {
    // The minimum video delay is longer than the current audio delay.
    // We need to decrease extra video delay, or add extra audio delay.
    // 视频已经有额外的延迟了，将视频的额外延迟减少，加快视频播放速度
    if (video_delay_.extra_ms > base_target_delay_ms_) {
      // We have extra delay added to ViE. Reduce this delay before adding
      // extra delay to VoE.
      video_delay_.extra_ms -= diff_ms;
      audio_delay_.extra_ms = base_target_delay_ms_;
    } else {  // video_delay_.extra_ms > 0
       // 没有额外延迟，增加音频的延迟，让音频等一下视频
      // We have no extra video delay to remove, increase the audio delay.
      audio_delay_.extra_ms += diff_ms;
      video_delay_.extra_ms = base_target_delay_ms_;
    }
  } else {  // if (diff_ms > 0)
     // 4. diff_ms < 0代表视频超过音频，也就是音频相对于视频有延迟
    // The video delay is lower than the current audio delay.
    // We need to decrease extra audio delay, or add extra video delay.
    // 音频为了音视频同步有额外延迟，降低低频延迟，此时diff_ms < 0
    if (audio_delay_.extra_ms > base_target_delay_ms_) {
      // We have extra delay in VoiceEngine.
      // Start with decreasing the voice delay.
      // Note: diff_ms is negative; add the negative difference.
      audio_delay_.extra_ms += diff_ms;
      video_delay_.extra_ms = base_target_delay_ms_;
    } else {  // audio_delay_.extra_ms > base_target_delay_ms_
       // 如果音频没有额外延迟，增大视频言延迟，让视频等一下
      // We have no extra delay in VoiceEngine, increase the video delay.
      // Note: diff_ms is negative; subtract the negative difference.
      video_delay_.extra_ms -= diff_ms;  // X - (-Y) = X + Y.
      audio_delay_.extra_ms = base_target_delay_ms_;
    }
  }

  // Make sure that video is never below our target.
  video_delay_.extra_ms =
      std::max(video_delay_.extra_ms, base_target_delay_ms_);

  int new_video_delay_ms;
  if (video_delay_.extra_ms > base_target_delay_ms_) {
    new_video_delay_ms = video_delay_.extra_ms;
  } else {
    // No change to the extra video delay. We are changing audio and we only
    // allow to change one at the time.
    new_video_delay_ms = video_delay_.last_ms;
  }

  // Make sure that we don't go below the extra video delay.
  new_video_delay_ms = std::max(new_video_delay_ms, video_delay_.extra_ms);

  // Verify we don't go above the maximum allowed video delay.
  new_video_delay_ms =
      std::min(new_video_delay_ms, base_target_delay_ms_ + kMaxDeltaDelayMs);

  int new_audio_delay_ms;
  if (audio_delay_.extra_ms > base_target_delay_ms_) {
    new_audio_delay_ms = audio_delay_.extra_ms;
  } else {
    // No change to the audio delay. We are changing video and we only allow to
    // change one at the time.
    new_audio_delay_ms = audio_delay_.last_ms;
  }

  // Make sure that we don't go below the extra audio delay.
  new_audio_delay_ms = std::max(new_audio_delay_ms, audio_delay_.extra_ms);

  // Verify we don't go above the maximum allowed audio delay.
  new_audio_delay_ms =
      std::min(new_audio_delay_ms, base_target_delay_ms_ + kMaxDeltaDelayMs);

  video_delay_.last_ms = new_video_delay_ms;
  audio_delay_.last_ms = new_audio_delay_ms;

  RTC_LOG(LS_VERBOSE) << "Sync video delay " << new_video_delay_ms
                      << " for video stream " << video_stream_id_
                      << " and audio delay " << audio_delay_.extra_ms
                      << " for audio stream " << audio_stream_id_;

  *total_video_delay_target_ms = new_video_delay_ms;
  *total_audio_delay_target_ms = new_audio_delay_ms;
  return true;
}

这里的extra_ms一开始是0，过程中会慢慢收敛。extra_ms的含义是音视频同步到底增加了多少的延迟，这里会不断累计，通过音画同步引入的“额外”延迟最终还需要恢复。extra_ms的作用在于此。

base_target_delay_ms_是基准，默认为0.

最后将目标延迟设置进入jitterbuffer

  if (!syncable_audio_->SetMinimumPlayoutDelay(target_audio_delay_ms)) {
    sync_->ReduceAudioDelay();
  }
  if (!syncable_video_->SetMinimumPlayoutDelay(target_video_delay_ms)) {
    sync_->ReduceVideoDelay();
  }