论文地址:https://www.microsoft.com/en-us/research/project/vasa-1/
DiT约束条件
引入了5个约束条件,包括主要注视方向、头部到摄像头的距离、情绪偏移量、为了增加帧间平滑性,还增加了前K帧的语音特征和生成的运动特征,具体如下:
condition signals:
- main gaze direcction g = ( θ , ϕ ) g=(\theta, \phi) g=(θ,ϕ) [70], focused direction of the generated talking face.
- head-to-camera distance d d d [16]: normalized scaler controling the distance between the face and camera, affecting the face scale in the
generated video.- emotion offset e e e [41]: modulates the depicted emotion on the talking face.
- last K K K frames of the audio feature, A p r e A^{pre} Apre
- last K K K frames of the generated motions, X p r e X^{pre} Xpre