Ablative Study on Domain Adapter, Motion Module Design, and MotionLoRA Efficiency

Domain adapter. To investigate the impact of the domain adapter in AnimateDiff, we conducted a study by adjusting the scaler in the adapter layers during inference, ranging from 1 (full impact) to 0 (complete removal). As illustrated in Figure 6, as the scaler of the adapter decreases, there is an improvement in overall visual quality, accompanied by a reduction in the visual content distribution learned from the video dataset (the watermark in the case of WebVid (Bain et al., 2021)). These results indicate the successful role of the domain adapter in enhancing the visual quality of AnimateDiff by alleviating the motion module from learning the visual distribution gap.

Motion module design. We compare our motion module design of the temporal Transformer with its full convolution counterpart, which is motivated by the fact that both designs are widely employed in recent works on video generation. We replace the temporal attention with 1D temporal convolution and ensured that the two model parameters were closely aligned. As depicted in supplementary materials, the convolutional motion module aligns all frames to be identical but does not incorporate any motion compared to the Transformer architecture.

Figure 6: Ablation on domain adapter. We adjust the scaler of the adapter from 1 to 0 to gradually remove its effects. In this figure, we show the first frame of the generated animation.

Figure 7: Ablation on MotionLoRA’s efficiency. Two samples on the left: with different network rank; Three samples on the right: with different numbers of reference videos. Best viewed with Acrobat Reader. Click the images to play the animation clips.

Efficiency of MotionLoRA. The efficiency of MotionLoRA in AnimateDiff was examined in terms of parameter efficiency and data efficiency. Parameter efficiency is crucial for efficient model training and sharing among users, while data efficiency is essential for real-world applications where collecting an adequate number of reference videos for specific motion patterns may be challenging.

To investigate these aspects, we trained multiple MotionLoRA models with varying parameter scales and reference video quantities. In Fig. 7, the first two samples demonstrate that MotionLoRA is capable of learning new camera motions (e.g., zoom-in) with a small parameter scale while maintaining comparable motion quality. Furthermore, even with a modest number of reference videos (e.g., N = 50), the model successfully learns the desired motion patterns. However, when the number of reference videos is excessively limited (e.g., N = 5), significant degradation in quality is observed, suggesting that MotionLoRA encounters difficulties in learning shared motion patterns and instead relies on capturing texture information from the reference videos.

文章来源: https://hackernoon.com/ablative-study-on-domain-adapter-motion-module-design-and-motionlora-efficiency?source=rss
如有侵权请联系:admin#unsafe.sh