Current diffusion-based video editing primarily focuses on local editing (object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage.
To tackle this issue, we introduce EVA, a zero-shot and multi-attribute video editing framework tailored for human-centric videos with complex motions. We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that leverages the intrinsic positive and negative correspondences of cross-frame diffusion features. To avoid attention leakage, we utilize these correspondences to boost the attention scores of tokens within the same attribute across all video frames while limiting interactions between tokens of different attributes in the self-attention layer. For precise text-to-attribute manipulation, we use discrete text embeddings focused on specific layout areas within the cross-attention layer. Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping. Extensive experiments demonstrate EVA achieves state-of-the-art results in real-world scenarios.
We integrate the ST-Layout Attn within the frozen SD in the denoising
process. In the self-attention layer, we compute the positive/negative value of each query token in
different attributes from a spatial-temporal perspective, This allows us to augment the attention
scores for tokens within the same attribute and reduce them for tokens in different attributes.
In the cross-attention layer, we extract each attribute’s text embeddings from the edit prompt,
ensuring they focus only on corresponding layouts across frames. |
man → Batman, clay court → snow covered court, stone wall -> an iced wall | man → Ironman, ground → grassland | ||
---|---|---|---|
girl → Batwoman, ground → snow covered ground | woman → Scarlet Witch, sofa → a moonlit pond, background → starry dark night | ||
---|---|---|---|
left man→ Iron Man, right man → Spider Man trees, ground → frosty yellow leaves |
back man→ Iron Man, front man → Batman bridge, ground → snow covered |
||
---|---|---|---|
left man→ Iron Man, right man → Batman treess→ crimson maple trees, road → snow covered road |
man→ Batman, woman → Batwoman ground → rain soaked ground, red wall → stormy lighting night |
||
---|---|---|---|
background→frosty yellow leaves, → sphalt road with building under sky |
left man→ Iron Man, right man → Spider-Man |
left man→ Spider-Man, right man → Iron Man |
wall → charcoal grey wall | man→ Spider-Man, woman→ Wonder Woman |
man→ Wonder Woman, woman → Spider-Man |
---|---|---|---|---|---|
ground → grassland , blue sky → raining sky |
left man→ Iron Man, right man → Spider-Man |
left man→ Spider-Man, right man → Iron Man |
store → cyberpunk cityspace | left man→ Iron Man, right man→ Hulk |
left man→ Hulk, right man → Iron Man |
---|---|---|---|---|---|
@misc{yang2024eva,
title={EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing},
author={Xiangpeng Yang and Linchao Zhu and Hehe Fan and Yi Yang},
year={2024},
eprint={2403.16111},
archivePrefix={arXiv},
primaryClass={cs.CV}
}