EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

1ReLER Lab, University of Technology Sydney 2Zhejiang University

đź“• TL;DR: EVA can achieve accurate multi-attribute editing for both single and multi-object scenarios in human-centric complex motion without any training, by leveraging pre-trained text-image models.

Abstract

Current diffusion-based video editing primarily focuses on local editing (object/background editing) or global style editing by utilizing various dense correspondences. However, these methods often fail to accurately edit the foreground and background simultaneously while preserving the original layout. We find that the crux of the issue stems from the imprecise distribution of attention weights across designated regions, including inaccurate text-to-attribute control and attention leakage.

To tackle this issue, we introduce EVA, a zero-shot and multi-attribute video editing framework tailored for human-centric videos with complex motions. We incorporate a Spatial-Temporal Layout-Guided Attention mechanism that leverages the intrinsic positive and negative correspondences of cross-frame diffusion features. To avoid attention leakage, we utilize these correspondences to boost the attention scores of tokens within the same attribute across all video frames while limiting interactions between tokens of different attributes in the self-attention layer. For precise text-to-attribute manipulation, we use discrete text embeddings focused on specific layout areas within the cross-attention layer. Benefiting from the precise attention weight distribution, EVA can be easily generalized to multi-object editing scenarios and achieves accurate identity mapping. Extensive experiments demonstrate EVA achieves state-of-the-art results in real-world scenarios.

Method

Intrinsic Cross-frame DIFT feature correspondence : We observe that, we can unsupervised identify intra/inter attributes using cross-frame DIFT correspondence. For each query token, we can:
  1. Maximizing the cosine cross-frame DIFT similarity to identify its corresponding positive pair in other frames (sharing the same attribute).
  2. Minimizing this similarity to find its negative pair (across different attributes).

EVA pipeline:

We integrate the ST-Layout Attn within the frozen SD in the denoising process. In the self-attention layer, we compute the positive/negative value of each query token in different attributes from a spatial-temporal perspective, This allows us to augment the attention scores for tokens within the same attribute and reduce them for tokens in different attributes. In the cross-attention layer, we extract each attribute’s text embeddings from the edit prompt, ensuring they focus only on corresponding layouts across frames.

EVA Editing Results (Click for More Results)

man → Batman, clay court → snow covered court, stone wall -> an iced wall man → Ironman, ground → grassland
girl → Batwoman, ground → snow covered ground woman → Scarlet Witch, sofa → a moonlit pond, background → starry dark night
left man→ Iron Man, right man → Spider Man
trees, ground → frosty yellow leaves
back man→ Iron Man, front man → Batman
bridge, ground → snow covered
left man→ Iron Man, right man → Batman
treess→ crimson maple trees, road → snow covered road
man→ Batman, woman → Batwoman
ground → rain soaked ground, red wall → stormy lighting night
background→frosty yellow leaves,
→ sphalt road with building under sky
left man→ Iron Man,
right man → Spider-Man
left man→ Spider-Man,
right man → Iron Man
wall → charcoal grey wall man→ Spider-Man,
woman→ Wonder Woman
man→ Wonder Woman,
woman → Spider-Man
ground → grassland ,
blue sky → raining sky
left man→ Iron Man,
right man → Spider-Man
left man→ Spider-Man,
right man → Iron Man
store → cyberpunk cityspace left man→ Iron Man,
right man→ Hulk
left man→ Hulk,
right man → Iron Man

Comparison (Click for More Comparison)

Previous video-editing methods failed in 1. accurate text-to-attribute control 2. avoiding attention leakage

BibTeX


      @misc{yang2024eva,
        title={EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing}, 
        author={Xiangpeng Yang and Linchao Zhu and Hehe Fan and Yi Yang},
        year={2024},
        eprint={2403.16111},
        archivePrefix={arXiv},
        primaryClass={cs.CV}
      }