EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Supplementary Material

Please refresh the webpage if the content has not loaded.


We present full results of our EVA, including single-/multi-object multi-attribute editing results, part of which are shown in Fig.6 and Fig.7 in the paper.


Qualitative Results of EVA

1. Single-Object

Single-Object Editing Results


Source video man → Iron Man
clay court → snow covered court
man → Batman
clay court → snow covered court
stone wall -> an iced wal
Source Video man → Ironman
ground → grassland
man → Ironman
ground → lake
source video girl → Batwoman
ground → snow covered ground
girl → Spider man
ground → trampoline
Source Video man → Ironman
sea, sky → falling snow
man → Ironman
sea, sky → falling snow
wave → pink wave
source video man → Iron Man
ground → grassland
slope → snow covered slope
source video woman → penguin
snow mountain→ ice rink
Source Video girl → Wonder Woman
grass → withered grass
green trees → golden ginkgo trees
source video man → Batman
road → frozen lake
sky → night sky
source video man → The Flash
black gloves→ red gloves
gym→ stormy lightning night
Source Video woman → Scarlet Witch
sofa → a moonlit pond
background → starry dark night

2. Multi-Object

To view the swap identity video, hover your mouse over the video.

Multi-Object Editing Results


source Video

left man → Iron Man
right man → Spider-Man
background → frosty yellow leaves
swap background → sphalt road with building under sky

source video

left man → Iron Man
right man → Batman
treess → crimson maple trees
road → snow covered road

source Video

left man→ Iron Man
right man → Spider-Man
ground →grassland
blue sky →raining sky

source video

man→ Spider-Man
woman → Wonder Woman
wall → charcoal grey wall

source Video

back man → Iron Man
front man → Batman
bridge, ground → snow covered

source video

left man → Iron Man
right man → Hulk
store → cyberpunk cityspace

Source Video man→ Batman
woman → Batwoman
ground → rain soaked ground
red wall → stormy lighting night
Source Video left man→ Iron Man
right man → Batman
trees→ golden ginkgo trees
Source Video left man→ Iron Man
right man → Batman
ground→ yellow floor

 


Qualitative Comparison

We compare EVA with following video-editing methods:

For fairness, all compared methods are equipped with ControlNet pose guidance.
In comparison with previous methods, EVA can achieves accurate text-to-attribute control while avoiding attention leakage.

 

Please scroll right for more comparisons.

 

1. Compare on Single-Object

Source Video EVA (Ours) FateZero ControlVideo TokenFlow Ground-A-Video
Edit Prompt: "A Batman is playing tennis on snow covered court before an iced wall"
Edit Prompt: "An Iron Man is surfing with kite rope on a pink wave over blue sea under falling snow sky"
Edit Prompt: "A Spider is jumping on trampoline before a graffiti wall"
Edit Prompt: "A Batman on a motorcycle does a burnout on a frozen lake under the night sky"

 

Please scroll right for more comparisons.

 

2. Compare on Multi-Object

Source Video EVA (Ours) FateZero ControlVideo TokenFlow Ground-A-Video
Edit Prompt: "An Iron man and a Spider Man unning under frosty yellow trees with golden leaves on the ground"
Edit Prompt: "An Spider man and a Wonder Woman are playing badminton before charcoal grey wall"
Edit Prompt: "An Iron man pushes a Batman in a soap-box car on the snowy bridge over snow covered ground "
Edit Prompt: "An Iron man and Batman are jumping skateboard before golden ginkgo trees "

 


Ablation Studies

In this section, we ablate key components of our method.

  • Latent Blend
  • ControlNet Pose Guidance
  • Spatial-Temporal Layout-Guided Attention (ST-Layout Attn)
  • Source Video w/o Latent Blend w/o ControlNet w/o ST-Layout Attn Full Method
    man → Batman, clay court → snow covered court, stone wall -> an iced wall
    man → Batman, road → frozen lake, sky → night sky
    left man→ Iron Man, right man → Hulk store → cyberpunk cityspace
    back man→ Iron Man, front man → Batman bridge, ground → snow covered

     


    Compare with Different Layout Attention

    In this section, we compare our Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) with two different layout-guided attention.

    1. Compare to Modulated Attention (Per-frame Layout Attention)

    We compare with the original Modulated Attention from Densediffusion[6]. To ensure fairness, we integrate DDIM Inversion and ControlNet for video editing and apply latent blending for background preservation. Modulated Attention serves as per-frame Layout Attention, but applying T2I methods frame-by-frame leads to severe attention leakage.

    Source Video Modulted Attention ST-Layout Attn
    man→ Spider-Man, woman → Wonder Woman, wall → charcoal grey wall
    left man→ Iron Man, right man → Spider Man

    2. Comparison with Sparse-Causal Layout-guided Attention (SC-Layout Attn)

    We expand the receptive field for positive/negative values from individual frames to include sparse frames, specifically the first and preceding frames. This approach, named Sparse-Causal Layout-guided Attention (SC-Layout Attn), can be viewed as a promotion of Per-frame Layout Attention. However, it also has limitations such as attention leakage and appearance inconsistency due to: (1) Limited receptive field for negative values and (2) Reduced interaction across full frames.

    Source Video SC-Layout Attn ST-Layout Attn
    left man→ Iron Man, right man → Batman, trees→ golden ginkgo trees
    left man→ Iron Man, right man → Spider Man, trees, ground → frosty yellow leaves

     

    References

    [1] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan and Qifeng Chen. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. ICCV, 2023.

    [2] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. ICLR, 2024

    [3] Michal Geyer, Omer Bar-Tal, Shai Bagon and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. ICLR, 2024.

    [4]Jeong Hyeonho and Ye Jong Chul. Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models. ICLR, 2024.

    [5]Lvmin Zhang and Anyi Rao and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ICCV, 2023.

    [6]Kim, Yunji and Lee, Jiyoung and Kim, Jin-Hwa and Ha, Jung-Woo and Zhu, Jun-Yan. Dense Text-to-Image Generation with Attention Modulation. ICCV, 2023.