EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Supplementary Material

We present full results of our EVA, including single-/multi-object multi-attribute editing results, part of which are shown in Fig.6 and Fig.7 in the paper.

Qualitative Results of EVA

1. Single-Object

Single-Object Editing Results


Source video	man → Iron Man clay court → snow covered court	man → Batman clay court → snow covered court stone wall -> an iced wal	Source Video	man → Ironman ground → grassland	man → Ironman ground → lake

source video	girl → Batwoman ground → snow covered ground	girl → Spider man ground → trampoline	Source Video	man → Ironman sea, sky → falling snow	man → Ironman sea, sky → falling snow wave → pink wave

source video	man → Iron Man ground → grassland slope → snow covered slope	source video	woman → penguin snow mountain→ ice rink	Source Video	girl → Wonder Woman grass → withered grass green trees → golden ginkgo trees

source video	man → Batman road → frozen lake sky → night sky	source video	man → The Flash black gloves→ red gloves gym→ stormy lightning night	Source Video	woman → Scarlet Witch sofa → a moonlit pond background → starry dark night

2. Multi-Object

To view the swap identity video, hover your mouse over the video.

Multi-Object Editing Results

source Video

left man → Iron Man
right man → Spider-Man
background → frosty yellow leaves
swap background → sphalt road with building under sky

source video

left man → Iron Man
right man → Batman
treess → crimson maple trees
road → snow covered road

source Video

left man→ Iron Man
right man → Spider-Man
ground →grassland
blue sky →raining sky

source video

man→ Spider-Man
woman → Wonder Woman
wall → charcoal grey wall

source Video

back man → Iron Man
front man → Batman
bridge, ground → snow covered

source video

left man → Iron Man
right man → Hulk
store → cyberpunk cityspace


Source Video	man→ Batman woman → Batwoman ground → rain soaked ground red wall → stormy lighting night	Source Video	left man→ Iron Man right man → Batman trees→ golden ginkgo trees	Source Video	left man→ Iron Man right man → Batman ground→ yellow floor

Qualitative Comparison

We compare EVA with following video-editing methods:

FateZero[1]: preserving layout information using source video attention maps.
ControlVideo [2]: achieving strict temporal consistency via optical-flow and ControlNet[5]
TokenFLow [3]: enforcing a linear mix of nearest key-frame features to ensure consistency
Ground-A-Video [4]: leveraging word-to-bounding box control for multi-attribute editing

For fairness, all compared methods are equipped with ControlNet pose guidance.
In comparison with previous methods, EVA can achieves accurate text-to-attribute control while avoiding attention leakage.

Please scroll right for more comparisons.

1. Compare on Single-Object

Source Video	EVA (Ours)	FateZero	ControlVideo	TokenFlow	Ground-A-Video

Edit Prompt: "A Batman is playing tennis on snow covered court before an iced wall"

Edit Prompt: "An Iron Man is surfing with kite rope on a pink wave over blue sea under falling snow sky"

Edit Prompt: "A Spider is jumping on trampoline before a graffiti wall"

Edit Prompt: "A Batman on a motorcycle does a burnout on a frozen lake under the night sky"

Please scroll right for more comparisons.

2. Compare on Multi-Object

Source Video	EVA (Ours)	FateZero	ControlVideo	TokenFlow	Ground-A-Video

Edit Prompt: "An Iron man and a Spider Man unning under frosty yellow trees with golden leaves on the ground"

Edit Prompt: "An Spider man and a Wonder Woman are playing badminton before charcoal grey wall"

Edit Prompt: "An Iron man pushes a Batman in a soap-box car on the snowy bridge over snow covered ground "

Edit Prompt: "An Iron man and Batman are jumping skateboard before golden ginkgo trees "

Ablation Studies

In this section, we ablate key components of our method.

Latent Blend

ControlNet Pose Guidance

Spatial-Temporal Layout-Guided Attention (ST-Layout Attn)

Source Video	w/o Latent Blend	w/o ControlNet	w/o ST-Layout Attn	Full Method

man → Batman, clay court → snow covered court, stone wall -> an iced wall

man → Batman, road → frozen lake, sky → night sky

left man→ Iron Man, right man → Hulk store → cyberpunk cityspace

back man→ Iron Man, front man → Batman bridge, ground → snow covered

Compare with Different Layout Attention

In this section, we compare our Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) with two different layout-guided attention.

1. Compare to Modulated Attention (Per-frame Layout Attention)

We compare with the original Modulated Attention from Densediffusion[6]. To ensure fairness, we integrate DDIM Inversion and ControlNet for video editing and apply latent blending for background preservation. Modulated Attention serves as per-frame Layout Attention, but applying T2I methods frame-by-frame leads to severe attention leakage.

Source Video	Modulted Attention	ST-Layout Attn

man→ Spider-Man, woman → Wonder Woman, wall → charcoal grey wall

left man→ Iron Man, right man → Spider Man

2. Comparison with Sparse-Causal Layout-guided Attention (SC-Layout Attn)

We expand the receptive field for positive/negative values from individual frames to include sparse frames, specifically the first and preceding frames. This approach, named Sparse-Causal Layout-guided Attention (SC-Layout Attn), can be viewed as a promotion of Per-frame Layout Attention. However, it also has limitations such as attention leakage and appearance inconsistency due to: (1) Limited receptive field for negative values and (2) Reduced interaction across full frames.

Source Video	SC-Layout Attn	ST-Layout Attn

left man→ Iron Man, right man → Batman, trees→ golden ginkgo trees

left man→ Iron Man, right man → Spider Man, trees, ground → frosty yellow leaves

References

[1] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan and Qifeng Chen. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. ICCV, 2023.

[2] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. ICLR, 2024

[3] Michal Geyer, Omer Bar-Tal, Shai Bagon and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. ICLR, 2024.

[4]Jeong Hyeonho and Ye Jong Chul. Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models. ICLR, 2024.

[5]Lvmin Zhang and Anyi Rao and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ICCV, 2023.

[6]Kim, Yunji and Lee, Jiyoung and Kim, Jin-Hwa and Ha, Jung-Woo and Zhu, Jun-Yan. Dense Text-to-Image Generation with Attention Modulation. ICCV, 2023.

EVA: Zero-shot Accurate Attributes and Multi-Object Video Editing

Supplementary Material

Please refresh the webpage if the content has not loaded.

Qualitative Results of EVA

1. Single-Object

Single-Object Editing Results

2. Multi-Object

Multi-Object Editing Results

Qualitative Comparison

1. Compare on Single-Object

2. Compare on Multi-Object

Ablation Studies

Compare with Different Layout Attention

1. Compare to Modulated Attention (Per-frame Layout Attention)

2. Comparison with Sparse-Causal Layout-guided Attention (SC-Layout Attn)

References