We present full results of our EVA, including single-/multi-object multi-attribute editing results, part of which are shown in Fig.6 and Fig.7 in the paper.
Source video |
man → Iron Man clay court → snow covered court |
man → Batman clay court → snow covered court stone wall -> an iced wal |
Source Video |
man → Ironman ground → grassland |
man → Ironman ground → lake |
---|---|---|---|---|---|
source video |
girl → Batwoman ground → snow covered ground |
girl → Spider man ground → trampoline |
Source Video |
man → Ironman sea, sky → falling snow |
man → Ironman sea, sky → falling snow wave → pink wave |
source video |
man → Iron Man ground → grassland slope → snow covered slope |
source video |
woman → penguin snow mountain→ ice rink |
Source Video |
girl → Wonder Woman grass → withered grass green trees → golden ginkgo trees |
source video |
man → Batman road → frozen lake sky → night sky |
source video |
man → The Flash black gloves→ red gloves gym→ stormy lightning night |
Source Video |
woman → Scarlet Witch sofa → a moonlit pond background → starry dark night |
To view the swap identity video, hover your mouse over the video.
source Video
left man → Iron Man
right man → Spider-Man
background → frosty yellow leaves
swap background → sphalt road with building under sky
source video
left man → Iron Man
right man → Batman
treess → crimson maple trees
road → snow covered road
source Video
left man→ Iron Man
right man → Spider-Man
ground →grassland
blue sky →raining sky
source video
man→ Spider-Man
woman → Wonder Woman
wall → charcoal grey wall
source Video
back man → Iron Man
front man → Batman
bridge, ground → snow covered
source video
left man → Iron Man
right man → Hulk
store → cyberpunk cityspace
Source Video |
man→ Batman woman → Batwoman ground → rain soaked ground red wall → stormy lighting night |
Source Video |
left man→ Iron Man right man → Batman trees→ golden ginkgo trees |
Source Video |
left man→ Iron Man right man → Batman ground→ yellow floor |
---|
We compare EVA with following video-editing methods:
For fairness, all compared methods are equipped with ControlNet pose guidance.
In comparison with previous methods, EVA can achieves accurate text-to-attribute control while avoiding attention leakage.
Please scroll right for more comparisons.
Source Video | EVA (Ours) | FateZero | ControlVideo | TokenFlow | Ground-A-Video |
---|---|---|---|---|---|
Edit Prompt: "A Batman is playing tennis on snow covered court before an iced wall" | |||||
Edit Prompt: "An Iron Man is surfing with kite rope on a pink wave over blue sea under falling snow sky" | |||||
Edit Prompt: "A Spider is jumping on trampoline before a graffiti wall" | |||||
Edit Prompt: "A Batman on a motorcycle does a burnout on a frozen lake under the night sky" |
Please scroll right for more comparisons.
Source Video | EVA (Ours) | FateZero | ControlVideo | TokenFlow | Ground-A-Video |
---|---|---|---|---|---|
Edit Prompt: "An Iron man and a Spider Man unning under frosty yellow trees with golden leaves on the ground" | |||||
Edit Prompt: "An Spider man and a Wonder Woman are playing badminton before charcoal grey wall" | |||||
Edit Prompt: "An Iron man pushes a Batman in a soap-box car on the snowy bridge over snow covered ground " | |||||
Edit Prompt: "An Iron man and Batman are jumping skateboard before golden ginkgo trees " |
In this section, we ablate key components of our method.
Source Video | w/o Latent Blend | w/o ControlNet | w/o ST-Layout Attn | Full Method |
---|---|---|---|---|
man → Batman, clay court → snow covered court, stone wall -> an iced wall | ||||
man → Batman, road → frozen lake, sky → night sky | ||||
left man→ Iron Man, right man → Hulk store → cyberpunk cityspace | ||||
back man→ Iron Man, front man → Batman bridge, ground → snow covered |
In this section, we compare our Spatial-Temporal Layout-Guided Attention (ST-Layout Attn) with two different layout-guided attention.
We compare with the original Modulated Attention from Densediffusion[6]. To ensure fairness, we integrate DDIM Inversion and ControlNet for video editing and apply latent blending for background preservation. Modulated Attention serves as per-frame Layout Attention, but applying T2I methods frame-by-frame leads to severe attention leakage.
Source Video | Modulted Attention | ST-Layout Attn | ||
---|---|---|---|---|
man→ Spider-Man, woman → Wonder Woman, wall → charcoal grey wall | ||||
left man→ Iron Man,
right man → Spider Man |
We expand the receptive field for positive/negative values from individual frames to include sparse frames, specifically the first and preceding frames. This approach, named Sparse-Causal Layout-guided Attention (SC-Layout Attn), can be viewed as a promotion of Per-frame Layout Attention. However, it also has limitations such as attention leakage and appearance inconsistency due to: (1) Limited receptive field for negative values and (2) Reduced interaction across full frames.
Source Video | SC-Layout Attn | ST-Layout Attn | ||
---|---|---|---|---|
left man→ Iron Man, right man → Batman, trees→ golden ginkgo trees | ||||
left man→ Iron Man, right man → Spider Man, trees, ground → frosty yellow leaves |
[1] Chenyang Qi, Xiaodong Cun, Yong Zhang, Chenyang Lei, Xintao Wang, Ying Shan and Qifeng Chen. FateZero: Fusing Attentions for Zero-shot Text-based Video Editing. ICCV, 2023.
[2] Yabo Zhang, Yuxiang Wei, Dongsheng Jiang, Xiaopeng Zhang, Wangmeng Zuo, and Qi Tian. Controlvideo: Training-free controllable text-to-video generation. ICLR, 2024
[3] Michal Geyer, Omer Bar-Tal, Shai Bagon and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. ICLR, 2024.
[4]Jeong Hyeonho and Ye Jong Chul. Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models. ICLR, 2024.
[5]Lvmin Zhang and Anyi Rao and Maneesh Agrawala. Adding Conditional Control to Text-to-Image Diffusion Models. ICCV, 2023.
[6]Kim, Yunji and Lee, Jiyoung and Kim, Jin-Hwa and Ha, Jung-Woo and Zhu, Jun-Yan. Dense Text-to-Image Generation with Attention Modulation. ICCV, 2023.