VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

1ReLER Lab, University of Technology Sydney, 2CCAI, Zhejiang University
ICLR 2025

📕 TL;DR: A zero-shot method for class-level, instance-level, and part-level video editing

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios.

Method

(1) we integrate ST-Layout Attn into the frozen SD for multigrained editing, where we modulate self- and cross-attention in a unified manner. (2) In crossattention, we view each local prompt and its location as positive pairs, while the prompt and outside-location areas are negative pairs, enabling text-to-region control. (3) In self-attention, we enhance positive awareness within intra-regions and restrict negative interactions between inter-regions across frames, making each query only attend to the target region and keep feature separation.

Definition of Multi-grained Video Editing

Class Level : Editing objects
within the same class
Instance Level: Editing each individual instance to distinct object Part Level: Applying part-level edit to specific elements of individual instances.
Human Class → Spiderman left man → Spiderman,
right man → Polar Bear,
left man → Spiderman,
right man → Polar Bear + Sunglasses,
T2I class-level:
TokenFlow
T2I instance-level:
Ground-A-Video
T2V based:
DMT
T2V based:
Pika
A simple comparison on instance editing: "A Spiderman and a Polar Bear are jogging on grassy meadow before cherry trees"

VideoGrain Editing Results

input video left man→ Iron Man, right man→ Spiderman, trees→ cherry blossoms Swap identity+bg→ asphalt road with building under cherry blossoms input video left monkey→ teddy bear, right monkey→ koala left monkey→ teddy bear, right monkey→ golden retriever
input video man→ Spiderman, woman→ Wonder Woman, wall→ charcoal grey wall input video man behind→ Iron Man, man in front→ Stormtrooper, ramp→ mossy stone bridge, ground→ lake,
bg→ forest
input video left man→ Iron Man, right man→ Batman, ground→ snowy ground, trees→ cherry blossoms
input video left cat→ Samoyed, right cat→ Tiger, background→ sunrise left cat→ Panda, right cat→ Toy Poodle, ground→ grassy meadow, bg→ starry night input video left car→ Fire Truck, right car→ School Bus
input video man→ Superman man→ Superman + a cap input video man→ Superman, ball→ moon, trees→ cherry blossoms man→ Superman + a sunglasses
input video man→ Ironman man→ Batman, clay court→ snow court, wall→ iced wall input video car→ Porsche
input video wolf→ cute pig, forest→ autumn forest wolf→ husky, forest→ green forest wolf→ bear, forest→ autumn forest wolf→ tiger, forest→ autumn forest

Comparison with other video editing methods

Source video Ours FateZero ControlVideo TokenFlow Ground-A-Video DMT
Part level: "Thor in sunglasses, punching red boxing gloves in starry night sky"
Source video Ours FateZero ControlVideo TokenFlow Ground-A-Video DMT
Human Instances: "An Iron Man and a monkey are riding bikes on the snowy ground under cherry blossoms"
Source video Ours FateZero ControlVideo TokenFlow Ground-A-Video DMT
Animal Instances: "A Panda and a toy poodle are playing toys in starry night on grassy meadow"

Soely edit on specific subjects, keep background unchanged

left man → Iron Man right man → Spider Man left→ Iron Man+ right→ Spiderman

Part-level Modifications examples

source video shirt color: gray→ blue half-sleeve gray shirt→ a black suit source video head color: black→ ginger, body color: black → ginger

VideoP2P with SAM-Track masks

Source prompt: red man and gray man are jogging under green trees
Edit prompt: Spider Man and Polar Bear are jogging under cherry blossoms
VideoP2P setting: Attention Replace 3 subject words + Attention Reweight (Spider man: 4, polar bear: 4,cherry blossoms:2)
source video Input SAM-Track 3 areas instance masks into VideoP2P VideoP2P joint Edit Result Our Edit Result
1st edit input VideoP2P 1st edit result
Attention Replace + Reweight(4):
red man → Spider Man
2nd edit input
VideoP2P 2nd edit result
Attention Replace + Reweight(4):
gray man → Polar Bear
3rd edit input VideoP2P 3rd edit result
Attention Replace + Reweight(2):
green trees → cherry blossoms

Difference between P2P and VideoGrain

VideoP2P reslt "spiderman" weight "bear" weight "cherry" weight
Our reslt "spiderman" weight "bear" weight "cherry" weight

Temporal Focus of ST-Layout Attn

Iron Man and Spiderman are jogging under green trees
source video Per-frame ST-Layout Attn Sparse-Casual ST-Layout Attn
(first frame + previous frmae)
Full frames ST-Layout Attn
Edit Prompt: "A Iron Man is playing tennis on snow covered court "
source video Without ControlNet pose condition With ControlNet pose condition
Edit Prompt: "A Batman is playing tennis on snow covered court before an iced wall"
source video Cluster Masks from DDIM Inversion self attention Cluter mask result SAM-Track mask result