VideoGrain: Modulating Space-Time Attention for Multi-Grained Video Editing

Xiangpeng Yang¹, Linchao Zhu², Hehe Fan², Yi Yang²

¹ReLER Lab, University of Technology Sydney, ²CCAI, Zhejiang University

ICLR 2025

📕 TL;DR: A zero-shot method for class-level, instance-level, and part-level video editing

Abstract

Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios.

Method

(1) we integrate ST-Layout Attn into the frozen SD for multigrained editing, where we modulate self- and cross-attention in a unified manner. (2) In crossattention, we view each local prompt and its location as positive pairs, while the prompt and outside-location areas are negative pairs, enabling text-to-region control. (3) In self-attention, we enhance positive awareness within intra-regions and restrict negative interactions between inter-regions across frames, making each query only attend to the target region and keep feature separation.

Definition of Multi-grained Video Editing

	Class Level : Editing objects within the same class	Instance Level: Editing each individual instance to distinct object	Part Level: Applying part-level edit to specific elements of individual instances.

	Human Class → Spiderman	left man → Spiderman, right man → Polar Bear,	left man → Spiderman, right man → Polar Bear + Sunglasses,

T2I class-level: TokenFlow	T2I instance-level: Ground-A-Video	T2V based: DMT	T2V based: Pika

A simple comparison on instance editing: "A Spiderman and a Polar Bear are jogging on grassy meadow before cherry trees"

VideoGrain Editing Results

1. Instance Level


input video	left man→ Iron Man, right man→ Spiderman, trees→ cherry blossoms	Swap identity+bg→ asphalt road with building under cherry blossoms	input video	left monkey→ teddy bear, right monkey→ koala	left monkey→ teddy bear, right monkey→ golden retriever

input video	man→ Spiderman, woman→ Wonder Woman, wall→ charcoal grey wall	input video	man behind→ Iron Man, man in front→ Stormtrooper, ramp→ mossy stone bridge, ground→ lake, bg→ forest	input video	left man→ Iron Man, right man→ Batman, ground→ snowy ground, trees→ cherry blossoms


input video	left cat→ Samoyed, right cat→ Tiger, background→ sunrise	left cat→ Panda, right cat→ Toy Poodle, ground→ grassy meadow, bg→ starry night	input video	left car→ Fire Truck, right car→ School Bus

2. Part Level


input video	man→ Superman	man→ Superman + a cap	input video	man→ Superman, ball→ moon, trees→ cherry blossoms	man→ Superman + a sunglasses

3. Class Level


input video	man→ Ironman	man→ Batman, clay court→ snow court, wall→ iced wall	input video	car→ Porsche

input video	wolf→ cute pig, forest→ autumn forest	wolf→ husky, forest→ green forest	wolf→ bear, forest→ autumn forest	wolf→ tiger, forest→ autumn forest

Comparison with other video editing methods

Source video	Ours	FateZero	ControlVideo	TokenFlow	Ground-A-Video	DMT

Part level: "Thor in sunglasses, punching red boxing gloves in starry night sky"

Source video	Ours	FateZero	ControlVideo	TokenFlow	Ground-A-Video	DMT

Human Instances: "An Iron Man and a monkey are riding bikes on the snowy ground under cherry blossoms"

Source video	Ours	FateZero	ControlVideo	TokenFlow	Ground-A-Video	DMT

Animal Instances: "A Panda and a toy poodle are playing toys in starry night on grassy meadow"

Soely edit on specific subjects, keep background unchanged


	left man → Iron Man	right man → Spider Man	left→ Iron Man+ right→ Spiderman

Part-level Modifications examples


source video	shirt color: gray→ blue	half-sleeve gray shirt→ a black suit	source video	head color: black→ ginger,	body color: black → ginger

VideoP2P with SAM-Track masks

Source prompt: red man and gray man are jogging under green trees Edit prompt: Spider Man and Polar Bear are jogging under cherry blossoms VideoP2P setting: Attention Replace 3 subject words + Attention Reweight (Spider man: 4, polar bear: 4,cherry blossoms:2)

source video	Input SAM-Track 3 areas instance masks into VideoP2P	VideoP2P joint Edit Result	Our Edit Result

Video P2P Sequential edit based on separate masks


1st edit input	VideoP2P 1st edit result Attention Replace + Reweight(4): red man → Spider Man	2nd edit input	VideoP2P 2nd edit result Attention Replace + Reweight(4): gray man → Polar Bear	3rd edit input	VideoP2P 3rd edit result Attention Replace + Reweight(2): green trees → cherry blossoms