Recent advancements in diffusion models have significantly improved video generation and editing capabilities. However, multi-grained video editing, which encompasses class-level, instance-level, and part-level modifications, remains a formidable challenge. The major difficulties in multi-grained editing include semantic misalignment of text-to-region control and feature coupling within the diffusion model. To address these difficulties, we present VideoGrain, a zero-shot approach that modulates space-time (cross- and self-) attention mechanisms to achieve fine-grained control over video content. We enhance text-to-region control by amplifying each local prompt's attention to its corresponding spatial-disentangled region while minimizing interactions with irrelevant areas in cross-attention. Additionally, we improve feature separation by increasing intra-region awareness and reducing inter-region interference in self-attention. Extensive experiments demonstrate our method achieves state-of-the-art performance in real-world scenarios.
(1) we integrate ST-Layout Attn into the frozen SD for multigrained
editing, where we modulate self- and cross-attention in a unified manner.
(2) In crossattention, we view each local prompt and its location as positive pairs, while the prompt and
outside-location areas are negative pairs, enabling text-to-region control.
(3) In self-attention, we enhance positive awareness within intra-regions and restrict negative interactions between inter-regions
across frames, making each query only attend to the target region and keep feature separation. |
Class Level : Editing objects within the same class |
Instance Level: Editing each individual instance to distinct object | Part Level: Applying part-level edit to specific elements of individual instances. | |
---|---|---|---|
Human Class → Spiderman |
left man → Spiderman, right man → Polar Bear, |
left man → Spiderman, right man → Polar Bear + Sunglasses, |
T2I class-level: TokenFlow |
T2I instance-level: Ground-A-Video |
T2V based: DMT |
T2V based: Pika |
---|---|---|---|
A simple comparison on instance editing: "A Spiderman and a Polar Bear are jogging on grassy meadow before cherry trees" |
input video | left man→ Iron Man, right man→ Spiderman, trees→ cherry blossoms | Swap identity+bg→ asphalt road with building under cherry blossoms | input video | left monkey→ teddy bear, right monkey→ koala | left monkey→ teddy bear, right monkey→ golden retriever |
---|---|---|---|---|---|
input video | man→ Spiderman, woman→ Wonder Woman, wall→ charcoal grey wall | input video |
man behind→ Iron Man,
man in front→ Stormtrooper,
ramp→ mossy stone bridge,
ground→ lake, bg→ forest |
input video | left man→ Iron Man, right man→ Batman, ground→ snowy ground, trees→ cherry blossoms |
input video | left cat→ Samoyed, right cat→ Tiger, background→ sunrise | left cat→ Panda, right cat→ Toy Poodle, ground→ grassy meadow, bg→ starry night | input video | left car→ Fire Truck, right car→ School Bus |
---|
input video | man→ Superman | man→ Superman + a cap | input video | man→ Superman, ball→ moon, trees→ cherry blossoms | man→ Superman + a sunglasses |
---|
input video | man→ Ironman | man→ Batman, clay court→ snow court, wall→ iced wall | input video | car→ Porsche |
---|---|---|---|---|
input video | wolf→ cute pig, forest→ autumn forest | wolf→ husky, forest→ green forest | wolf→ bear, forest→ autumn forest | wolf→ tiger, forest→ autumn forest |
Source video | Ours | FateZero | ControlVideo | TokenFlow | Ground-A-Video | DMT |
---|---|---|---|---|---|---|
Source video | Ours | FateZero | ControlVideo | TokenFlow | Ground-A-Video | DMT |
---|---|---|---|---|---|---|
Source video | Ours | FateZero | ControlVideo | TokenFlow | Ground-A-Video | DMT |
---|---|---|---|---|---|---|
left man → Iron Man | right man → Spider Man | left→ Iron Man+ right→ Spiderman |
---|
source video | shirt color: gray→ blue | half-sleeve gray shirt→ a black suit | source video | head color: black→ ginger, | body color: black → ginger |
---|
Source prompt: red man and gray man are jogging under green trees Edit prompt: Spider Man and Polar Bear are jogging under cherry blossoms VideoP2P setting: Attention Replace 3 subject words + Attention Reweight (Spider man: 4, polar bear: 4,cherry blossoms:2) |
|||
---|---|---|---|
source video | Input SAM-Track 3 areas instance masks into VideoP2P | VideoP2P joint Edit Result | Our Edit Result |
1st edit input |
VideoP2P 1st edit result Attention Replace + Reweight(4): red man → Spider Man |
2nd edit input |
VideoP2P 2nd edit result Attention Replace + Reweight(4): gray man → Polar Bear |
3rd edit input |
VideoP2P 3rd edit result Attention Replace + Reweight(2): green trees → cherry blossoms |
---|
VideoP2P reslt | "spiderman" weight | "bear" weight | "cherry" weight |
---|---|---|---|
Our reslt | "spiderman" weight | "bear" weight | "cherry" weight |
source video | Per-frame ST-Layout Attn |
Sparse-Casual ST-Layout Attn (first frame + previous frmae) |
Full frames ST-Layout Attn |
---|
source video | Without ControlNet pose condition | With ControlNet pose condition |
---|
source video | Cluster Masks from DDIM Inversion self attention | Cluter mask result | SAM-Track mask result |
---|