ICML 2026

MotiMotion: Motion-Controlled Video Generation with Visual Reasoning

Lee Hsin-Ying1 *Hanwen Jiang2Yiqun Mei2Jing Shi2Ming-Hsuan Yang1Zhixin Shu2

1University of California, Merced  2Adobe Research

*Work done as an intern at Adobe

MotiMotion reformulates motion control as a reasoning-then-generation problem. It refines sparse user trajectories, predicts plausible secondary motion, and uses confidence-aware control to balance faithfulness and realism.

Abstract

Reasoning over intent, causality, and motion plausibility

Current motion-controlled image-to-video models often treat user-provided trajectories as literal commands, even when those inputs are sparse, imprecise, or causally incomplete. MotiMotion introduces a visual-language reasoner that refines primary motion, hallucinates plausible secondary effects, and guides a confidence-aware video generator to produce more natural and physically grounded outcomes.

Method

Prompt and motion reasoning before generation

A training-free visual-language reasoner interprets the input image, trajectories, and optional text prompt to refine primary motion and hallucinate plausible secondary effects. These enriched plans guide a video generator to produce causally grounded videos.

Overview of the MotiMotion method
MotiMotion uses a training-free visual-language reasoner to refine user intent, infer missing motion, and improve controllability for complex interactions.

Method

Confidence-aware control that balances user intent and generative prior

We assign a confidence score to each trajectory and modulate the motion conditioning strength accordingly. High-confidence inputs are followed strictly; low-confidence inputs let the generative prior fill in natural dynamics.

Confidence-aware training and inference
Input image with user trajectories

Images from MoveBench and MotionEdit.

MotiBench

Interaction-centric benchmark scenes that motivate causal motion reasoning

Each sample pairs an image with a prompt that requires plausible causal effects and world-aware motion understanding.

Results

Generated samples with refined motion and causal interaction reasoning

Red lines indicate user trajectories. Blue lines indicate predicted trajectories.

Comparison

Baselines and Ablations

Confidence-Aware Control

High-confidence inputs follow plans strictly, low-confidence inputs stay elastic

We set user trajectories (red lines) to high-confidence and show the difference between predicted trajectories (blue lines) under high and low-confidence.

Citation

BibTeX

@inproceedings{hsinying2026motimotion,
  title={MotiMotion: Motion-Controlled Video Generation with Visual Reasoning},
  author={Hsin-Ying, Lee and Jiang, Hanwen and Mei, Yiqun and Shi, Jing and Yang, Ming-Hsuan and Shu, Zhixin},
  booktitle={Forty-third International Conference on Machine Learning},
  year={2026},
}