VideoMDM — 3D Human Motion Generation from 2D Supervision

Abstract

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers — velocity consistency and over-parameterized representation alignment — to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); on real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

Method

We are given monocular human-motion videos with accurate 2D joint trajectories but no 3D ground truth. A pretrained 2D-to-3D lifter produces approximate 3D pose sequences that serve as a noisy teacher. As in standard diffusion training, we diffuse these lifted 3D estimates and train a denoiser to recover the clean 3D motion — but supervision is applied entirely in 2D by reprojecting the prediction and comparing against the accurate keypoints.

A depth-aware weighting of the reprojection loss is, under mild assumptions, provably equivalent in expectation to direct 3D MSE supervision. We further adapt two standard 3D motion regularizers: a depth-weighted 2D velocity loss for temporal coherence, and a representation alignment loss that supervises the over-parameterized motion channels — joint rotations, velocities, and foot contacts — via ray-projection pseudo-targets derived from the predicted motion and the observed 2D keypoints.

VideoMDM training pipeline — **VideoMDM training.** From monocular video, we extract accurate 2D keypoints and approximate 3D poses. A motion diffusion model is trained to denoise the 3D poses under multi-source supervision: (i) 3D representation alignment, and (ii) 2D reprojection and velocity consistency with the accurate 2D pose.

Motion Representation Alignment illustration — **Motion Representation Alignment.** Motion generation models commonly adopt an over-parameterized representation that includes joint rotations, velocities, and foot-contact labels alongside joint positions. These redundant channels — derivable from positions — cannot be directly supervised from 2D observations. Each predicted 3D joint is projected to the nearest point on the ray through its observed 2D location, yielding a 2D-consistent pseudo-target from which the redundant channels can be computed and used as supervision.

Text-to-3D Motion from 2D Poses — HumanML3D

To isolate the effect of 2D supervision from pose-estimation noise, we construct a 2D-only version of HumanML3D by projecting MoCap sequences to random cameras and lifting them back with a pretrained lifter. VideoMDM is trained using only these 2D projections and evaluated against the 3D MoCap ground truth.

FID 0.88 · 3D-supervised MDM achieves 0.54

Prompt	MDM on MVLift	VideoMDM (Ours)
"a person waves with his right hand."
"the person walks backwards in a straight line."

Generation from Real Fitness Videos — Fit3D

VideoMDM is fine-tuned on Fit3D — real monocular fitness videos with no 3D supervision — using WHAM as the noisy teacher lifter. The fitness exercises (deadlifts, mule kicks, push-ups) are far outside the distribution of any MoCap-based lifter, directly testing the framework's ability to learn motions the teacher has never seen.

Exercise	WHAM Lift	MDM on WHAM	VideoMDM (Ours)
Deadlifts
Diamond push-ups
Barbell dead rows

Citation

@misc{mann2026videomdm,
  title         = {VideoMDM: Towards 3D Human Motion Generation From 2D Supervision},
  author        = {Amir Mann and Gal Harari and Merav Keidar and Or Litany},
  year          = {2026},
  eprint        = {2606.13364},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CV},
  url           = {https://arxiv.org/abs/2606.13364}
}