Brief · PulseAugur

RESEARCH · arXiv cs.AI English(EN) · 1d · [2 sources]

Diffusion Transformer World-Action Model for AV Scene Prediction

Researchers have developed a Diffusion Transformer World-Action Model for predicting future scenes in autonomous vehicle (AV) environments. This model uses a compact latent world model to forecast scene latents up to 8 seconds ahead, which a decoder renders into images. The approach significantly outperforms standard regression methods in terms of prediction accuracy and realism, as measured by metrics like Fréchet Inception Distance (FID) and Kernel Inception Distance (KID). The model demonstrates strong action controllability, with planned steering inputs directly influencing predicted scene displacements. AI

IMPACT This model offers a more realistic and controllable approach to predicting future driving scenes, potentially improving AV planning and simulation capabilities.

Fréchet inception distance
Diffusion Transformer
nuScenes
V-JEPA2
Diffusion Transformer World-Action Model
AV Scene Prediction
Stable-Diffusion-VAE
Ruslan Sharifullin
Kernel Inception Distance