DecMem

Towards Minute-Long Consistent World Generation with Decoupled Memory

Zhenhao Yang¹, Xiaoshi Wu², Zhengyao Lv¹, Xiaoyu Shi^2†, Xintao Wang², Pengfei Wan², Kun Gai², Kwan-Yee K. Wong^1†

¹ The University of Hong Kong ² Kling Team, Kuaishou Technology

^†Corresponding Author

Paper Code

Abstract

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation.

We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.

Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.

Motivation

Overview of DecMem motivation.

(a) Quality and Consistency: Visual quality and spatio-temporal consistency of different long-horizon extrapolation methods (memory bank initialized with 221 frames). Prior methods fail to jointly preserve fidelity and consistency, while ours breaks this trade-off and sustains fine-grained memory under long rollouts.

(b) Computational Efficiency: Generation latency of our method and naïve Dense Attention (memory bank initialized with 221 frames). Our sparse block retrieval substantially reduces cost without sacrificing quality.

(c) Learnable Block Retrieval: Comparison of our learnable block retrieval against FOV-based frame retrieval (e.g., WorldMem).

Method

Overview of DecMem architecture.

(a) Sparse Global Memory (SGM): Combines a block-level sparse retrieval module and a context-aware attention module for long-term memory fine-grained retrieval in an end-to-end manner, enabling efficient access to global history with bounded cost.

(b) Anchored Local Memory (ALM): Keeps short-term transition smooth and supports stable, high-quality extrapolation by anchoring generation on recent local context.

(c) Decoupled Pipeline: DecMem comprises decoupled memory for long-term consistency and extrapolation generalization while keeping computational cost low.

For clearer visualization, we display 3 frame latents as key & value and the last frame indexed by t as query; each frame contains 2 blocks with 2 tokens per block.

Long World Simulation

Minute-long video generation results with 221 frames for memory bank initialization. The left side shows the ground truth (GT). On the right side, the frames highlighted with red boxes are the predicted frames, while the frames without red boxes are the memory frames. .

BibTeX

@misc{yang2026decmemminutelongconsistentworld, title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong}, year={2026}, eprint={2605.31336}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2605.31336}, }