DecMem

Towards Minute-Long Consistent World Generation with Decoupled Memory

Zhenhao Yang1, Xiaoshi Wu2, Zhengyao Lv1, Xiaoyu Shi2†, Xintao Wang2, Pengfei Wan2, Kun Gai2, Kwan-Yee K. Wong1†

1 The University of Hong Kong   2 Kling Team, Kuaishou Technology

Corresponding Author

Abstract

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation.

We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation.

Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.

Motivation

Figure 1: Motivation

Overview of DecMem motivation.

(a) Quality and Consistency: Visual quality and spatio-temporal consistency of different long-horizon extrapolation methods (memory bank initialized with 221 frames). Prior methods fail to jointly preserve fidelity and consistency, while ours breaks this trade-off and sustains fine-grained memory under long rollouts.

(b) Computational Efficiency: Generation latency of our method and naïve Dense Attention (memory bank initialized with 221 frames). Our sparse block retrieval substantially reduces cost without sacrificing quality.

(c) Learnable Block Retrieval: Comparison of our learnable block retrieval against FOV-based frame retrieval (e.g., WorldMem).

Method

Figure 4: DecMem pipeline

Overview of DecMem architecture.

(a) Sparse Global Memory (SGM): Combines a block-level sparse retrieval module and a context-aware attention module for long-term memory fine-grained retrieval in an end-to-end manner, enabling efficient access to global history with bounded cost.

(b) Anchored Local Memory (ALM): Keeps short-term transition smooth and supports stable, high-quality extrapolation by anchoring generation on recent local context.

(c) Decoupled Pipeline: DecMem comprises decoupled memory for long-term consistency and extrapolation generalization while keeping computational cost low. For clearer visualization, we display 3 frame latents as key & value and the last frame indexed by t as query; each frame contains 2 blocks with 2 tokens per block.

For clearer visualization, we display 3 frame latents as key & value and the last frame indexed by t as query; each frame contains 2 blocks with 2 tokens per block.

Comparison

Qualitative comparison with state-of-the-art methods (We use 221 frames for memory bank initialization.).

BibTeX

@misc{yang2026decmemminutelongconsistentworld,
      title={DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory}, 
      author={Zhenhao Yang and Xiaoshi Wu and Zhengyao Lv and Xiaoyu Shi and Xintao Wang and Pengfei Wan and Kun Gai and Kwan-Yee K. Wong},
      year={2026},
      eprint={2605.31336},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.31336}, 
}