Despite recent advances in diffusion transformers (DiTs) for text-to-video generation, scaling to long-duration content remains challenging due to the quadratic complexity of self-attention. While prior efforts---such as sparse attention and temporally autoregressive models---offer partial relief, they often compromise temporal coherence or scalability. We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos, designed to produce long, coherent videos through a segment-wise generation process. At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations. It supports variable-length inputs with linearly adjustable compression rates, enabled by a single query token design based on the Q-Former architecture. Additionally, by encoding temporal context through position-aware mechanisms, our model seamlessly supports prediction, retradiction, interpolation, and multi-shot generation within a unified paradigm. Extensive experiments across diverse tasks validate the effectiveness and versatility of our approach.
Our model has the flexibility to do video continuation of any direction and generate multi-shot video with advanced efficiency. It shows the capability of retaining ID consistency within large temporal range, generating videos of large and smooth motion.
The left part of the figure features an autoencoder consisting of a FlexFormer encoder and a FlexFormer decoder. The encoder compresses multiple segments of video and text tokens separately. The number of query tokens is derived from the video token sequence length. The query token sequence is formed by copying the single learnable token multiple times. The decoder decode the context tokens into video and text features in a similar way. Each context videotext pair is compressed into some context tokens. The multiple chunks of context tokens are concatenated then fed into the DiT by further concatenating with the input tokens of the self-attention layer.
Each block represents the positional index (t, h, w) of the corresponding token.
Blue and purple dots represent the position of video tokens and query tokens respectively. Text tokens are omitted.
@article{jiang2025lovic,
title={LoViC: Efficient Long Video Generation with Context Compression},
author={Jiang, Jiaxiu and Li, Wenbo and Ren, Jingjing and Qiu, Yuping and Guo, Yong and Xu, Xiaogang and Wu, Han and Zuo, Wangmeng},
journal={arXiv preprint arXiv:2507.12952},
year={2025}
}