TTOM: Test-Time Optimization and Memorization for Compositional Video Generation

1. NExT++ Lab, National University of Singapore
2. University of Science and Technology of China
3. Harbin Institute of Technology (Shenzhen)
*Equal Contribution       Correspondence

Abstract

Video Foundation Models (VFMs) exhibit remarkable visual generation performance, but struggle in compositional scenarios (e.g., motion, numeracy, and spatial relations).

In this work, we introduce Test-Time Optimization and Memorization (TTOM), a Training-free framework that aligns VFM outputs with spatiotemporal layouts during inference for better text-image alignment.

Rather than direct intervention to latents or attention per-sample in existing work, we integrate and optimize new parameters guided by a general layout-attention objective. Furthermore, we formulate video generation within a streaming setting, and maintain historical optimization contexts with a parametric memory mechanism that supports flexible operations, such as insert, read, update, and delete.

Notably, we found that TTOM disentangles compositional world knowledge, showing powerful transferability and generalization. Experimental results on the T2V-CompBench and Vbench benchmarks establish TTOM as an effective, practical, scalable, and efficient framework, to achieve cross-modal alignment for compositional video generation on the fly.

Video Results

Motion Pattern Transfer with Memory

INSERT: Store motion pattern into memory | RETRIEVE: Apply stored patterns to new prompts.

INSERT

A cat slinking to the left side of a cozy living room.

RETRIEVE

A fresh orange rolls left across the countertop.

RETRIEVE

A gentle swan glides left over the lake.

INSERT

A vibrant blue jay with striking plumage takes flight, ascending gracefully through a lush, sun-dappled garden...

A vibrant blue jay with striking plumage takes flight, ascending gracefully through a lush, sun-dappled garden. Its wings beat rhythmically, catching the golden morning light as it rises past blooming roses and towering sunflowers. The garden is alive with color, from the deep green of the foliage to the vivid reds and yellows of the flowers. As the bird soars higher, it casts a fleeting shadow over a tranquil pond, where koi fish swim lazily beneath the surface. The air is filled with the gentle rustle of leaves and the distant hum of bees, creating a serene, harmonious backdrop for the bird's elegant ascent.

RETRIEVE

A sleek silver drone lifts from the ground, its rotors whirring softly as it ascends into the crisp morning air...

A sleek silver drone lifts gently from the ground, its rotors whirring softly as it ascends into the crisp morning air. Rising through a lush, sun-dappled garden, it weaves past blooming roses and towering sunflowers, its metallic surface catching and reflecting the golden sunlight. The garden bursts with color—the deep green of foliage mingles with the vivid reds and yellows of the blossoms, creating a vibrant tapestry of life. As the drone climbs higher, its fleeting shadow glides across a tranquil pond where koi fish drift lazily beneath the surface. The gentle hum of the rotors blends with the rustling leaves and the distant buzz of bees, forming a serene, harmonious backdrop for its graceful ascent.

RETRIEVE

A sleek silver drone lifts gently from the edge of a Grand Canyon cliff, its rotors whirring softly as it rises into the crisp morning air...

A sleek silver drone lifts gently from the edge of a Grand Canyon cliff, its rotors whirring softly as it rises into the crisp morning air. Skimming past striated sandstone walls and wind-carved ledges, it climbs through bands of sun-warmed rock where sagebrush and desert blooms cling to the rim. Its metallic surface catches and reflects the golden light, echoing the canyon's rust-reds and ochres into a vibrant tapestry of color. As it ascends, the drone's fleeting shadow drifts over a winding ribbon of the Colorado River far below, where eddies swirl beneath sheer faces. The gentle hum of the rotors blends with the sigh of updrafts, the whisper of dry grasses, and the distant cry of a condor, forming a serene, harmonious backdrop for its graceful ascent.

Methods Comparison

Wan2.1

LVD (on Wan2.1)

TTOM

Motion

A petal floats left to right by the flowing stream.

Action

A penguin slides down a snowy slope and a seal claps .

Numeracy

Five elephants splash water from a river.

Cons

Interaction

Spatial

More on motion category

Wan2.1

LVD (on Wan2.1)

TTOM

A robot vacuum is sweeping the floor from left to right .

A balloon drifts right to left above a statue in a city square.

A basketball is thrown from right to the left .

A child climbs down the slide.

A bright lantern floats left down the river, its warm glow reflecting off the gentle ripples.

Framework

Overview of the TTOM framework for compositional text-to-video generation. A stream of text prompts is first fed into LLMs for spatial-temporal layout planning. Meanwhile, a denoising sampling process of video foundation models is performed, in which cross-attention maps are extracted, followed by test-time optimization for alignment. Historical optimization context is maintained by the parametric memory.

TTOM Framework

BibTeX

@article{qu2025ttom,
  title   = {TTOM: Test-Time Optimization and Memorization for Compositional Video Generation},
  author  = {Leigang Qu and Ziyang Wang and Na Zheng and Wenjie Wang and Liqiang Nie and Tat-Seng Chua},
  journal = {arXiv preprint arXiv:2510.07940},
  year    = {2025},
  url     = {https://arxiv.org/abs/2510.07940}
}