From Zero to Hero: Training-Free Custom Concept Spawning in World Models

TL;DR

Autoregressive world models generate worlds chunk by chunk as the user navigates. Once the camera moves past the reference frame, the unseen regions are populated by the model's priors, with no mechanism for the user to specify what should appear and where.

We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. In autoregressive video models, the first slot of the context memory is pinned to the reference frame and acts as an anchor for every generated chunk. SPAWN replaces this anchor with a concept latent over a short window, then restores it. The concept persists in the rollout via the model's own memory. SPAWN works with either a concept image or a text prompt, and supports concepts from fine-grained entities to large landmarks.

Method

Method overview. At each rollout step, the context memory is reconstituted from the memory cache and passed through the encoder, autoregressive diffusion transformer, and decoder to produce the next chunk. At specified injection chunks, we replace the anchor slot of the context memory with a concept latent before encoding, which introduces the concept into the scene. Once the injection window ends, we restore the original anchor; even so, the concept persists in subsequent chunks. For clarity, the figure shows a reduced context size and a single replaced slot.

Qualitative Results

a) Image-based control

b) Text-based control

"A lone brown bear"

"A translucent glass sculpture shimmering"

"A bright orange camping tent with backpacks"

c) Multiple Concepts in Single Video

Add

to the right

Add

to the left

Add

above

Camera actions:

d) Diversity

Baseline

Diverse Generation (Ours)

Buildings

Beach

Church

Ocean

Road

Qualitative Comparison

"Camera pans right. Add a lily pad cluster with a single blooming flower."

Wan 2.2

HunyuanVideo

Worldplay

Ours

"Camera pans right. Add a vintage lantern hanging from a tree."

Wan 2.2

HunyuanVideo

Worldplay

Ours

"Camera pans left. Insert a bright neon desk lamp shining towards the twisted rock formation."

Wan 2.2

HunyuanVideo

Worldplay

Ours

"Camera pans right. Place a blue rectangular sign on a metal post with white icons: pedestrian, cyclist, and airplane symbols."

Wan 2.2

HunyuanVideo

Worldplay

Ours

"Camera pans left. Place a bright orange camping tent among the trees."

Wan 2.2

HunyuanVideo

Worldplay

Ours

Additional Results

"a turtle resting on a rock near the water's edge"

"a vibrant red canoe tied to a tree"

"a wooden ladder leading from the shore into the water"

"a lone streetlight with a lit bulb"

"an antique pocket watch, open and glinting in the sunlight, on the rock"

"a park bench facing towards the horizon"

"a decorative throw pillow casually resting against a small boulder"

"a shiny steel chain coiled on a prominent rock"

"a picnic basket on a rock"

"a sleek silver motorbike parked"

"fully decorated Christmas tree with twinkling lights"

"three children holding hands in the snow"

"stacked colorful kayaks in the field"

"a pair of ice skates hanging from a tree branch"

"a vintage wooden sled leaning against a tree"

BibTeX

@misc{akdemir2026zeroherotrainingfreecustom,
      title={From Zero to Hero: Training-Free Custom Concept Spawning in World Models}, 
      author={Kiymet Akdemir and Pinar Yanardag},
      year={2026},
      eprint={2606.02575},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.02575}, 
}