From Zero to Hero: Training-Free Custom Concept Spawning in World Models

Kiymet Akdemir, Pinar Yanardag

Virginia Tech


Control via text prompt: "Add a lake on the right"
Control via concept image: Horse concept
Baseline
Ours
Ours

TL;DR

Autoregressive world models generate worlds chunk by chunk as the user navigates. Once the camera moves past the reference frame, the unseen regions are populated by the model's priors, with no mechanism for the user to specify what should appear and where.

We introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a training-free method for concept spawning. In autoregressive video models, the first slot of the context memory is pinned to the reference frame and acts as an anchor for every generated chunk. SPAWN replaces this anchor with a concept latent over a short window, then restores it. The concept persists in the rollout via the model's own memory. SPAWN works with either a concept image or a text prompt, and supports concepts from fine-grained entities to large landmarks.

Method

Method overview of SPAWN

Method overview. At each rollout step, the context memory is reconstituted from the memory cache and passed through the encoder, autoregressive diffusion transformer, and decoder to produce the next chunk. At specified injection chunks, we replace the anchor slot of the context memory with a concept latent before encoding, which introduces the concept into the scene. Once the injection window ends, we restore the original anchor; even so, the concept persists in subsequent chunks. For clarity, the figure shows a reduced context size and a single replaced slot.


Qualitative Results

a) Image-based control

Cowboy concept
Cauldron concept
Fire concept

b) Text-based control

"A lone brown bear"
"A translucent glass sculpture shimmering"
"A bright orange camping tent with backpacks"

c) Multiple Concepts in Single Video

Add lamp to the right
Add statue to the left
Add ufo above
Camera actions:
1
2
3
4
5

d) Diversity

Baseline
Diverse Generation (Ours)
Buildings
Buildings
Buildings
Buildings
Beach
Church
Ocean
Road

Qualitative Comparison

"Camera pans right. Add a lily pad cluster with a single blooming flower."

Wan 2.2
HunyuanVideo
Worldplay
Ours

"Camera pans right. Add a vintage lantern hanging from a tree."

Wan 2.2
HunyuanVideo
Worldplay
Ours

"Camera pans left. Insert a bright neon desk lamp shining towards the twisted rock formation."

Wan 2.2
HunyuanVideo
Worldplay
Ours

"Camera pans right. Place a blue rectangular sign on a metal post with white icons: pedestrian, cyclist, and airplane symbols."

Wan 2.2
HunyuanVideo
Worldplay
Ours

"Camera pans left. Place a bright orange camping tent among the trees."

Wan 2.2
HunyuanVideo
Worldplay
Ours

Additional Results

"a turtle resting on a rock near the water's edge"
"a vibrant red canoe tied to a tree"
"a wooden ladder leading from the shore into the water"
"a lone streetlight with a lit bulb"
"an antique pocket watch, open and glinting in the sunlight, on the rock"
"a park bench facing towards the horizon"
"a decorative throw pillow casually resting against a small boulder"
"a shiny steel chain coiled on a prominent rock"
"a picnic basket on a rock"
"a sleek silver motorbike parked"
"fully decorated Christmas tree with twinkling lights"
"three children holding hands in the snow"
"stacked colorful kayaks in the field"
"a pair of ice skates hanging from a tree branch"
"a vintage wooden sled leaning against a tree"

BibTeX

@misc{akdemir2026zeroherotrainingfreecustom,
      title={From Zero to Hero: Training-Free Custom Concept Spawning in World Models}, 
      author={Kiymet Akdemir and Pinar Yanardag},
      year={2026},
      eprint={2606.02575},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.02575}, 
}