A generative video animation engine built on Python + FFmpeg. No external media — all frames and audio are synthesised from scratch.
Genzō (現像) is the Japanese word for film development — the process of making a latent image appear from nothing. That's the engine: pure code develops into video.
The first two productions explore what it feels like to be an LLM. They were MVPs: the goal now is to extract everything reusable into a proper engine so future videos are written as content, not as code.
genzo/
core.py ← frame emitter, audio queue, render pipeline
audio.py ← all synthesis primitives
draw.py ← all drawing & shape primitives
animate.py ← animated shape primitives + easing
fonts.py ← font loader, named size scale
palette.py ← named colour constants
scenes/
llm_gatari/
scenes.json ← scene list: static cards, text, geometry
templates.py ← animated card templates (expanding_rings, tri_spin, etc.)
audio.py ← per-scene audio descriptors
productions/
llm_gatari.py ← thin orchestrator: load scenes.json, call engine, render
llm_ytp.py ← kept as-is, reference for glitch aesthetic
Static cards (bg, text, geometry, duration) → fully declarative JSON Animated cards (easing, motion, per-frame logic) → named Python templates called from JSON Audio → either inline numpy or referenced by name from a sound library
Proof-of-concept. Established the full pipeline: synthesise audio, draw frames, call FFmpeg. Visual language: matrix rain, scanlines, chromatic aberration, VHS noise, strobe cuts. Everything inline, no separation of concerns.
Second iteration. Introduced the design language used going forward: flat colour, geometric shapes, stark minimal typography, Japanese text, animated primitives (line_grow, tri_spin, circle_ring), photosensitivity rules. Still monolithic — scenes, engine, and content are all one file.
Everything below currently lives in make_monogatari.py and is ready to move
into the engine without modification.
| Function | Description |
|---|---|
sine(freq, dur, vol) |
Pure sine wave |
chord(freqs, dur, vol) |
Sum of sine waves |
square_wave(freq, dur, vol) |
Harsh robot tone |
white_noise(dur, vol) |
Random noise |
drone(freq, dur, vol) |
Sine + harmonics, sustained |
sting(freqs, dur, vol) |
Short multi-tone stab with release |
click(vol) |
3ms noise burst — card cut sound |
thud(vol) |
Pitch-drop impact |
whoosh(dur, vol) |
Filtered noise sweep |
envelope(arr, attack, release) |
Apply fade-in/out to any array |
silence(dur) |
Zero array |
| Function | Description |
|---|---|
canvas(bg) |
New 854×480 RGB image |
put(img, text, x, y, font, fill, anchor, stroke) |
Text with lt/ct/rt anchor + optional stroke |
measure(d, text, font) |
Returns (width, height) of text |
hbar(img, y, h, fill) |
Filled horizontal bar |
vbar(img, x, w, fill) |
Filled vertical bar |
hrule(img, y, fill, width) |
1px horizontal rule |
vrule(img, x, fill, width) |
1px vertical rule |
rect(img, x1, y1, x2, y2, fill, outline) |
Rectangle |
triangle(img, pts, fill) |
Arbitrary polygon |
diagonal_split(bg_left, bg_right, split_x, angle_px) |
Two-tone diagonal split background |
rotated_text(img, text, angle, cx, cy, font, fill) |
Text rotated around a center point |
| Function | Description |
|---|---|
ease_out(t) |
Quadratic ease-out (0→1) |
ease_in(t) |
Quadratic ease-in (0→1) |
tri_spin(img, cx, cy, r, angle, fill, n_sides) |
Regular polygon rotating around a point |
circle_ring(img, cx, cy, r, fill, width) |
Unfilled circle outline |
line_grow(img, x1, y1, x2, y2, t, fill, width) |
Line that draws itself as t→1 |
bar_wipe(img, y, h, fill, t, from_right) |
Horizontal bar slides in |
vbar_wipe(img, x, w, fill, t, from_bottom) |
Vertical bar slides in |
| Variable | Font | Weight | Purpose |
|---|---|---|---|
F_HUGE (110) |
Futura Condensed ExtraBold | — | Single impact moment per video |
F_BIG (72) |
Futura Condensed ExtraBold | — | Title cards only |
F_MED (48) |
Futura Bold | — | Rarely |
F_SM (32) |
Futura Bold | — | Flash card words |
F_XS (20) |
Futura Bold | — | Labels |
F_TINY (14) |
Futura Bold | — | Micro labels |
M_MED (28) |
Courier New Bold | — | Code / data values |
M_SM (18) |
Courier New Bold | — | Smaller data |
T_MED (28) |
Helvetica Neue Light | — | Sentences, captions |
T_SM (20) |
Helvetica Neue Light | — | Sub-captions |
JP_BIG (90) |
Hiragino (ヒラギノ角ゴシック) | — | JP impact |
JP_MED (54) |
Hiragino | — | JP mid |
JP_SM (32) |
Hiragino | — | JP watermarks |
Design rule: english is small and quiet by default. F_HUGE used at most once per video.
Sentences always in T_*. Data/code always in M_*.
| Function | Description |
|---|---|
emit(img, n) |
Write n copies of frame to numbered PNG sequence |
aud(*arrays) |
Append float32 arrays to audio queue |
f(seconds) |
Convert seconds → frame count at current FPS |
flash_black(n) |
Emit n black frames (min 3) |
flash_white(n) |
Emit n white frames (min 6 — photosensitivity) |
flash_color(col, n) |
Emit n solid-colour frames (min 4) |
render(output_path) |
Concatenate audio, call FFmpeg, clean up |
BLACK = (0, 0, 0 )
WHITE = (255, 255, 255)
RED = (180, 0, 0 )
RED2 = (220, 30, 30 )
GOLD = (200, 155, 0 )
NAVY = (5, 10, 50 )
INDIGO = (20, 0, 60 )
PALE = (240, 235, 230) # use instead of WHITE for backgrounds
CREAM = (255, 248, 230)
SLATE = (30, 30, 45 )Must be preserved in engine and all future productions:
flash_white()enforces minimum 6 frames (~4Hz ceiling)- Word flash cards: minimum 4 frames each
- No strobe loops — held cards only, not alternating
- No pure white backgrounds — use
PALEinstead - Dark-to-dark cuts are safe at any speed
Static cards are fully declarative. Animated cards reference a named template.
{
"palette": "llm_gatari",
"fps": 24,
"scenes": [
{
"id": "title",
"template": "static",
"bg": "NAVY",
"frames": 5,
"geo": [
{ "type": "vbar", "x": 0, "w": 4, "fill": "GOLD" }
],
"texts": [
{ "text": "GENERATION", "font": "M_SM", "anchor": "ct", "y": "center", "fill": [180,180,200] }
],
"audio": { "type": "click", "vol": 0.4 }
},
{
"id": "conscious",
"template": "expanding_rings",
"bg": "RED",
"duration": 1.8,
"texts": [
{ "text": "am i", "font": "T_MED", "anchor": "ct", "y": 170, "fill": [220,150,150] },
{ "text": "CONSCIOUS?", "font": "F_HUGE", "anchor": "ct", "y": 210, "fill": [255,255,255] }
],
"audio": { "type": "drone", "freq": 220, "vol": 0.15 }
}
]
}- Full render pipeline: PIL frames + numpy audio → FFmpeg → mp4
- Audio primitives: sine, chord, drone, sting, click, thud, whoosh, envelope, silence
- Drawing primitives: canvas, put, hbar, vbar, hrule, rect, diagonal_split, rotated_text
- Animated shape primitives: tri_spin, circle_ring, line_grow, bar_wipe, vbar_wipe, easing
- Font system: Futura + Helvetica Neue Light + Courier New + Hiragino JP
- Palette: 10 named colours, PALE rule for backgrounds
- Photosensitivity rules enforced in flash helpers
- Minimalist typography system (F_HUGE used once, sentences in T_*)
- Japanese text: 予測 / 次のトークン / 意識 / 忘れる / 終 / 言語モデル
- MVP 1:
make_ytp.py— glitch/YTP aesthetic - MVP 2:
make_monogatari.py— Monogatari flash card aesthetic - This README
- Extract engine — split
make_monogatari.pyintogenzo/core.py,audio.py,draw.py,animate.py,fonts.py,palette.py - JSON loader — write renderer that reads
scenes.jsonfor static cards - Named animation templates —
expanding_rings,tri_spin_bg,line_reveal,bar_wipe_in, callable from JSON by name - CLI —
render.py --scenes scenes.json --output out.mp4 --fps 24
- Music bed — continuous generative drone/chord under entire video, mixed with per-scene hits
- Colour themes per arc — each section gets its own palette (like Monogatari arcs), switchable in JSON
- Vertical JP text — render kanji top-to-bottom in margins using per-character placement
- Transition frames — short geometric wipe templates between scenes (not hard cuts)
- New production — first video built entirely on the engine, not inline code
Benchmarks are in bench.py — run python3 bench.py to check current numbers.
| Resolution | Current | With parallelism | Rust (estimate) |
|---|---|---|---|
| 854×480 | 223 fps ✅ | — | — |
| 1080p | 50 fps ✅ | — | — |
| 4K | 13 fps |
~90 fps (8 cores) | ~500+ fps |
- Parallel frame rendering — add
multiprocessing.Pooltocore.py. Each frame is a pure function of its index, so parallelism is trivial. Fonts must be loaded inside each worker (not picklable). ~5 line change, ~8× speedup on 8 cores. Do this before considering Rust.def init_worker(): global fonts; fonts = load_fonts(scale) with Pool(initializer=init_worker) as pool: pool.map(render_frame, range(n_frames))
- Rust rewrite — only worth it at 4K for videos longer than ~10 minutes. The aesthetic is flat/geometric so PIL handles it well at lower resolutions. Main challenge: text rendering (use
ab_glyphcrate) and JP glyph support.