Built for the Gemini Live Agent Challenge.
ExplainFlow is an agent-coordinated AI Production Studio that transforms documents, PDFs, and media into grounded visual explainer streams where every claim links back to the original source.
It has two modes:
- Advanced Studio — a staged workflow for source ingestion, signal extraction, render-profile locking, script-pack planning, live scene generation with QA retries, proof-linked review, and export
- Quick — a faster path that produces a grounded artifact, Proof Reel, playlist, and cinematic MP4 from the same inputs
Unlike standard AI generators, ExplainFlow uses a checkpoint-driven agentic workflow. Users can pause, review, co-direct, and verify source-proof backing for generated claims before final rendering. It is built on Gemini via the Google GenAI SDK and runs on Google Cloud Run and Cloud Storage.
Most AI generators go directly from source to final output. ExplainFlow does not. It exposes the lifecycle in the middle:
- The extracted signal can be reviewed.
- The script pack is locked before rendering.
- The planner self-checks and repairs weak plans automatically.
- Each scene can be reviewed, retried, or selectively regenerated without rerunning the whole workflow.
- Proof remains attached to the output.
This is optimized for controllability, recovery, and traceability, not just blind generation speed.
ExplainFlow is deployed and available for testing. No login or credentials required.
- Studio: https://explainflow-web-nxgdm2zy3a-uc.a.run.app
- API: https://explainflow-api-nxgdm2zy3a-uc.a.run.app
- Open the Studio link above and select Quick.
- Paste a YouTube URL (try https://www.youtube.com/watch?v=yRV8fSw6HaE) or copy-paste a transcript or any source text.
- Set the topic to: Explain this video to non-technical students in a playful tone.
- Click Generate.
- The artifact appears in seconds — four grounded blocks with claim refs and evidence.
- Click Proof Reel to see a deterministic walkthrough of the artifact with source-backed evidence.
- Click Export MP4 to render a cinematic video from the artifact.
- Select Advanced from the landing page.
- Paste the following source text:
Explain Mandelbrot's core idea through the coastline paradox: the closer you measure, the longer the coastline becomes.
- Click Extract Signal and review the extracted signal (thesis, claims, evidence, narrative beats) in the Content Signal panel.
- Lock the render profile with the following settings:
- Artifact Type: Storyboard Grid
- Visual Mode: Hybrid
- Audience Level: Intermediate
- Audience Persona: Curious systems thinker
- Taste Bar: Highest
- Click Generate Script Pack — the planner builds a scene-by-scene production manifest with QA checks.
- Click Start Generation — scenes stream in real time with text, images, and audio. The Agent Session Notes panel shows checkpoint progress and QA status as they happen.
- After generation, click any claim badge on a scene card to open the Proof Viewer and inspect the backing evidence.
- Optionally regenerate any individual scene with a custom instruction.
- Click Export for a ZIP bundle (script + images + audio) or Export MP4 for a rendered video.
- Agentic Coordination: A persistent
AgentCoordinatormanages production gates (Signal, Profile, Script) using a state-awareworkflow_id. - Agent Harness: A staged harness manages checkpoints, self-checks the plan before generation, validates scenes during streaming, and preserves proof-aware recovery paths.
- Interleaved Multimodal Output: Streams scene-by-scene text, visuals, audio, and proof-linked media instead of waiting to reveal only a final artifact.
- Self-Healing Production: An Auto QA Gate scores every scene. If quality or alignment fails, the director automatically triggers a retry to fix the scene in real-time.
- Visual Chaining: Passes visual anchor terms between scenes to prevent narrative drift.
- Proof-Linked Generation: Carries claim refs, evidence refs, and source-proof media from extraction through scene review and proof playback.
- Conversational Co-Direction: The
WorkflowChatAgentlets users adjust styles, personas, or narrative focus through natural language at any checkpoint. - Cinematic MP4 Export: Takes generated images, synthesized voiceover audio, and source proof clips, then bundles them with Ken Burns panning, crossfade transitions, and timed overlays into a single exportable video.
- Production-Grade Export: Package your validated explainer into a ZIP bundle containing the script, high-res images, and synchronized audio.
flowchart TD
subgraph USER["User"]
QUICK["Quick Mode"]
ADVANCED["Advanced Studio"]
end
subgraph FRONTEND["Next.js Frontend — Google Cloud Run"]
UI["Studio UI\nSource Intake · Render Profile\nScene Review · Proof Viewer · Export"]
SSE_CLIENT["SSE Stream Client\nReal-Time Scene Rendering"]
end
subgraph BACKEND["FastAPI Backend — Google Cloud Run"]
ROUTES["REST API + SSE Endpoints"]
COORD["AgentCoordinator\nCP1 Signal → CP2 Artifacts → CP3 Render\n→ CP4 Script → CP5 Stream → CP6 Bundle"]
STORY["GeminiStoryAgent\nExtraction · Planning · Scene Streaming"]
CHAT["WorkflowChatAgent\nCheckpoint-Aware Co-Direction"]
QA["Planner QA + Scene QA\nSelf-Healing · Auto-Retry"]
end
subgraph GEMINI["Gemini Models — Google GenAI SDK"]
PRO["gemini-3.1-pro-preview\nSignal Extraction (structural + creative)\nScript Pack Planning · Constrained Replans\nWorkflow Chat"]
LITE["gemini-3.1-flash-lite-preview\nTranscript Normalization\nPlanner Precompute (salience, forward-pull)\nQuick Artifact Generation"]
IMG["gemini-3-pro-image-preview\nInterleaved Text + Image\nScene Generation · Scene Regen\nQuick Image Generation"]
FLASH["gemini-3-flash-preview\nAsset-Backed Source Recovery"]
end
GCS[("Google Cloud Storage\nSource Assets · Scene Media\nProof Media · Final Outputs")]
QUICK & ADVANCED --> UI
UI -->|"REST"| ROUTES
ROUTES --> COORD
ROUTES --> CHAT
CHAT --> COORD
CHAT --> STORY
COORD --> STORY
STORY --> QA
QA -->|"Auto-Retry on QA Fail"| STORY
STORY --> PRO
STORY --> LITE
STORY --> IMG
STORY --> FLASH
CHAT -->|"GenAI SDK"| PRO
STORY -->|"Assets + Media"| GCS
ROUTES -->|"SSE Stream"| SSE_CLIENT
SSE_CLIENT --> UI
style USER fill:#f8fafc,stroke:#94a3b8,color:#1e293b
style FRONTEND fill:#dbeafe,stroke:#3b82f6,color:#1e293b
style BACKEND fill:#fef3c7,stroke:#f59e0b,color:#1e293b
style GEMINI fill:#fce7f3,stroke:#ec4899,color:#1e293b
style GCS fill:#d1fae5,stroke:#10b981,color:#1e293b
sequenceDiagram
participant U as "User"
participant ST as "Studio UI"
participant C as "AgentCoordinator"
participant G as "Gemini Runtime"
U->>ST: Provide source material and choose output intent
ST->>C: Start or update workflow
C->>G: Extract grounded signal
G-->>C: content_signal
C-->>ST: Signal ready for review
ST->>C: Lock profile and request script pack
C->>G: Plan scenes and run planner QA
G-->>C: locked script pack
C-->>ST: Script pack ready
ST->>C: Start generation stream
loop Per scene
C->>G: Generate text, image, and audio
G-->>C: Scene candidate
C->>C: QA / retry if needed
C-->>ST: Scene output + proof links
end
C-->>ST: Final bundle ready
flowchart LR
A["Source Material<br/>text, PDF, image, audio, video"]
B["Extracted Signal<br/>thesis, claims, evidence"]
C["Locked Script Pack<br/>scene goals, refs, render strategy"]
D["Scene Outputs<br/>text, visuals, audio, proof links"]
E["Review + Export<br/>scene review, final bundle, MP4"]
A --> B
B --> C
C --> D
D --> E
B -.->|"claim refs / evidence refs"| D
A -.->|"source proof media"| D
For the exact route-level workflow and checkpoint ownership, see docs/architecture.md.
Extracts a style-agnostic content_signal (Thesis, Key Claims, Narrative Beats) using gemini-3.1-pro-preview.
A multi-stage intake process to define:
- Audience Persona: (e.g., Venture Capitalist, Student, Engineer)
- Art Direction: (Diagram, Illustration, Hybrid)
- Constraints:
must_includeandmust_avoidrules for strict content alignment.
The planner compiles a production manifest before generation so every scene has:
- scene count and pacing budget
- claim coverage across the planned narrative
- acceptance checks per scene
- render strategy before expensive generation starts
- proof attachment opportunities (
claim_refs,evidence_refs,source_media)
Before or around script-pack planning, ExplainFlow can also run:
- Salience analysis — identifies what is central, high-stakes, surprising, causally important, or genuinely transformative in the source
- Forward-pull analysis — models narrative momentum using bait, hook, threat, reward, and payload
- Planner QA / repair — checks the proposed plan before scene generation and either repairs it deterministically or triggers a constrained replan
The orchestration loop delivers narration text interleaved with high-fidelity visuals from gemini-3-pro-image-preview. During streaming, ExplainFlow generates text, image, and optional audio scene by scene, runs scene-level QA, retries weak scenes automatically, and keeps proof links attached to the output.
Quick is intentionally layered: artifact first, Proof Reel second, MP4 third. Each layer reuses the previous one instead of re-planning from scratch. That keeps Quick latency low while preserving claim refs, evidence refs, source-proof selection, and exportability.
The workflow chat agent can explain the current stage, inspect workflow state, recommend the next safe action, trigger tool-backed actions (extraction, lock, script-pack, stream launch), and return the workflow to the right checkpoint instead of forcing a full restart.
For the best experience, use the live hosted links above. Local setup requires only a Gemini API key — the backend falls back to local file storage automatically when Google Cloud Storage is not configured.
- Python 3.10+
- Node.js 20+ and
npm - A Google GenAI API Key (with access to
gemini-3.1-pro-previewandgemini-3-pro-image-preview)
cd api
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
echo "GEMINI_API_KEY=your_api_key_here" > .env
uvicorn app.main:app --reload --port 8000cd web
npm install
npm run devFrontend runs at http://localhost:3000. Backend runs at http://localhost:8000.
Built for the Gemini Live Agent Challenge.