Skip to content

feat: TTS humanization pipeline, rating system, and STT upgrade#414

Closed
Reuz93 wants to merge 2 commits intojamiepine:mainfrom
Reuz93:main
Closed

feat: TTS humanization pipeline, rating system, and STT upgrade#414
Reuz93 wants to merge 2 commits intojamiepine:mainfrom
Reuz93:main

Conversation

@Reuz93
Copy link
Copy Markdown

@Reuz93 Reuz93 commented Apr 15, 2026

Summary

  • STT upgrade: Default Whisper model upgraded to large-v3-mlx (Apple Silicon) / large-v3 (PyTorch) for better Spanish transcription. Fixed missing preprocessor_config.json with fallback to turbo processor.
  • Paralinguistic tags: Added sniff, shush, whimper, scream, whisper to the tag router PARA_TAGS set for richer expressiveness in TTS output.
  • Generation rating system: Thumbs up/down on history rows. Rating + sampling params stored per generation. GET /profiles/{id}/suggested-params returns averaged best params after 3+ high-rated generations.
  • History params visibility: All 5 sampling params (temperature, top_k, top_p, repetition_penalty, speed) shown in badge popover per history row. "Reuse" button applies text + params back to the generation form.
  • Advanced panel: Added Top-K, Top-P, Rep. Penalty sliders to FloatingGenerateBox Advanced popover (was missing, causing those fields to never be saved).
  • TTS humanization utilities: New modules — breath_injection, hybrid_generate, tag_router, text_preprocess — form the backbone of the humanization pipeline.

Test plan

  • Record or upload a voice sample and confirm STT transcribes Spanish speech correctly using large-v3-mlx on Apple Silicon
  • Verify fallback to turbo processor when preprocessor_config.json is absent
  • Generate TTS with text containing paralinguistic tags ([sniff], [whimper], [scream], [whisper], [shush]) and confirm they route correctly
  • Rate several generations thumbs up; confirm GET /profiles/{id}/suggested-params returns averaged params after 3+ ratings
  • Open a history row popover and verify all 5 sampling params display correctly
  • Click "Reuse" on a history row and confirm text + params populate the generation form
  • Open the Advanced panel in FloatingGenerateBox and confirm Top-K, Top-P, and Rep. Penalty sliders are present and their values are saved on generation
  • Run a generation end-to-end and confirm no regressions in audio output quality

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Advanced generation panel with sampling controls (temperature, top_k/top_p, repetition_penalty, speed, jitter, humanize options) and “Proven params” suggestions per voice
    • Rate generations (thumbs up/down) and reuse parameters from history
    • Guided 40s voice recording UI with auto-filled script and improved upload flow
    • Breath injection and micro-timing (jitter) for more natural audio
    • New audio effects and presets; expanded speech-recognition models
  • Performance

    • Automatic idle-model unloading to reduce memory usage

- Upgrade default Whisper model to large-v3-mlx (Apple Silicon) / large-v3 (PyTorch) for better Spanish transcription; fix missing preprocessor_config.json with fallback to turbo processor
- Add paralinguistic tags (sniff, shush, whimper, scream, whisper) to tag router PARA_TAGS set
- Add thumbs up/down rating system on history rows; rating + sampling params stored per generation; GET /profiles/{id}/suggested-params returns averaged best params after 3+ high-rated generations
- Show all 5 sampling params (temperature, top_k, top_p, repetition_penalty, speed) in history row badge popover with Reuse button that restores text + params to generation form
- Add Top-K, Top-P, Rep. Penalty sliders to FloatingGenerateBox Advanced popover so those fields are properly saved
- Add breath_injection, hybrid_generate, tag_router, text_preprocess utility modules for TTS humanization pipeline

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: d5872451-045d-46ba-94b8-639f0af933df

📥 Commits

Reviewing files that changed from the base of the PR and between f728150 and 5c25994.

📒 Files selected for processing (9)
  • app/src/components/Generation/FloatingGenerateBox.tsx
  • app/src/components/History/HistoryTable.tsx
  • app/src/lib/api/types.ts
  • backend/database/migrations.py
  • backend/database/models.py
  • backend/models.py
  • backend/routes/generations.py
  • backend/routes/profiles.py
  • backend/services/history.py

📝 Walkthrough

Walkthrough

Advanced generation controls, sampling and humanization options, breath/jitter audio shaping, history/profile-based parameter suggestions and reuse, 40s recording guide, new backend hybrid multi-engine flow, idle model unloading, new DB fields/endpoints for ratings and suggested params, and expanded effects/preset registry.

Changes

Cohort / File(s) Summary
Frontend Generation UI
app/src/components/Generation/FloatingGenerateBox.tsx, app/src/components/Generation/GenerationForm.tsx
Added advanced settings panel (temperature, top_k, top_p, repetition_penalty, speed, inject_breaths, jitter_ms, humanize_text/intensity), suggested params fetch/apply, _reuse preset handling, and engine-specific input adjustments.
Frontend History & Store
app/src/components/History/HistoryTable.tsx, app/src/stores/generationStore.ts
Added ParamsBadge popover, rating actions, "reuse params" action that populates generationStore.reuseParams, and new ReuseParams store typing.
Frontend Voice Recording / Profiles
app/src/components/VoiceProfiles/AudioSampleRecording.tsx, app/src/components/VoiceProfiles/ProfileForm.tsx, app/src/components/VoiceProfiles/SampleUpload.tsx
Recording guide UI, auto-advance/scroll lines, increased recording guidance to 40s, auto-fill referenceText from SCRIPT_LINES, and removal of explicit transcription button/props in some flows.
Frontend Hooks & API types
app/src/lib/hooks/useGenerationForm.ts, app/src/lib/hooks/useAudioRecording.ts, app/src/lib/api/client.ts, app/src/lib/api/types.ts
Extended form schema and hook to accept sampling/humanization fields; increased recording maxDuration default; added rateGeneration and getSuggestedParams client methods; expanded request/response types and WhisperModelSize.
Backend Routes & Services
backend/routes/generations.py, backend/routes/profiles.py, backend/services/generation.py, backend/services/history.py
Persist and propagate sampling/jitter/humanize/inject_breaths fields through generation pipeline; added PATCH /generations/{id}/rating; added GET /profiles/{id}/suggested-params computing decayed averages; run_generation updated to accept new params and route hybrid generation.
Backend Models & DB
backend/models.py, backend/database/models.py, backend/database/migrations.py
Added persisted sampling and humanization fields plus rating and jitter_ms to DB/model layers; migrations to add new columns.
Backend Backends & Lifecycle
backend/backends/__init__.py, backend/backends/chatterbox_turbo_backend.py, backend/backends/mlx_backend.py, backend/backends/pytorch_backend.py, backend/app.py
TTS backend APIs now accept sampling_params; MLX/backends improved unload/cache clearing and processor fallback; default Whisper sizes updated; added idle model tracking and startup background unload loop.
Backend Audio/Effects Utilities
backend/utils/audio.py, backend/utils/chunked_tts.py, backend/utils/effects.py
Added warm-up trimming, increased reference audio max to 40s, jitter_ms support in concatenation, and new effects (distortion, clipping, noise_gate, limiter) plus new presets.
Backend Text & Audio Helpers
backend/utils/text_preprocess.py, backend/utils/breath_injection.py, backend/utils/tag_router.py, backend/utils/hybrid_generate.py
Added Ollama-based disfluency injector, breath-injection synthesis, paralinguistic tag parser, and hybrid multi-engine generation routing.
Backend Transcription
backend/routes/transcription.py
Improved temp-file extension derivation from MIME type with fallback and refined error messages.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant UI as Generation UI
    participant API as API Server
    participant DB as Database
    participant TTS as TTS Engine

    User->>UI: Submit text + sampling/humanize params
    UI->>API: POST /generate {text, engine, sampling_params, ...}
    API->>DB: create_generation(record with params)
    API->>API: build_sampling_params()
    API->>TTS: generate(text, voice_prompt, sampling_params, jitter_ms)
    TTS->>TTS: apply sampling overrides / generate audio
    TTS-->>API: audio + metadata
    API->>API: optionally inject_breaths(), trim_warmup(), apply_jitter()
    API->>DB: update generation record with outputs
    API-->>UI: return audio + generation id
    User->>UI: Click rating button
    UI->>API: PATCH /generations/{id}/rating {rating}
    API->>DB: update rating
Loading
sequenceDiagram
    actor User
    participant History as History UI
    participant Store as generationStore
    participant UI as Generation UI

    User->>History: Click "Reuse params" on a row
    History->>Store: setReuseParams(ReuseParams)
    Store-->>UI: reuseParams updated
    UI->>UI: populate form fields (engine, language, sampling, effects=_reuse)
    UI->>API: GET /profiles/{profileId}/suggested-params
    API->>DB: query high-rated generations, compute decayed averages
    API-->>UI: SuggestedParams
    UI->>UI: show banner "Proven params" -> Apply -> update sliders
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • rhmod09-dev

"🐰 I fiddled with sliders, nudged breaths in the dark,
Reused an old sample, sparked a fresh lark.
Forty seconds of cadence, presets that play nice—
I hopped through the pipeline and sprinkled some spice. 🥕🎶"

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 72.22% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the three main change categories: TTS humanization pipeline, rating system, and STT upgrade, which align with the substantial changes across the codebase.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

- Added humanize_text, humanize_intensity, jitter_ms fields to generation history (DB migration, models, API types, UI badge)
- Rating system: weighted average with exponential decay, no minimum threshold, displays "Based on N ratings"

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@Reuz93 Reuz93 closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant