Purpose: Define the normative runtime, auth, capture, paste, configuration, and release contract for Voxit in this repository.
Status: normative
Read this when: You need the authoritative contract for Voxit runtime behavior, state transitions, authentication, audio capture, paste flow, configuration keys, or release scope.
Not this document: Step-by-step operational guidance, design rationale, or workflow instructions.
Defines:
- macOS-first runtime scope and platform boundaries
- user-visible state machine and transcript lifecycle
- authentication, storage, audio capture, finalize, rewrite, and paste contracts
- onboarding, configuration, CI, release, observability, and known-gap expectations
- Build entrypoint is the SwiftPM native macOS host under
native/macos-host/. - Voxit uses Rust Core plus a platform host: shared runtime and platform-neutral model contracts stay in Rust crates, while Swift owns the macOS UI.
- Context-aware voice behavior is governed by
contextual-voice.md. The runtime must treat raw transcription as one step in a broader contextual voice input pipeline. - The staged app bundle is a menu bar utility (
LSUIElement = true) with a SwiftUIMenuBarExtraand an on-demand Voxit window. - The app supports English-first behavior and configuration defaults (
language = "en"). - No speech is injected into target apps while Pass1 is running; text is only pasted after Pass2 or Pass3 completion.
The runtime state is user-visible through the Rust-owned native-host snapshot rendered by Swift:
Ready to listen.ListeningStoppedFinalizingPass2RewritingPass3Done
State transitions:
Start Dictationor menu shortcut start in toggle mode -> capture focused context, start recording, and enterListening.Stop Dictationor hotkey release in hold mode -> stop capture, encode WAV, thenFinalizingPass2.- Pass2 completion:
- if auto rewrite is enabled ->
RewritingPass3 - else -> set final output to raw transcript and
Done
- if auto rewrite is enabled ->
- Pass3 completion:
- if guard passes -> set final output to rewritten result and
Done - if skipped or rejected -> set final output to raw transcript and
Done
- if guard passes -> set final output to rewritten result and
- If the active output policy is
insert_text, the runtime pastes final output into the captured target automatically. Preview and confirmation policies leave output in the HUD for explicit paste.
- Default login is ChatGPT OAuth via device-code authorization.
- Browser callback OAuth is not part of the active V1 login surface.
- Token acquisition flow:
- show the device code and verification URL
- poll until ChatGPT authorization completes or fails
- exchange the authorized device session and persist auth locally
- fallback path uses
OPENAI_API_KEYonly when no OAuth token exists
- Storage:
- preferred: keyring
- fallback: local
auth.json
- On startup:
- read status as "signed in" when unexpired token or session metadata exists
- otherwise show "Not signed in."
- Default recorder is macOS CoreAudio VoiceProcessingIO.
- The active recorder input is resolved at session start from
audio.input_device_id.0means system default.- non-zero uses the requested CoreAudio input device id from config.
- if the requested device is missing or unusable, Voxit falls back to system default before capture starts.
- Capture should be continuous while in
Listening, producing in-memory PCM sample buffers and metadata (sample_rate,channels,frames). - Raw audio must not be persisted by default.
- The current Swift Settings audio picker exposes System default
(
audio.input_device_id = 0). - Rust can resolve explicit
audio.input_device_idandaudio.input_device_namevalues supplied through config. - If a configured device id is invalid or stale when starting recording, the runtime falls back to system default and reports fallback in status or logs.
- For each chunk, send
input_audio_buffer.appendpayload frames to OpenAI Realtime. - Realtime session must be configured with:
model:openai.realtime_model(defaultgpt-realtime-2)reasoning.effort: the Rust-selected contextual voice plan effortaudio.input.format:audio/pcmwith sample rate from config (default24000)audio.input.noise_reduction: configured profile (defaultnear_field) ornullwhen set tooffaudio.input.transcription.model:openai.realtime.transcription_model(defaultgpt-4o-mini-transcribe)audio.input.transcription.language:openai.language(defaulten)audio.input.turn_detection.type:server_vadaudio.input.turn_detection.create_response:false
- Realtime events consumed by the UI:
conversation.item.input_audio_transcription.delta(draft)conversation.item.input_audio_transcription.completed(committed)
- Draft and committed must be separated in UI:
- committed = finalized turns from completed events
- draft = latest in-flight text fragment
- Ordering for committed text is deterministic by
item_idandprevious_item_idchain; out-of-order completed events must still render in chain order.
- On stop, stop capture and upload full WAV to
/v1/audio/transcriptions. - Use the configured finalize model.
- Final transcript (
Pass2) becomes baseline output for:- paste when rewrite is disabled or skipped
- rewrite input when enabled
- final output display
- Auto-run rewrite only when:
- raw Pass2 transcript exists
- rewrite is enabled in runtime preference
- rewrite auto flag is enabled
- If disabled for this run, skip and paste raw final transcript.
- Rewriter output contract:
- keep meaning
- preserve numeric, date, and currency tokens
- reject rewrite when the protected token multiset changes
- enforce
rewrite.max_output_chars - apply
rewrite.styleand any user glossary terms to prompt construction
- Guarded outcomes:
Applied: paste rewritten textRejected: fallback to raw Pass2 and paste rawSkipped: fallback to raw Pass2 and paste raw
- Before starting recording, capture frontmost app metadata (pid, bundle id, name) if
lock_frontmost_app = true. - Focus context capture also records available window title, URL domain, focused element role, and selected-text presence for Rust-owned prompt routing.
- On paste:
- attempt to reactivate captured target app with retries
- copy to clipboard
- dispatch
Cmd+V(Meta+V) to simulate paste
- A dedicated test-paste action should validate the clipboard and paste injection path.
- Hotkey chord handling:
- supported mode switch: toggle or hold
- system-wide and app-local key monitors observe the configured
hotkey.chord - pressing the chord presents the non-activating floating recording HUD and starts dictation without making Voxit the target-app context
- toggle mode stops on the next chord press; hold mode stops on hotkey release
- Menu bar behavior:
MenuBarExtraexposesOpen Voxit(Cmd+O),Settings...(Cmd+,),Start Dictation,Stop Dictation,Refresh Status(Cmd+R), andQuit Voxit(Cmd+Q).Start DictationandStop Dictationcall the Rust host FFI command surface.Settings...opens a dedicated AppKit-hosted Settings window.- The Settings window handles
Cmd+Wto close andCmd+Qto terminate.
- UI surfaces are split by responsibility:
- menu bar: always-available status and control
- recording HUD: live session state, transcript preview, active profile, and paste controls
- Voxit control-center window: activity, app rules, profiles, glossary, prompt lab, and debug/evaluation surfaces
- Settings window: app preferences, shortcuts, model choices, microphone, permissions, account defaults, privacy, logging, and notifications
- Settings provides shortcut actions for required macOS permission panes:
- Microphone
- Accessibility
- Input Monitoring
- Grant each permission in macOS Privacy & Security settings, then re-check before continuing to a real dictation run.
- "Paste raw now" is always available when finalization or rewrite is active and should bypass Pass3.
- The Control Center exposes the current focused context, selected profile, profile override, glossary terms, and prompt lab sample state. Profile override and glossary terms are passed back through Rust FFI before model calls.
- The Swift native host must render platform-neutral Rust model snapshots from
packages/voxit-core/throughpackages/voxit-host-ffi/instead of defining a separate UI state machine, contextual routing policy, or prompt profile registry.
Config file location:
${Application Support}/voxit/config.tomlviaProjectDirs
Supported sections and keys:
ui.start_hidden,ui.panel_width_px,ui.panel_height_pxhotkey.chord,hotkey.mode(toggleorhold)audio.backend,audio.input_sample_rate_hz,audio.input_device_name,audio.input_device_id,audio.realtime_target_rate_hzopenai.api_base_url,openai.realtime_model,openai.finalize_model,openai.rewrite_model,openai.languageopenai.realtime.noise_reduction,openai.realtime.transcription_modelrewrite.enabled,rewrite.auto,rewrite.guard_numbers,rewrite.max_output_chars,rewrite.stylepaste.lock_frontmost_app,paste.method
On load:
- parse file when present
- defaults are used when missing or invalid entries are encountered
audio.input_device_id = 0is treated as system default- non-zero
audio.input_device_idrequests that device; if unavailable at startup, Voxit falls back to default input
Current Swift Settings window:
- persists shell preferences in macOS
UserDefaults - exposes editable OpenAI model IDs for realtime voice, realtime transcript, finalize, and rewrite passes
- writes supported shell and model preferences through the Rust host FFI into
config.toml
language.ymlis macOS-only for lint, format, and test checks.- Release packaging matrix is restricted to
aarch64-apple-darwinand comments out Linux and Windows jobs. - Packaging uses
scripts/build_and_run.sh stageto buildpackages/voxit-host-ffi, build the SwiftPM host, stagetarget/voxit-native-host/Voxit.app, and zip it asvoxit-<target>.zip.
- Runtime logs are written via rotating file appender under the data directory.
- User-facing state is mirrored by status strings for troubleshooting.
- Error states must avoid hard-crash behavior and should return to a user-actionable status.
- App-rule authoring is not implemented yet; users can refresh focus context and manually override the active built-in profile.
- The Swift Settings audio picker still exposes only System Default even though Rust can resolve configured CoreAudio input device ids.
- CPAL fallback capture is not implemented despite a configuration option; only the VoiceProcessingIO path is active.