diff --git a/README.md b/README.md index c224e74..1458f12 100644 --- a/README.md +++ b/README.md @@ -24,7 +24,7 @@ AI dictation App for macOS (MVP scaffold). - Pass-2 finalize pass using `gpt-4o-transcribe` for better punctuation and stability. - Optional Pass-3 rewrite for cleaner English output with numeric/proper noun protection. - Auto-paste into the app that was frontmost when recording began. -- Configurable behavior and models via `config.toml`. +- Configurable behavior and models via Settings-backed `config.toml`. For the normative product contract, constraints, and gaps, see the [Runtime Spec](docs/spec/runtime.md). @@ -37,8 +37,8 @@ V1 target is **macOS-first** and aligned to the English-only voice input design. - Scope: ✅ Native macOS mic capture + OpenAI model pipeline only. - Limitation: ✅ Linux/Windows build is intentionally disabled. - Limitation: ⚠️ Known gaps are documented in the - [Runtime Spec](docs/spec/runtime.md) (runtime action wiring, config write-through, - CPAL fallback robustness, and rollout cleanup items). + [Runtime Spec](docs/spec/runtime.md) (explicit microphone picker, CPAL fallback + robustness, app-rule authoring, and rollout cleanup items). ## Usage @@ -104,13 +104,14 @@ realtime_target_rate_hz = 24000 [openai] api_base_url = "https://api.openai.com/v1" -realtime_model = "gpt-4o-mini-transcribe" +realtime_model = "gpt-realtime-2" finalize_model = "gpt-4o-transcribe" rewrite_model = "gpt-5.2-mini" language = "en" [openai.realtime] noise_reduction = "near_field" # near_field | far_field | off +transcription_model = "gpt-4o-mini-transcribe" [rewrite] enabled = true @@ -130,14 +131,14 @@ First-run onboarding checklist: - Microphone permission in **System Settings → Privacy & Security → Microphone**. - Accessibility permission in **Privacy & Security → Accessibility** (for Cmd+V fallback). - Input Monitoring permission in **Privacy & Security → Input Monitoring** (for global hotkey hooks). -- Voxit uses request buttons to guide you through the permission prompts in sequence (Microphone → Accessibility → Input Monitoring); grant each permission and re-check when prompted. +- Voxit Settings includes shortcut buttons for the relevant macOS privacy panes; grant each permission and re-check before a real dictation run. - Verify paste flow after permission grant and restart the app if needed. For the full guided sequence, see [First Run](docs/runbook/first-run.md). Runtime configuration remains sourced from `config.toml`. The current Swift Settings -window persists shell preferences in macOS `UserDefaults`; writing those settings back -through the Rust config path is a tracked runtime gap. +window persists shell and model preferences in macOS `UserDefaults` and writes +supported preferences back through the Rust host FFI. ### Interaction @@ -147,8 +148,10 @@ through the Rust config path is a tracked runtime gap. - While listening: panel shows live draft text and committed segments. - Stop recording: toggle key again or release key in hold mode. - Finalize: Pass-2 runs automatically; rewrite runs by default unless disabled in settings. -- Microphone input selection is persisted in config as `audio.input_device_id` and `audio.input_device_name`. -- Refresh workflow: the picker list is refreshed at startup and via the **Refresh microphones** control before choosing from a list of input-capable devices. +- Model choice: Settings exposes editable OpenAI model IDs for realtime voice, + realtime transcript, finalize, and rewrite passes. +- The Swift Settings audio picker currently exposes the system default microphone; explicit + `audio.input_device_id` values can still be resolved by Rust config. - Runtime fallback: if a saved explicit device id is unavailable, Voxit falls back to the system default input device and continues recording. - Paste behavior: by default paste rewritten text after finalize, or paste raw transcript via available controls. - Output target: text is pasted into the app that was frontmost when dictation started. diff --git a/docs/decisions/contextual-voice-layer.md b/docs/decisions/contextual-voice-layer.md index 77ac40b..47081a8 100644 --- a/docs/decisions/contextual-voice-layer.md +++ b/docs/decisions/contextual-voice-layer.md @@ -21,8 +21,8 @@ Consequences: - The main Voxit window is a control center for activity, app rules, profiles, glossary, prompt experiments, and debug/evaluation surfaces. - The Settings window stays separate and limited to app preferences such as startup, - shortcuts, microphone, permissions, account defaults, privacy, logging, and - notifications. + shortcuts, model choices, microphone, permissions, account defaults, privacy, logging, + and notifications. - Swift owns the native macOS presentation layer and UI glue. Rust owns durable product logic, context classification, prompt profile selection, voice session planning, output policy, and provider orchestration. diff --git a/docs/reference/repository-layout.md b/docs/reference/repository-layout.md index 004ff28..936b66f 100644 --- a/docs/reference/repository-layout.md +++ b/docs/reference/repository-layout.md @@ -16,8 +16,9 @@ files. ## Top-level surfaces - `native/macos-host/` holds the SwiftPM native macOS host. It owns platform UI - composition, the menu bar extra, the Voxit control-center window, the Settings - window, and links Rust through the host FFI static library. + composition, the menu bar extra, global hotkey observation, the floating recording + HUD, the Voxit control-center window, the Settings window, and links Rust through the + host FFI static library. - `packages/voxit-core/` holds the shared runtime logic, auth, OpenAI integration, and dictation pipeline code. Platform-neutral UI model types and contextual voice planning contracts also live here so hosts do not invent divergent state names, diff --git a/docs/runbook/first-run.md b/docs/runbook/first-run.md index 7b2919f..7604dd5 100644 --- a/docs/runbook/first-run.md +++ b/docs/runbook/first-run.md @@ -49,17 +49,17 @@ Verification: ## 4. Confirm runtime configuration - Open **Settings...** from the menu bar menu or press `Cmd+,` to confirm shell - preferences and permission shortcuts are available. + preferences, model choices, and permission shortcuts are available. - Check the config file at: ```text $HOME/Library/Application Support/voxit/config.toml ``` -- Confirm the default runtime hotkey and audio device settings look reasonable for the - machine. -- If you need an explicit microphone, refresh the device list and select it before the - first real dictation run. +- Confirm the default runtime hotkey, OpenAI model IDs, and system-default audio route + look reasonable for the machine. +- If you need an explicit microphone before the Swift picker exposes one, set + `audio.input_device_id` and `audio.input_device_name` in `config.toml`. ## 5. Verify paste flow diff --git a/docs/spec/contextual-voice.md b/docs/spec/contextual-voice.md index 393dbeb..c16c9fe 100644 --- a/docs/spec/contextual-voice.md +++ b/docs/spec/contextual-voice.md @@ -167,7 +167,7 @@ Swift hosts own: - menu bar, HUD, main window, and Settings presentation - macOS-specific context capture -- permission prompts and native controls +- permission panes and native controls - rendering Rust-owned snapshots and session plans - user confirmation UX diff --git a/docs/spec/runtime.md b/docs/spec/runtime.md index 5a6f907..debc54b 100644 --- a/docs/spec/runtime.md +++ b/docs/spec/runtime.md @@ -93,14 +93,10 @@ State transitions: ### 4.2 Device picker lifecycle -- On startup, the app refreshes available input-capable devices and caches the result. -- A manual **Refresh microphones** action is available in the UI to repopulate the - picker. -- Picker values map to: - - **System default** (`audio.input_device_id = 0`) - - an explicit input device id and name pair from a discovered device list -- Selection changes persist `audio.input_device_name` and `audio.input_device_id` to - config. +- The current Swift Settings audio picker exposes **System default** + (`audio.input_device_id = 0`). +- Rust can resolve explicit `audio.input_device_id` and `audio.input_device_name` values + supplied through config. - If a configured device id is invalid or stale when starting recording, the runtime falls back to system default and reports fallback in status or logs. @@ -108,10 +104,16 @@ State transitions: - For each chunk, send `input_audio_buffer.append` payload frames to OpenAI Realtime. - Realtime session must be configured with: + - `model`: `openai.realtime_model` (default `gpt-realtime-2`) + - `reasoning.effort`: the Rust-selected contextual voice plan effort - `audio.input.format`: `audio/pcm` with sample rate from config (default `24000`) - - `audio.input.noise_reduction`: configured profile (default `near_field`) - - `audio.input.transcription.model`: Pass1 model + - `audio.input.noise_reduction`: configured profile (default `near_field`) or `null` + when set to `off` + - `audio.input.transcription.model`: `openai.realtime.transcription_model` (default + `gpt-4o-mini-transcribe`) + - `audio.input.transcription.language`: `openai.language` (default `en`) - `audio.input.turn_detection.type`: `server_vad` + - `audio.input.turn_detection.create_response`: `false` - Realtime events consumed by the UI: - `conversation.item.input_audio_transcription.delta` (draft) - `conversation.item.input_audio_transcription.completed` (committed) @@ -167,8 +169,10 @@ State transitions: - Hotkey chord handling: - supported mode switch: toggle or hold - - the menu command uses the configured `hotkey.chord` presentation - - system-wide hotkey capture is not active yet + - system-wide and app-local key monitors observe the configured `hotkey.chord` + - pressing the chord presents the non-activating floating recording HUD and starts + dictation without making Voxit the target-app context + - toggle mode stops on the next chord press; hold mode stops on hotkey release - Menu bar behavior: - `MenuBarExtra` exposes `Open Voxit` (`Cmd+O`), `Settings...` (`Cmd+,`), `Start Dictation`, `Stop Dictation`, `Refresh Status` (`Cmd+R`), and `Quit Voxit` @@ -185,15 +189,14 @@ State transitions: controls - Voxit control-center window: activity, app rules, profiles, glossary, prompt lab, and debug/evaluation surfaces - - Settings window: app preferences, shortcuts, microphone, permissions, account - defaults, privacy, logging, and notifications -- Onboarding checklist provides request actions for required macOS permissions. The UI - prompts permission requests in order: - - Microphone: probe-based request and retry loop when denied - - Accessibility: system prompt request plus re-check - - Input Monitoring: system prompt request plus re-check -- Grant each permission in macOS Privacy & Security settings when prompted, then - re-check in Voxit before continuing. + - Settings window: app preferences, shortcuts, model choices, microphone, + permissions, account defaults, privacy, logging, and notifications +- Settings provides shortcut actions for required macOS permission panes: + - Microphone + - Accessibility + - Input Monitoring +- Grant each permission in macOS Privacy & Security settings, then re-check before + continuing to a real dictation run. - "Paste raw now" is always available when finalization or rewrite is active and should bypass Pass3. - The Control Center exposes the current focused context, selected profile, profile @@ -217,7 +220,7 @@ Supported sections and keys: `audio.input_device_id`, `audio.realtime_target_rate_hz` - `openai.api_base_url`, `openai.realtime_model`, `openai.finalize_model`, `openai.rewrite_model`, `openai.language` -- `openai.realtime.noise_reduction` +- `openai.realtime.noise_reduction`, `openai.realtime.transcription_model` - `rewrite.enabled`, `rewrite.auto`, `rewrite.guard_numbers`, `rewrite.max_output_chars`, `rewrite.style` - `paste.lock_frontmost_app`, `paste.method` @@ -233,7 +236,10 @@ On load: Current Swift Settings window: - persists shell preferences in macOS `UserDefaults` -- writes supported preferences through the Rust host FFI into `config.toml` +- exposes editable OpenAI model IDs for realtime voice, realtime transcript, finalize, + and rewrite passes +- writes supported shell and model preferences through the Rust host FFI into + `config.toml` ## 11) CI and Release @@ -253,10 +259,6 @@ Current Swift Settings window: ## 13) Known Gaps -- System-wide global hotkey capture is not implemented yet; the configured shortcut is - currently a Swift menu command. -- The native HUD does not yet render Pass1 realtime draft/committed transcript events; - it shows active profile/state plus raw and final output after Pass2/Pass3. - App-rule authoring is not implemented yet; users can refresh focus context and manually override the active built-in profile. - The Swift Settings audio picker still exposes only System Default even though Rust can diff --git a/native/macos-host/Sources/VoxitHostBridge/HostFFI.swift b/native/macos-host/Sources/VoxitHostBridge/HostFFI.swift index 0c9ec68..a92c048 100644 --- a/native/macos-host/Sources/VoxitHostBridge/HostFFI.swift +++ b/native/macos-host/Sources/VoxitHostBridge/HostFFI.swift @@ -70,6 +70,8 @@ public struct HostSnapshot: Equatable, Sendable { public var hasFocusedContext: Bool public var selectedTextPresent: Bool public var hasRawTranscript: Bool + public var hasPass1CommittedTranscript: Bool + public var hasPass1DraftTranscript: Bool public var hasFinalOutput: Bool public var hasError: Bool public var recordingDurationMS: UInt64 @@ -80,6 +82,8 @@ public struct HostSnapshot: Equatable, Sendable { public var focusedElementRole: String? public var promptProfileID: String? public var promptDirective: String? + public var pass1CommittedTranscript: String? + public var pass1DraftTranscript: String? public var rawTranscript: String? public var finalOutput: String? public var lastError: String? @@ -100,6 +104,8 @@ public struct HostSnapshot: Equatable, Sendable { hasFocusedContext: Bool, selectedTextPresent: Bool, hasRawTranscript: Bool, + hasPass1CommittedTranscript: Bool, + hasPass1DraftTranscript: Bool, hasFinalOutput: Bool, hasError: Bool, recordingDurationMS: UInt64, @@ -110,6 +116,8 @@ public struct HostSnapshot: Equatable, Sendable { focusedElementRole: String?, promptProfileID: String?, promptDirective: String?, + pass1CommittedTranscript: String?, + pass1DraftTranscript: String?, rawTranscript: String?, finalOutput: String?, lastError: String?, @@ -129,6 +137,8 @@ public struct HostSnapshot: Equatable, Sendable { self.hasFocusedContext = hasFocusedContext self.selectedTextPresent = selectedTextPresent self.hasRawTranscript = hasRawTranscript + self.hasPass1CommittedTranscript = hasPass1CommittedTranscript + self.hasPass1DraftTranscript = hasPass1DraftTranscript self.hasFinalOutput = hasFinalOutput self.hasError = hasError self.recordingDurationMS = recordingDurationMS @@ -139,6 +149,8 @@ public struct HostSnapshot: Equatable, Sendable { self.focusedElementRole = focusedElementRole self.promptProfileID = promptProfileID self.promptDirective = promptDirective + self.pass1CommittedTranscript = pass1CommittedTranscript + self.pass1DraftTranscript = pass1DraftTranscript self.rawTranscript = rawTranscript self.finalOutput = finalOutput self.lastError = lastError @@ -277,6 +289,34 @@ public final class VoxitHostSession { return try currentSnapshot() } + public func saveModelPreferences( + realtimeModel: String, + realtimeTranscriptionModel: String, + finalizeModel: String, + rewriteModel: String + ) throws -> HostSnapshot { + try realtimeModel.withCString { realtime in + try realtimeTranscriptionModel.withCString { realtimeTranscription in + try finalizeModel.withCString { finalize in + try rewriteModel.withCString { rewrite in + try requireOk( + voxit_host_session_save_model_preferences( + handle, + realtime, + realtimeTranscription, + finalize, + rewrite + ), + context: "saving model preferences" + ) + } + } + } + } + + return try currentSnapshot() + } + public func setProfileOverride(_ profileKind: PromptProfileKind) throws -> HostSnapshot { try requireOk( voxit_host_session_set_profile_override(handle, encode(promptProfileKind: profileKind)), @@ -321,6 +361,8 @@ public final class VoxitHostSession { hasFocusedContext: snapshot.has_focused_context != 0, selectedTextPresent: snapshot.selected_text_present != 0, hasRawTranscript: snapshot.has_raw_transcript != 0, + hasPass1CommittedTranscript: snapshot.has_pass1_committed_transcript != 0, + hasPass1DraftTranscript: snapshot.has_pass1_draft_transcript != 0, hasFinalOutput: snapshot.has_final_output != 0, hasError: snapshot.has_error != 0, recordingDurationMS: snapshot.recording_duration_ms, @@ -331,6 +373,8 @@ public final class VoxitHostSession { focusedElementRole: try copyString(field: VOXIT_HOST_STRING_FOCUSED_ELEMENT_ROLE), promptProfileID: try copyString(field: VOXIT_HOST_STRING_PROMPT_PROFILE_ID), promptDirective: try copyString(field: VOXIT_HOST_STRING_PROMPT_DIRECTIVE), + pass1CommittedTranscript: try copyString(field: VOXIT_HOST_STRING_PASS1_COMMITTED_TRANSCRIPT), + pass1DraftTranscript: try copyString(field: VOXIT_HOST_STRING_PASS1_DRAFT_TRANSCRIPT), rawTranscript: try copyString(field: VOXIT_HOST_STRING_RAW_TRANSCRIPT), finalOutput: try copyString(field: VOXIT_HOST_STRING_FINAL_OUTPUT), lastError: try copyString(field: VOXIT_HOST_STRING_LAST_ERROR), diff --git a/native/macos-host/Sources/VoxitNativeHostKit/App/VoxitNativeHostApp.swift b/native/macos-host/Sources/VoxitNativeHostKit/App/VoxitNativeHostApp.swift index c50795f..002f236 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/App/VoxitNativeHostApp.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/App/VoxitNativeHostApp.swift @@ -5,7 +5,9 @@ public struct VoxitNativeHostApp: App { @Environment(\.openWindow) private var openWindow @StateObject private var store = HostStore() @StateObject private var settingsStore = VoxitSettingsStore() + @StateObject private var hotkeyMonitor = GlobalHotkeyMonitor() @State private var settingsWindowController: VoxitSettingsWindowController? + @State private var recordingHUDWindowController: RecordingHUDWindowController? public init() {} @@ -16,6 +18,7 @@ public struct VoxitNativeHostApp: App { .task { VoxitArtwork.applyApplicationIcon() configureSettingsSync() + configureHotkeyMonitor() await store.reload() await store.savePreferences(settingsStore.settings) await store.setGlossary(UserDefaults.standard.string(forKey: "glossaryTerms") ?? "") @@ -36,10 +39,6 @@ public struct VoxitNativeHostApp: App { Button("Start Dictation") { startDictation() } - .keyboardShortcut( - settingsStore.settings.dictationHotkeyPresentation.swiftUIKeyEquivalent, - modifiers: settingsStore.settings.dictationHotkeyPresentation.swiftUIModifiers - ) Button("Stop Dictation") { Task { @@ -60,15 +59,6 @@ public struct VoxitNativeHostApp: App { } } - Window("Voxit Recording", id: "recording-hud") { - RecordingHUDView(store: store) - .task { - await store.reload() - } - } - .windowResizability(.contentSize) - .defaultPosition(.topTrailing) - MenuBarExtra { Button("Open Voxit") { openWindow(id: "main") @@ -79,10 +69,6 @@ public struct VoxitNativeHostApp: App { Button("Start Dictation") { startDictation() } - .keyboardShortcut( - settingsStore.settings.dictationHotkeyPresentation.swiftUIKeyEquivalent, - modifiers: settingsStore.settings.dictationHotkeyPresentation.swiftUIModifiers - ) Button("Stop Dictation") { Task { @@ -121,20 +107,79 @@ public struct VoxitNativeHostApp: App { @MainActor private func configureSettingsSync() { settingsStore.setSyncHandler { settings in - Task { + Task { @MainActor in await store.savePreferences(settings) + configureHotkeyMonitor() } } } + @MainActor + private func configureHotkeyMonitor() { + hotkeyMonitor.configure( + settings: settingsStore.settings, + keyDown: { + handleHotkeyDown() + }, + keyUp: { + handleHotkeyUp() + } + ) + } + @MainActor private func startDictation() { - openWindow(id: "recording-hud") + presentRecordingHUD() Task { await store.startDictation() } } + @MainActor + private func handleHotkeyDown() { + presentRecordingHUD() + + if settingsStore.settings.hotkeyMode == .hold { + guard store.snapshot?.dictationState != .listening else { + return + } + + Task { + await store.startDictation() + } + } else if store.snapshot?.dictationState == .listening { + Task { + await store.stopDictation() + } + } else { + Task { + await store.startDictation() + } + } + } + + @MainActor + private func handleHotkeyUp() { + guard settingsStore.settings.hotkeyMode == .hold, + store.snapshot?.dictationState == .listening + else { + return + } + + Task { + await store.stopDictation() + } + } + + @MainActor + private func presentRecordingHUD() { + if recordingHUDWindowController == nil { + recordingHUDWindowController = RecordingHUDWindowController(store: store) + } + + recordingHUDWindowController?.present() + } + @MainActor private func presentSettings() { if settingsWindowController == nil { diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Stores/HostStore.swift b/native/macos-host/Sources/VoxitNativeHostKit/Stores/HostStore.swift index 8bcac9d..f3eae04 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Stores/HostStore.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Stores/HostStore.swift @@ -7,9 +7,14 @@ public final class HostStore: ObservableObject { @Published public private(set) var errorMessage: String? private var session: VoxitHostSession? + private var pollingTask: Task? public init() {} + deinit { + pollingTask?.cancel() + } + public func reload() async { do { let session = try currentSession() @@ -35,12 +40,14 @@ public final class HostStore: ObservableObject { let session = try currentSession() snapshot = try session.startDictation() errorMessage = snapshot?.lastError + startRealtimePolling() } catch { errorMessage = String(describing: error) } } public func stopDictation() async { + pollingTask?.cancel() do { let session = try currentSession() snapshot = try session.stopDictation() @@ -70,6 +77,12 @@ public final class HostStore: ObservableObject { pasteAfterTranscription: settings.pasteAfterTranscription, rewriteAfterTranscription: settings.rewriteAfterTranscription ) + snapshot = try session.saveModelPreferences( + realtimeModel: settings.realtimeModel, + realtimeTranscriptionModel: settings.realtimeTranscriptionModel, + finalizeModel: settings.finalizeModel, + rewriteModel: settings.rewriteModel + ) errorMessage = snapshot?.lastError } catch { errorMessage = String(describing: error) @@ -110,4 +123,19 @@ public final class HostStore: ObservableObject { return session } + + private func startRealtimePolling() { + pollingTask?.cancel() + pollingTask = Task { [weak self] in + while Task.isCancelled == false { + try? await Task.sleep(nanoseconds: 250_000_000) + await self?.reload() + + let state = self?.snapshot?.dictationState + if state != .listening { + break + } + } + } + } } diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Stores/VoxitSettingsStore.swift b/native/macos-host/Sources/VoxitNativeHostKit/Stores/VoxitSettingsStore.swift index 12b22d9..c41412f 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Stores/VoxitSettingsStore.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Stores/VoxitSettingsStore.swift @@ -16,6 +16,10 @@ final class VoxitSettingsStore: ObservableObject { static let rewriteAfterTranscription = "rewriteAfterTranscription" static let authRoute = "authRoute" static let audioInput = "audioInput" + static let realtimeModel = "realtimeModel" + static let realtimeTranscriptionModel = "realtimeTranscriptionModel" + static let finalizeModel = "finalizeModel" + static let rewriteModel = "rewriteModel" } private let defaults: UserDefaults @@ -43,7 +47,16 @@ final class VoxitSettingsStore: ObservableObject { ?? baseSettings.authRoute, audioInput: VoxitAudioInputPreference( rawValue: defaults.string(forKey: DefaultsKey.audioInput) ?? "") - ?? baseSettings.audioInput + ?? baseSettings.audioInput, + realtimeModel: defaults.string(forKey: DefaultsKey.realtimeModel) + ?? baseSettings.realtimeModel, + realtimeTranscriptionModel: defaults.string( + forKey: DefaultsKey.realtimeTranscriptionModel) + ?? baseSettings.realtimeTranscriptionModel, + finalizeModel: defaults.string(forKey: DefaultsKey.finalizeModel) + ?? baseSettings.finalizeModel, + rewriteModel: defaults.string(forKey: DefaultsKey.rewriteModel) + ?? baseSettings.rewriteModel ) self.settings = settings.sanitized() Self.persist(self.settings, into: defaults) @@ -71,6 +84,13 @@ final class VoxitSettingsStore: ObservableObject { defaults.set(settings.rewriteAfterTranscription, forKey: DefaultsKey.rewriteAfterTranscription) defaults.set(settings.authRoute.rawValue, forKey: DefaultsKey.authRoute) defaults.set(settings.audioInput.rawValue, forKey: DefaultsKey.audioInput) + defaults.set(settings.realtimeModel, forKey: DefaultsKey.realtimeModel) + defaults.set( + settings.realtimeTranscriptionModel, + forKey: DefaultsKey.realtimeTranscriptionModel + ) + defaults.set(settings.finalizeModel, forKey: DefaultsKey.finalizeModel) + defaults.set(settings.rewriteModel, forKey: DefaultsKey.rewriteModel) } } @@ -82,6 +102,10 @@ struct VoxitSettings: Equatable { var rewriteAfterTranscription: Bool var authRoute: VoxitAuthRoutePreference var audioInput: VoxitAudioInputPreference + var realtimeModel: String + var realtimeTranscriptionModel: String + var finalizeModel: String + var rewriteModel: String static var defaults: Self { Self( @@ -91,7 +115,11 @@ struct VoxitSettings: Equatable { pasteAfterTranscription: true, rewriteAfterTranscription: true, authRoute: .chatGPTDeviceCode, - audioInput: .systemDefault + audioInput: .systemDefault, + realtimeModel: "gpt-realtime-2", + realtimeTranscriptionModel: "gpt-4o-mini-transcribe", + finalizeModel: "gpt-4o-transcribe", + rewriteModel: "gpt-5.2-mini" ) } @@ -104,6 +132,22 @@ struct VoxitSettings: Equatable { copy.dictationHotkey = Self.dictationHotkeyPresentation(for: copy.dictationHotkey) .displayTitle + copy.realtimeModel = Self.sanitizedModelID( + copy.realtimeModel, + fallback: Self.defaults.realtimeModel + ) + copy.realtimeTranscriptionModel = Self.sanitizedModelID( + copy.realtimeTranscriptionModel, + fallback: Self.defaults.realtimeTranscriptionModel + ) + copy.finalizeModel = Self.sanitizedModelID( + copy.finalizeModel, + fallback: Self.defaults.finalizeModel + ) + copy.rewriteModel = Self.sanitizedModelID( + copy.rewriteModel, + fallback: Self.defaults.rewriteModel + ) return copy } @@ -192,6 +236,12 @@ struct VoxitSettings: Equatable { .map { $0.trimmingCharacters(in: .whitespacesAndNewlines) } .filter { $0.isEmpty == false } } + + private static func sanitizedModelID(_ raw: String, fallback: String) -> String { + let modelID = raw.trimmingCharacters(in: .whitespacesAndNewlines) + + return modelID.isEmpty ? fallback : modelID + } } struct VoxitHotkeyPresentation: Equatable { diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Support/GlobalHotkeyMonitor.swift b/native/macos-host/Sources/VoxitNativeHostKit/Support/GlobalHotkeyMonitor.swift new file mode 100644 index 0000000..b304aad --- /dev/null +++ b/native/macos-host/Sources/VoxitNativeHostKit/Support/GlobalHotkeyMonitor.swift @@ -0,0 +1,118 @@ +import AppKit + +@MainActor +final class GlobalHotkeyMonitor: ObservableObject { + private enum Phase: Sendable { + case down + case up + } + + private struct EventPayload: Sendable { + let characters: String + let modifierRawValue: UInt + let phase: Phase + } + + private static let relevantModifiers: NSEvent.ModifierFlags = [ + .command, .control, .option, .shift, + ] + + private var globalKeyDownMonitor: Any? + private var globalKeyUpMonitor: Any? + private var localKeyDownMonitor: Any? + private var localKeyUpMonitor: Any? + private var presentation = VoxitSettings.defaults.dictationHotkeyPresentation + private var hotkeyMode = VoxitHotkeyModePreference.toggle + private var isPressed = false + private var keyDownHandler: (() -> Void)? + private var keyUpHandler: (() -> Void)? + + init() { + installMonitors() + } + + func configure( + settings: VoxitSettings, + keyDown: @escaping () -> Void, + keyUp: @escaping () -> Void + ) { + presentation = settings.dictationHotkeyPresentation + hotkeyMode = settings.hotkeyMode + keyDownHandler = keyDown + keyUpHandler = keyUp + } + + private func installMonitors() { + globalKeyDownMonitor = NSEvent.addGlobalMonitorForEvents(matching: .keyDown) { + [weak self] event in + Self.enqueue(event: event, phase: .down, target: self) + } + globalKeyUpMonitor = NSEvent.addGlobalMonitorForEvents(matching: .keyUp) { [weak self] event in + Self.enqueue(event: event, phase: .up, target: self) + } + localKeyDownMonitor = NSEvent.addLocalMonitorForEvents(matching: .keyDown) { + [weak self] event in + Self.enqueue(event: event, phase: .down, target: self) + return event + } + localKeyUpMonitor = NSEvent.addLocalMonitorForEvents(matching: .keyUp) { [weak self] event in + Self.enqueue(event: event, phase: .up, target: self) + return event + } + } + + private func handle(_ payload: EventPayload) { + guard matchesHotkey(payload) else { + return + } + + switch payload.phase { + case .down: + guard isPressed == false else { + return + } + isPressed = true + keyDownHandler?() + case .up: + guard isPressed else { + return + } + isPressed = false + if hotkeyMode == .hold { + keyUpHandler?() + } + } + } + + private func matchesHotkey(_ payload: EventPayload) -> Bool { + let modifiers = NSEvent.ModifierFlags(rawValue: payload.modifierRawValue) + .intersection(Self.relevantModifiers) + let expectedModifiers = presentation.modifierMask.intersection(Self.relevantModifiers) + + guard modifiers == expectedModifiers else { + return false + } + + return normalizedKey(payload.characters) == normalizedKey(presentation.keyEquivalent) + } + + private func normalizedKey(_ value: String) -> String { + if value == " " { + return "space" + } + + return value.lowercased() + } + + private static func enqueue(event: NSEvent, phase: Phase, target: GlobalHotkeyMonitor?) { + let payload = EventPayload( + characters: event.charactersIgnoringModifiers ?? "", + modifierRawValue: event.modifierFlags.rawValue, + phase: phase + ) + + Task { @MainActor in + target?.handle(payload) + } + } +} diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Support/Labels.swift b/native/macos-host/Sources/VoxitNativeHostKit/Support/Labels.swift index 0b5d148..4f18542 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Support/Labels.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Support/Labels.swift @@ -1,3 +1,4 @@ +import Foundation import VoxitHostBridge extension AuthMethod { @@ -135,4 +136,20 @@ extension HostSnapshot { } return "No Runs" } + + var pass1TranscriptPreview: String? { + let committed = pass1CommittedTranscript?.trimmingCharacters(in: .whitespacesAndNewlines) ?? "" + let draft = pass1DraftTranscript?.trimmingCharacters(in: .whitespacesAndNewlines) ?? "" + + switch (committed.isEmpty, draft.isEmpty) { + case (false, false): + return "\(committed) \(draft)" + case (false, true): + return committed + case (true, false): + return draft + case (true, true): + return nil + } + } } diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Support/RecordingHUDWindowController.swift b/native/macos-host/Sources/VoxitNativeHostKit/Support/RecordingHUDWindowController.swift new file mode 100644 index 0000000..6cf753c --- /dev/null +++ b/native/macos-host/Sources/VoxitNativeHostKit/Support/RecordingHUDWindowController.swift @@ -0,0 +1,57 @@ +import AppKit +import SwiftUI + +@MainActor +final class RecordingHUDWindowController: NSWindowController, NSWindowDelegate { + private let store: HostStore + + init(store: HostStore) { + self.store = store + + let contentRect = NSRect(x: 0, y: 0, width: 380, height: 220) + let panel = NSPanel( + contentRect: contentRect, + styleMask: [.titled, .closable, .hudWindow, .nonactivatingPanel, .fullSizeContentView], + backing: .buffered, + defer: false + ) + panel.title = "Voxit Recording" + panel.titleVisibility = .hidden + panel.titlebarAppearsTransparent = true + panel.isReleasedWhenClosed = false + panel.hidesOnDeactivate = false + panel.level = .floating + panel.collectionBehavior = [.canJoinAllSpaces, .moveToActiveSpace, .transient] + + super.init(window: panel) + + panel.delegate = self + panel.contentViewController = NSHostingController(rootView: RecordingHUDView(store: store)) + } + + @available(*, unavailable) + required init?(coder: NSCoder) { + fatalError("init(coder:) has not been implemented") + } + + func present() { + guard let window else { + return + } + + positionNearTopTrailing(window) + showWindow(nil) + window.orderFrontRegardless() + } + + private func positionNearTopTrailing(_ window: NSWindow) { + let visibleFrame = NSScreen.main?.visibleFrame ?? NSRect(x: 0, y: 0, width: 1_280, height: 720) + let frame = window.frame + let origin = NSPoint( + x: visibleFrame.maxX - frame.width - 24, + y: visibleFrame.maxY - frame.height - 24 + ) + + window.setFrameOrigin(origin) + } +} diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Views/DetailView.swift b/native/macos-host/Sources/VoxitNativeHostKit/Views/DetailView.swift index 750003d..22a1d9b 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Views/DetailView.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Views/DetailView.swift @@ -136,6 +136,9 @@ private struct ActivityDetail: View { if let rawTranscript = snapshot?.rawTranscript { TranscriptPreview(title: "Raw Transcript", text: rawTranscript) } + if let pass1Transcript = snapshot?.pass1TranscriptPreview { + TranscriptPreview(title: "Realtime Draft", text: pass1Transcript) + } } } diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Views/RecordingHUDView.swift b/native/macos-host/Sources/VoxitNativeHostKit/Views/RecordingHUDView.swift index de41c80..2f8578c 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Views/RecordingHUDView.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Views/RecordingHUDView.swift @@ -55,6 +55,9 @@ struct RecordingHUDView: View { if let rawTranscript = store.snapshot?.rawTranscript { return rawTranscript } + if let pass1Transcript = store.snapshot?.pass1TranscriptPreview { + return pass1Transcript + } if let error = store.snapshot?.lastError { return error } diff --git a/native/macos-host/Sources/VoxitNativeHostKit/Views/VoxitSettingsView.swift b/native/macos-host/Sources/VoxitNativeHostKit/Views/VoxitSettingsView.swift index 11b8bed..d4a9462 100644 --- a/native/macos-host/Sources/VoxitNativeHostKit/Views/VoxitSettingsView.swift +++ b/native/macos-host/Sources/VoxitNativeHostKit/Views/VoxitSettingsView.swift @@ -3,8 +3,8 @@ import SwiftUI enum VoxitSettingsWindowMetrics { static let width: CGFloat = 620 - static let minHeight: CGFloat = 336 - static let idealHeight: CGFloat = 396 + static let minHeight: CGFloat = 420 + static let idealHeight: CGFloat = 520 static let cornerRadius: CGFloat = 18 } @@ -40,6 +40,10 @@ final class VoxitSettingsViewModel: ObservableObject { openPrivacySettings(query: "Privacy_Accessibility") } + func openInputMonitoringSettings() { + openPrivacySettings(query: "Privacy_ListenEvent") + } + private func openPrivacySettings(query: String) { let modernURLString = "x-apple.systempreferences:com.apple.settings.PrivacySecurity.extension?\(query)" @@ -88,6 +92,7 @@ struct VoxitSettingsView: View { private enum VoxitSettingsSection: String, CaseIterable, Identifiable { case general case dictation + case models case audio case permissions case about @@ -100,6 +105,8 @@ private enum VoxitSettingsSection: String, CaseIterable, Identifiable { return "General" case .dictation: return "Dictation" + case .models: + return "Models" case .audio: return "Audio" case .permissions: @@ -115,6 +122,8 @@ private enum VoxitSettingsSection: String, CaseIterable, Identifiable { return "Startup" case .dictation: return "Shortcut" + case .models: + return "OpenAI" case .audio: return "Input" case .permissions: @@ -130,6 +139,8 @@ private enum VoxitSettingsSection: String, CaseIterable, Identifiable { return "switch.2" case .dictation: return "waveform" + case .models: + return "cpu" case .audio: return "mic" case .permissions: @@ -141,7 +152,7 @@ private enum VoxitSettingsSection: String, CaseIterable, Identifiable { var allowsRestoreDefaults: Bool { switch self { - case .general, .dictation, .audio: + case .general, .dictation, .models, .audio: return true case .permissions, .about: return false @@ -259,6 +270,8 @@ private struct SettingsDashboard: View { GeneralSettingsPane(model: model) case .dictation: DictationSettingsPane(model: model) + case .models: + ModelsSettingsPane(model: model) case .audio: AudioSettingsPane(model: model) case .permissions: @@ -360,6 +373,130 @@ private struct DictationSettingsPane: View { } } +private struct ModelsSettingsPane: View { + @ObservedObject var model: VoxitSettingsViewModel + + var body: some View { + SettingsPanel { + ModelSettingRow( + title: "Realtime voice", + presets: ["gpt-realtime-2"], + modelID: modelBinding(\.realtimeModel) + ) + ModelSettingRow( + title: "Realtime text", + presets: ["gpt-4o-mini-transcribe", "gpt-4o-transcribe"], + modelID: modelBinding(\.realtimeTranscriptionModel) + ) + ModelSettingRow( + title: "Finalize", + presets: ["gpt-4o-transcribe", "gpt-4o-mini-transcribe"], + modelID: modelBinding(\.finalizeModel) + ) + ModelSettingRow( + title: "Rewrite", + presets: ["gpt-5.2-mini", "gpt-5.5", "gpt-5.4", "gpt-5.4-mini"], + modelID: modelBinding(\.rewriteModel) + ) + } + } + + private func modelBinding(_ keyPath: WritableKeyPath) -> Binding { + Binding( + get: { model.settings[keyPath: keyPath] }, + set: { value in + model.update { $0[keyPath: keyPath] = value } + } + ) + } +} + +private struct ModelSettingRow: View { + private static let customPresetTag = "__voxit_custom_model__" + + let title: String + let presets: [String] + @Binding var modelID: String + @State private var draftModelID: String + + init(title: String, presets: [String], modelID: Binding) { + self.title = title + self.presets = presets + self._modelID = modelID + self._draftModelID = State(initialValue: modelID.wrappedValue) + } + + var body: some View { + VStack(alignment: .leading, spacing: 6) { + HStack(alignment: .firstTextBaseline, spacing: 8) { + Text(title) + .frame(width: 116, alignment: .leading) + Picker("", selection: presetBinding) { + ForEach(presets, id: \.self) { preset in + Text(preset).tag(preset) + } + Text("Custom").tag(Self.customPresetTag) + } + .labelsHidden() + .pickerStyle(.menu) + .frame(width: 210, alignment: .leading) + } + + HStack(spacing: 6) { + TextField("Model ID", text: $draftModelID) + .textFieldStyle(.roundedBorder) + .onSubmit(commitDraft) + Button("Apply", action: commitDraft) + .disabled(canApplyDraft == false) + } + .padding(.leading, 124) + } + .onChange(of: modelID) { _, newValue in + if draftModelID != newValue { + draftModelID = newValue + } + } + } + + private var presetBinding: Binding { + Binding( + get: { + presets.contains(modelID) ? modelID : Self.customPresetTag + }, + set: { value in + guard value != Self.customPresetTag else { + return + } + draftModelID = value + modelID = value + } + ) + } + + private var canApplyDraft: Bool { + let sanitized = sanitizedDraftModelID + + return sanitized.isEmpty == false && sanitized != modelID + } + + private var sanitizedDraftModelID: String { + draftModelID.trimmingCharacters(in: .whitespacesAndNewlines) + } + + private func commitDraft() { + let sanitized = sanitizedDraftModelID + + guard sanitized.isEmpty == false else { + draftModelID = modelID + + return + } + + draftModelID = sanitized + modelID = sanitized + } +} + private struct AudioSettingsPane: View { @ObservedObject var model: VoxitSettingsViewModel @@ -405,6 +542,13 @@ private struct PermissionsSettingsPane: View { model.openAccessibilitySettings() } } + + HStack { + LabeledContent("Input Monitoring", value: "Shortcut") + Button("Open") { + model.openInputMonitoringSettings() + } + } } } } diff --git a/packages/voxit-audio/src/lib.rs b/packages/voxit-audio/src/lib.rs index d084ee3..93f2c29 100644 --- a/packages/voxit-audio/src/lib.rs +++ b/packages/voxit-audio/src/lib.rs @@ -100,6 +100,8 @@ impl Recorder { pub fn start_with_stream( stream_tx: Option>, selection: &InputDeviceSelection, + target_sample_rate_hz: u32, + target_channels: u16, ) -> Result { let use_voice_processing = selection.requested_device_id.is_none(); let io_type = @@ -121,11 +123,15 @@ impl Recorder { let input_format = audio_unit.input_stream_format().map_err(|err: Error| err.to_string())?; let _ = audio_unit.uninitialize(); - let (sample_rate, channels) = configure_input_format( - &mut audio_unit, - input_format.sample_rate, - input_format.channels, - )?; + let requested_sample_rate = if target_sample_rate_hz == 0 { + input_format.sample_rate + } else { + target_sample_rate_hz as f64 + }; + let requested_channels = + if target_channels == 0 { input_format.channels } else { u32::from(target_channels) }; + let (sample_rate, channels) = + configure_input_format(&mut audio_unit, requested_sample_rate, requested_channels)?; let recording = Arc::new(Mutex::new(Vec::::new())); let recording_cb = Arc::clone(&recording); let callback_tx = stream_tx.clone(); @@ -265,10 +271,13 @@ pub struct InputDeviceSelection { pub fn start_recording_with_stream( chunk_capacity: usize, preferred_device_id: Option, + target_sample_rate_hz: u32, + target_channels: u16, ) -> Result<(Recorder, AudioChunkReceiver, InputDeviceSelection), String> { let (tx, rx) = mpsc::sync_channel(chunk_capacity); let selection = resolve_input_device(preferred_device_id)?; - let recorder = Recorder::start_with_stream(Some(tx), &selection)?; + let recorder = + Recorder::start_with_stream(Some(tx), &selection, target_sample_rate_hz, target_channels)?; Ok((recorder, rx, selection)) } @@ -313,9 +322,13 @@ pub fn list_input_devices() -> Result, String> { pub fn start_recording_with_stream( _chnk_capacity: usize, _preferred_device_id: Option, + _target_sample_rate_hz: u32, + _target_channels: u16, ) -> Result<(Recorder, AudioChunkReceiver, InputDeviceSelection), String> { let _ = _chnk_capacity; let _ = _preferred_device_id; + let _ = _target_sample_rate_hz; + let _ = _target_channels; Err("recording is only supported on macOS in this build".to_string()) } diff --git a/packages/voxit-core/src/config.rs b/packages/voxit-core/src/config.rs index a191b4e..ff551b2 100644 --- a/packages/voxit-core/src/config.rs +++ b/packages/voxit-core/src/config.rs @@ -84,7 +84,7 @@ impl Default for OpenAiConfig { fn default() -> Self { Self { api_base_url: "https://api.openai.com/v1".to_string(), - realtime_model: "gpt-4o-mini-transcribe".to_string(), + realtime_model: "gpt-realtime-2".to_string(), finalize_model: "gpt-4o-transcribe".to_string(), rewrite_model: "gpt-5.2-mini".to_string(), language: "en".to_string(), @@ -98,10 +98,15 @@ impl Default for OpenAiConfig { pub struct OpenAiRealtimeConfig { /// Optional noise reduction profile. pub noise_reduction: String, + /// Input-audio transcription model used for realtime Pass1 transcript events. + pub transcription_model: String, } impl Default for OpenAiRealtimeConfig { fn default() -> Self { - Self { noise_reduction: "near_field".to_string() } + Self { + noise_reduction: "near_field".to_string(), + transcription_model: "gpt-4o-mini-transcribe".to_string(), + } } } @@ -410,6 +415,13 @@ fn apply_openai_config( config.openai.realtime.noise_reduction = v.to_string(); } }, + ([openai_section, realtime_section], "transcription_model") + if openai_section == "openai" && realtime_section == "realtime" => + { + if let Some(v) = value.str.clone() { + config.openai.realtime.transcription_model = v; + } + }, _ => return false, } @@ -535,8 +547,11 @@ fn serialize_toml(config: &Config) -> String { output.push_str(&format!("rewrite_model = \"{}\"\n", config.openai.rewrite_model)); output.push_str(&format!("language = \"{}\"\n\n", config.openai.language)); output.push_str("[openai.realtime]\n"); - output - .push_str(&format!("noise_reduction = \"{}\"\n\n", config.openai.realtime.noise_reduction)); + output.push_str(&format!("noise_reduction = \"{}\"\n", config.openai.realtime.noise_reduction)); + output.push_str(&format!( + "transcription_model = \"{}\"\n\n", + config.openai.realtime.transcription_model + )); output.push_str("[rewrite]\n"); output.push_str(&format!("enabled = {}\n", config.rewrite.enabled)); output.push_str(&format!("auto = {}\n", config.rewrite.auto)); @@ -576,13 +591,14 @@ mode = "hold" [openai] api_base_url = "https://api.openai.com/v1" -realtime_model = "gpt-4o-mini-transcribe" +realtime_model = "gpt-realtime-2" finalize_model = "gpt-4o-transcribe" rewrite_model = "gpt-5.2-mini" language = "en" [openai.realtime] noise_reduction = "near_field" +transcription_model = "gpt-4o-mini-transcribe" [rewrite] enabled = false @@ -605,6 +621,7 @@ method = "clipboard_cmd_v" assert_eq!(parsed.audio.input_device_name, "USB Mic"); assert_eq!(parsed.audio.input_device_id, 123); assert_eq!(parsed.openai.realtime.noise_reduction, "near_field"); + assert_eq!(parsed.openai.realtime.transcription_model, "gpt-4o-mini-transcribe"); } #[test] @@ -617,5 +634,6 @@ method = "clipboard_cmd_v" assert_eq!(parsed.paste.method, "clipboard_cmd_v"); assert_eq!(parsed.audio.input_device_id, 0); assert!(parsed.audio.input_device_name.is_empty()); + assert_eq!(parsed.openai.realtime_model, "gpt-realtime-2"); } } diff --git a/packages/voxit-core/src/realtime.rs b/packages/voxit-core/src/realtime.rs index 01c0a11..bb53e3a 100644 --- a/packages/voxit-core/src/realtime.rs +++ b/packages/voxit-core/src/realtime.rs @@ -27,18 +27,30 @@ pub const REALTIME_ENDPOINT: &str = "wss://api.openai.com/v1/realtime"; pub struct RealtimeSessionConfig { /// API model id. pub model: String, + /// Input-audio transcription model id. + pub transcription_model: String, + /// Input language hint for realtime transcription. + pub language: String, /// Input sample rate expected by OpenAI (`24000` by plan). pub sample_rate_hz: u32, /// `near_field` | `far_field` | `off`. pub noise_reduction: String, + /// Session instructions for contextual voice behavior. + pub instructions: String, + /// Realtime reasoning effort for models that support it. + pub reasoning_effort: String, } impl Default for RealtimeSessionConfig { /// Default session configuration for English pass1 streaming. fn default() -> Self { Self { - model: "gpt-4o-mini-transcribe".to_string(), + model: "gpt-realtime-2".to_string(), + transcription_model: "gpt-4o-mini-transcribe".to_string(), + language: "en".to_string(), sample_rate_hz: 24_000, noise_reduction: "near_field".to_string(), + instructions: "Transcribe the user's dictation as text for the target app.".to_string(), + reasoning_effort: "minimal".to_string(), } } } @@ -139,8 +151,13 @@ fn start_realtime_session_impl( event_tx: Sender, ) -> Result { let (stop_tx, stop_rx) = mpsc::channel::<()>(); + let worker_event_tx = event_tx.clone(); let worker = thread::spawn(move || { - let _ = run_realtime_worker(api_key, account_id, config, chunk_rx, event_tx, stop_rx); + if let Err(err) = + run_realtime_worker(api_key, account_id, config, chunk_rx, event_tx, stop_rx) + { + let _ = worker_event_tx.send(RealtimeEvent::StreamError(err.to_string())); + } }); Ok(RealtimeSession { stop_tx: Some(stop_tx), worker: Some(worker) }) @@ -159,6 +176,7 @@ fn run_realtime_worker( reason: format!("failed to create tokio runtime: {err}"), })?; let endpoint = format!("{REALTIME_ENDPOINT}?model={}", config.model); + let session_update = realtime_session_update(&config); rt.block_on(async move { let mut builder = Request::builder() @@ -174,22 +192,6 @@ fn run_realtime_worker( let request = builder.body(()).map_err(|err| RealtimeError::RuntimeError { reason: format!("invalid realtime request: {err}"), })?; - let session_update = serde_json::json!({ - "type": "session.update", - "session": { - "audio": { - "input": { - "format": { - "type": "audio/pcm", - "rate": config.sample_rate_hz, - }, - "noise_reduction": { "type": config.noise_reduction }, - "transcription": { "model": config.model }, - "turn_detection": { "type": "server_vad" }, - }, - }, - } - }); let (mut ws, _) = tokio_tungstenite::connect_async(request).await.map_err(|err| { RealtimeError::RuntimeError { reason: format!("realtime websocket connect failed: {err}"), @@ -271,6 +273,37 @@ fn run_realtime_worker( Ok(()) } +fn realtime_session_update(config: &RealtimeSessionConfig) -> Value { + serde_json::json!({ + "type": "session.update", + "session": { + "type": "realtime", + "instructions": config.instructions, + "output_modalities": ["text"], + "reasoning": { + "effort": config.reasoning_effort, + }, + "audio": { + "input": { + "format": { + "type": "audio/pcm", + "rate": config.sample_rate_hz, + }, + "noise_reduction": noise_reduction_payload(&config.noise_reduction), + "transcription": { + "model": config.transcription_model, + "language": config.language, + }, + "turn_detection": { + "type": "server_vad", + "create_response": false, + }, + }, + }, + } + }) +} + fn chunk_to_base64(samples: &[i16]) -> String { let mut bytes = Vec::with_capacity(samples.len() * 2); @@ -281,6 +314,10 @@ fn chunk_to_base64(samples: &[i16]) -> String { STANDARD.encode(bytes) } +fn noise_reduction_payload(profile: &str) -> Value { + if profile == "off" { Value::Null } else { serde_json::json!({ "type": profile }) } +} + fn parse_realtime_frame(body: &str) -> Result, RealtimeError> { let value: Value = serde_json::from_str(body).map_err(|err| RealtimeError::RuntimeError { reason: format!("invalid realtime frame json: {err}"), @@ -368,6 +405,14 @@ mod tests { let config = RealtimeSessionConfig::default(); assert!(config.model.contains("gpt")); + assert_eq!(config.transcription_model, "gpt-4o-mini-transcribe"); + assert_eq!(config.language, "en"); assert_eq!(config.sample_rate_hz, 24_000); + assert_eq!(config.reasoning_effort, "minimal"); + } + + #[test] + fn noise_reduction_off_maps_to_null() { + assert!(realtime::noise_reduction_payload("off").is_null()); } } diff --git a/packages/voxit-host-ffi/include/voxit_host_ffi.h b/packages/voxit-host-ffi/include/voxit_host_ffi.h index de2b6b8..dced076 100644 --- a/packages/voxit-host-ffi/include/voxit_host_ffi.h +++ b/packages/voxit-host-ffi/include/voxit_host_ffi.h @@ -7,7 +7,7 @@ extern "C" { #endif -#define VOXIT_HOST_FFI_ABI_VERSION 4u +#define VOXIT_HOST_FFI_ABI_VERSION 6u typedef struct VoxitHostSessionHandle VoxitHostSessionHandle; @@ -86,6 +86,8 @@ typedef enum VoxitHostStringField { VOXIT_HOST_STRING_RAW_TRANSCRIPT = 7, VOXIT_HOST_STRING_FINAL_OUTPUT = 8, VOXIT_HOST_STRING_LAST_ERROR = 9, + VOXIT_HOST_STRING_PASS1_COMMITTED_TRANSCRIPT = 10, + VOXIT_HOST_STRING_PASS1_DRAFT_TRANSCRIPT = 11, } VoxitHostStringField; typedef struct VoxitHostConfig { @@ -111,6 +113,8 @@ typedef struct VoxitHostSnapshot { uint8_t has_focused_context; uint8_t selected_text_present; uint8_t has_raw_transcript; + uint8_t has_pass1_committed_transcript; + uint8_t has_pass1_draft_transcript; uint8_t has_final_output; uint8_t has_error; uint64_t recording_duration_ms; @@ -132,6 +136,13 @@ enum VoxitStatus voxit_host_session_save_preferences( struct VoxitHostPreferences preferences, const char *hotkey_chord ); +enum VoxitStatus voxit_host_session_save_model_preferences( + VoxitHostSessionHandle *handle, + const char *realtime_model, + const char *realtime_transcription_model, + const char *finalize_model, + const char *rewrite_model +); enum VoxitStatus voxit_host_session_set_profile_override( VoxitHostSessionHandle *handle, enum VoxitPromptProfileKind profile_kind diff --git a/packages/voxit-host-ffi/src/lib.rs b/packages/voxit-host-ffi/src/lib.rs index 3ee054d..1bcf8a0 100644 --- a/packages/voxit-host-ffi/src/lib.rs +++ b/packages/voxit-host-ffi/src/lib.rs @@ -4,16 +4,19 @@ //! This gives the Swift host a stable Rust-owned model without moving audio, auth, or //! inference orchestration across FFI before those boundaries are ready. +#[cfg(target_os = "macos")] use std::sync::mpsc; use std::{ ffi::{CStr, c_char}, ptr::{self, NonNull}, + sync::mpsc::{Receiver, TryRecvError}, }; #[cfg(target_os = "macos")] use voxit_audio::Recorder; +#[cfg(target_os = "macos")] use voxit_core::RealtimeSessionConfig; #[cfg(target_os = "macos")] use voxit_core::RewriteSettings; use voxit_core::{ self, Config, ContextualVoiceRouter, FocusedAppContext, NativeHostSnapshot, PlatformHost, - VoiceSessionPlan, + RealtimeEvent, RealtimeSession, TranscriptAssembler, VoiceSessionPlan, contextual::{ PromptProfileKind, VoiceInteractionTier, VoiceOutputPolicy, VoiceReasoningEffort, }, @@ -22,7 +25,7 @@ use voxit_core::{ #[cfg(target_os = "macos")] use voxit_macos::TargetApp; /// ABI version exported by the thin C host bridge. -pub const VOXIT_HOST_FFI_ABI_VERSION: u32 = 4; +pub const VOXIT_HOST_FFI_ABI_VERSION: u32 = 6; /// Opaque session handle owned by the native host through the C ABI. pub struct VoxitHostSessionHandle { @@ -32,10 +35,15 @@ pub struct VoxitHostSessionHandle { profile_override: Option, voice_plan: VoiceSessionPlan, glossary_terms: String, + transcript_assembler: TranscriptAssembler, + pass1_committed_transcript: String, + pass1_draft_transcript: String, last_raw_transcript: String, last_final_output: String, last_error: String, recording_duration_ms: u64, + realtime_session: Option, + realtime_event_rx: Option>, #[cfg(target_os = "macos")] recorder: Option, #[cfg(target_os = "macos")] @@ -194,6 +202,10 @@ pub enum VoxitHostStringField { FinalOutput = 8, /// Latest user-actionable error. LastError = 9, + /// Latest committed realtime Pass1 transcript. + Pass1CommittedTranscript = 10, + /// Latest in-flight realtime Pass1 draft transcript. + Pass1DraftTranscript = 11, } /// FFI-safe session configuration. @@ -244,6 +256,10 @@ pub struct VoxitHostSnapshot { pub selected_text_present: u8, /// Non-zero when a raw Pass2 transcript is available. pub has_raw_transcript: u8, + /// Non-zero when realtime Pass1 committed transcript text is available. + pub has_pass1_committed_transcript: u8, + /// Non-zero when realtime Pass1 draft transcript text is available. + pub has_pass1_draft_transcript: u8, /// Non-zero when a final output is available. pub has_final_output: u8, /// Non-zero when the last command failed or produced a warning. @@ -273,6 +289,8 @@ impl Default for VoxitHostSnapshot { has_focused_context: 0, selected_text_present: 0, has_raw_transcript: 0, + has_pass1_committed_transcript: 0, + has_pass1_draft_transcript: 0, has_final_output: 0, has_error: 0, recording_duration_ms: 0, @@ -311,10 +329,15 @@ pub extern "C" fn voxit_host_session_create( profile_override: None, voice_plan, glossary_terms: String::new(), + transcript_assembler: TranscriptAssembler::new(), + pass1_committed_transcript: String::new(), + pass1_draft_transcript: String::new(), last_raw_transcript: String::new(), last_final_output: String::new(), last_error: String::new(), recording_duration_ms: 0, + realtime_session: None, + realtime_event_rx: None, #[cfg(target_os = "macos")] recorder: None, #[cfg(target_os = "macos")] @@ -331,7 +354,9 @@ pub extern "C" fn voxit_host_session_create( #[unsafe(no_mangle)] pub unsafe extern "C" fn voxit_host_session_destroy(handle: *mut VoxitHostSessionHandle) { if let Some(handle) = NonNull::new(handle) { - unsafe { drop(Box::from_raw(handle.as_ptr())) }; + let mut handle = unsafe { Box::from_raw(handle.as_ptr()) }; + + stop_realtime_preview(&mut handle); } } @@ -346,13 +371,16 @@ pub unsafe extern "C" fn voxit_host_session_copy_snapshot( handle: *mut VoxitHostSessionHandle, out: *mut VoxitHostSnapshot, ) -> VoxitStatus { - let Some(handle) = NonNull::new(handle) else { + let Some(mut handle) = NonNull::new(handle) else { return VoxitStatus::NullHandle; }; let Some(out) = NonNull::new(out) else { return VoxitStatus::NullOutput; }; - let handle_ref = unsafe { handle.as_ref() }; + let handle_ref = unsafe { handle.as_mut() }; + + drain_realtime_events(handle_ref); + let snapshot = &handle_ref.snapshot; let focused_context = &handle_ref.focused_context; let voice_plan = &handle_ref.voice_plan; @@ -469,6 +497,54 @@ pub unsafe extern "C" fn voxit_host_session_save_preferences( save_preferences(handle, preferences, hotkey_chord) } +/// Saves OpenAI model preferences through the Rust-owned config file. +/// +/// # Safety +/// +/// `handle` must be a valid pointer returned by [`voxit_host_session_create`]. Model +/// pointers must point to null-terminated UTF-8 strings. +#[unsafe(no_mangle)] +pub unsafe extern "C" fn voxit_host_session_save_model_preferences( + handle: *mut VoxitHostSessionHandle, + realtime_model: *const c_char, + realtime_transcription_model: *const c_char, + finalize_model: *const c_char, + rewrite_model: *const c_char, +) -> VoxitStatus { + let Some(mut handle) = NonNull::new(handle) else { + return VoxitStatus::NullHandle; + }; + let handle = unsafe { handle.as_mut() }; + let realtime_model = match read_required_c_string(handle, realtime_model, "realtime model") { + Ok(value) => value, + Err(status) => return status, + }; + let realtime_transcription_model = match read_required_c_string( + handle, + realtime_transcription_model, + "realtime transcription model", + ) { + Ok(value) => value, + Err(status) => return status, + }; + let finalize_model = match read_required_c_string(handle, finalize_model, "finalize model") { + Ok(value) => value, + Err(status) => return status, + }; + let rewrite_model = match read_required_c_string(handle, rewrite_model, "rewrite model") { + Ok(value) => value, + Err(status) => return status, + }; + + save_model_preferences( + handle, + realtime_model, + realtime_transcription_model, + finalize_model, + rewrite_model, + ) +} + /// Sets a manual prompt-profile override for the current host session. /// /// # Safety @@ -555,7 +631,7 @@ pub unsafe extern "C" fn voxit_host_session_copy_string( out: *mut c_char, out_len: usize, ) -> VoxitStatus { - let Some(handle) = NonNull::new(handle) else { + let Some(mut handle) = NonNull::new(handle) else { return VoxitStatus::NullHandle; }; let Some(out) = NonNull::new(out) else { @@ -566,7 +642,10 @@ pub unsafe extern "C" fn voxit_host_session_copy_string( return VoxitStatus::InvalidInput; } - let handle = unsafe { handle.as_ref() }; + let handle = unsafe { handle.as_mut() }; + + drain_realtime_events(handle); + let value = string_field_value(handle, field); write_c_string(out, out_len, value) @@ -588,8 +667,31 @@ fn start_dictation(handle: &mut VoxitHostSessionHandle) -> VoxitStatus { let preferred_device_id = (handle.config.audio.input_device_id != 0) .then_some(handle.config.audio.input_device_id); - match voxit_audio::start_recording_with_stream(64, preferred_device_id) { - Ok((recorder, _chunk_rx, selection)) => { + match voxit_audio::start_recording_with_stream( + 64, + preferred_device_id, + handle.config.audio.realtime_target_rate_hz, + 1, + ) { + Ok((recorder, chunk_rx, selection)) => { + let (event_tx, event_rx) = mpsc::channel(); + + match voxit_core::start_realtime_session( + realtime_session_config(handle), + chunk_rx, + event_tx, + ) { + Ok(session) => { + handle.realtime_session = Some(session); + handle.realtime_event_rx = Some(event_rx); + }, + Err(err) => { + handle.realtime_session = None; + handle.realtime_event_rx = None; + handle.last_error = format!("realtime preview unavailable: {err}"); + }, + } + handle.recorder = Some(recorder); handle.snapshot.dictation_state = DictationSurfaceState::Listening; handle.recording_duration_ms = 0; @@ -636,6 +738,7 @@ fn stop_dictation(handle: &mut VoxitHostSessionHandle) -> VoxitStatus { Err(err) => { handle.snapshot.dictation_state = DictationSurfaceState::Done; + stop_realtime_preview(handle); set_error(handle, format!("failed to stop recording: {err}")); return VoxitStatus::Ok; @@ -644,16 +747,28 @@ fn stop_dictation(handle: &mut VoxitHostSessionHandle) -> VoxitStatus { handle.recording_duration_ms = recording.duration_ms; + stop_realtime_preview(handle); + drain_realtime_events(handle); + let (raw_transcript, _) = match voxit_core::transcribe_only(&recording.wav, &handle.config.openai.finalize_model) { Ok(result) => result, Err(err) => { - handle.snapshot.dictation_state = DictationSurfaceState::Done; + let fallback = realtime_transcript_text(handle); + + if fallback.is_empty() { + handle.snapshot.dictation_state = DictationSurfaceState::Done; - set_error(handle, format!("transcription failed: {err}")); + set_error(handle, format!("transcription failed: {err}")); - return VoxitStatus::Ok; + return VoxitStatus::Ok; + } + + handle.last_error = + format!("transcription failed; using realtime transcript: {err}"); + + (fallback, 0) }, }; @@ -767,7 +882,33 @@ fn save_preferences( VoxitStatus::Ok } +fn save_model_preferences( + handle: &mut VoxitHostSessionHandle, + realtime_model: String, + realtime_transcription_model: String, + finalize_model: String, + rewrite_model: String, +) -> VoxitStatus { + handle.config.openai.realtime_model = realtime_model; + handle.config.openai.realtime.transcription_model = realtime_transcription_model; + handle.config.openai.finalize_model = finalize_model; + handle.config.openai.rewrite_model = rewrite_model; + + if let Err(err) = handle.config.save() { + set_error(handle, format!("failed to save config: {err}")); + } else { + handle.last_error.clear(); + } + + VoxitStatus::Ok +} + fn clear_run_output(handle: &mut VoxitHostSessionHandle) { + stop_realtime_preview(handle); + + handle.transcript_assembler.reset(); + handle.pass1_committed_transcript.clear(); + handle.pass1_draft_transcript.clear(); handle.last_raw_transcript.clear(); handle.last_final_output.clear(); handle.last_error.clear(); @@ -775,10 +916,131 @@ fn clear_run_output(handle: &mut VoxitHostSessionHandle) { handle.recording_duration_ms = 0; } +#[cfg(target_os = "macos")] +fn realtime_session_config(handle: &VoxitHostSessionHandle) -> RealtimeSessionConfig { + RealtimeSessionConfig { + model: handle.config.openai.realtime_model.clone(), + transcription_model: handle.config.openai.realtime.transcription_model.clone(), + language: handle.config.openai.language.clone(), + sample_rate_hz: handle.config.audio.realtime_target_rate_hz, + noise_reduction: handle.config.openai.realtime.noise_reduction.clone(), + instructions: realtime_session_instructions(handle), + reasoning_effort: reasoning_effort_value(handle.voice_plan.reasoning_effort).to_string(), + } +} + +#[cfg(target_os = "macos")] +fn realtime_session_instructions(handle: &VoxitHostSessionHandle) -> String { + format!( + "You are Voxit, a contextual voice input layer. Listen to the user's dictation for the focused target app and keep any response text suitable for insertion or preview.\n\ + Active profile: {profile_title} ({profile_id}).\n\ + Profile direction: {prompt_directive}\n\ + Output policy: {output_policy}.\n\ + Do not claim that app actions or shell commands have already run.", + profile_title = handle.voice_plan.profile_title, + profile_id = handle.voice_plan.profile_id, + prompt_directive = handle.voice_plan.prompt_directive, + output_policy = output_policy_value(handle.voice_plan.output_policy), + ) +} + +#[cfg(target_os = "macos")] +fn reasoning_effort_value(effort: VoiceReasoningEffort) -> &'static str { + match effort { + VoiceReasoningEffort::Minimal => "minimal", + VoiceReasoningEffort::Low => "low", + VoiceReasoningEffort::Medium => "medium", + VoiceReasoningEffort::High => "high", + } +} + +#[cfg(target_os = "macos")] +fn output_policy_value(policy: VoiceOutputPolicy) -> &'static str { + match policy { + VoiceOutputPolicy::InsertText => "insert_text", + VoiceOutputPolicy::PreviewBeforeInsert => "preview_before_insert", + VoiceOutputPolicy::ConfirmBeforeAction => "confirm_before_action", + } +} + +fn stop_realtime_preview(handle: &mut VoxitHostSessionHandle) { + if let Some(session) = handle.realtime_session.take() { + session.stop(); + } +} + +fn drain_realtime_events(handle: &mut VoxitHostSessionHandle) { + loop { + let event = match handle.realtime_event_rx.as_ref().map(Receiver::try_recv) { + Some(Ok(event)) => event, + Some(Err(TryRecvError::Empty)) | None => break, + Some(Err(TryRecvError::Disconnected)) => { + handle.realtime_event_rx = None; + + break; + }, + }; + + match event { + RealtimeEvent::Draft(event) | RealtimeEvent::Committed(event) => { + handle.transcript_assembler.apply(event); + }, + RealtimeEvent::StreamError(reason) => { + handle.last_error = reason; + }, + } + } + + let transcript = handle.transcript_assembler.snapshot(); + + handle.pass1_committed_transcript = transcript.committed; + handle.pass1_draft_transcript = transcript.draft; +} + +#[cfg(target_os = "macos")] +fn realtime_transcript_text(handle: &VoxitHostSessionHandle) -> String { + let committed = handle.pass1_committed_transcript.trim(); + let draft = handle.pass1_draft_transcript.trim(); + + match (committed.is_empty(), draft.is_empty()) { + (false, false) => format!("{committed} {draft}"), + (false, true) => committed.to_string(), + (true, false) => draft.to_string(), + (true, true) => String::new(), + } +} + fn set_error(handle: &mut VoxitHostSessionHandle, message: impl Into) { handle.last_error = message.into(); } +fn read_required_c_string( + handle: &mut VoxitHostSessionHandle, + value: *const c_char, + label: &str, +) -> Result { + let Some(value) = NonNull::new(value.cast_mut()) else { + set_error(handle, format!("{label} is missing")); + + return Err(VoxitStatus::InvalidInput); + }; + let value = unsafe { CStr::from_ptr(value.as_ptr()) }; + let Ok(value) = value.to_str() else { + set_error(handle, format!("{label} is not valid UTF-8")); + + return Err(VoxitStatus::Ok); + }; + let value = value.trim(); + + if value.is_empty() { + set_error(handle, format!("{label} cannot be empty")); + + return Err(VoxitStatus::Ok); + } + + Ok(value.to_string()) +} + #[cfg(target_os = "macos")] fn rewrite_settings(handle: &VoxitHostSessionHandle) -> RewriteSettings { RewriteSettings { @@ -815,6 +1077,8 @@ fn encode_snapshot( has_focused_context: 0, selected_text_present: 0, has_raw_transcript: 0, + has_pass1_committed_transcript: 0, + has_pass1_draft_transcript: 0, has_final_output: 0, has_error: 0, recording_duration_ms: 0, @@ -836,6 +1100,9 @@ fn encode_snapshot_with_context( encoded.has_focused_context = u8::from(!focused_context.is_empty()); encoded.selected_text_present = u8::from(focused_context.selected_text_present); encoded.has_raw_transcript = u8::from(!handle.last_raw_transcript.is_empty()); + encoded.has_pass1_committed_transcript = + u8::from(!handle.pass1_committed_transcript.is_empty()); + encoded.has_pass1_draft_transcript = u8::from(!handle.pass1_draft_transcript.is_empty()); encoded.has_final_output = u8::from(!handle.last_final_output.is_empty()); encoded.has_error = u8::from(!handle.last_error.is_empty()); encoded.recording_duration_ms = handle.recording_duration_ms; @@ -946,6 +1213,8 @@ fn string_field_value(handle: &VoxitHostSessionHandle, field: VoxitHostStringFie VoxitHostStringField::RawTranscript => &handle.last_raw_transcript, VoxitHostStringField::FinalOutput => &handle.last_final_output, VoxitHostStringField::LastError => &handle.last_error, + VoxitHostStringField::Pass1CommittedTranscript => &handle.pass1_committed_transcript, + VoxitHostStringField::Pass1DraftTranscript => &handle.pass1_draft_transcript, } } @@ -1020,6 +1289,8 @@ mod tests { assert_eq!(snapshot.rewrite_enabled, 1); assert_eq!(snapshot.has_focused_context, 0); assert_eq!(snapshot.selected_text_present, 0); + assert_eq!(snapshot.has_pass1_committed_transcript, 0); + assert_eq!(snapshot.has_pass1_draft_transcript, 0); assert_eq!(snapshot.prompt_profile_kind, VoxitPromptProfileKind::FastDictation); assert_eq!(snapshot.voice_tier, VoxitVoiceInteractionTier::FastDictation); assert_eq!(snapshot.reasoning_effort, VoxitVoiceReasoningEffort::Minimal);