AI voice agent framework for dTelecom rooms. Build real-time voice agents that join WebRTC rooms and interact with participants using speech-to-text, LLMs, and text-to-speech.
Participant mic -> SFU -> server-sdk-node (Opus decode) -> PCM16 16kHz
-> STT plugin -> transcription
-> LLM plugin -> streaming text response
-> Sentence splitter -> TTS plugin -> PCM16 16kHz
-> AudioSource (upsample 48kHz + Opus encode) -> SFU -> Participants
The pipeline uses a producer/consumer pattern: LLM tokens are split into sentences and queued, while a consumer synthesizes and plays audio concurrently. This minimizes time-to-first-audio.
npm install @dtelecom/agents @dtelecom/server-sdk-js @dtelecom/server-sdk-nodeimport { VoiceAgent, setLogLevel } from '@dtelecom/agents';
import { DeepgramSTT, OpenRouterLLM, CartesiaTTS } from '@dtelecom/agents/providers';
const agent = new VoiceAgent({
stt: new DeepgramSTT({ apiKey: process.env.DEEPGRAM_API_KEY! }),
llm: new OpenRouterLLM({
apiKey: process.env.OPENROUTER_API_KEY!,
model: 'openai/gpt-4o',
}),
tts: new CartesiaTTS({
apiKey: process.env.CARTESIA_API_KEY!,
voiceId: 'your-voice-id',
}),
instructions: 'You are a helpful voice assistant.',
});
await agent.start({
room: 'my-room',
apiKey: process.env.DTELECOM_API_KEY!,
apiSecret: process.env.DTELECOM_API_SECRET!,
});interface STTPlugin {
createStream(options?: STTStreamOptions): STTStream;
}
interface STTStream {
sendAudio(pcm16: Buffer): void;
on(event: 'transcription', cb: (result: TranscriptionResult) => void): this;
on(event: 'error', cb: (error: Error) => void): this;
close(): Promise<void>;
}interface LLMPlugin {
chat(messages: Message[], signal?: AbortSignal): AsyncGenerator<LLMChunk>;
warmup?(systemPrompt: string): Promise<void>;
}interface TTSPlugin {
synthesize(text: string, signal?: AbortSignal): AsyncGenerator<Buffer>;
warmup?(): Promise<void>;
}| Option | Type | Default | Description |
|---|---|---|---|
stt |
STTPlugin |
required | Speech-to-text provider |
llm |
LLMPlugin |
required | Language model provider |
tts |
TTSPlugin |
undefined |
Text-to-speech provider (text-only if omitted) |
instructions |
string |
required | System prompt for the LLM |
respondMode |
'always' | 'addressed' |
'always' |
When to respond to speech |
agentName |
string |
'assistant' |
Name for addressed-mode detection |
nameVariants |
string[] |
[] |
Additional names to respond to |
onDataMessage |
DataMessageHandler |
undefined |
Callback for data channel messages |
| Event | Payload | Description |
|---|---|---|
transcription |
{ text, isFinal, speaker } |
STT transcription result |
response |
string |
Full agent response text |
speaking |
boolean |
Agent started/stopped speaking |
error |
Error |
Pipeline error |
connected |
— | Agent connected to room |
disconnected |
string? |
Agent disconnected |
Implement the plugin interface and pass it to VoiceAgent:
import { BaseSTTStream, type STTPlugin, type STTStreamOptions } from '@dtelecom/agents';
class MySTTStream extends BaseSTTStream {
sendAudio(pcm16: Buffer): void {
// Send audio to your STT service
}
async close(): Promise<void> {
// Clean up
}
}
class MySTT implements STTPlugin {
createStream(options?: STTStreamOptions) {
return new MySTTStream();
}
}
const agent = new VoiceAgent({
stt: new MySTT(),
// ...
});Receive data channel messages from participants:
const agent = new VoiceAgent({
// ...
onDataMessage: (payload, participantIdentity, topic) => {
const message = JSON.parse(new TextDecoder().decode(payload));
console.log(`${participantIdentity} sent:`, message);
},
});Apache-2.0