Project Details
Agentic AI Telephony Solution
Production telephony solution using Deepgram, ElevenLabs, OpenAI GPT Realtime, and Plivo with intelligent interruption detection.
About the Project
Built a production-ready telephony solution integrating Deepgram for real-time speech-to-text, ElevenLabs for natural text-to-speech, OpenAI GPT Realtime API for conversational AI, and Plivo for telephony infrastructure. Implemented intelligent interruption detection that filters acknowledgment words and uses intent severity levels to enable natural conversation flow with minimal latency.
Project Overview
Built a production-ready telephony solution using Plivo for call infrastructure,Deepgram for real-time speech-to-text, ElevenLabs for natural text-to-speech, and OpenAI GPT Realtime API for conversational AI. The system features intelligent interruption detectionthat distinguishes between valid interruptions and acknowledgment words, enabling natural conversation flow with minimal latency.
Plivo Telephony Platform
Plivo serves as the telephony infrastructure, handling call initiation, WebSocket media streaming, and call management:
// Plivo Client Setup
const plivo = require('plivo');
const client = new plivo.Client(PLIVO_AUTH_ID, PLIVO_AUTH_TOKEN);
// Initiate Call
async initiateCall(to, answerUrl, hangupUrl, from) {
const response = await this.client.calls.create(
from,
to,
answerUrl,
{ answer_method: 'POST', hangup_url: hangupUrl }
);
return response.requestUuid;
}
// WebSocket Connection for Media Streaming
app.ws('/answer', (ws, req) => {
const handler = new PlivoStreamHandler(ws, req);
// Handles bidirectional audio streaming
});
// Handle Media Stream Events
handleMessage(data) {
const msg = JSON.parse(data);
if (msg.event === 'media') {
// Send audio chunks to Deepgram
this.transcriptionService.send(
Buffer.from(msg.media.payload, 'base64')
);
}
}Deepgram Speech-to-Text Integration
Deepgram provides real-time transcription with interim results for low-latency interruption detection:
// Deepgram Live Transcription Setup
const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");
this.connection = this.deepgram.listen.live({
model: "nova-2",
encoding: "mulaw", // μ-law encoding
sample_rate: 8000, // 8kHz sample rate
interim_results: true, // Real-time partial results
language: "hi", // Hindi language support
punctuate: true,
endpointing: 300, // Endpoint detection
utterance_end_ms: 1500, // Utterance end timeout
filler_words: true // Detect filler words
});
// Handle Transcription Events
this.connection.on(LiveTranscriptionEvents.Transcript, (event) => {
const text = event.channel?.alternatives[0]?.transcript;
// Interim results for interruption detection
if (!event.is_final) {
this.emit('interim-transcription', text, event.start);
}
// Final results when speech is complete
if (event.is_final && event.speech_final) {
this.emit('transcription', this.finalResult, timestamp);
}
});
// Send Audio to Deepgram
send(audioData) {
this.connection.send(audioData);
}Intelligent Interruption Detection
The system implements sophisticated interruption detection that filters out acknowledgment words and handles interruptions based on intent severity:
// Interruption Detection Logic
function isNotInterrupt(text, isTTSPlaying, intentSeverity) {
if (!text || typeof text !== 'string') return true;
// HIGH severity intents: Never interrupt during TTS
if (intentSeverity === "HIGH" && isTTSPlaying) return true;
// Non-interrupt keywords (acknowledgments)
const notInterruptKeywords = [
// English
'hmm', 'ok', 'okay', 'right', 'go on', 'uh-huh', 'hello',
// Hindi/Hinglish
'bataiye', 'kahiye', 'accha', 'achha', 'han', 'haan',
'haanji', 'ha', 'huh', 'अच्छा', 'बोलिए', 'हाँ', 'हां'
];
const lowerCaseText = text.trim().replace(/[.,!?]/g, '').toLowerCase();
const words = lowerCaseText.split(/\s+/).filter(Boolean);
// All words are non-interrupt keywords
const allNonInterrupt = words.every(word =>
notInterruptKeywords.includes(word)
);
if (allNonInterrupt && isTTSPlaying) return true;
// Single word or ≤2 words during MEDIUM severity
if (words.length <= 2 && isTTSPlaying && intentSeverity === "MEDIUM") {
return true;
}
// Single word during any TTS
if (words.length === 1 && isTTSPlaying) return true;
return false; // Valid interrupt
}
// Handle Interruption
handleInterimTranscription(text, utteranceStartTime) {
const isInterrupt = !isNotInterrupt(
text,
this.isTTSPlaying,
this.intentSeverity
);
if (isInterrupt) {
// Pause TTS and rollback audio buffer
this.evaluateTTSInterruptionAndRollback();
this.streamService.pauseAudio(); // Send clearAudio to Plivo
}
}OpenAI GPT Realtime API
OpenAI GPT Realtime API provides conversational AI with streaming responses and function calling capabilities:
// GPT Realtime WebSocket Connection
initWebSocket() {
const url = `${this.endpoint}/openai/realtime?api-version=${this.apiVersion}&deployment=${this.deploymentId}`;
this.websocket = new WebSocket(url, {
headers: { "Authorization": `Bearer ${this.apiKey}` }
});
this.websocket.on('message', (data) => {
const event = JSON.parse(data);
// Handle streaming text responses
if (event.type === 'response.text.delta') {
this.sentenceBuffer += event.delta;
this.emit('gptreply', this.sentenceBuffer, event.item_id);
}
// Handle function calls
if (event.type === 'response.function_call') {
if (event.name === 'call_hangup') {
this.emit('hangup-call');
}
if (event.name === 'machine_detection') {
this.emit('machine-detected');
}
}
});
}
// Send User Message
sendMessage(text, interactionCount) {
const message = {
type: 'conversation.item.create',
item: {
type: 'message',
role: 'user',
content: [{ type: 'input_text', text }]
}
};
this.websocket.send(JSON.stringify(message));
}ElevenLabs Text-to-Speech
ElevenLabs provides natural-sounding voice synthesis with WebSocket streaming for low-latency audio generation:
// ElevenLabs WebSocket TTS Setup
initWebSocket() {
const uri = `wss://api.elevenlabs.io/v1/text-to-speech/${this.voiceId}/stream-input?model_id=${this.model}&output_format=ulaw_8000&inactivity_timeout=180&auto_mode=true&enable_ssml_parsing=true`;
this.websocket = new WebSocket(uri, {
headers: { "xi-api-key": this.apiKey }
});
this.websocket.on('message', (event) => {
const data = JSON.parse(event.toString());
if (data.audio) {
const base64Chunk = Buffer.from(data.audio).toString("base64");
// Calculate audio duration from alignment data
const startTimes = data.alignment?.charStartTimesMs ?? [];
const durations = data.alignment?.charDurationsMs ?? [];
const audioDuration = startTimes[startTimes.length - 1] +
durations[durations.length - 1];
// Emit audio chunk with duration
this.emit("speech", data.audio, audioDuration);
}
});
}
// Generate Speech
generate(text) {
const message = {
text: text,
flush: false // Stream mode
};
this.websocket.send(JSON.stringify(message));
}Audio Streaming Pipeline
The system handles bidirectional audio streaming with buffering and checkpoint management:
// Stream Service - Audio Buffer Management
class StreamService {
sendAudio(audio) {
const json = {
"event": 'playAudio',
"media": {
"contentType": "audio/x-mulaw",
"sampleRate": 8000,
"payload": audio // Base64 encoded μ-law audio
}
};
this.ws.send(JSON.stringify(json));
// Create checkpoint for interruption rollback
const markLabel = uuid.v4();
const checkpoint = {
"streamId": this.streamSid,
"event": 'checkpoint',
"name": markLabel
};
this.ws.send(JSON.stringify(checkpoint));
this.emit('audiosent', markLabel);
}
pauseAudio() {
// Clear audio buffer on interruption
const json = {
"event": "clearAudio",
"streamId": this.streamSid
};
this.ws.send(JSON.stringify(json));
}
}
// TTS Audio Handler
this.ttsService.on('speech', (audio, audioDuration) => {
this.streamService.buffer(null, audio);
this.isTTSPlaying = true;
this.ttsDuration += audioDuration;
});Intent Severity-Based Interruption Handling
The system uses intent severity levels to determine interruption behavior:
// Intent Severity Levels
setIntentSeverity() {
if (this.interactionCount <= 2) {
this.intentSeverity = "MEDIUM"; // Early in conversation
} else {
this.intentSeverity = "LOW"; // Later in conversation
}
}
// Initial state: HIGH severity (welcome message)
this.intentSeverity = "HIGH";
// Interruption Rules by Severity:
// HIGH: Never interrupt - critical messages (welcome, important info)
// MEDIUM: Allow interrupts for ≤2 words only
// LOW: Allow all valid interrupts
// Usage in Interruption Detection
if (intentSeverity === "HIGH" && isTTSPlaying) {
return true; // Never interrupt
}
if (words.length <= 2 && isTTSPlaying && intentSeverity === "MEDIUM") {
return true; // Ignore short interruptions
}Key Features
Real-Time Transcription
Deepgram provides interim results for instant interruption detection, enabling natural conversation flow with minimal latency.
Intelligent Interruption Detection
Filters out acknowledgment words (hmm, ok, haan) and uses intent severity to determine valid interruptions.
Streaming TTS
ElevenLabs WebSocket streaming provides low-latency audio generation with character-level alignment for precise duration tracking.
Audio Rollback
On interruption, the system rolls back audio buffers using checkpoint marks and sends filler audio (ok, hmm) for smooth transitions.
Function Calling
GPT Realtime API function calling enables call hangup, machine detection, and call hold capabilities.
Multi-Language Support
Supports Hindi and English with Hinglish detection for natural conversation in mixed-language scenarios.
Technical Stack
Architecture Benefits
- Low Latency: WebSocket streaming and interim transcription enable sub-second response times
- Natural Conversation: Intelligent interruption detection allows natural back-and-forth dialogue
- Scalability: Plivo handles call infrastructure, allowing focus on AI logic
- Reliability: Checkpoint-based audio rollback ensures smooth interruption handling
- Multi-Language: Deepgram and ElevenLabs support Hindi, English, and Hinglish
- Production Ready: Includes call recording, machine detection, and call hold capabilities