Agentic AI Telephony Solution

About the Project

Built a production-ready telephony solution integrating Deepgram for real-time speech-to-text, ElevenLabs for natural text-to-speech, OpenAI GPT Realtime API for conversational AI, and Plivo for telephony infrastructure. Implemented intelligent interruption detection that filters acknowledgment words and uses intent severity levels to enable natural conversation flow with minimal latency.

Project Overview

Built a production-ready telephony solution using Plivo for call infrastructure,Deepgram for real-time speech-to-text, ElevenLabs for natural text-to-speech, and OpenAI GPT Realtime API for conversational AI. The system features intelligent interruption detectionthat distinguishes between valid interruptions and acknowledgment words, enabling natural conversation flow with minimal latency.

Plivo Telephony Platform

Plivo serves as the telephony infrastructure, handling call initiation, WebSocket media streaming, and call management:

// Plivo Client Setup
const plivo = require('plivo');
const client = new plivo.Client(PLIVO_AUTH_ID, PLIVO_AUTH_TOKEN);

// Initiate Call
async initiateCall(to, answerUrl, hangupUrl, from) {
  const response = await this.client.calls.create(
    from,
    to,
    answerUrl,
    { answer_method: 'POST', hangup_url: hangupUrl }
  );
  return response.requestUuid;
}

// WebSocket Connection for Media Streaming
app.ws('/answer', (ws, req) => {
  const handler = new PlivoStreamHandler(ws, req);
  // Handles bidirectional audio streaming
});

// Handle Media Stream Events
handleMessage(data) {
  const msg = JSON.parse(data);
  if (msg.event === 'media') {
    // Send audio chunks to Deepgram
    this.transcriptionService.send(
      Buffer.from(msg.media.payload, 'base64')
    );
  }
}

Deepgram Speech-to-Text Integration

Deepgram provides real-time transcription with interim results for low-latency interruption detection:

// Deepgram Live Transcription Setup
const { createClient, LiveTranscriptionEvents } = require("@deepgram/sdk");

this.connection = this.deepgram.listen.live({
  model: "nova-2",
  encoding: "mulaw",        // μ-law encoding
  sample_rate: 8000,        // 8kHz sample rate
  interim_results: true,     // Real-time partial results
  language: "hi",           // Hindi language support
  punctuate: true,
  endpointing: 300,         // Endpoint detection
  utterance_end_ms: 1500,   // Utterance end timeout
  filler_words: true        // Detect filler words
});

// Handle Transcription Events
this.connection.on(LiveTranscriptionEvents.Transcript, (event) => {
  const text = event.channel?.alternatives[0]?.transcript;
  
  // Interim results for interruption detection
  if (!event.is_final) {
    this.emit('interim-transcription', text, event.start);
  }
  
  // Final results when speech is complete
  if (event.is_final && event.speech_final) {
    this.emit('transcription', this.finalResult, timestamp);
  }
});

// Send Audio to Deepgram
send(audioData) {
  this.connection.send(audioData);
}

Intelligent Interruption Detection

The system implements sophisticated interruption detection that filters out acknowledgment words and handles interruptions based on intent severity:

// Interruption Detection Logic
function isNotInterrupt(text, isTTSPlaying, intentSeverity) {
  if (!text || typeof text !== 'string') return true;
  
  // HIGH severity intents: Never interrupt during TTS
  if (intentSeverity === "HIGH" && isTTSPlaying) return true;
  
  // Non-interrupt keywords (acknowledgments)
  const notInterruptKeywords = [
    // English
    'hmm', 'ok', 'okay', 'right', 'go on', 'uh-huh', 'hello',
    // Hindi/Hinglish
    'bataiye', 'kahiye', 'accha', 'achha', 'han', 'haan', 
    'haanji', 'ha', 'huh', 'अच्छा', 'बोलिए', 'हाँ', 'हां'
  ];
  
  const lowerCaseText = text.trim().replace(/[.,!?]/g, '').toLowerCase();
  const words = lowerCaseText.split(/\s+/).filter(Boolean);
  
  // All words are non-interrupt keywords
  const allNonInterrupt = words.every(word => 
    notInterruptKeywords.includes(word)
  );
  
  if (allNonInterrupt && isTTSPlaying) return true;
  
  // Single word or ≤2 words during MEDIUM severity
  if (words.length <= 2 && isTTSPlaying && intentSeverity === "MEDIUM") {
    return true;
  }
  
  // Single word during any TTS
  if (words.length === 1 && isTTSPlaying) return true;
  
  return false; // Valid interrupt
}

// Handle Interruption
handleInterimTranscription(text, utteranceStartTime) {
  const isInterrupt = !isNotInterrupt(
    text, 
    this.isTTSPlaying, 
    this.intentSeverity
  );
  
  if (isInterrupt) {
    // Pause TTS and rollback audio buffer
    this.evaluateTTSInterruptionAndRollback();
    this.streamService.pauseAudio(); // Send clearAudio to Plivo
  }
}

OpenAI GPT Realtime API

OpenAI GPT Realtime API provides conversational AI with streaming responses and function calling capabilities:

// GPT Realtime WebSocket Connection
initWebSocket() {
  const url = `${this.endpoint}/openai/realtime?api-version=${this.apiVersion}&deployment=${this.deploymentId}`;
  
  this.websocket = new WebSocket(url, {
    headers: { "Authorization": `Bearer ${this.apiKey}` }
  });
  
  this.websocket.on('message', (data) => {
    const event = JSON.parse(data);
    
    // Handle streaming text responses
    if (event.type === 'response.text.delta') {
      this.sentenceBuffer += event.delta;
      this.emit('gptreply', this.sentenceBuffer, event.item_id);
    }
    
    // Handle function calls
    if (event.type === 'response.function_call') {
      if (event.name === 'call_hangup') {
        this.emit('hangup-call');
      }
      if (event.name === 'machine_detection') {
        this.emit('machine-detected');
      }
    }
  });
}

// Send User Message
sendMessage(text, interactionCount) {
  const message = {
    type: 'conversation.item.create',
    item: {
      type: 'message',
      role: 'user',
      content: [{ type: 'input_text', text }]
    }
  };
  this.websocket.send(JSON.stringify(message));
}

ElevenLabs Text-to-Speech

ElevenLabs provides natural-sounding voice synthesis with WebSocket streaming for low-latency audio generation:

// ElevenLabs WebSocket TTS Setup
initWebSocket() {
  const uri = `wss://api.elevenlabs.io/v1/text-to-speech/${this.voiceId}/stream-input?model_id=${this.model}&output_format=ulaw_8000&inactivity_timeout=180&auto_mode=true&enable_ssml_parsing=true`;
  
  this.websocket = new WebSocket(uri, {
    headers: { "xi-api-key": this.apiKey }
  });
  
  this.websocket.on('message', (event) => {
    const data = JSON.parse(event.toString());
    
    if (data.audio) {
      const base64Chunk = Buffer.from(data.audio).toString("base64");
      
      // Calculate audio duration from alignment data
      const startTimes = data.alignment?.charStartTimesMs ?? [];
      const durations = data.alignment?.charDurationsMs ?? [];
      const audioDuration = startTimes[startTimes.length - 1] + 
                           durations[durations.length - 1];
      
      // Emit audio chunk with duration
      this.emit("speech", data.audio, audioDuration);
    }
  });
}

// Generate Speech
generate(text) {
  const message = {
    text: text,
    flush: false  // Stream mode
  };
  this.websocket.send(JSON.stringify(message));
}

Audio Streaming Pipeline

The system handles bidirectional audio streaming with buffering and checkpoint management:

// Stream Service - Audio Buffer Management
class StreamService {
  sendAudio(audio) {
    const json = {
      "event": 'playAudio',
      "media": {
        "contentType": "audio/x-mulaw",
        "sampleRate": 8000,
        "payload": audio  // Base64 encoded μ-law audio
      }
    };
    this.ws.send(JSON.stringify(json));
    
    // Create checkpoint for interruption rollback
    const markLabel = uuid.v4();
    const checkpoint = {
      "streamId": this.streamSid,
      "event": 'checkpoint',
      "name": markLabel
    };
    this.ws.send(JSON.stringify(checkpoint));
    this.emit('audiosent', markLabel);
  }
  
  pauseAudio() {
    // Clear audio buffer on interruption
    const json = {
      "event": "clearAudio",
      "streamId": this.streamSid
    };
    this.ws.send(JSON.stringify(json));
  }
}

// TTS Audio Handler
this.ttsService.on('speech', (audio, audioDuration) => {
  this.streamService.buffer(null, audio);
  this.isTTSPlaying = true;
  this.ttsDuration += audioDuration;
});

Intent Severity-Based Interruption Handling

The system uses intent severity levels to determine interruption behavior:

// Intent Severity Levels
setIntentSeverity() {
  if (this.interactionCount <= 2) {
    this.intentSeverity = "MEDIUM";  // Early in conversation
  } else {
    this.intentSeverity = "LOW";     // Later in conversation
  }
}

// Initial state: HIGH severity (welcome message)
this.intentSeverity = "HIGH";

// Interruption Rules by Severity:
// HIGH:   Never interrupt - critical messages (welcome, important info)
// MEDIUM: Allow interrupts for ≤2 words only
// LOW:    Allow all valid interrupts

// Usage in Interruption Detection
if (intentSeverity === "HIGH" && isTTSPlaying) {
  return true; // Never interrupt
}

if (words.length <= 2 && isTTSPlaying && intentSeverity === "MEDIUM") {
  return true; // Ignore short interruptions
}

Key Features

Real-Time Transcription

Deepgram provides interim results for instant interruption detection, enabling natural conversation flow with minimal latency.

Intelligent Interruption Detection

Filters out acknowledgment words (hmm, ok, haan) and uses intent severity to determine valid interruptions.

Streaming TTS

ElevenLabs WebSocket streaming provides low-latency audio generation with character-level alignment for precise duration tracking.

Audio Rollback

On interruption, the system rolls back audio buffers using checkpoint marks and sends filler audio (ok, hmm) for smooth transitions.

Function Calling

GPT Realtime API function calling enables call hangup, machine detection, and call hold capabilities.

Multi-Language Support

Supports Hindi and English with Hinglish detection for natural conversation in mixed-language scenarios.

Technical Stack

PlivoDeepgramElevenLabsOpenAI GPT RealtimeNode.jsWebSocketExpressRedis

Architecture Benefits

Low Latency: WebSocket streaming and interim transcription enable sub-second response times
Natural Conversation: Intelligent interruption detection allows natural back-and-forth dialogue
Scalability: Plivo handles call infrastructure, allowing focus on AI logic
Reliability: Checkpoint-based audio rollback ensures smooth interruption handling
Multi-Language: Deepgram and ElevenLabs support Hindi, English, and Hinglish
Production Ready: Includes call recording, machine detection, and call hold capabilities

Project Details

Agentic AI Telephony Solution

About the Project

Project Overview

Plivo Telephony Platform

Deepgram Speech-to-Text Integration

Intelligent Interruption Detection

OpenAI GPT Realtime API

ElevenLabs Text-to-Speech

Audio Streaming Pipeline

Intent Severity-Based Interruption Handling

Key Features

Real-Time Transcription

Intelligent Interruption Detection

Streaming TTS

Audio Rollback

Function Calling

Multi-Language Support

Technical Stack

Architecture Benefits