← Back to documentation

Voice Integration Specification

Version: 0.1 (Draft)
Date: March 2026
Status: Proposal
Depends on: Tutor Behavior Spec
Used by: Tutor


1. Overview

Voice is how the Tutor speaks and listens. It must feel natural — like talking to a real person, not a voice assistant.

The voice pipeline runs on the Tutor device — which can be a phone, laptop, or browser tab. The tablet (Edge) has no audio role.

1.1 Deployment Options

Platform Audio API Pros Cons
Browser (Web) Web Audio API, MediaDevices No install, cross-platform Some latency, permission prompts
Phone (Native) iOS AVFoundation, Android AudioRecord Best latency, background audio App Store approval, separate builds
Desktop (Native) Platform audio APIs Best performance Install friction
PWA Web Audio API Installable, offline capable Same constraints as browser

Recommendation: Browser-first for v1, with PWA wrapper for "installed" feel.

1.2 Design Goals

Goal Implication
Conversational latency <500ms from student stops speaking → Tutor starts
Natural speech Not robotic, not over-enunciated
Interruptible Student can cut in anytime
Robust Works with background noise, accents, kids' voices
Private Audio processed locally when possible
Cross-platform Same experience on phone, laptop, browser

1.3 Non-Goals


2. Architecture

2.1 Pipeline Overview

┌─────────────────────────────────────────────────────────────────┐
│                       VOICE PIPELINE                            │
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │  Audio  │    │   VAD   │    │   STT   │    │  Tutor  │     │
│  │ Capture │───►│         │───►│         │───►│  Core   │     │
│  │         │    │ (Voice  │    │ (Speech │    │         │     │
│  │         │    │  Detect)│    │ to Text)│    │         │     │
│  └─────────┘    └─────────┘    └─────────┘    └────┬────┘     │
│                                                     │          │
│                                                     ▼          │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │  Audio  │    │   Mix   │    │   TTS   │    │ Response│     │
│  │ Output  │◄───│         │◄───│         │◄───│  Gen    │     │
│  │         │    │         │    │ (Text   │    │         │     │
│  │         │    │         │    │ to Spch)│    │         │     │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 Component Responsibilities

Component Responsibility
Audio Capture Microphone input, noise suppression
VAD Detect speech start/end
STT Convert speech to text
Tutor Core Decide what to say (see Tutor Behavior Spec)
Response Gen Generate response text
TTS Convert text to speech
Mix Handle interruption, volume
Audio Output Speaker output

3. Speech-to-Text (STT)

3.1 Requirements

Requirement Target
Latency (end of speech → text) <500ms
Word Error Rate <10%
Streaming Yes (partial results)
Languages English (v1), Hebrew (v1.1)
Speaker profile Children ages 10-15
Vocabulary Math terms, numbers, variables

3.2 Provider Options

Provider Latency Accuracy Cost Offline
Whisper (local) 300ms High Free Yes
Deepgram 200ms High $0.0043/min No
Google STT 250ms High $0.006/min No
Azure STT 300ms High $0.016/min No
OpenAI Whisper API 500ms Highest $0.006/min No

Recommendation: Whisper (local) for privacy and cost. Fall back to Deepgram for poor local performance.

3.3 Streaming STT

STT provides partial results while student speaks:

Time    Audio                    Partial Results
────    ─────                    ───────────────
0.0s    "What..."               
0.3s                             "What"
0.5s    "...should..."          
0.8s                             "What should"
1.0s    "...I do..."            
1.3s                             "What should I do"
1.5s    "...first?"             
2.0s    (silence detected)       
2.2s                             "What should I do first?" [FINAL]

3.4 Math Vocabulary

Custom vocabulary boost for:

Numbers: zero through twenty, hundred, thousand
Variables: x, y, z, n, a, b, c
Operations: plus, minus, times, divided by, equals, squared, cubed
Terms: equation, expression, fraction, numerator, denominator
       exponent, coefficient, variable, constant, term
       positive, negative, both sides, isolate, solve

3.5 Handling Math Speech

Student says Interpretation
"three x" 3x
"x squared"
"two over three" 2/3
"negative five" -5
"equals" =
"open paren" (

4. Text-to-Speech (TTS)

4.1 Requirements

Requirement Target
Latency (text → first audio) <300ms
Naturalness Conversational, not robotic
Streaming Yes (start before full synthesis)
Interruptible Stop immediately on student speech
Emotion Warm, encouraging, variable
Languages English (v1), Hebrew (v1.1)

4.2 Provider Options

Provider Latency Quality Cost Emotion
ElevenLabs 200ms Excellent $0.30/1K chars Yes
OpenAI TTS 300ms Good $0.015/1K chars Limited
Google TTS 150ms Good $0.016/1K chars Limited
Azure TTS 200ms Good $0.016/1K chars Yes
Coqui (local) 400ms Medium Free Limited

Recommendation: ElevenLabs for quality and emotion. Voice cloning for consistent persona.

4.3 Voice Configuration

{
  "voice_id": "freaking_genius_tutor_v1",
  "model": "eleven_turbo_v2",
  "settings": {
    "stability": 0.65,
    "similarity_boost": 0.75,
    "style": 0.35,
    "use_speaker_boost": true
  },
  "generation_config": {
    "optimize_streaming_latency": 3,
    "output_format": "mp3_44100_128"
  }
}

4.4 Prosody Control

Tune speech based on context:

Context Adjustment
Question Upward inflection at end
Encouragement Warmer, slightly higher energy
Correction Neutral, steady
Excitement (correct!) Higher energy, faster
Thinking/Hinting Slower, contemplative

4.5 Speech Synthesis Markup (SSML)

For nuanced control:

<speak>
  <prosody rate="95%">
    What happens <break time="300ms"/> 
    when you move the five 
    <emphasis level="moderate">to the other side</emphasis>?
  </prosody>
</speak>

5. Voice Activity Detection (VAD)

5.1 VAD Behavior

                    SILENCE          SPEECH          SILENCE
Audio:     ─────────────────│███████████████│─────────────────
                            │               │
                            ▼               ▼
State:     NOT_SPEAKING → SPEAKING → TRAILING → NOT_SPEAKING
                            │               │
Events:         speech_start         speech_end (after 1s silence)

5.2 VAD Parameters

Parameter Value Description
speech_threshold 0.5 Probability threshold for speech
silence_duration_ms 1000 Silence before speech_end
min_speech_duration_ms 200 Ignore very short sounds
padding_ms 300 Include audio before/after

5.3 Noise Handling


6. Turn-Taking

6.1 Turn States

┌─────────────────────────────────────────────────────────────────┐
│                       TURN-TAKING                               │
│                                                                 │
│   ┌───────────┐     Student speaks      ┌───────────┐         │
│   │  TUTOR    │ ───────────────────────►│  STUDENT  │         │
│   │  TURN     │                          │  TURN     │         │
│   │           │◄─────────────────────────│           │         │
│   │ (speaking │     Tutor responds       │ (speaking │         │
│   │  or       │                          │  or       │         │
│   │  silent)  │◄─────────────────────────│  writing) │         │
│   └───────────┘     Student finishes,    └───────────┘         │
│                     Tutor decides to                            │
│                     speak                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6.2 Interruption Handling

When student speaks while Tutor is speaking:

Tutor speaking: "When you move the five to the other side—"
                                                    │
Student: "Wait, what?"                              │
                                                    │
1. TTS stops immediately ◄──────────────────────────┘
2. STT processes student speech
3. Tutor responds to interruption

Tutor: "Go ahead, what's your question?"

6.3 Barge-In Sensitivity

Student sound Action
Clear speech Stop TTS, process
"mm-hmm", "uh-huh" Continue TTS (acknowledgment)
Cough, noise Continue TTS (ignore)
"Wait" / "Hold on" Stop TTS, wait

Detect backchannel vs. interruption using:


7. Audio Processing

7.1 Input Processing

Microphone → Noise Suppression → AGC → VAD → STT
                    │
                    └── Echo Cancellation (if Tutor playing)
Stage Purpose
Noise Suppression Remove background noise
AGC Normalize volume levels
Echo Cancellation Remove Tutor's voice from mic
VAD Detect speech boundaries

7.2 Output Processing

TTS → Volume Normalization → Ducking (if student speaks) → Speaker
Stage Purpose
Volume Normalization Consistent loudness
Ducking Lower Tutor volume if student starts speaking

7.3 Audio Format

Parameter Input Output
Sample Rate 16kHz 44.1kHz
Channels Mono Mono
Bit Depth 16-bit 16-bit
Format PCM MP3/AAC

8. Latency Budget

8.1 End-to-End Target

Student finishes speaking → Tutor starts responding: <800ms

8.2 Budget Breakdown

Stage Budget Notes
VAD speech_end detection 100ms After 1s silence
STT final transcription 200ms Streaming helps
Tutor decision 100ms Usually fast
TTS first audio 200ms Streaming synthesis
Audio output start 50ms Buffer management
Total 650ms Within budget

8.3 Perceived Latency

Even if processing takes time, Tutor can:


9. Language Support

9.1 Initial Languages

Language STT TTS Timeline
English (US) v1 v1 Launch
English (UK) v1 v1 Launch
Hebrew v1.1 v1.1 +3 months

9.2 Language Detection

9.3 Accent Handling

Train/fine-tune STT for:


10. Offline Capability

10.1 Offline Mode

When internet unavailable:

Component Offline Option
STT Whisper (on-device)
TTS Coqui / System TTS
Quality Degraded but functional

10.2 Fallback Behavior

  1. Detect connectivity loss
  2. Switch to offline models
  3. Notify student: "I'm having trouble connecting. I'll do my best."
  4. Resume cloud services when available

11. Privacy & Security

11.1 Audio Data Handling

Data Storage Retention
Raw audio Not stored
Transcriptions Session only Cleared on session end
Anonymized samples Research opt-in 90 days

11.2 Privacy Modes

Mode Behavior
Standard Cloud STT/TTS, no audio stored
Privacy On-device only, no cloud
Research Opt-in audio sampling for improvement

11.3 Compliance


12. Platform Integration

12.1 Browser (Web)

Audio capture:

// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);

Constraints:

Constraint Mitigation
Permission prompt on first use Clear UX explaining why mic is needed
No background audio when tab hidden Keep tab visible, or use PWA
Autoplay policies block audio Require user gesture before TTS
Variable latency Use AudioWorklet for lower latency

Browser support:

Browser Support Notes
Chrome ✓ Full Best WebRTC support
Firefox ✓ Full Good
Safari ✓ Partial Some AudioWorklet limitations
Edge ✓ Full Chromium-based

12.2 Phone (Native)

iOS:

let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord, mode: .voiceChat)
try audioSession.setActive(true)

Android:

val audioManager = getSystemService(Context.AUDIO_SERVICE) as AudioManager
audioManager.mode = AudioManager.MODE_IN_COMMUNICATION

Interruption handling:

Interruption Behavior
Phone call Pause session, resume after
Notification Suppress during session
Alarm Pause, show UI to resume
Other app audio Tutor pauses or ducks

12.3 Desktop (Native/Electron)

Advantages:

Considerations:

12.4 Hardware Requirements

Platform Microphone Speaker Bluetooth
Browser Built-in or USB Built-in or external Via browser
Phone Built-in Built-in or earbuds ✓ Native
Desktop Built-in, USB, or headset Built-in or external

12.5 Audio Session Management

When Tutor session is active:

  1. Claim audio focus — pause other audio sources
  2. Configure for voice — optimize for speech, not music
  3. Handle interruptions — pause gracefully, resume cleanly
  4. Manage permissions — request once, remember grant

13. Testing

13.1 STT Testing

Test Pass Criteria
Clean speech recognition >95% accuracy
Noisy environment >85% accuracy
Math vocabulary >90% accuracy
Child voice >90% accuracy
Accented speech >85% accuracy

13.2 TTS Testing

Test Pass Criteria
Latency (first byte) <300ms
MOS (Mean Opinion Score) >4.0/5.0
Emotion appropriate Manual review
Math pronunciation Correct

13.3 Integration Testing

Test Pass Criteria
End-to-end latency <800ms
Interruption response <200ms to stop
Session audio quality No dropouts, echo
Offline fallback Functional

14. Metrics

14.1 Quality Metrics

Metric Description Target
STT Word Error Rate % words incorrect <10%
TTS MOS User rating of voice >4.0
Latency P50 50th percentile response time <600ms
Latency P95 95th percentile <1000ms

14.2 Usage Metrics

Metric Description
voice_interactions_per_session Count of turn-takes
avg_student_utterance_length Words per student turn
interruption_rate % of Tutor speech interrupted
stt_failure_rate % of utterances failed to parse

Appendix A: Provider Integration

ElevenLabs Setup

from elevenlabs import generate, stream

def speak(text):
    audio_stream = generate(
        text=text,
        voice="freaking_genius_tutor_v1",
        model="eleven_turbo_v2",
        stream=True
    )
    stream(audio_stream)

Whisper Local Setup

import whisper

model = whisper.load_model("base.en")  # or "small.en" for better quality

def transcribe(audio_path):
    result = model.transcribe(audio_path)
    return result["text"]

Deepgram Streaming

from deepgram import Deepgram

dg = Deepgram(API_KEY)

async def transcribe_stream(audio_stream):
    socket = await dg.transcription.live({
        "model": "nova-2",
        "language": "en-US",
        "smart_format": True
    })
    
    socket.on("transcript", handle_transcript)
    # ... stream audio to socket

Appendix B: Troubleshooting

Issue Diagnosis Fix
High latency Network or provider issue Switch to local/faster provider
Poor recognition Noise, accent, vocabulary Boost vocab, improve preprocessing
Robotic voice TTS settings Adjust prosody, try different voice
Echo No AEC Enable echo cancellation
Interruption not working VAD sensitivity Lower threshold

This completes Phase 2 specs. Ready for implementation.