Voice Integration Specification

Version: 0.1 (Draft)
Date: March 2026
Status: Proposal
Depends on: Tutor Behavior Spec
Used by: Tutor

1. Overview

Voice is how the Tutor speaks and listens. It must feel natural — like talking to a real person, not a voice assistant.

The voice pipeline runs on the Tutor device — which can be a phone, laptop, or browser tab. The tablet (Edge) has no audio role.

1.1 Deployment Options

Platform	Audio API	Pros	Cons
Browser (Web)	Web Audio API, MediaDevices	No install, cross-platform	Some latency, permission prompts
Phone (Native)	iOS AVFoundation, Android AudioRecord	Best latency, background audio	App Store approval, separate builds
Desktop (Native)	Platform audio APIs	Best performance	Install friction
PWA	Web Audio API	Installable, offline capable	Same constraints as browser

Recommendation: Browser-first for v1, with PWA wrapper for "installed" feel.

1.2 Design Goals

Goal	Implication
Conversational latency	<500ms from student stops speaking → Tutor starts
Natural speech	Not robotic, not over-enunciated
Interruptible	Student can cut in anytime
Robust	Works with background noise, accents, kids' voices
Private	Audio processed locally when possible
Cross-platform	Same experience on phone, laptop, browser

1.3 Non-Goals

Wake word ("Hey Tutor") — session is explicit, always listening
Multi-speaker recognition — only student speaks to Tutor
Background music/audio — educational context only

2. Architecture

2.1 Pipeline Overview

┌─────────────────────────────────────────────────────────────────┐
│                       VOICE PIPELINE                            │
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │  Audio  │    │   VAD   │    │   STT   │    │  Tutor  │     │
│  │ Capture │───►│         │───►│         │───►│  Core   │     │
│  │         │    │ (Voice  │    │ (Speech │    │         │     │
│  │         │    │  Detect)│    │ to Text)│    │         │     │
│  └─────────┘    └─────────┘    └─────────┘    └────┬────┘     │
│                                                     │          │
│                                                     ▼          │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │  Audio  │    │   Mix   │    │   TTS   │    │ Response│     │
│  │ Output  │◄───│         │◄───│         │◄───│  Gen    │     │
│  │         │    │         │    │ (Text   │    │         │     │
│  │         │    │         │    │ to Spch)│    │         │     │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘     │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 Component Responsibilities

Component	Responsibility
Audio Capture	Microphone input, noise suppression
VAD	Detect speech start/end
STT	Convert speech to text
Tutor Core	Decide what to say (see Tutor Behavior Spec)
Response Gen	Generate response text
TTS	Convert text to speech
Mix	Handle interruption, volume
Audio Output	Speaker output

3. Speech-to-Text (STT)

3.1 Requirements

Requirement	Target
Latency (end of speech → text)	<500ms
Word Error Rate	<10%
Streaming	Yes (partial results)
Languages	English (v1), Hebrew (v1.1)
Speaker profile	Children ages 10-15
Vocabulary	Math terms, numbers, variables

3.2 Provider Options

Provider	Latency	Accuracy	Cost	Offline
Whisper (local)	300ms	High	Free	Yes
Deepgram	200ms	High	$0.0043/min	No
Google STT	250ms	High	$0.006/min	No
Azure STT	300ms	High	$0.016/min	No
OpenAI Whisper API	500ms	Highest	$0.006/min	No

Recommendation: Whisper (local) for privacy and cost. Fall back to Deepgram for poor local performance.

3.3 Streaming STT

STT provides partial results while student speaks:

Time    Audio                    Partial Results
────    ─────                    ───────────────
0.0s    "What..."               
0.3s                             "What"
0.5s    "...should..."          
0.8s                             "What should"
1.0s    "...I do..."            
1.3s                             "What should I do"
1.5s    "...first?"             
2.0s    (silence detected)       
2.2s                             "What should I do first?" [FINAL]

3.4 Math Vocabulary

Custom vocabulary boost for:

Numbers: zero through twenty, hundred, thousand
Variables: x, y, z, n, a, b, c
Operations: plus, minus, times, divided by, equals, squared, cubed
Terms: equation, expression, fraction, numerator, denominator
       exponent, coefficient, variable, constant, term
       positive, negative, both sides, isolate, solve

3.5 Handling Math Speech

Student says	Interpretation
"three x"	3x
"x squared"	x²
"two over three"	2/3
"negative five"	-5
"equals"	=
"open paren"	(

4. Text-to-Speech (TTS)

4.1 Requirements

Requirement	Target
Latency (text → first audio)	<300ms
Naturalness	Conversational, not robotic
Streaming	Yes (start before full synthesis)
Interruptible	Stop immediately on student speech
Emotion	Warm, encouraging, variable
Languages	English (v1), Hebrew (v1.1)

4.2 Provider Options

Provider	Latency	Quality	Cost	Emotion
ElevenLabs	200ms	Excellent	$0.30/1K chars	Yes
OpenAI TTS	300ms	Good	$0.015/1K chars	Limited
Google TTS	150ms	Good	$0.016/1K chars	Limited
Azure TTS	200ms	Good	$0.016/1K chars	Yes
Coqui (local)	400ms	Medium	Free	Limited

Recommendation: ElevenLabs for quality and emotion. Voice cloning for consistent persona.

4.3 Voice Configuration

{
  "voice_id": "freaking_genius_tutor_v1",
  "model": "eleven_turbo_v2",
  "settings": {
    "stability": 0.65,
    "similarity_boost": 0.75,
    "style": 0.35,
    "use_speaker_boost": true
  },
  "generation_config": {
    "optimize_streaming_latency": 3,
    "output_format": "mp3_44100_128"
  }
}

4.4 Prosody Control

Tune speech based on context:

Context	Adjustment
Question	Upward inflection at end
Encouragement	Warmer, slightly higher energy
Correction	Neutral, steady
Excitement (correct!)	Higher energy, faster
Thinking/Hinting	Slower, contemplative

4.5 Speech Synthesis Markup (SSML)

For nuanced control:

<speak>
  <prosody rate="95%">
    What happens <break time="300ms"/> 
    when you move the five 
    <emphasis level="moderate">to the other side</emphasis>?
  </prosody>
</speak>

5. Voice Activity Detection (VAD)

5.1 VAD Behavior

                    SILENCE          SPEECH          SILENCE
Audio:     ─────────────────│███████████████│─────────────────
                            │               │
                            ▼               ▼
State:     NOT_SPEAKING → SPEAKING → TRAILING → NOT_SPEAKING
                            │               │
Events:         speech_start         speech_end (after 1s silence)

5.2 VAD Parameters

Parameter	Value	Description
`speech_threshold`	0.5	Probability threshold for speech
`silence_duration_ms`	1000	Silence before speech_end
`min_speech_duration_ms`	200	Ignore very short sounds
`padding_ms`	300	Include audio before/after

5.3 Noise Handling

Apply noise suppression before VAD
Calibrate to ambient noise level on session start
Adapt threshold during session

6. Turn-Taking

6.1 Turn States

┌─────────────────────────────────────────────────────────────────┐
│                       TURN-TAKING                               │
│                                                                 │
│   ┌───────────┐     Student speaks      ┌───────────┐         │
│   │  TUTOR    │ ───────────────────────►│  STUDENT  │         │
│   │  TURN     │                          │  TURN     │         │
│   │           │◄─────────────────────────│           │         │
│   │ (speaking │     Tutor responds       │ (speaking │         │
│   │  or       │                          │  or       │         │
│   │  silent)  │◄─────────────────────────│  writing) │         │
│   └───────────┘     Student finishes,    └───────────┘         │
│                     Tutor decides to                            │
│                     speak                                       │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

6.2 Interruption Handling

When student speaks while Tutor is speaking:

Tutor speaking: "When you move the five to the other side—"
                                                    │
Student: "Wait, what?"                              │
                                                    │
1. TTS stops immediately ◄──────────────────────────┘
2. STT processes student speech
3. Tutor responds to interruption

Tutor: "Go ahead, what's your question?"

6.3 Barge-In Sensitivity

Student sound	Action
Clear speech	Stop TTS, process
"mm-hmm", "uh-huh"	Continue TTS (acknowledgment)
Cough, noise	Continue TTS (ignore)
"Wait" / "Hold on"	Stop TTS, wait

Detect backchannel vs. interruption using:

Duration (backchannels are short)
Prosody (backchannels have characteristic pattern)
Keywords ("wait", "hold on", "but" = interrupt)

7. Audio Processing

7.1 Input Processing

Microphone → Noise Suppression → AGC → VAD → STT
                    │
                    └── Echo Cancellation (if Tutor playing)

Stage	Purpose
Noise Suppression	Remove background noise
AGC	Normalize volume levels
Echo Cancellation	Remove Tutor's voice from mic
VAD	Detect speech boundaries

7.2 Output Processing

TTS → Volume Normalization → Ducking (if student speaks) → Speaker

Stage	Purpose
Volume Normalization	Consistent loudness
Ducking	Lower Tutor volume if student starts speaking

7.3 Audio Format

Parameter	Input	Output
Sample Rate	16kHz	44.1kHz
Channels	Mono	Mono
Bit Depth	16-bit	16-bit
Format	PCM	MP3/AAC

8. Latency Budget

8.1 End-to-End Target

Student finishes speaking → Tutor starts responding: <800ms

8.2 Budget Breakdown

Stage	Budget	Notes
VAD speech_end detection	100ms	After 1s silence
STT final transcription	200ms	Streaming helps
Tutor decision	100ms	Usually fast
TTS first audio	200ms	Streaming synthesis
Audio output start	50ms	Buffer management
Total	650ms	Within budget

8.3 Perceived Latency

Even if processing takes time, Tutor can:

Use filler: "Hmm..." "Let's see..." while thinking
Start TTS with common phrase while generating rest

9. Language Support

9.1 Initial Languages

Language	STT	TTS	Timeline
English (US)	v1	v1	Launch
English (UK)	v1	v1	Launch
Hebrew	v1.1	v1.1	+3 months

9.2 Language Detection

Set per-student profile (not auto-detected)
Support code-switching within session (future)

9.3 Accent Handling

Train/fine-tune STT for:

Children's voices (higher pitch, less clear articulation)
Regional accents (Israeli English, British variations)
Math-specific pronunciations

10. Offline Capability

10.1 Offline Mode

When internet unavailable:

Component	Offline Option
STT	Whisper (on-device)
TTS	Coqui / System TTS
Quality	Degraded but functional

10.2 Fallback Behavior

Detect connectivity loss
Switch to offline models
Notify student: "I'm having trouble connecting. I'll do my best."
Resume cloud services when available

11. Privacy & Security

11.1 Audio Data Handling

Data	Storage	Retention
Raw audio	Not stored	—
Transcriptions	Session only	Cleared on session end
Anonymized samples	Research opt-in	90 days

11.2 Privacy Modes

Mode	Behavior
Standard	Cloud STT/TTS, no audio stored
Privacy	On-device only, no cloud
Research	Opt-in audio sampling for improvement

11.3 Compliance

COPPA compliant (parental consent for minors)
GDPR compliant (data minimization, deletion rights)
Audio encrypted in transit (TLS)

12. Platform Integration

12.1 Browser (Web)

Audio capture:

// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);

Constraints:

Constraint	Mitigation
Permission prompt on first use	Clear UX explaining why mic is needed
No background audio when tab hidden	Keep tab visible, or use PWA
Autoplay policies block audio	Require user gesture before TTS
Variable latency	Use AudioWorklet for lower latency

Browser support:

Browser	Support	Notes
Chrome	✓ Full	Best WebRTC support
Firefox	✓ Full	Good
Safari	✓ Partial	Some AudioWorklet limitations
Edge	✓ Full	Chromium-based

12.2 Phone (Native)

iOS:

let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord, mode: .voiceChat)
try audioSession.setActive(true)

Android:

val audioManager = getSystemService(Context.AUDIO_SERVICE) as AudioManager
audioManager.mode = AudioManager.MODE_IN_COMMUNICATION

Interruption handling:

Interruption	Behavior
Phone call	Pause session, resume after
Notification	Suppress during session
Alarm	Pause, show UI to resume
Other app audio	Tutor pauses or ducks

12.3 Desktop (Native/Electron)

Advantages:

Best audio latency
No permission prompts after initial grant
Background operation
System audio integration

Considerations:

Electron/Tauri for cross-platform
Native builds for best performance

12.4 Hardware Requirements

Platform	Microphone	Speaker	Bluetooth
Browser	Built-in or USB	Built-in or external	Via browser
Phone	Built-in	Built-in or earbuds	✓ Native
Desktop	Built-in, USB, or headset	Built-in or external	✓

12.5 Audio Session Management

When Tutor session is active:

Claim audio focus — pause other audio sources
Configure for voice — optimize for speech, not music
Handle interruptions — pause gracefully, resume cleanly
Manage permissions — request once, remember grant

13. Testing

13.1 STT Testing

Test	Pass Criteria
Clean speech recognition	>95% accuracy
Noisy environment	>85% accuracy
Math vocabulary	>90% accuracy
Child voice	>90% accuracy
Accented speech	>85% accuracy

13.2 TTS Testing

Test	Pass Criteria
Latency (first byte)	<300ms
MOS (Mean Opinion Score)	>4.0/5.0
Emotion appropriate	Manual review
Math pronunciation	Correct

13.3 Integration Testing

Test	Pass Criteria
End-to-end latency	<800ms
Interruption response	<200ms to stop
Session audio quality	No dropouts, echo
Offline fallback	Functional

14. Metrics

14.1 Quality Metrics

Metric	Description	Target
STT Word Error Rate	% words incorrect	<10%
TTS MOS	User rating of voice	>4.0
Latency P50	50th percentile response time	<600ms
Latency P95	95th percentile	<1000ms

14.2 Usage Metrics

Metric	Description
`voice_interactions_per_session`	Count of turn-takes
`avg_student_utterance_length`	Words per student turn
`interruption_rate`	% of Tutor speech interrupted
`stt_failure_rate`	% of utterances failed to parse

Appendix A: Provider Integration

ElevenLabs Setup

from elevenlabs import generate, stream

def speak(text):
    audio_stream = generate(
        text=text,
        voice="freaking_genius_tutor_v1",
        model="eleven_turbo_v2",
        stream=True
    )
    stream(audio_stream)

Whisper Local Setup

import whisper

model = whisper.load_model("base.en")  # or "small.en" for better quality

def transcribe(audio_path):
    result = model.transcribe(audio_path)
    return result["text"]

Deepgram Streaming

from deepgram import Deepgram

dg = Deepgram(API_KEY)

async def transcribe_stream(audio_stream):
    socket = await dg.transcription.live({
        "model": "nova-2",
        "language": "en-US",
        "smart_format": True
    })
    
    socket.on("transcript", handle_transcript)
    # ... stream audio to socket

Appendix B: Troubleshooting

Issue	Diagnosis	Fix
High latency	Network or provider issue	Switch to local/faster provider
Poor recognition	Noise, accent, vocabulary	Boost vocab, improve preprocessing
Robotic voice	TTS settings	Adjust prosody, try different voice
Echo	No AEC	Enable echo cancellation
Interruption not working	VAD sensitivity	Lower threshold

This completes Phase 2 specs. Ready for implementation.