← Back to documentation
Voice Integration Specification
Version: 0.1 (Draft)
Date: March 2026
Status: Proposal
Depends on: Tutor Behavior Spec
Used by: Tutor
1. Overview
Voice is how the Tutor speaks and listens. It must feel natural — like talking to a real person, not a voice assistant.
The voice pipeline runs on the Tutor device — which can be a phone, laptop, or browser tab. The tablet (Edge) has no audio role.
1.1 Deployment Options
| Platform |
Audio API |
Pros |
Cons |
| Browser (Web) |
Web Audio API, MediaDevices |
No install, cross-platform |
Some latency, permission prompts |
| Phone (Native) |
iOS AVFoundation, Android AudioRecord |
Best latency, background audio |
App Store approval, separate builds |
| Desktop (Native) |
Platform audio APIs |
Best performance |
Install friction |
| PWA |
Web Audio API |
Installable, offline capable |
Same constraints as browser |
Recommendation: Browser-first for v1, with PWA wrapper for "installed" feel.
1.2 Design Goals
| Goal |
Implication |
| Conversational latency |
<500ms from student stops speaking → Tutor starts |
| Natural speech |
Not robotic, not over-enunciated |
| Interruptible |
Student can cut in anytime |
| Robust |
Works with background noise, accents, kids' voices |
| Private |
Audio processed locally when possible |
| Cross-platform |
Same experience on phone, laptop, browser |
1.3 Non-Goals
- Wake word ("Hey Tutor") — session is explicit, always listening
- Multi-speaker recognition — only student speaks to Tutor
- Background music/audio — educational context only
2. Architecture
2.1 Pipeline Overview
┌─────────────────────────────────────────────────────────────────┐
│ VOICE PIPELINE │
│ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Audio │ │ VAD │ │ STT │ │ Tutor │ │
│ │ Capture │───►│ │───►│ │───►│ Core │ │
│ │ │ │ (Voice │ │ (Speech │ │ │ │
│ │ │ │ Detect)│ │ to Text)│ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └────┬────┘ │
│ │ │
│ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Audio │ │ Mix │ │ TTS │ │ Response│ │
│ │ Output │◄───│ │◄───│ │◄───│ Gen │ │
│ │ │ │ │ │ (Text │ │ │ │
│ │ │ │ │ │ to Spch)│ │ │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
2.2 Component Responsibilities
| Component |
Responsibility |
| Audio Capture |
Microphone input, noise suppression |
| VAD |
Detect speech start/end |
| STT |
Convert speech to text |
| Tutor Core |
Decide what to say (see Tutor Behavior Spec) |
| Response Gen |
Generate response text |
| TTS |
Convert text to speech |
| Mix |
Handle interruption, volume |
| Audio Output |
Speaker output |
3. Speech-to-Text (STT)
3.1 Requirements
| Requirement |
Target |
| Latency (end of speech → text) |
<500ms |
| Word Error Rate |
<10% |
| Streaming |
Yes (partial results) |
| Languages |
English (v1), Hebrew (v1.1) |
| Speaker profile |
Children ages 10-15 |
| Vocabulary |
Math terms, numbers, variables |
3.2 Provider Options
| Provider |
Latency |
Accuracy |
Cost |
Offline |
| Whisper (local) |
300ms |
High |
Free |
Yes |
| Deepgram |
200ms |
High |
$0.0043/min |
No |
| Google STT |
250ms |
High |
$0.006/min |
No |
| Azure STT |
300ms |
High |
$0.016/min |
No |
| OpenAI Whisper API |
500ms |
Highest |
$0.006/min |
No |
Recommendation: Whisper (local) for privacy and cost. Fall back to Deepgram for poor local performance.
3.3 Streaming STT
STT provides partial results while student speaks:
Time Audio Partial Results
──── ───── ───────────────
0.0s "What..."
0.3s "What"
0.5s "...should..."
0.8s "What should"
1.0s "...I do..."
1.3s "What should I do"
1.5s "...first?"
2.0s (silence detected)
2.2s "What should I do first?" [FINAL]
3.4 Math Vocabulary
Custom vocabulary boost for:
Numbers: zero through twenty, hundred, thousand
Variables: x, y, z, n, a, b, c
Operations: plus, minus, times, divided by, equals, squared, cubed
Terms: equation, expression, fraction, numerator, denominator
exponent, coefficient, variable, constant, term
positive, negative, both sides, isolate, solve
3.5 Handling Math Speech
| Student says |
Interpretation |
| "three x" |
3x |
| "x squared" |
x² |
| "two over three" |
2/3 |
| "negative five" |
-5 |
| "equals" |
= |
| "open paren" |
( |
4. Text-to-Speech (TTS)
4.1 Requirements
| Requirement |
Target |
| Latency (text → first audio) |
<300ms |
| Naturalness |
Conversational, not robotic |
| Streaming |
Yes (start before full synthesis) |
| Interruptible |
Stop immediately on student speech |
| Emotion |
Warm, encouraging, variable |
| Languages |
English (v1), Hebrew (v1.1) |
4.2 Provider Options
| Provider |
Latency |
Quality |
Cost |
Emotion |
| ElevenLabs |
200ms |
Excellent |
$0.30/1K chars |
Yes |
| OpenAI TTS |
300ms |
Good |
$0.015/1K chars |
Limited |
| Google TTS |
150ms |
Good |
$0.016/1K chars |
Limited |
| Azure TTS |
200ms |
Good |
$0.016/1K chars |
Yes |
| Coqui (local) |
400ms |
Medium |
Free |
Limited |
Recommendation: ElevenLabs for quality and emotion. Voice cloning for consistent persona.
4.3 Voice Configuration
{
"voice_id": "freaking_genius_tutor_v1",
"model": "eleven_turbo_v2",
"settings": {
"stability": 0.65,
"similarity_boost": 0.75,
"style": 0.35,
"use_speaker_boost": true
},
"generation_config": {
"optimize_streaming_latency": 3,
"output_format": "mp3_44100_128"
}
}
4.4 Prosody Control
Tune speech based on context:
| Context |
Adjustment |
| Question |
Upward inflection at end |
| Encouragement |
Warmer, slightly higher energy |
| Correction |
Neutral, steady |
| Excitement (correct!) |
Higher energy, faster |
| Thinking/Hinting |
Slower, contemplative |
4.5 Speech Synthesis Markup (SSML)
For nuanced control:
<speak>
<prosody rate="95%">
What happens <break time="300ms"/>
when you move the five
<emphasis level="moderate">to the other side</emphasis>?
</prosody>
</speak>
5. Voice Activity Detection (VAD)
5.1 VAD Behavior
SILENCE SPEECH SILENCE
Audio: ─────────────────│███████████████│─────────────────
│ │
▼ ▼
State: NOT_SPEAKING → SPEAKING → TRAILING → NOT_SPEAKING
│ │
Events: speech_start speech_end (after 1s silence)
5.2 VAD Parameters
| Parameter |
Value |
Description |
speech_threshold |
0.5 |
Probability threshold for speech |
silence_duration_ms |
1000 |
Silence before speech_end |
min_speech_duration_ms |
200 |
Ignore very short sounds |
padding_ms |
300 |
Include audio before/after |
5.3 Noise Handling
- Apply noise suppression before VAD
- Calibrate to ambient noise level on session start
- Adapt threshold during session
6. Turn-Taking
6.1 Turn States
┌─────────────────────────────────────────────────────────────────┐
│ TURN-TAKING │
│ │
│ ┌───────────┐ Student speaks ┌───────────┐ │
│ │ TUTOR │ ───────────────────────►│ STUDENT │ │
│ │ TURN │ │ TURN │ │
│ │ │◄─────────────────────────│ │ │
│ │ (speaking │ Tutor responds │ (speaking │ │
│ │ or │ │ or │ │
│ │ silent) │◄─────────────────────────│ writing) │ │
│ └───────────┘ Student finishes, └───────────┘ │
│ Tutor decides to │
│ speak │
│ │
└─────────────────────────────────────────────────────────────────┘
6.2 Interruption Handling
When student speaks while Tutor is speaking:
Tutor speaking: "When you move the five to the other side—"
│
Student: "Wait, what?" │
│
1. TTS stops immediately ◄──────────────────────────┘
2. STT processes student speech
3. Tutor responds to interruption
Tutor: "Go ahead, what's your question?"
6.3 Barge-In Sensitivity
| Student sound |
Action |
| Clear speech |
Stop TTS, process |
| "mm-hmm", "uh-huh" |
Continue TTS (acknowledgment) |
| Cough, noise |
Continue TTS (ignore) |
| "Wait" / "Hold on" |
Stop TTS, wait |
Detect backchannel vs. interruption using:
- Duration (backchannels are short)
- Prosody (backchannels have characteristic pattern)
- Keywords ("wait", "hold on", "but" = interrupt)
7. Audio Processing
7.1 Input Processing
Microphone → Noise Suppression → AGC → VAD → STT
│
└── Echo Cancellation (if Tutor playing)
| Stage |
Purpose |
| Noise Suppression |
Remove background noise |
| AGC |
Normalize volume levels |
| Echo Cancellation |
Remove Tutor's voice from mic |
| VAD |
Detect speech boundaries |
7.2 Output Processing
TTS → Volume Normalization → Ducking (if student speaks) → Speaker
| Stage |
Purpose |
| Volume Normalization |
Consistent loudness |
| Ducking |
Lower Tutor volume if student starts speaking |
7.3 Audio Format
| Parameter |
Input |
Output |
| Sample Rate |
16kHz |
44.1kHz |
| Channels |
Mono |
Mono |
| Bit Depth |
16-bit |
16-bit |
| Format |
PCM |
MP3/AAC |
8. Latency Budget
8.1 End-to-End Target
Student finishes speaking → Tutor starts responding: <800ms
8.2 Budget Breakdown
| Stage |
Budget |
Notes |
| VAD speech_end detection |
100ms |
After 1s silence |
| STT final transcription |
200ms |
Streaming helps |
| Tutor decision |
100ms |
Usually fast |
| TTS first audio |
200ms |
Streaming synthesis |
| Audio output start |
50ms |
Buffer management |
| Total |
650ms |
Within budget |
8.3 Perceived Latency
Even if processing takes time, Tutor can:
- Use filler: "Hmm..." "Let's see..." while thinking
- Start TTS with common phrase while generating rest
9. Language Support
9.1 Initial Languages
| Language |
STT |
TTS |
Timeline |
| English (US) |
v1 |
v1 |
Launch |
| English (UK) |
v1 |
v1 |
Launch |
| Hebrew |
v1.1 |
v1.1 |
+3 months |
9.2 Language Detection
- Set per-student profile (not auto-detected)
- Support code-switching within session (future)
9.3 Accent Handling
Train/fine-tune STT for:
- Children's voices (higher pitch, less clear articulation)
- Regional accents (Israeli English, British variations)
- Math-specific pronunciations
10. Offline Capability
10.1 Offline Mode
When internet unavailable:
| Component |
Offline Option |
| STT |
Whisper (on-device) |
| TTS |
Coqui / System TTS |
| Quality |
Degraded but functional |
10.2 Fallback Behavior
- Detect connectivity loss
- Switch to offline models
- Notify student: "I'm having trouble connecting. I'll do my best."
- Resume cloud services when available
11. Privacy & Security
11.1 Audio Data Handling
| Data |
Storage |
Retention |
| Raw audio |
Not stored |
— |
| Transcriptions |
Session only |
Cleared on session end |
| Anonymized samples |
Research opt-in |
90 days |
11.2 Privacy Modes
| Mode |
Behavior |
| Standard |
Cloud STT/TTS, no audio stored |
| Privacy |
On-device only, no cloud |
| Research |
Opt-in audio sampling for improvement |
11.3 Compliance
- COPPA compliant (parental consent for minors)
- GDPR compliant (data minimization, deletion rights)
- Audio encrypted in transit (TLS)
12. Platform Integration
12.1 Browser (Web)
Audio capture:
// Request microphone access
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
const audioContext = new AudioContext();
const source = audioContext.createMediaStreamSource(stream);
Constraints:
| Constraint |
Mitigation |
| Permission prompt on first use |
Clear UX explaining why mic is needed |
| No background audio when tab hidden |
Keep tab visible, or use PWA |
| Autoplay policies block audio |
Require user gesture before TTS |
| Variable latency |
Use AudioWorklet for lower latency |
Browser support:
| Browser |
Support |
Notes |
| Chrome |
✓ Full |
Best WebRTC support |
| Firefox |
✓ Full |
Good |
| Safari |
✓ Partial |
Some AudioWorklet limitations |
| Edge |
✓ Full |
Chromium-based |
12.2 Phone (Native)
iOS:
let audioSession = AVAudioSession.sharedInstance()
try audioSession.setCategory(.playAndRecord, mode: .voiceChat)
try audioSession.setActive(true)
Android:
val audioManager = getSystemService(Context.AUDIO_SERVICE) as AudioManager
audioManager.mode = AudioManager.MODE_IN_COMMUNICATION
Interruption handling:
| Interruption |
Behavior |
| Phone call |
Pause session, resume after |
| Notification |
Suppress during session |
| Alarm |
Pause, show UI to resume |
| Other app audio |
Tutor pauses or ducks |
12.3 Desktop (Native/Electron)
Advantages:
- Best audio latency
- No permission prompts after initial grant
- Background operation
- System audio integration
Considerations:
- Electron/Tauri for cross-platform
- Native builds for best performance
12.4 Hardware Requirements
| Platform |
Microphone |
Speaker |
Bluetooth |
| Browser |
Built-in or USB |
Built-in or external |
Via browser |
| Phone |
Built-in |
Built-in or earbuds |
✓ Native |
| Desktop |
Built-in, USB, or headset |
Built-in or external |
✓ |
12.5 Audio Session Management
When Tutor session is active:
- Claim audio focus — pause other audio sources
- Configure for voice — optimize for speech, not music
- Handle interruptions — pause gracefully, resume cleanly
- Manage permissions — request once, remember grant
13. Testing
13.1 STT Testing
| Test |
Pass Criteria |
| Clean speech recognition |
>95% accuracy |
| Noisy environment |
>85% accuracy |
| Math vocabulary |
>90% accuracy |
| Child voice |
>90% accuracy |
| Accented speech |
>85% accuracy |
13.2 TTS Testing
| Test |
Pass Criteria |
| Latency (first byte) |
<300ms |
| MOS (Mean Opinion Score) |
>4.0/5.0 |
| Emotion appropriate |
Manual review |
| Math pronunciation |
Correct |
13.3 Integration Testing
| Test |
Pass Criteria |
| End-to-end latency |
<800ms |
| Interruption response |
<200ms to stop |
| Session audio quality |
No dropouts, echo |
| Offline fallback |
Functional |
14. Metrics
14.1 Quality Metrics
| Metric |
Description |
Target |
| STT Word Error Rate |
% words incorrect |
<10% |
| TTS MOS |
User rating of voice |
>4.0 |
| Latency P50 |
50th percentile response time |
<600ms |
| Latency P95 |
95th percentile |
<1000ms |
14.2 Usage Metrics
| Metric |
Description |
voice_interactions_per_session |
Count of turn-takes |
avg_student_utterance_length |
Words per student turn |
interruption_rate |
% of Tutor speech interrupted |
stt_failure_rate |
% of utterances failed to parse |
Appendix A: Provider Integration
ElevenLabs Setup
from elevenlabs import generate, stream
def speak(text):
audio_stream = generate(
text=text,
voice="freaking_genius_tutor_v1",
model="eleven_turbo_v2",
stream=True
)
stream(audio_stream)
Whisper Local Setup
import whisper
model = whisper.load_model("base.en") # or "small.en" for better quality
def transcribe(audio_path):
result = model.transcribe(audio_path)
return result["text"]
Deepgram Streaming
from deepgram import Deepgram
dg = Deepgram(API_KEY)
async def transcribe_stream(audio_stream):
socket = await dg.transcription.live({
"model": "nova-2",
"language": "en-US",
"smart_format": True
})
socket.on("transcript", handle_transcript)
# ... stream audio to socket
Appendix B: Troubleshooting
| Issue |
Diagnosis |
Fix |
| High latency |
Network or provider issue |
Switch to local/faster provider |
| Poor recognition |
Noise, accent, vocabulary |
Boost vocab, improve preprocessing |
| Robotic voice |
TTS settings |
Adjust prosody, try different voice |
| Echo |
No AEC |
Enable echo cancellation |
| Interruption not working |
VAD sensitivity |
Lower threshold |
This completes Phase 2 specs. Ready for implementation.