← Back to documentation

Vision API Specification

Version: 0.1 (Draft)
Date: March 2026
Status: Proposal
Depends on: Stroke Format Spec
Used by: Tutor


1. Overview

Vision is the component that transforms raw strokes into structured mathematical understanding. It answers: "What did the student write, and what does it mean?"

Vision is NOT:

1.1 Design Goals

Goal Implication
Mathematical understanding Parse structure (equations, steps), not just symbols
Work-in-progress parsing Understand incomplete work as student writes
Error localization Identify where in the work an error occurred
Low latency <500ms for incremental updates
Confidence scoring Know when interpretation is uncertain

1.2 Scope

Vision handles:

Vision does NOT handle (yet):


2. Architecture

2.1 Vision Pipeline

┌─────────────────────────────────────────────────────────────────┐
│                       VISION PIPELINE                           │
│                                                                 │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │ Stroke  │    │ Symbol  │    │  Math   │    │  Work   │     │
│  │ Groups  │───►│ Recog   │───►│ Parser  │───►│ Analyzer│     │
│  │         │    │         │    │         │    │         │     │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘     │
│                                                                 │
│  "These strokes   "That's a    "It's the     "They're         │
│   belong together" 3, x, +"     equation      isolating x"     │
│                                  3x + 5 = 14"                   │
└─────────────────────────────────────────────────────────────────┘

2.2 Processing Stages

Stage Input Output
Stroke Grouping Raw strokes Logical groups (symbols, expressions)
Symbol Recognition Stroke groups Characters with confidence
Math Parsing Symbols + positions Structured math (AST)
Work Analysis Math AST over time Steps, intent, errors

2.3 Deployment Options

Option Latency Accuracy Cost
On-device (Tutor device) ~100ms Medium Free
Cloud API ~300ms High Per-call
Hybrid ~150ms High Reduced

Recommended: Hybrid — fast on-device for incremental updates, cloud for full analysis.

Note: On-device performance varies by platform. Native phone apps can use optimized ML frameworks (CoreML, TensorFlow Lite). Browser-based Tutor may need to rely more on cloud processing.


3. Data Model

3.1 Symbol

A recognized character or mathematical symbol.

{
  "id": "sym_001",
  "value": "3",
  "type": "digit",
  "confidence": 0.95,
  "strokeIds": ["str_001", "str_002"],
  "bounds": {"x": 0.12, "y": 0.35, "w": 0.03, "h": 0.05},
  "alternatives": [
    {"value": "8", "confidence": 0.03},
    {"value": "5", "confidence": 0.02}
  ]
}
Field Type Description
id string Unique identifier
value string Recognized value
type enum Symbol category (see 3.2)
confidence float 0.0 - 1.0
strokeIds array Source strokes
bounds object Bounding box
alternatives array Other possible interpretations

3.2 Symbol Types

enum SymbolType {
  // Numerals
  DIGIT,           // 0-9
  DECIMAL_POINT,   // .
  
  // Variables
  VARIABLE,        // x, y, n, etc.
  
  // Operators
  PLUS,            // +
  MINUS,           // - (as operator)
  MULTIPLY,        // × or *
  DIVIDE,          // ÷ or /
  EQUALS,          // =
  NOT_EQUALS,      // ≠
  LESS_THAN,       // <
  GREATER_THAN,    // >
  
  // Grouping
  PAREN_OPEN,      // (
  PAREN_CLOSE,     // )
  BRACKET_OPEN,    // [
  BRACKET_CLOSE,   // ]
  
  // Special
  FRACTION_BAR,    // horizontal line in fraction
  EXPONENT,        // superscript indicator
  SQRT,            // √
  NEGATIVE_SIGN,   // - (as sign, not operator)
  
  // Other
  UNKNOWN,         // Unrecognized
  SCRATCH,         // Crossed out / scribble
}

3.3 Expression

A mathematical expression parsed from symbols.

{
  "id": "expr_001",
  "latex": "3x + 5",
  "tree": {
    "type": "add",
    "left": {
      "type": "multiply",
      "left": {"type": "number", "value": 3},
      "right": {"type": "variable", "name": "x"}
    },
    "right": {"type": "number", "value": 5}
  },
  "symbols": ["sym_001", "sym_002", "sym_003", "sym_004"],
  "bounds": {"x": 0.10, "y": 0.34, "w": 0.15, "h": 0.06},
  "confidence": 0.92
}

3.4 Equation

An equation (expression = expression).

{
  "id": "eq_001",
  "type": "equation",
  "left": { /* expression */ },
  "right": { /* expression */ },
  "latex": "3x + 5 = 14",
  "confidence": 0.91
}

3.5 WorkStep

A single step in the student's work.

{
  "id": "step_001",
  "index": 0,
  "content": { /* equation or expression */ },
  "operation": null,
  "bounds": {"x": 0.05, "y": 0.30, "w": 0.40, "h": 0.08},
  "timestamp": 1711670500000
}

For subsequent steps:

{
  "id": "step_002",
  "index": 1,
  "content": { /* equation: 3x = 9 */ },
  "operation": {
    "type": "subtract_both_sides",
    "operand": {"type": "number", "value": 5}
  },
  "valid": true,
  "bounds": {"x": 0.05, "y": 0.40, "w": 0.35, "h": 0.08},
  "timestamp": 1711670510000
}

3.6 Work

Complete analysis of student's work.

{
  "exerciseId": "ex_4521",
  "problem": { /* original equation */ },
  "steps": [ /* array of WorkStep */ ],
  "answer": {
    "expression": {"type": "number", "value": 3},
    "latex": "x = 3",
    "zoneId": "answer_zone"
  },
  "status": "complete",
  "errors": [],
  "confidence": 0.89
}

4. API Interface

4.1 Incremental Update

Called on each stroke batch (every 200ms during active writing).

Request:

{
  "method": "vision.update",
  "params": {
    "sessionId": "sess_xyz",
    "exerciseId": "ex_4521",
    "strokes": [ /* new strokes only */ ],
    "context": {
      "problem": "3x + 5 = 14",
      "expectedAnswer": "x = 3",
      "previousParsed": { /* last Work object */ }
    }
  }
}

Response:

{
  "result": {
    "symbols": [ /* newly recognized symbols */ ],
    "expressions": [ /* updated expressions */ ],
    "work": { /* full Work object */ },
    "changes": [
      {"type": "symbol_added", "symbolId": "sym_042"},
      {"type": "step_updated", "stepId": "step_002"}
    ],
    "parseTime": 145
  }
}

4.2 Full Parse

Called when complete analysis needed (e.g., on "submit").

Request:

{
  "method": "vision.parse",
  "params": {
    "sessionId": "sess_xyz",
    "exerciseId": "ex_4521",
    "allStrokes": [ /* all strokes */ ],
    "context": {
      "problem": "3x + 5 = 14",
      "expectedAnswer": "x = 3"
    }
  }
}

Response:

{
  "result": {
    "work": { /* complete Work object */ },
    "evaluation": {
      "answerCorrect": true,
      "workValid": true,
      "stepsAnalysis": [
        {"stepIndex": 0, "valid": true, "operation": "given"},
        {"stepIndex": 1, "valid": true, "operation": "subtract_both_sides"},
        {"stepIndex": 2, "valid": true, "operation": "divide_both_sides"}
      ]
    },
    "confidence": 0.93,
    "parseTime": 320
  }
}

4.3 Error Detection

Called periodically or on suspected error.

Request:

{
  "method": "vision.checkErrors",
  "params": {
    "sessionId": "sess_xyz",
    "work": { /* current Work object */ }
  }
}

Response:

{
  "result": {
    "errors": [
      {
        "type": "sign_error",
        "stepIndex": 1,
        "location": {"symbolIds": ["sym_025"]},
        "expected": "+",
        "actual": "-",
        "message": "Sign should flip when moving to other side",
        "confidence": 0.87
      }
    ]
  }
}

4.4 Symbol Clarification

When confidence is low, Vision can request clarification.

Response with clarification needed:

{
  "result": {
    "work": { /* partial */ },
    "clarificationNeeded": [
      {
        "symbolId": "sym_023",
        "bounds": {"x": 0.35, "y": 0.42, "w": 0.03, "h": 0.05},
        "candidates": [
          {"value": "6", "confidence": 0.45},
          {"value": "0", "confidence": 0.40},
          {"value": "9", "confidence": 0.10}
        ],
        "contextHint": "Expecting a single digit here"
      }
    ]
  }
}

Tutor can then ask: "Is that a 6 or a 0?"


5. Recognition Details

5.1 Stroke Grouping

Strokes are grouped based on:

Factor Weight Description
Temporal proximity High Strokes within 500ms likely same symbol
Spatial proximity High Strokes that overlap or touch
Structural patterns Medium Known multi-stroke symbols (=, +, ×)
Context Low What makes sense mathematically

5.2 Symbol Recognition

Multi-stage recognition:

  1. Shape classification — What general shape? (loop, line, curve)
  2. Character matching — What character fits?
  3. Context refinement — What makes sense here?

Example: A loop shape could be 0, O, or o. Context (math expression) strongly suggests 0.

5.3 Spatial Parsing

Position determines meaning:

Position Interpretation
Superscript Exponent
Subscript Index (x₁)
Inline Normal symbol
Above/below line Fraction
Small leading Negative sign vs. minus

5.4 Common Confusions

Pair Disambiguation
1, l, I Context: numbers vs. variables
0, O, o Math context → 0
×, x Operator position → ×, variable position → x
-, − Position: between terms → operator, before term → sign
2, z Typical handwriting differences
5, S Context: digit expected vs. variable

6. Work Analysis

6.1 Step Detection

Steps are detected by:

6.2 Operation Detection

Vision infers what operation was performed:

Pattern Detected Operation
Same terms, one moved across = add/subtract_both_sides
All terms multiplied/divided by same multiply/divide_both_sides
Expression simplified simplify
Terms combined combine_like_terms
Distribution applied distribute
Factoring applied factor

6.3 Validity Checking

For each step, Vision checks:

  1. Algebraic validity — Is the transformation mathematically correct?
  2. Derivation — Does it follow from the previous step?
  3. Completeness — Are all terms accounted for?
{
  "stepIndex": 2,
  "valid": false,
  "error": {
    "type": "arithmetic_error",
    "detail": "14 - 5 = 9, not 8",
    "expected": "9",
    "actual": "8"
  }
}

7. Confidence Model

7.1 Confidence Levels

Level Range Meaning Action
High 0.85+ Very confident Proceed normally
Medium 0.60-0.85 Somewhat confident Proceed with caution
Low 0.40-0.60 Uncertain May need clarification
Very Low <0.40 Guessing Request clarification

7.2 Confidence Propagation

Overall confidence is combination of:

overall = (symbol_conf^0.4) × (parse_conf^0.3) × (context_conf^0.3)

7.3 Confidence Signals to Tutor

{
  "confidence": 0.72,
  "confidenceFlags": {
    "lowSymbolConfidence": ["sym_023", "sym_025"],
    "ambiguousParse": false,
    "unexpectedContent": false
  }
}

8. Error Types

8.1 Error Categories

enum ErrorType {
  // Arithmetic
  ARITHMETIC_ERROR,      // Wrong calculation
  
  // Algebraic
  SIGN_ERROR,            // Wrong sign when moving terms
  OPERATION_ERROR,       // Wrong operation applied
  MISSING_TERM,          // Dropped a term
  EXTRA_TERM,            // Added a term
  DISTRIBUTION_ERROR,    // Incorrect distribution
  COMBINING_ERROR,       // Wrong combination of like terms
  
  // Structural
  INCOMPLETE_STEP,       // Step not finished
  SKIPPED_STEP,          // Jumped too far
  WRONG_VARIABLE,        // Solving for wrong variable
  
  // Format
  NOTATION_ERROR,        // Math written incorrectly
  AMBIGUOUS,             // Can't determine intent
}

8.2 Error Localization

Errors include location info for highlighting:

{
  "type": "SIGN_ERROR",
  "location": {
    "stepIndex": 1,
    "symbolIds": ["sym_025"],
    "bounds": {"x": 0.30, "y": 0.41, "w": 0.02, "h": 0.04}
  },
  "context": {
    "operation": "subtract_both_sides",
    "expected": "3x = 14 - 5",
    "actual": "3x = 14 + 5"
  }
}

9. Integration with Tutor

9.1 Event Flow

Tutor                               Vision
  │                                    │
  │  Stroke batch received             │
  │                                    │
  │  vision.update(strokes) ──────────►│
  │                                    │  Process strokes
  │                                    │  Update Work model
  │  ◄────────────── Work + changes ───│
  │                                    │
  │  Evaluate: should I speak?         │
  │  • Error detected?                 │
  │  • Step completed?                 │
  │  • Student stuck?                  │
  │                                    │

9.2 Tutor Decision Points

Vision output triggers Tutor decisions:

Vision Output Tutor Consideration
error detected Intervene now or let student self-correct?
step completed Praise? Move on?
confidence low Ask for clarification?
answer in answer zone Check correctness?
No change for 30s Student stuck? Offer hint?

9.3 Vision Queries

Tutor can ask Vision specific questions:

{"method": "vision.isAnswerCorrect", "params": {"work": {...}}}
{"method": "vision.getNextHint", "params": {"work": {...}, "errorType": "sign_error"}}
{"method": "vision.compareToExpected", "params": {"work": {...}, "expected": "x = 3"}}

10. Performance

10.1 Latency Targets

Operation Target Max
Incremental update 150ms 300ms
Full parse 300ms 500ms
Error check 100ms 200ms

10.2 Caching

Vision maintains cache per session:

10.3 Batching Strategy

Vision processes in cycles, not per-stroke:

Strokes arrive:  ─●──●●─●──●●●──●─●──────────●●──
Process cycles:  ──────X────────X────────────X──
                      150ms    150ms         (idle, process on activity)

11. Future Capabilities

11.1 Planned

Capability Description Timeline
Geometry notation Angles, parallel marks, congruence v1.5
Graph interpretation Identify plotted points, lines v2.0
Word problem parsing Extract math from text v2.0
Multi-language Non-Latin numerals, RTL v2.0

11.2 Research

Capability Challenge
Intent prediction What is student trying to do?
Learning style detection Visual vs. procedural approach?
Misconception identification What conceptual error underlies this?

Appendix A: Example Parsing

Student writes: 3x + 5 = 14

Strokes received:

str_001: curves forming "3"
str_002: crossed lines forming "x"
str_003: crossed lines forming "+"
str_004: curves forming "5"
str_005: two horizontal lines forming "="
str_006: vertical line (part of "1")
str_007: curves forming "4"

Symbol recognition:

[
  {"id": "sym_001", "value": "3", "type": "DIGIT", "confidence": 0.97},
  {"id": "sym_002", "value": "x", "type": "VARIABLE", "confidence": 0.94},
  {"id": "sym_003", "value": "+", "type": "PLUS", "confidence": 0.96},
  {"id": "sym_004", "value": "5", "type": "DIGIT", "confidence": 0.93},
  {"id": "sym_005", "value": "=", "type": "EQUALS", "confidence": 0.98},
  {"id": "sym_006", "value": "1", "type": "DIGIT", "confidence": 0.91},
  {"id": "sym_007", "value": "4", "type": "DIGIT", "confidence": 0.95}
]

Expression parsing:

{
  "type": "equation",
  "left": {
    "type": "add",
    "left": {"type": "multiply", "left": {"type": "number", "value": 3}, "right": {"type": "variable", "name": "x"}},
    "right": {"type": "number", "value": 5}
  },
  "right": {"type": "number", "value": 14},
  "latex": "3x + 5 = 14"
}

Appendix B: Error Examples

Sign Error

Student writes: 3x = 14 + 5 (should be 14 - 5)

{
  "type": "SIGN_ERROR",
  "stepIndex": 1,
  "location": {"symbolIds": ["sym_015"]},
  "expected": "-",
  "actual": "+",
  "message": "When moving +5 to the other side, it becomes -5"
}

Arithmetic Error

Student writes: 3x = 8 (should be 9)

{
  "type": "ARITHMETIC_ERROR",
  "stepIndex": 1,
  "location": {"symbolIds": ["sym_020"]},
  "calculation": "14 - 5",
  "expected": "9",
  "actual": "8"
}

Missing Term

Student writes: 3x = 14 (dropped the 5)

{
  "type": "MISSING_TERM",
  "stepIndex": 1,
  "missingFrom": "left_side",
  "missing": {"type": "number", "value": 5},
  "message": "The +5 term needs to be accounted for"
}

Next spec: Tutor Behavior (when to speak, what to say)