Protocol, instrumentation,
and evidence.
The workspace is treated as a research instrument. Each study begins with a question, a defined state model, and a small set of signals that can be reviewed across sessions. Cohorts in the Free Compute Program run under these protocols.
A four-stage loop
for evaluating change.
Instrument
Baseline mastery, misconception state, curriculum context, and learner permissions are captured before any intervention.
Intervene
Learners move through the wrapper while tutoring mode, routing, and support patterns stay legible on every turn.
Observe
The system records user-scoped histories, evaluation traces, and notebook signals tied to the hypothesis.
Revise
Protocol changes follow evidence review, not because a flow merely feels smoother.
Decisions depend on
pre-declared signals.
The same matrix is reviewed across programs so results stay interpretable from one cycle to the next.
Tutor behavior
- Signal
- Correction quality, retention, transfer
- Cadence
- Per turn · weekly review
- Threshold
- Sustained gain across cohorts
Routing quality
- Signal
- Tier choice vs. task difficulty and energy
- Cadence
- Per request
- Threshold
- Smallest capable model selected
Curriculum stability
- Signal
- Graph and lesson consistency across return
- Cadence
- On generation and replay
- Threshold
- No drift for unchanged inputs
Pedagogical fit
- Signal
- Clarity · correctness · ped-fit scores
- Cadence
- Per evaluation trace
- Threshold
- Above review threshold
Before a study runs, it declares
Hypothesis
One question the study is built to answer, written before any data is collected.
State model
The mastery, misconceptions, and permissions captured for each learner up front.
Signals
The small set of measurements that will count as evidence, and nothing else.
Threshold
The result that would actually change the protocol.
Escalation
The point where a low-confidence turn leaves the model and goes to a human.
A quiet system, so researchers
can read what matters.
State is explicit, not incidental.
Every measurement is tied to a hypothesis.
Access is constrained until the surface is trustworthy.
Every public claim maps to an observable workflow.
Public demos are labelled simulated, preview, or experimental.
Low-confidence turns escalate to a human, and guardrails gate the sensitive ones.