Build. Sprint 0 of every Phase 02 build is the eval harness. Domain-specific evals defined with the client, thresholds agreed, regression suite live before any model code is written.
Eval-Driven Development.
Eval harness before the model.
Every Phase 02 build. There are no exceptions. If we cannot define evals, we cannot define the use case, and we should not be building.
The framework, what it covers, and the problem it addresses.
A development discipline borrowed from test-driven development and applied to LLM systems. The eval harness is built before the model is fine-tuned or the prompt is finalized. Evals are domain-specific: faithfulness (does the answer match the source), citation accuracy (does the cited source say what the answer claims it says), bilingual equivalence (do the AR and EN versions say the same thing), false-positive rate (do we cry wolf). Eval thresholds become release gates.
The reason this framework exists in the Rubix toolkit, and why omitting it is the wrong shortcut.
LLM systems demo well and ship badly. The team builds something that looks impressive in five test cases and fails on the sixteenth. Eval-Driven Development is the discipline that surfaces the sixteenth case before the customer does. It is also what produces operator trust: line staff act on alerts when the false-positive rate is below their threshold of frustration.
Regional context. PDPL, SDAIA, Vision 2030, Saudization, and the operating realities that shape how this framework lands here.
In bilingual AR/EN deployments, eval discipline is non-negotiable. A system that performs at 95% faithfulness in English and 78% in Arabic is not a 'mostly fine' system; it is unfit for production. Eval-Driven Development makes this disparity visible and forces it to be addressed.
The phases of the Rubix Way where this framework is operationalized, and what we do with it there.
Scale. Evals run continuously in production. New use cases extend the eval harness rather than replacing it.
The failure modes we have seen up close, written so the next engagement avoids them.
- 01
Treating evals as 'we'll add them later.' Later means never. Evals built after the system are never as honest as evals built before.
- 02
Eval thresholds set by the engineering team alone. Thresholds are set with the operations team that will trust (or distrust) the alerts.
- 03
Evaluating on the training set. Held-out test sets are non-negotiable. We rotate held-out sets across sprints.