Rubix-developed

Eval-Driven Development.

Eval harness before the model.

Category

Rubix-developed

When we recommend it

Every Phase 02 build. There are no exceptions. If we cannot define evals, we cannot define the use case, and we should not be building.

What it is

The framework, what it covers, and the problem it addresses.

A development discipline borrowed from test-driven development and applied to LLM systems. The eval harness is built before the model is fine-tuned or the prompt is finalized. Evals are domain-specific: faithfulness (does the answer match the source), citation accuracy (does the cited source say what the answer claims it says), bilingual equivalence (do the AR and EN versions say the same thing), false-positive rate (do we cry wolf). Eval thresholds become release gates.

Why it matters

The reason this framework exists in the Rubix toolkit, and why omitting it is the wrong shortcut.

LLM systems demo well and ship badly. The team builds something that looks impressive in five test cases and fails on the sixteenth. Eval-Driven Development is the discipline that surfaces the sixteenth case before the customer does. It is also what produces operator trust: line staff act on alerts when the false-positive rate is below their threshold of frustration.

In the Kingdom and the GCC

Regional context. PDPL, SDAIA, Vision 2030, Saudization, and the operating realities that shape how this framework lands here.

In bilingual AR/EN deployments, eval discipline is non-negotiable. A system that performs at 95% faithfulness in English and 78% in Arabic is not a 'mostly fine' system; it is unfit for production. Eval-Driven Development makes this disparity visible and forces it to be addressed.

How Rubix applies it

The phases of the Rubix Way where this framework is operationalized, and what we do with it there.

Phase 02

Build. Sprint 0 of every Phase 02 build is the eval harness. Domain-specific evals defined with the client, thresholds agreed, regression suite live before any model code is written.

Phase 03

Scale. Evals run continuously in production. New use cases extend the eval harness rather than replacing it.

Common pitfalls

The failure modes we have seen up close, written so the next engagement avoids them.

01
Treating evals as 'we'll add them later.' Later means never. Evals built after the system are never as honest as evals built before.
02
Eval thresholds set by the engineering team alone. Thresholds are set with the operations team that will trust (or distrust) the alerts.
03
Evaluating on the training set. Held-out test sets are non-negotiable. We rotate held-out sets across sprints.

Related frameworks

Eval-Driven Development.

Other rubix-developed frameworks in the library.

AI Strategy Canvas

Feasibility × Impact Matrix

Hub-and-Spoke CoE

Bias & Fairness Review