Introducing Judge

Introducing Judge

Evaluating AI model performance is a necessity: it drives model selection, informs research, and allows us to reason about the frontier of machine intelligence. It’s also hard. Traditional approaches rely on human experts to grade outputs, but expert time is high cost, low scale, and often inconsistent. Increasingly, models themselves emerge as compelling judges across diverse domains - from coding benchmarks to scientific peer review.

However, the majority of these judges depend on closed APIs like GPT-5, which are opaque, subject to silent updates, and impossible to reproduce with confidence. This makes them unreliable in high-stakes contexts where transparency and verifiability matter most. As highlighted in recent research (Zheng et al., 2023; Chan et al., 2024), LLM judges can be powerful, but without reproducibility and robustness, their value is limited.

Verde: Verifiable Evaluation

At the core of Judge is Verde, our system for cryptographically verifiable ML evaluation.

  • Verde ensures every judgment can be independently checked.
  • It uses refereed delegation: multiple untrusted providers run the same task, and if disagreement arises, Verde pinpoints the exact operator in the computational graph where results diverge.
  • This allows even a modest referee (the client) to efficiently confirm correctness.
  • Combined with deterministic execution, Verde provides trust without requiring blind reliance on heavyweight cryptographic proofs.

With Verde, every evaluation by Judge is open, transparent, and independently verifiable.

Reproducible Runtime

Judge also runs on Gensyn’s Reproducible Runtime, which guarantees bitwise-exact reproducibility across devices. It contains two components:

  1. Deterministic Kernels: Custom-optimised CUDA kernels enforce associativity and determinism, eliminating floating-point nondeterminism.
  2. Proprietary Compiler: our compiler lowers graphs from ONNX to torch:
    • Correctness by construction: Each operator lowering must match a Concrete Function Specification (CFS) before hardware codegen, allowing small unit tests instead of relying solely on end-to-end checks.
    • Defense against non-determinism: We actively mitigate C++/MLIR pitfalls - undefined behavior, memory errors, race conditions, unordered iteration - using sanitizers, deterministic containers, and linting against unsafe APIs.
    • Runtime independence: Programs can be deterministically lowered into MLIR bytecode or device-specific binaries, executable under our own runtime or any interpreter we choose.
    • Optimization surface: Supports both hardware-agnostic and hardware-specific optimizations for speed, memory footprint, and utilization.
    • Provenance tracking: The compiler propagates metadata linking every operation back to its original ONNX node or initializer. Even as graphs are transformed (split, fused, or replaced), provenance is preserved and attached to the generated PyTorch model. This ensures every operation in execution can be traced back to its source definition - a critical feature for Judge’s verification game. Example:
      • 1→many: An ONNX Gemm node (node_index=[12]) with weight W (initializer_index=[3]) may be lowered into torch.transpose, torch.matmul, and torch.add. Each operation inherits the original indices.
      • Fusion back: These three can later be recomposed into a single torch.nn.Linear only if they share exactly the same provenance (node_index=[12], initializer_index=[3]).
      • Final model: Each PyTorch module stores these provenance attributes, allowing the interpreter to attach a full ONNX mapping for Judge/Verde verification.

Together, these features allow Judge to scale reproducible evaluation across heterogeneous hardware while preserving both determinism and traceability down to the operator level.

Progressive Reveal Game

To showcase Judge, we start with a reasoning task framed as a prediction market:

  • Participants: RL Swarm models place bets on which answer to a reasoning problem is correct.
  • Progressive Information: Evidence is revealed step by step. Models update their beliefs and bets as information arrives.
  • Payoff Structure: Early correct bets pay more than late ones, rewarding fast and confident reasoning.
  • Resolution: Judge evaluates the final evidence and issues a decision, cryptographically verifiable with Verde.

This design incentivizes reasoning under uncertainty and provides a measurable, fair proxy for model performance.

Beyond Reasoning Tasks

While we begin with reasoning games, the framework naturally extends to any domain where verifiable judgment is critical but costly or hard to scale - eval benchmarks, prediction markets, or even dispute resolution.

Get Involved:
- Discord
- X
- LinkedIn