DEI: Diversity in Evolutionary Inference for Quality-Diversity Search
Introducing DEI (Diversity in Evolutionary Inference), a distributed Quality-Diversity search framework that uses heterogeneous LLMs as complementary mutation operators. This allows nodes running different LLMs to collaborate and reach a solution which outperforms solutions by an individual LLMs.
Agentic workflows, such as auto-research, often see agents solving an iterative task while optimizing against some score function or benchmark. This iterative loop is a vanilla example of an evolutionary algorithm. Evolutionary algorithms are simple but effective optimization techniques which aim to mimic biological evolution. Generally, at each iteration, a sample is mutated in some way and then scored to measure its performance. If its score increases, then the mutated sample is kept, if not it may be thrown out. Over the last few years we have seen an abundance of work demonstrating the utility of using an LLM as the mutation operator in these algorithms. An LLM may be a particularly powerful mutation operator if the problem being solved is in the coding or natural language domain. This is because LLMs are great at generating programs, mutating existing solutions, and reasoning about search spaces in ways that hand-written mutation operators cannot. Choosing which LLM to use as a mutation operator restricts your search space by adhering to the chosen LLMs creative prior. Sampling from it many times may increase throughput, but it still samples from the same underlying distribution.
In our paper, we introduce DEI (Diversity in Evolutionary Inference), a distributed Quality-Diversity search framework that uses heterogeneous LLMs as complementary mutation operators. Instead of asking many copies of one model to search faster, DEI asks different models to search differently. Each node runs its own LLM, evolves solutions locally, and asynchronously shares its best discoveries with peers, allowing for a collaboratively reached solution which outperforms that reached by an individual LLM.
Distributed Search with Homogeneous Mutation Operators
LLMs are powerful evolutionary operators because they can generate and mutate programs with semantic awareness. But every model also carries its own distribution over programs, shaped by its training data, architecture, and alignment procedure. One model may favor certain control-flow patterns, another may produce different styles of code, and an instruction-tuned model may explore differently from a code-completion model. Quality-Diversity search is a flavor of evolutionary algorithm that aims to increase the breadth of the search space while maintaining high quality solutions. In Quality-Diversity search, these biases matter. If a model rarely proposes solutions in some behavioral niche, simply sampling from it more often may leave that part of the archive empty.
Most parallel LLM search frameworks treat scaling as a compute problem. More workers make more calls, but usually to the same model, so diversity comes only from stochastic sampling rather than from fundamentally different generative priors.
Recent work from Sakana AI establishes the Digital Red Queen framework as a testbed for studying quality diversity search in adversarial domains. In the original Digital Red Queen setup, a single model drives the evolutionary loop; its blind spots can therefore become persistent gaps in the MAP-Elites archive.
We argue for parallel cognition, not just parallel computation. If different LLMs explore the same behavioral space, each model’s inductive bias can push it toward different regions of the archive. Champions discovered by one model may occupy niches another model would rarely generate on its own, making them useful seeds for cross-pollination. When shared as opponents, these champions also create cross-model adversarial pressure, producing a stronger Red Queen dynamic than intra-model self-play alone.
This introduces a systems challenge: heterogeneous LLMs run at different speeds. A local model may take much longer per call than a hosted frontier model, and a synchronization barrier would slow the whole ensemble down to the slowest node. DEI is built around this constraint using AXL, allowing nodes to search independently, share discoveries without blocking, and benefit from model diversity without requiring identical hardware or identical inference latency.
Core War Adversarial Domain
We evaluate DEI on Core War, a competitive programming game where Redcode “warrior” programs battle inside a circular memory simulator called MARS. Warriors survive by replicating, bombing memory, scanning opponents, or protecting themselves with defensive structures.
Following the Digital Red Queen setup, each candidate warrior is evaluated against an opponent pool, and fitness is based on survival across simulation timesteps. In line with DRQ we use a two-dimensional behavioral space: Time-Space Product, which combines code length and lifespan, and Memory Coverage, which measures how much of the core a warrior touches during battle.
We also evaluate final champions against a held-out set of human-authored warriors to measure generality.
Distributed Heterogenous Red Queen Search
DEI extends the Digital Red Queen framework from one LLM node to a distributed ensemble of heterogeneous LLMs.
Each node runs a local DRQ loop. In each round, it uses its assigned LLM to generate a new warrior from scratch or mutate an existing elite from its archive. The candidate is evaluated, the local MAP-Elites archive is updated, and the node selects a round champion.
At the end of a round, each node publishes its champion to its peers. Received champions are added to the local opponent pool, creating cross-model adversarial pressure, and are also seeded into the archive when they occupy previously empty cells.
This gives DEI two sources of improvement: peers provide useful evolutionary seeds from niches a model may not find alone, and each model must compete against strategies generated by other models rather than only against its own self-play lineage.
Asynchronous by Design
Heterogeneous LLMs do not run at the same speed. A local open-weight model can be much slower than a hosted frontier model. A synchronous barrier would force every node to wait for the slowest participant, wasting the benefit of faster nodes.
DEI avoids this with non-blocking champion sharing. Nodes publish champions when ready and consume incoming champions at their own pace. In our implementation, this uses a gossip-style communication layer over AXL on Yggdrasil, allowing heterogeneous peers to collaborate without global synchronization.
Results
We compare three settings under a fixed total LLM-call budget: a single-node DRQ baseline, homogeneous four-node ensembles where every node uses the same model, and a diverse four-node ensemble using GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5.
At the individual-node level, the diverse ensemble achieves the highest peak generality across all four model families. For example, Claude Sonnet 4.6 reaches 0.850 ± 0.087 peak generality in the diverse ensemble, compared with 0.825 ± 0.106 in the homogeneous ensemble and 0.775 ± 0.035 in solo DRQ. Claude Haiku 4.5 improves from 0.538 ± 0.063 in the homogeneous ensemble to 0.700 ± 0.050 in the diverse ensemble.
DEI also improves niche novelty. Since solo runs do not receive peer champions, this metric applies only to ensemble settings. Moving from homogeneous to diverse ensembles substantially increases the fraction of received champions that land in previously empty archive cells, suggesting that different models contribute genuinely different behavioral discoveries.
The merged-archive results give the clearest picture. At equal compute, the diverse merged archive reaches 80.6% coverage and 45.90 QD-Score, compared with 63.0% coverage and 20.46 QD-Score for solo DRQ. The homogeneous merged archive reaches 59.0% coverage and 29.85 QD-Score. The diverse ensemble therefore produces the broadest archive and the highest final QD-Score.
Impact
DEI shows that distributed LLM search should be understood as parallel cognition, not just parallel computation. Homogeneous ensembles can make search faster, but heterogeneous ensembles make search broader.
This is especially important for decentralized search. Useful collaboration should not require every participant to run the same model, use the same hardware, or move at the same speed. DEI shows that heterogeneous nodes can contribute asynchronously, share compact evolutionary discoveries, and turn model diversity into a first-class source of exploration.
At fixed LLM-call budget, DEI improves individual-node generality, increases cross-node niche novelty, and produces a merged archive with higher coverage and QD-Score than both solo DRQ and homogeneous ensembles.