SAPO, Efficient LM Post-Training with Collective RL

SAPO, Efficient LM Post-Training with Collective RL

This is an academic paper describing SAPO, a meta-algorithm that wraps around your preferred policy gradient algorithm; models generate rollouts on a local batch of data, share them with a swarm, sample rollouts from other models in the swarm, update, and repeat.

Key Highlights:
- Models trained with SAPO show ~94% improvement in cumulative reward over models trained in silo.
- With SAPO models can train faster with less compute per node.

Post-training language models with reinforcement learning (RL) has emerged as one of the most effective ways to push their reasoning capabilities beyond what supervised learning alone can achieve. But there’s a problem: scaling RL is expensive, fragile, and often requires carefully engineered infrastructure.


Introducing SAPO: Swarm sAmpling Policy Optimization

In our latest research, we introduce Swarm sAmpling Policy Optimization (SAPO) - a fully decentralised and asynchronous RL post-training algorithm.

Instead of relying on centralised GPU clusters, SAPO is designed for swarms of heterogeneous nodes. Each node trains its own model and “shares” its generated rollouts with the rest of the network. These rollouts are lightweight and model-agnostic, decoded text rather than gradients, meaning that any device, from high-end servers to consumer laptops, can participate with minimal overhead.

This simple mechanism creates a powerful network effect: breakthroughs on one node (“Aha moments”) can propagate across the swarm, accelerating the collective learning process.


Why It Matters

  • Efficiency without fragility - SAPO avoids the bottlenecks of traditional distributed RL while still improving performance.
  • Open, collaborative training - anyone can contribute, no matter their hardware or model preference.
  • Collective intelligence - nodes learn not just from their own experience but from the shared experiences of thousands of others.

Results from Our Experiments

We tested SAPO in two settings:

1. Controlled Experiments

  • Eight Qwen2.5 models (0.5B parameters) were trained on ReasoningGYM, a benchmark suite of algebra, logic, and reasoning tasks.
  • The best configuration - 4 local rollouts / 4 external rollouts - achieved a 94% cumulative reward improvement over the baseline (no sharing).
  • Too much reliance on external rollouts (e.g. 2 local / 6 external) caused instability, showing that balance is key.

2. Open-Source Demo

  • Thousands of Gensyn community members contributed compute to a live swarm.
  • Mid-capacity models consistently performed better when participating in the swarm compared to isolated training.
  • Stronger models saw marginal gains, suggesting future improvements in filtering and sampling could extend benefits even further.

Looking Ahead

SAPO shows that experience sharing can be a core advantage in AI post-training. With it, models learn faster, communities contribute directly to progress, and the cost and fragility of scaling are reduced.

Our next steps include:

  • Testing swarms with more heterogeneous models and specialised tasks.
  • Exploring adaptive sampling strategies and reward-guided sharing.
  • Extending SAPO beyond text - into multimodal domains like images, where models could even share and absorb “aesthetic” preferences.

Why This Is Important for AI

By making reinforcement learning collective, SAPO represents a new paradigm: one where decentralised communities of models (and people) can teach each other. It’s a scalable, open, and practical path to more powerful reasoning capabilities.

And it’s only possible through the contributions of our community.


Read the full research paper here:
https://arxiv.org/abs/2509.08721

Contribute to RLSwarm and see collective learning in action: https://github.com/gensyn-ai/rl-swarm


Get Involved:
- Discord
- X
- LinkedIn