Orca-Agent-RL
- Trained Qwen3-14B orchestrator model using RL to better coordinate explorer & coder subagents, achieving 167% relative improvement (7% → 18.25%) on Stanford's TerminalBench—within striking distance of Qwen3-Coder-480B (19.7%).
- Scaled RL training to 32x H100s across 4 bare-metal nodes with 256 Docker environments rolling out simultaneously, achieving stable training with smooth entropy decrease.
- Discovered that simple reward design using just unit tests outperformed all crafted reward signals, which consistently led to policy collapse during training.