Nous Analysis Releases NousCoder-14B: A Aggressive Olympiad Programming Mannequin Put up-Skilled on Qwen3-14B by way of Reinforcement Studying


Nous Analysis has launched NousCoder-14B, a aggressive olympiad programming mannequin that’s submit educated on Qwen3-14B utilizing reinforcement studying (RL) with verifiable rewards. On the LiveCodeBench v6 benchmark, which covers issues from 08/01/2024 to 05/01/2025, the mannequin reaches a Cross@1 accuracy of 67.87 p.c. That is 7.08 share factors greater than the Qwen3-14B baseline of 60.79 p.c on the identical benchmark. The analysis workforce educated the mannequin on 24k verifiable coding issues utilizing 48 B200 GPUs over 4 days, and launched the weights beneath the Apache 2.0 license on Hugging Face.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Benchmark focus and what Cross@1 means

LiveCodeBench v6 is designed for aggressive programming analysis. The check cut up used right here accommodates 454 issues. The coaching set makes use of the identical recipe because the DeepCoder-14B venture from Agentica and Collectively AI. It combines issues from TACO Verified, PrimeIntellect SYNTHETIC 1, and LiveCodeBench issues created earlier than 07/31/2024.

The benchmark solely consists of aggressive programming model duties. For every downside, an answer should respect strict time and reminiscence limits and should cross a big set of hidden enter output checks. Cross@1 is the fraction of issues the place the primary generated program passes all checks, together with time and reminiscence constraints.

https://nousresearch.com/nouscoder-14b-a-competitive-olympiad-programming-model/

Dataset development for execution primarily based RL

All datasets used for coaching are composed of verifiable code era issues. Every downside has a reference implementation and lots of check instances. The coaching set accommodates 24k issues drawn from:

  • TACO Verified
  • PrimeIntellect SYNTHETIC 1
  • LiveCodeBench issues that come earlier than 07/31/2024

The check set is LiveCodeBench v6, which has 454 issues between 08/01/2024 and 05/01/2025.

Each downside is an entire aggressive programming process with an outline, enter format, output format, and check instances. This setup is essential for RL as a result of it offers a binary reward sign that’s low cost to compute as soon as the code has run.

RL surroundings with Atropos and Modal

The RL surroundings is constructed utilizing the Atropos framework. NousCoder-14B is prompted utilizing the usual LiveCodeBench immediate format, and it generates Python code for every downside. Every rollout receives a scalar reward that is dependent upon check case outcomes:

  • Reward 1 when the generated code passes all check instances for that downside
  • Reward −1 when the code outputs a improper reply, exceeds a 15 second time restrict, or exceeds a 4 GB reminiscence restrict on any check case

To execute untrusted code safely and at scale, the workforce makes use of Modal as an autoscaled sandbox. The system launches one Modal container per rollout in the primary design that the analysis workforce describes because the used setting. Every container runs all check instances for that rollout. This avoids mixing coaching compute with verification compute and retains the RL loop secure.

The analysis workforce additionally pipelines inference and verification. When an inference employee finishes a era, it sends the completion to a Modal verifier and instantly begins a brand new era. With many inference staff and a hard and fast pool of Modal containers, this design retains the coaching loop inference compute sure as a substitute of verification sure.

The workforce discusses 3 verification parallelization methods. They discover one container per downside, one per rollout, and one per check case. They lastly keep away from the per check case setting due to container launch overhead and use an method the place every container evaluates many check instances and focuses on a small set of the toughest check instances first. If any of those fail, the system can cease verification early.

GRPO targets, DAPO, GSPO, and GSPO+

NousCoder-14B makes use of Group Relative Coverage Optimization (GRPO) which doesn’t require a separate worth mannequin. On high of GRPO the analysis workforce check 3 targets: Dynamic sAmpling Coverage Optimization (DAPO), Group Sequence Coverage Optimization (GSPO), and a modified GSPO variant referred to as GSPO+.

All 3 targets share the identical definition of benefit. The benefit for every rollout is the reward for that rollout normalized by the imply and customary deviation of rewards contained in the group. DAPO applies significance weighting and clipping on the token degree, and introduces three important modifications relative to GRPO:

  • A clip greater rule that will increase exploration for low chance tokens
  • A token degree coverage gradient loss that provides every token equal weight
  • Dynamic sampling, the place teams which might be all right or all incorrect are dropped as a result of they carry zero benefit

GSPO strikes the significance weighting to the sequence degree. It defines a sequence significance ratio that aggregates token ratios over the entire program. GSPO+ retains sequence degree correction, however it rescales gradients in order that tokens are weighted equally no matter sequence size.

On LiveCodeBench v6, the variations between these targets are modest. At a context size of 81,920 tokens, DAPO reaches a Cross@1 of 67.87 p.c whereas GSPO and GSPO+ attain 66.26 p.c and 66.52 p.c. At 40,960 tokens, all 3 targets cluster round 63 p.c Cross@1.

Iterative context extension and overlong filtering

Qwen3-14B helps lengthy context and the coaching follows an iterative context extension schedule. The workforce first trains the mannequin with a 32k context window after which continues coaching on the most Qwen3-14B context window of 40k. At every stage they choose the checkpoint with one of the best LiveCodeBench rating at 40k context after which use YaRN context extension at analysis time to achieve 80k tokens, that’s 81,920 tokens.

A key trick is overlong filtering. When a generated program exceeds the utmost context window, they reset its benefit to zero. This removes that rollout from the gradient sign moderately than penalizing it. The analysis workforce report that this method avoids pushing the mannequin towards shorter options for purely optimization causes and helps preserve high quality once they scale context size at check time.

Key Takeaways

  • NousCoder 14B is a Qwen3-14B primarily based aggressive programming mannequin educated with execution primarily based RL, it reaches 67.87 p.c Cross@1 on LiveCodeBench v6, a 7.08 share level achieve over the Qwen3-14B baseline of 60.79 p.c on the identical benchmark.
  • The mannequin is educated on 24k verifiable coding issues from TACO Verified, PrimeIntellect SYNTHETIC-1, and pre 07 31 2024 LiveCodeBench duties, and evaluated on a disjoint LiveCodeBench v6 check set of 454 issues from 08/01/2024 to 05/01/2025.
  • The RL setup makes use of Atropos, with Python options executed in sandboxed containers, a easy reward of 1 for fixing all check instances and minus 1 for any failure or useful resource restrict breach, and a pipelined design the place inference and verification run asynchronously.
  • Group Relative Coverage Optimization targets DAPO, GSPO, and GSPO+ are used for lengthy context code RL, all function on group normalized rewards, and present comparable efficiency, with DAPO reaching one of the best Cross@1 on the longest 81,920 token context.
  • The coaching makes use of iterative context extension, first at 32k then at 40k tokens, together with YaRN primarily based extension at analysis time to 81,920 tokens, consists of overlong rollout filtering for stability, and ships as a completely reproducible open stack with Apache 2.0 weights and RL pipeline code.

Try the Model Weights and Technical details. Additionally, be happy to observe us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

The submit Nous Research Releases NousCoder-14B: A Competitive Olympiad Programming Model Post-Trained on Qwen3-14B via Reinforcement Learning appeared first on MarkTechPost.

Leave a Reply

Your email address will not be published. Required fields are marked *