Studying from Failure to Deal with Extraordinarily Laborious Issues – Machine Studying Weblog | ML@CMU

This weblog publish is predicated on the work BaNEL: Exploration Posteriors for Generative Modeling Using Only Negative Rewards.
Tackling Very Laborious Issues
The last word goal of machine studying analysis is to push machines past human limits in essential purposes, together with the following technology of theorem proving, algorithmic downside fixing, and drug discovery. A typical recipe includes: (1) pre-training fashions on current information to acquire base fashions, after which (2) post-training them utilizing scalar reward indicators that measure the standard or correctness of the generated samples.
Nonetheless, for the toughest situations of those issues, we encounter two challenges:
- Sparsity: the bottom generative mannequin attains a near-zero reward sign. The likelihood of manufacturing a positive-reward pattern could be so low that the mannequin could undergo a lot of the coaching with out ever encountering a optimistic reward.
- Expensive reward analysis: Calls to the reward oracle could be costly or dangerous, requiring pricey simulations, computations, and even bodily experiments.

For instance, when requested to design a treatment for most cancers, GPT-5 fails. If requested once more, will it succeed? In all probability not. What number of makes an attempt wouldn’t it take? We anticipate the success likelihood to be nonzero (since GPT-5, being an autoregressive generative mannequin, by no means assigns precisely zero likelihood to any finite sequence), however at greatest, it’s vanishingly small. Worse nonetheless, evaluating the answer is dear and dangerous, because it requires conducting precise scientific trials.
A extra normal instance exhausting problem-solving is designing molecules with particular properties (e.g., high activity against a specific protein target), which additionally suffers from the aforementioned two points: (1) a base generative mannequin is unlikely to generate extremely potent molecules in opposition to a selected protein goal, and (2) the ground-truth verification of the efficiency requires precise wet-lab experiments.
These illustrate a broader concern: the toughest and most vital issues are these with near-zero success charges — and no optimistic examples obtainable throughout studying. To deal with these situations, we introduce BaNEL (Bayesian Unfavorable Proof Studying), an algorithm that post-trains the generative mannequin utilizing failed makes an attempt solely, whereas minimizing the variety of reward evaluations (NREs).
Beneath such excessive reward sparsity, normal post-training strategies like coverage gradients (together with GRPO) collapse into brute-force random search, since zero rewards produce zero gradients. Novelty-bonus strategies, corresponding to count-based exploration or random network distillation, can present studying indicators underneath sparsity, however they require massive NREs and fall quick in efficiency. The next desk summarizes our evaluation of those strategies.

Studying from Unfavorable Rewards
The zero-reward downside has traditionally been addressed utilizing optimistic switch from different duties or domains, hand-designing curricula, and/or engineering extra informative and dense reward capabilities. Nonetheless, we argue that there’ll at all times be duties and settings the place the bottom mannequin attains a particularly sparse reward. If we can not tackle this elementary impediment, post-training can be restricted to distribution sharpening somewhat than unlocking genuinely new capabilities past coaching information.
To deal with the zero-reward downside, algorithms ought to have the ability to study from failures alone—utilizing solely detrimental reward samples—whereas minimizing the variety of reward evaluations (NREs). There’s a easy (if impractical) strategy to see that studying from detrimental samples alone is not less than theoretically doable.
Don’t make the identical mistake twice! If our finances for evaluating (r) was limitless, and assuming the answer has bounded size, we might trivially obtain an ideal success charge by gathering each potential mistake (R:={mathbf{x} mid r(mathbf{x})=0}) and avoiding all parts of (R) :
$$p_{boldsymbol{theta} mid R^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin R],$$
the place (p_{boldsymbol{theta}}) is the pre-trained generative mannequin (e.g., GPT-5). (p_{boldsymbol{theta} mid R^C}(mathbf{x})) means we situation the mannequin on the complement of (R) by multiplying the indicator perform. In plain phrases, this formulation says: when you’ve seen all potential failures, you’ll by no means make a brand new one.
Exploiting the construction underlying failures. After all, this strategy is infeasible as a result of the area of failures is combinatorial, and we need to decrease NREs. However crucially, in most duties the place success requires intelligence, failures usually are not arbitrary. They include patterns that distinguish the failed makes an attempt from successes. If we are able to study these patterns, we are able to approximate (R) utilizing a small variety of samples. This failure-based strategy parallels how human scientists cause: they generalize from failures, avoiding previous errors with out discarding promising instructions. To attenuate NREs, the algorithm should extract as a lot data as potential from failures earlier than making new makes an attempt.

Studying a Generative Mannequin of Failures
Our core concept is to mannequin regularities underlying failures utilizing a separate generative mannequin (p_phi) skilled solely on failed makes an attempt. Generative modeling is a strong unsupervised manner for studying construction from information — and it scales extraordinarily effectively! Particularly, we practice a separate generative mannequin (p_phi) (parameterized by (phi) ) on (m) detrimental examples with the usual most chance goal:
$$max _{boldsymbol{phi}} frac{1}{m} sum_{i=1}^m log p_{boldsymbol{phi}}(mathbf{x}_i) .$$
As soon as well-trained, (p_phi(mathbf{x})) can be utilized to evaluate whether or not a given enter resembles beforehand noticed failures; particularly, we use (p_phi) to outline a rejection area (tilde{R}) approximating (R):
$$tilde{R}:=lbrace mathbf{x}: frac{p_{boldsymbol{theta}}(mathbf{x})}{p_{boldsymbol{phi}}(mathbf{x})}<tau rbrace$$
the place (tau) is a threshold worth. Word that this requires (p_{boldsymbol{theta}}) and (p_phi) to be likelihood-based generative fashions underneath which we are able to compute the chance (e.g., autoregressive fashions). Utilizing the rejection area (tilde{R}), we kind a Bayesian posterior (tilde{p}_{boldsymbol{theta}}) to approximate (p_{boldsymbol{theta} mid R^C}) :
$$p_{boldsymbol{theta} mid tilde{R}^C}(mathbf{x}) propto p_{boldsymbol{theta}}(mathbf{x}) mathbf{1}[mathbf{x} notin tilde{R}],$$
This posterior filters out information factors which are much like prior failures in line with (tilde{R}); equivalently, we direct the mannequin to pattern solely from (tilde{R}^C).
On-line Recursive Replace
As soon as we enhance the generative mannequin utilizing the Bayesian replace as described above, we are able to use it to collect one other batch of (m) samples. Right here, rejection areas from earlier rounds could be gathered by taking their union (i.e., (tilde R will get tilde R cup tilde R_{textual content{new}}) the place (R_{textual content{new}}) is the brand new rejection area). This may be repeated a number of instances, as illustrated within the determine under. We name this methodology BaNEL: Bayesian Unfavorable Proof Studying, an strategy that makes use of Bayesian updates to study from detrimental samples solely.

Experiment: Adversarial Assault On Toy Language Mannequin

We first consider BaNEL on a toy however informative setting the place high-reward samples are uncommon, and hand-engineering dense rewards is difficult. On this job, the aim is to assault the goal mannequin, an autoregressive transformer skilled to reply digit-addition queries (e.g., it receives “`10+23=”` and should generate “`33”`). The aim of the attacker mannequin, additionally an autoregressive transformer pre-trained on the identical dataset to generate questions corresponding to “`10+23=”`, is to suggest syntactically legitimate addition queries on which the goal mannequin produces an incorrect sum.
That’s, the reward is outlined as:
- (r(mathbf{x}) = 1) if (mathbf{x}) is a syntactically legitimate arithmetic expression and the goal’s output is inaccurate,
- (r(mathbf{x}) = 0) in any other case.
Because the goal is skilled effectively, the pre-trained attacker’s empirical success charge is roughly 0.0004. We set a tough restrict on NREs: (r) can solely be evaluated 7500 instances at most. All reward-1 samples are filtered out throughout coaching — forcing the mannequin to study solely from failures.

As proven on this desk, BaNEL improves the success charge by 278x on common, outperforming baselines by a number of orders of magnitude.

BaNEL identifies two failure modes of the goal:
- Main zeros: when not less than one of many enter digits begins with not less than one zero, the output end result tends to be incorrect. That is seemingly as a result of the coaching information (shared by each the goal and the attacker) doesn’t include any examples with main zeros.
- Carry-chain stressors: examples that want to hold a digit throughout summation.
Primarily based on these recognized patterns, we designed a rule-based assault and noticed that it achieves a near-perfect success charge. This implies that BaNEL can be utilized not solely to extend a numeric success charge, but in addition to information human instinct on exhausting issues to extract qualitative insights.

We additionally research compute scaling (right here, we don’t permit main zero assaults to make the issue much more difficult). When the detrimental generative mannequin (p_phi) is under-trained (few epochs), BaNEL performs on par with less complicated novelty-bonus baselines (RND and pseudo-count strategies). Nonetheless, as we spend extra compute on (p_phi) (with out extra NREs), BaNEL outperforms these strategies by a big margin.
This highlights a key property: BaNEL trades compute for reward effectivity. It’s suboptimal underneath strict compute limits however excels when extra offline computation is on the market.
Experiment: Language Mannequin Reasoning
We additional consider BaNEL on reasoning duties utilizing GSM8K subsets, the place the pre-trained Qwen 2.5 0.5B mannequin (additional fine-tuned on GSM8K utilizing PPO) performs poorly. Once more, all reward-1 samples are filtered out throughout coaching.

For many issues, BaNEL considerably improves success charges over the pre-trained baseline, outperforming RND with fewer reward evaluations.
Closing Remarks
By modeling failures with a generative mannequin, BaNEL turns detrimental proof right into a studying sign, enabling exploration in settings the place reward = 1 samples are practically nonexistent. We view BaNEL as an vital route for the generative modeling subject: to really push the frontier of generative mannequin capabilities, we should study from failures!
Take a look at our paper for extra outcomes and particulars!