Scaling legal guidelines for reward mannequin overoptimization

In reinforcement studying from human suggestions, it’s common to optimize towards a reward mannequin skilled to foretell human preferences. As a result of the reward mannequin is an imperfect proxy, optimizing its worth an excessive amount of can hinder floor reality efficiency, in accordance with Goodhart’s legislation. This impact has been regularly noticed, however not fastidiously measured because of the expense of accumulating human desire knowledge. On this work, we use an artificial setup during which a hard and fast “gold-standard” reward mannequin performs the position of people, offering labels used to coach a proxy reward mannequin. We research how the gold reward mannequin rating modifications as we optimize towards the proxy reward mannequin utilizing both reinforcement studying or best-of-n sampling. We discover that this relationship follows a unique useful kind relying on the strategy of optimization, and that in each circumstances its coefficients scale easily with the variety of reward mannequin parameters. We additionally research the impact on this relationship of the scale of the reward mannequin dataset, the variety of reward mannequin and coverage parameters, and the coefficient of the KL penalty added to the reward within the reinforcement studying setup. We discover the implications of those empirical outcomes for theoretical concerns in AI alignment.

Scaling legal guidelines for reward mannequin overoptimization

Amazon SageMaker inference launches sooner auto scaling for generative AI fashions

How To Navigate the Filesystem with Python’s Pathlib

LLM experimentation at scale utilizing Amazon SageMaker Pipelines and MLflow

Leave a Reply Cancel reply

ASRock Launches Passively Cooled Radeon RX 7900 XTX & XT Playing cards for Servers

FLUTE: A CUDA Kernel Designed for Fused Quantized Matrix Multiplications to Speed up LLM Inference

Radical Simplicity in Knowledge Engineering | by Cai Parry-Jones | Jul, 2024

Discover solutions precisely and shortly utilizing Amazon Q Enterprise with the SharePoint On-line connector

Shader Launches Actual-Time AI Video Results Creation Platform

More Stories

Leave a Reply Cancel reply

You may have missed