Our strategy to alignment analysis

There’s presently no recognized indefinitely scalable resolution to the alignment downside. As AI progress continues, we anticipate to come across quite a lot of new alignment issues that we don’t observe but in present methods. A few of these issues we anticipate now and a few of them will likely be completely new.

We imagine that discovering an indefinitely scalable resolution is probably going very tough. As an alternative, we intention for a extra pragmatic strategy: constructing and aligning a system that may make sooner and higher alignment analysis progress than people can.

As we make progress on this, our AI methods can take over an increasing number of of our alignment work and finally conceive, implement, examine, and develop higher alignment strategies than we now have now. They’ll work along with people to make sure that their very own successors are extra aligned with people.

We imagine that evaluating alignment analysis is considerably simpler than producing it, particularly when supplied with analysis help. Due to this fact human researchers will focus an increasing number of of their effort on reviewing alignment analysis executed by AI methods as a substitute of producing this analysis by themselves. Our aim is to coach fashions to be so aligned that we are able to off-load virtually all the cognitive labor required for alignment analysis.

Importantly, we solely want “narrower” AI methods which have human-level capabilities within the related domains to do in addition to people on alignment analysis. We anticipate these AI methods are simpler to align than general-purpose methods or methods a lot smarter than people.

Language fashions are significantly well-suited for automating alignment analysis as a result of they arrive “preloaded” with loads of information and details about human values from studying the web. Out of the field, they aren’t impartial brokers and thus don’t pursue their very own targets on this planet. To do alignment analysis they don’t want unrestricted entry to the web. But loads of alignment analysis duties will be phrased as pure language or coding duties.

Future variations of WebGPT, InstructGPT, and Codex can present a basis as alignment analysis assistants, however they aren’t sufficiently succesful but. Whereas we don’t know when our fashions will likely be succesful sufficient to meaningfully contribute to alignment analysis, we expect it’s vital to get began forward of time. As soon as we practice a mannequin that might be helpful, we plan to make it accessible to the exterior alignment analysis neighborhood.