Language fashions can clarify neurons in language fashions

Though the overwhelming majority of our explanations rating poorly, we consider we are able to now use ML strategies to additional enhance our capability to supply explanations. For instance, we discovered we had been in a position to enhance scores by:

Iterating on explanations. We will enhance scores by asking GPT-4 to provide you with potential counterexamples, then revising explanations in gentle of their activations.
Utilizing bigger fashions to present explanations. The typical rating goes up because the explainer mannequin’s capabilities enhance. Nonetheless, even GPT-4 offers worse explanations than people, suggesting room for enchancment.
Altering the structure of the defined mannequin. Coaching fashions with completely different activation capabilities improved clarification scores.

We’re open-sourcing our datasets and visualization instruments for GPT-4-written explanations of all 307,200 neurons in GPT-2, in addition to code for clarification and scoring using publicly available models on the OpenAI API. We hope the analysis neighborhood will develop new strategies for producing higher-scoring explanations and higher instruments for exploring GPT-2 utilizing explanations.

We discovered over 1,000 neurons with explanations that scored at the least 0.8, which means that in line with GPT-4 they account for a lot of the neuron’s top-activating conduct. Most of those well-explained neurons aren’t very fascinating. Nonetheless, we additionally discovered many fascinating neurons that GPT-4 did not perceive. We hope as explanations enhance we might be able to quickly uncover fascinating qualitative understanding of mannequin computations.