Because large language models operate using neuron-like structures that can link many different concepts and modalities, it can be difficult for AI developers to tune their models to change their behavior. If you don't know which neurons connect which concepts, you won't know which neurons to change.
On May 21, Anthropic published a remarkably detailed map of the inner workings of the refined version of its Claude AI, specifically the Claude 3 Sonnet 3.0 model. About two weeks later, OpenAI published its own research to find out how GPT-4 interprets the patterns.
With Anthropic's map, researchers can explore how neuron-like data points, called features, affect the output of a generative AI. Otherwise, people will only be able to see the result itself.
Some of these features are “safety-relevant,” meaning that if people reliably identify those features, it could help fine-tune generative AI to avoid potentially dangerous topics or actions. The features are useful for adjusting the classification and the classification could affect the bias.
What did Anthropic discover?
Anthropic researchers extracted interpretable features from Claude 3, a current-generation large language model. The interpretable features can be translated into human-understandable concepts from the numbers readable by the model.
Interpretable features can be applied to the same concept in different languages and to both images and text.
“Our high-level goal in this work is to decompose the activations of a model (Claude 3 Sonnet) into more interpretable pieces,” the researchers wrote.
“One hope for interpretability is that it can be a kind of 'safety test suite, allowing us to know whether models that appear safe in training will actually be safe in deployment,'” they said.
SEE: Anthropic's Claude Team business plan includes an AI assistant for small and medium-sized businesses.
Features are produced by sparse autoencoders, which are a type of neural network architecture. During the AI training process, the rare autoencoders are guided, among other things, by scaling laws. Therefore, identifying features can give researchers insight into the rules that govern the topics that AI associates. Simply put, Anthropic used few autoencoders to reveal and analyze features.
“We found a diversity of very abstract features,” the researchers wrote. “They (the features) respond to and cause abstract behaviors.”
Details of the hypotheses used to try to figure out what's going on under the hood of LLMs can be found in the Anthropic research article.
What did OpenAI discover?
OpenAI's research, published on June 6, focuses on sparse autoencoders. The researchers go into detail in their paper on how to scale and evaluate sparse autoencoders; Simply put, the goal is to make features more understandable and therefore more manageable for humans. They are planning a future in which “frontier models” may be even more complex than today's generative AI.
“We used our recipe to train a variety of autoencoders on small GPT-2 and GPT-4 activations, including a 16 million feature autoencoder on GPT-4,” OpenAI wrote.
So far, they cannot interpret all of GPT-4's behaviors: “Currently, passing GPT-4 activations through the sparse autoencoder results in performance equivalent to a model trained with about 10 times less computation.” But the research is another step toward understanding the “black box” of generative AI and potentially improving its security.
How feature manipulation affects bias and cybersecurity
Anthropic found three distinct characteristics that could be relevant to cybersecurity: insecure code, code bugs, and backdoors. These features can be activated in conversations that do not involve insecure code; For example, the backdoor feature is activated for conversations or images about “hidden cameras” and “jewelry with a hidden USB drive.” But Anthropic was able to experiment with “clamping” (in short, increasing or decreasing the intensity) of these specific features, which could help fine-tune models to avoid or tactfully handle sensitive security issues.
Claude's bias or hate speech can be adjusted by clamping characteristics, but Claude will resist some of his own statements. Anthropic researchers “found this response puzzling,” anthropomorphizing the model when Claude expressed “self-deprecation.” For example, Claude could generate “That's just racist hate speech from a deplorable robot…” when researchers limited a characteristic related to hate and name-calling to 20 times its maximum activation value.
Another characteristic the researchers examined is flattery; They could adjust the model so that it exaggeratedly praised the person who was conversing with it.
What does research on AI autoencoders mean for enterprise cybersecurity?
Identifying some of the characteristics used by an LLM to connect concepts could help fine-tune an AI to avoid biased speech or to prevent or fix cases where the AI could be made to lie to the user. Anthropic's greater understanding of why the LLM behaves the way it does could allow for greater tuning options for Anthropic's commercial customers.
SEE: 8 business AI trends, according to Stanford researchers
Anthropic plans to use some of this research to delve into topics related to the safety of generative AI and LLMs in general, such as exploring which functions are activated or remain inactive if Claude is asked to give advice on weapons production.
Another topic Anthropic plans to address in the future is the question: “Can we use the feature base to detect when fine-tuning a model increases the likelihood of undesirable behaviors?”
TechRepublic has reached out to Anthropic for more information. Additionally, this article has been updated to include OpenAI's research on sparse autoencoders.