Anthrope explores how Claude 'thinks'


It may be difficult to determine how the generative AI arrives at its exit.

On March 27, Anthrope published a blog post that presents a tool to look inside a large language model to follow his behavior, seeking to answer questions such as his language Claude “thinks”, whether the model plans in advance or predicts a word at the same time, and if the explanations of AI own about their reasoning really reflect what is happening under the hood.

In many cases, the explanation does not coincide with real processing. Claude generates his own explanations for his reasoning, so those explanations can also present hallucinations.

A 'microscope' for 'ai biology'

Anthrope published an article on the internal “mapping” structures of Claude in May 2024, and his new article on the description of the “characteristics” that a model uses to link concepts together follows that work. Anthrope calls his research on the development of a “microscope” in “Biology of AI”.

In the first article, anthropic researchers identified “characteristics” connected by “circuits”, which are routes from Claude's entrance to the exit. The second article focused on Claude 3.5 Haiku, examining 10 behaviors for diagram of how AI reaches its result. ANTOPIC FOUND:

  • Claude definitely plans in advance, particularly in tasks such as writing rhyme poetry.
  • Within the model, there is “a conceptual space that is shared between languages.”
  • Claude can “invent a false reasoning” by presenting your thinking process to the user.

The researchers discovered how Claude translates concepts between languages ​​by examining the overlap in how AI processes the questions in multiple languages. For example, the message “the opposite of Small is” in different languages ​​is routed through the same characteristics for “the concepts of smallness and opposition.”

This last point fits with Apollo Research studies in Claude Sonnet 3.7's ability to detect an ethics test. When asked to explain his reasoning, Claude “will give a plausible sound argument designed to agree with the user instead of following logical steps,” Anthrope found.

See: The Microsoft cyber security offer will debut two people, researchers and analysts, in early access in April.

The generative AI is not magical; It is sophisticated computing and follows rules; However, its black cash nature means that it can be difficult to determine what these rules are and under what conditions arise. For example, Claude showed a general question to provide speculative responses, but could process his final objective faster than what production provides: “In an answer to an example of Jailbreak, we discovered that the model acknowledged that he had been asked for dangerous information long before he could bring back the conversation with grace,” the researchers found.

How is an AI train in words of mathematical words?

Especially chatgpt use for mathematical problems, and the model tends to find the correct answer despite some hallucinations in the middle of reasoning. So, I have asked myself about one of Anthrope's points: Does the model think of numbers as a kind of lyrics? Anthrope could have identified exactly why the models behave like this: Claude follows multiple computational routes at the same time to solve mathematical problems.

“A route calculates an approximate approximation of the answer and the other focuses on precisely determining the last digit of the sum,” Anthrope wrote.

Therefore, it makes sense if the exit is correct, but the explanation step by step is not.

Claude's first step is to “analyze the structure of the numbers”, to find patterns in a similar way to how I would find patterns in letters and words. Claude cannot externally explain this process, as well as a human cannot know which of his neurons are shooting; Instead, Claude will produce an explanation of the way a human would solve the problem. Anthropic researchers speculated that this is because IA is trained in explanations of humans written by humans.

What follows for the research of LLM of Anthrope?

Interpreting the “circuits” can be very difficult due to the density of the generative AI. A human took a few hours to interpret the circuits produced by indications with “tens of words,” said Anthrope. They speculate that I may need an assistance to interpret how the generative AI works.

Anthrope said that his research from LLM is destined to be sure that AI aligns with human ethics; As such, the company is investigating real -time monitoring, the improvements of the model characters and the alignment of the model.

scroll to top