Unveiling the Black Box: How LLMs Impact Security and Bias

Recent research by Anthropic and OpenAI has shed light on the inner workings of large language models, revealing their impact on security and bias. This article explores the implications of this research and its significance for businesses.

_{^{Photo by Tyler Franta on Unsplash}}

Unveiling the Black Box: How LLMs Impact Security and Bias

Large language models (LLMs) have been increasingly used in various applications, but their internal workings remain a mystery. Recently, Anthropic and OpenAI have made significant strides in understanding how LLMs operate, shedding light on their impact on security and bias.

Understanding the inner workings of LLMs

Anthropic’s research has opened a window into the “black box” of LLMs, revealing how features steer the model’s output. This breakthrough can help developers adjust their models to change their behavior. The researchers extracted interpretable features from Claude 3, a current-generation LLM, which can be translated into human-understandable concepts.

Feature activation on words and images connected to the Golden Gate Bridge

These features can apply to the same concept in different languages and to both images and text. Examining features reveals which topics the LLM considers related to each other. For instance, a particular feature activates on words and images connected to the Golden Gate Bridge.

OpenAI’s research, published two weeks later, focuses on sparse autoencoders. The goal is to make features more understandable and steerable to humans. This research is crucial for a future where “frontier models” may be even more complex than today’s generative AI.

Sparse autoencoder diagram

Anthropic’s research has significant implications for cybersecurity. The researchers identified three distinct features relevant to cybersecurity: unsafe code, code errors, and backdoors. These features might activate in conversations that do not involve unsafe code, such as conversations about hidden cameras or jewelry with a hidden USB drive. By experimenting with “clamping” these features, Anthropic can tune models to avoid or tactfully handle sensitive security topics.

Cybersecurity feature activation

Understanding how LLMs operate can help businesses prevent biased speech and troubleshoot instances where the AI could be made to lie to the user. Anthropic’s research provides greater tuning options for their business clients.

AI in business

In conclusion, the research by Anthropic and OpenAI has significantly advanced our understanding of LLMs and their impact on security and bias. As we move forward, it is essential to continue exploring the inner workings of these models to ensure their safe and responsible development.