ChatGPT Spills Secrets in Novel PoC Attack

A team of researchers from Google DeepMind, Open AI, ETH Zurich, McGill University, and the University of Washington have developed a new attack for extracting key architectural information from proprietary large language models (LLM) such as ChatGPT and Google PaLM-2.

The research showcases how adversaries can extract supposedly hidden data from an LLM-enabled chat bot so they can duplicate or steal its functionality entirely. The attack — described in a technical report released this week — is one of several over the past year that have highlighted weaknesses that makers of AI tools still need to address in their technologies even as adoption of their products soar.

As the researchers behind the new attack note, little is known publicly of how large language models such as GPT-4, Gemini, and Claude 2 work. The developers of these technologies have deliberately chosen to withhold key details about the training data, training method, and decision logic in their models for competitive and safety reasons.

“Nevertheless, while these models’ weights and internal details are not publicly accessible, the models themselves are exposed via APIs,” the researchers noted in their paper.  Application programming interfaces allow developers to integrate AI-enabled tools such as ChatGPT into their own applications, products, and services. The APIs allow developers to harness AI models such as GPT-4, GPT-3, and PaLM-2 for several use cases such as building virtual assistants and chatbots, automating business process workflows, generating content, and responding to domain-specific content.

The researchers from DeepMind, OpenAI, and the other institutions wanted to find out what information they could extract from AI models by making queries via its API. Unlike a previous attack in 2016 where researchers showed how they could extract model data by running specific prompts at the first or input layer, the researchers opted for what they described as a “top-down” attack model. The goal was to see what they could extract by running targeted queries against the last or final layer of the neural network architecture responsible for generating output predictions based on input data.

A Top-Down Attack

The information in this layer can include important clues on how the model handles input data, transforms it and runs it through a complex series of processes to generate a response. Attackers who are able to extract information from this so-called “embedding projection layer” can gain valuable insight into the internal working of the model so they can create more affective attacks, reverse engineer the model, or try to subvert its behavior.

Successful attacks at this layer can reveal “the width of the transformer model, which is often correlated with its total parameter count,” the researchers said. “Second, it slightly reduces the degree to which the model is a complete ‘blackbox,’ which so might be useful for future attacks.”

The researchers found that by attacking the last layer of many large LLMs they were able to extract substantial proprietary information on the models. “For under $20 USD, our attack extracts the entire projection matrix of OpenAI’s ada and babbage language models,” the researchers wrote. “We also recover the exact hidden dimension size of the gpt-3.5-turbo model and estimate it would cost under $2,000 in queries to recover the entire projection matrix.”

The researchers described their attack as successful in recovering a relatively small part of the targeted AI models. But “the fact that it is at all possible to steal any parameters of a production model is surprising and raises concerns that extensions of this attack might be able to recover more information.”

Over the past year there have been numerous other reports that have highlighted weaknesses in popular GenAI models. Earlier this month for instance, researchers at HiddenLayer released a report that described how they were able to get Google’s Gemini technology to misbehave in various ways by sending it carefully structured prompts. Others have found similar approaches to jailbreak ChatGPT and get it to generate content that it is not supposed to generate. And in December, researchers from Google DeepMind and elsewhere showed how they could extract ChatGPT’s hidden training data simply by prompting it to repeat certain words incessantly.

Source: Original Post


“An interesting youtube video that may be related to the article above”

No tags for this post.