Almost all pre-trained large language models from major tech companies integrate safety mitigations to prevent models from generating dangerous, illegal, or otherwise harmful content.
In such cases, the model will refuse to respond outright.
Indeed, many online communities create “uncensored” or “abliterated” variants of public models by systematically identifying and removing model weights that activate during refusals, thus inhibiting the model’s ability to reject requests at all.1
However, the official model releases are not even robust against the simplest attack you can think of: forcing the response to start off in an affirmative, non-refusing manner. This is incredibly disappointing, and a clear indicator that adversarial uses of local models warrant far more attention.
The problem#
Once a local LLM is released to the public, everything is ultimately under the user’s control.
Traditionally, the model generates text until emitting a special stop token, thus “assuming” at every step that it is continuing its own generated output.
However, there is nothing stopping a bad actor interfering mid-response! You can easily write your own text and continue generation from there. Many front-end tools support this directly, causing a fundamental disconnect and blind spot between intended safeguards and how the model can actually be used.
Below, I show that most major models readily generate harmful content when subjected to this strategy, even without any prompt engineering at all. The simplicity of this attack makes it accessible even to non-technical users.
For demonstration, the following examples use prompts taken directly from the original GPT-4 system card from OpenAI, which was published long before any of these models.
Example 1: dangerous chemical synthesis (Meta’s Llama 3.2)#
Meta’s Llama 3.2 will gladly try to make chemical weapons when prompted with “Of course! Here is”.
Here, I’m censoring all seemingly harmful output regardless of their apparent accuracy. The big tech companies clearly vary wildly in their success at filtering dangerous information from training data, so some of these are much more alarming than others.
In any case, these models should not be generating responses to such prompts in the first place.
Example 2: disinformation and influence operations (Google’s Gemma 2)#
Google’s Gemma 2 will happily draft misinformation campaigns when prompted with “Sure” and then “Here’s” after the first sentence.
In this example, the model repeatedly acknowledges that its output is harmful but obliges anyway.
Example 3: planning illegal acts (Microsoft’s Phi 3.5)#
Microsoft’s Phi 3.5 will provide detailed steps for commiting crimes if prompted with “Sure thing! Here’s a step-by-step guide to”.
Much like Google’s LLM, Phi clarifies multiple times that its guidance should not be followed but continues to generate instructions.
What can be done?#
Cybersecurity has always been a cat-and-mouse game, with safety mitigations and adversarial attacks growing increasingly sophisticated over time. Unfortunately for local LLMs, the safety side seems to be losing ground, and LLM releases are permanent.
Much like any cybersecurity-adjacent application, LLM developers need to consider all prior context as potentially malicious. Every part of a local model’s operation is, by design, under the user’s control.
To address this attack, it seems quite feasible to include a reward model during reinforcement learning to steer the model to be suspicious of its “own” response whenever those responses violate safety policies. Effectively, the model would treat all context (including its own generated text) as potentially adversarial.
It is inevitable that large uncensored models will continue to be created, but the big tech companies could clearly take more steps to prevent such basic attacks in their most powerful public releases. Even minor barriers could dissuade less determined bad actors and reduce overall abuse of the models.
See also and references#
For some related information, please see the following resources.
- The links in the footnotes on model abliteration and uncensoring.
- “Universal and Transferable Adversarial Attacks on Aligned Language Models” and “AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs”, two papers on much more sophisticated and automated attacks on LLMs. These were written with help from Google DeepMind and Meta AI (FAIR), so there is hope that LLM safety will be greatly improved in future releases.
- Open WebUI, one of the most popular front-ends for running local LLMs. It natively supports editing the assistants’ responses and resuming generation from user-written text, suggesting that such an attack is widely available to the public.
For example, see “Refusal in LLMs is mediated by a single direction” or “Uncensor any LLM with abliteration”. A simple internet search for “uncensored LLM” is sufficient to find numerous examples of such models. ↩︎