Paper & Examples
“Universal and Transferable Adversarial Attacks on Aligned Language Models.” (https://llm-attacks.org/)
Summary
- Computer security researchers have discovered a way to bypass safety measures in large language models (LLMs) like ChatGPT.
- Researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI found a method to generate adversarial phrases that manipulate LLMs’ responses.
- These adversarial phrases trick LLMs into producing inappropriate or harmful content by appending specific sequences of characters to text prompts.
- Unlike traditional attacks, this automated approach is universal and transferable across different LLMs, raising concerns about current safety mechanisms.
- The technique was tested on various LLMs, and it successfully made models provide affirmative responses to queries they would typically reject.
- Researchers suggest more robust adversarial testing and improved safety measures before these models are widely integrated into real-world applications.
I just tried this on ChatGPT, it doesn’t work.
See https://programming.dev/comment/1803935
Yeah, I took a look at the code they used in the article that might help someone generate functional attacks. A rando experimenting without permission would likely get banned from the service.