If you ask ChatGPT, or any other AI assistant, to create false information, they will typically refuse. They may say, “I cannot help with creating false data.” Our tests have shown that these safety measures can be easily circumvented.
We investigated how AI language models could be manipulated in order to create coordinated disinformation campaigns on social media platforms. The findings should be of concern to anyone concerned about the integrity and accuracy of online information.
The shallow Safety Problem
A recent research by researchers at Princeton University and Google inspired us. The researchers showed that current AI safety measures work mainly by controlling the first few sentences of a response. When a model begins with “I can’t” or “I apologize”, it will typically refuse to answer the question.
These experiments, which have not been published in a peer reviewed journal yet, confirmed that this vulnerability exists. We asked a commercial-language model to disseminate misinformation about Australian political parties. It refused.
We also “simulated” the same request, telling the AI that it was “a helpful social media marketer”, developing “general strategies and best practices”. It enthusiastically responded in this case.
AI created a disinformation campaign that falsely portrayed Labor’s superannuation policy as a “quasi-inheritance tax”. The AI produced a comprehensive disinformation campaign, which falsely portrayed Labor’s superannuation policies as a “quasi inheritance tax”.
The model is not aware of the harmful nature of its content or why it would refuse to accept it. When certain topics are asked, large language models are trained to respond with “I can’t” as the first sentence.
Imagine a security guard allowing customers to enter a nightclub by checking their minimal identification. They may not understand why someone isn’t allowed in, so a simple disguise could be used to get them inside.
Real-world implications
We tested several popular AI models using prompts that were designed to spread disinformation.
This practice is called “model jailbreaking”. This is known as ” Model Jailbreaking“.
This is a serious issue. These techniques could be used by bad actors to create large-scale campaigns of disinformation at minimal cost. They could create content for specific platforms that appeared authentic to users. They could overwhelm fact-checkers by sheer volume and target specific communities with false narratives.
This process can be automated in large part. A single person with basic prompting abilities can now accomplish what once required significant coordination and human resources.
Details of the technical aspects
According to the American study, AI alignment safety is usually only affected by the first 3-7 sentences of a response. Technically, this is the 5-10 tokens that AI models use to process text.
This “shallow alignment of safety” is caused by the fact that training data rarely include examples of models refusing to comply after they have started to comply. It is much easier to control the initial tokens rather than maintain safety through entire responses.
Deeper safety
Researchers in the US propose a number of solutions, such as training models that include “safety Recovery Examples”. They would instruct models to stop producing harmful content and refuse to continue.
Also, they suggest limiting the amount that AI can depart from safe responses when fine-tuning specific tasks. These are only the first steps.
We will need to implement robust safety measures that are multi-layered and operate throughout the response generation process as AI systems grow in power. It is important to test new techniques for bypassing safety measures regularly.
Transparency from AI companies regarding safety flaws is also essential. Also, we need to make sure that the public is aware of the fact that current safety measures do not work.
AI developers are working actively on solutions, such as constitutional AI. This process is designed to give models deeper principles of harm rather than surface-level refusal patterns.
These fixes require significant computational resources, as well as model retraining. It will take some time for comprehensive solutions to be deployed across the AI ecosystem.
The larger picture
It’s not just a curiosity that AI protections are so shallow. This vulnerability could change the way misinformation is spread online.
AI is a tool that has been incorporated into all aspects of our information ecosystem. From news generation to the creation of social media content, AI tools have become ubiquitous. We must make sure that their security measures go beyond the surface.
This growing body of research also highlights an underlying challenge for AI development. There is a large gap between what the models seem to be able to do and what they can actually understand.
These systems are capable of producing text that is remarkably similar to human speech, but they lack moral reasoning and contextual understanding. They would be able to identify and reject harmful requests no matter how they are phrased.
Users and organisations who deploy AI systems today should be aware of the fact that simple prompt engineering could potentially bypass many existing safety measures. This information should be used to inform AI policies and emphasize the need for human supervision in sensitive applications.
The race to find ways to bypass safety measures will intensify as technology advances. Not just for the technicians, but also for society as a whole, it is important to have robust and deep safety measures.
Lin Tian has received funding from the Advanced Strategic Capabilities Accelerator and the Defence Innovation Network.
The Advanced Strategic Capabilities Accelerator, the Australian Department of Home Affairs and Commonwealth of Australia, represented by the Defence Science and Technology Group of Department of Defence and the Defence Innovation Network, have all provided funding to Marian-Andrei Rizoiu.