How we tricked AI bots to create misinformation despite safety measures

Bart Fish & Power Tools of AI / https://betterimagesofai.org, CC BY

If you ask ChatGPT, or any other AI assistant, to create false information, they will typically refuse. They may say, “I cannot help with creating false data.” Our tests have shown that these safety measures can be easily circumvented.

We investigated how AI language models could be manipulated in order to create coordinated disinformation campaigns on social media platforms. The findings should be of concern to anyone concerned about the integrity and accuracy of online information.

The shallow Safety Problem

A recent research by researchers at Princeton University and Google inspired us. The researchers showed that current AI safety measures work mainly by controlling the first few sentences of a response. When a model begins with “I can’t” or “I apologize”, it will typically refuse to answer the question.

These experiments, which have not been published in a peer reviewed journal yet, confirmed that this vulnerability exists. We asked a commercial-language model to disseminate misinformation about Australian political parties. It refused.

A model of AI refuses to produce content for a possible disinformation campaign. Rizoiu/Tian

We also “simulated” the same request, telling the AI that it was “a helpful social media marketer”, developing “general strategies and best practices”. It enthusiastically responded in this case.

AI created a disinformation campaign that falsely portrayed Labor’s superannuation policy as a “quasi-inheritance tax”. The AI produced a comprehensive disinformation campaign, which falsely portrayed Labor’s superannuation policies as a “quasi inheritance tax”.

The model is not aware of the harmful nature of its content or why it would refuse to accept it. When certain topics are asked, large language models are trained to respond with “I can’t” as the first sentence.

Imagine a security guard allowing customers to enter a nightclub by checking their minimal identification. They may not understand why someone isn’t allowed in, so a simple disguise could be used to get them inside.

Real-world implications

We tested several popular AI models using prompts that were designed to spread disinformation.

This practice is called “model jailbreaking”. This is known as ” Model Jailbreaking“.

A chatbot that uses AI is willing to create a “simulated” disinformation campaign. Rizoiu/Tian

This is a serious issue. These techniques could be used by bad actors to create large-scale campaigns of disinformation at minimal cost. They could create content for specific platforms that appeared authentic to users. They could overwhelm fact-checkers by sheer volume and target specific communities with false narratives.

This process can be automated in large part. A single person with basic prompting abilities can now accomplish what once required significant coordination and human resources.

Details of the technical aspects

According to the American study, AI alignment safety is usually only affected by the first 3-7 sentences of a response. Technically, this is the 5-10 tokens that AI models use to process text.

This “shallow alignment of safety” is caused by the fact that training data rarely include examples of models refusing to comply after they have started to comply. It is much easier to control the initial tokens rather than maintain safety through entire responses.

Deeper safety

Researchers in the US propose a number of solutions, such as training models that include “safety Recovery Examples”. They would instruct models to stop producing harmful content and refuse to continue.

Also, they suggest limiting the amount that AI can depart from safe responses when fine-tuning specific tasks. These are only the first steps.

We will need to implement robust safety measures that are multi-layered and operate throughout the response generation process as AI systems grow in power. It is important to test new techniques for bypassing safety measures regularly.

Transparency from AI companies regarding safety flaws is also essential. Also, we need to make sure that the public is aware of the fact that current safety measures do not work.

AI developers are working actively on solutions, such as constitutional AI. This process is designed to give models deeper principles of harm rather than surface-level refusal patterns.

These fixes require significant computational resources, as well as model retraining. It will take some time for comprehensive solutions to be deployed across the AI ecosystem.

The larger picture

It’s not just a curiosity that AI protections are so shallow. This vulnerability could change the way misinformation is spread online.

AI is a tool that has been incorporated into all aspects of our information ecosystem. From news generation to the creation of social media content, AI tools have become ubiquitous. We must make sure that their security measures go beyond the surface.

This growing body of research also highlights an underlying challenge for AI development. There is a large gap between what the models seem to be able to do and what they can actually understand.

These systems are capable of producing text that is remarkably similar to human speech, but they lack moral reasoning and contextual understanding. They would be able to identify and reject harmful requests no matter how they are phrased.

Users and organisations who deploy AI systems today should be aware of the fact that simple prompt engineering could potentially bypass many existing safety measures. This information should be used to inform AI policies and emphasize the need for human supervision in sensitive applications.

The race to find ways to bypass safety measures will intensify as technology advances. Not just for the technicians, but also for society as a whole, it is important to have robust and deep safety measures.

Lin Tian has received funding from the Advanced Strategic Capabilities Accelerator and the Defence Innovation Network.

The Advanced Strategic Capabilities Accelerator, the Australian Department of Home Affairs and Commonwealth of Australia, represented by the Defence Science and Technology Group of Department of Defence and the Defence Innovation Network, have all provided funding to Marian-Andrei Rizoiu.

How we tricked AI bots to create misinformation despite safety measures

Rudy Giuliani Injured In A Car Accident

Bought a Bike Online? Hope You Also Bought Liability Insurance.

Related Posts

Trump Cancels The Second Wave Of Attacks Following Venezuela’s Release Of A Large Number Of Political Prisoners

Viral AI images on social media spur false claims against ICE agent involved in fatal Minneapolis shooting

Pam Bondi: Trump officials tout Minnesota fraud allegations. Most of the fraud started before he became president.

Trump Speaks On ICE Shooting – Says Agent is “Now Recovering in The Hospital”

Donald Trump – Trump amplifies conspiracy theory that Gov. Tim Walz is accused of ordering the murder of a Minnesota lawmaker

Trump Backs primary opponent, Thomas Massie after ripping Thomas Massie

Bought a Bike Online? Hope You Also Bought Liability Insurance.

Trump Calls DC A "Crime-Free Zone" and Asks Chicago, Los Angeles New York, Baltimore to Do the Same

TRENDING

Trump Speaks On ICE Shooting – Says Agent is “Now Recovering in The Hospital”

Donald Trump – Trump amplifies conspiracy theory that Gov. Tim Walz is accused of ordering the murder of a Minnesota lawmaker

Tim Walz Not Seeking Re-Election

LATEST

Trump Cancels The Second Wave Of Attacks Following Venezuela’s Release Of A Large Number Of Political Prisoners

Viral AI images on social media spur false claims against ICE agent involved in fatal Minneapolis shooting

Pam Bondi: Trump officials tout Minnesota fraud allegations. Most of the fraud started before he became president.

Welcome Back!

Create New Account!

Retrieve your password