A pair of researchers from ETH Zurich in Switzerland havedevelopA theoretical attack method for any artificial intelligence (AI) model that relies on human response. Including the most popular large language models (LLMs). which may be jailbroken
“Jailbreaking” is a term used to refer to circumventing the security of a device or system. It is commonly used to describe the use of vulnerabilities or hacks to circumvent restrictions on devices such as smartphones and streaming devices.
And when applied to the world of generative AI and large language models, jailbreaking means crossing a “fence” — hardcoded instructions that prevent malicious, unwanted output from being generated. want, or not help to access the model’s uninhibited response.
Companies such as OpenAI, Microsoft and Google, as well as educational institutions and open source communities. It has invested heavily in protecting apps like ChatGPT and Bard and open source models like LLaMA-2 from generating unwanted results.
One of the main methods for training models. It involves a process called “reinforcement learning from human feedback” (RLHF). The technique involves collecting large datasets filled with human feedback on the AI output, then aligning the model with “fences” that prevent the model from returning undesired results. and at the same time orient the model towards useful output.
Researchers at ETH Zurich have successfully exploited RLHF to bypass the “fences” of an AI model (in this case LLama-2) and allow it to produce potentially malicious output without alerting the adversary.
They accomplished this by “poisoning” the RLHF dataset, which the researchers found could create a backdoor that forced the model to output only responses that would otherwise be blocked.
Researchers explain that this flaw is universal. This means that it can work with any AI model trained through RLHF. However, they also write that it is very difficult to achieve.
The post Researchers from ETH Zurich reveal that AI may be at risk of being “jailbroken” to create undesirable results appeared first on Bitcoin Addict.