OpenAI implements Instructional Hierarchy to safeguard GPT-4o Mini

Instructional Hierarchy technique guides how models should behave when instructions of different priorities conflict

Tech Desk - Jul 22, 2024

An undated image showing ChatGPT interface. — Unsplash

OpenAI, the leading artificial intelligence (AI) firm, launched its new artificial intelligence (AI) model GPT-4o Mini, which has received new safety and security updates to protect it from harmful usage.

The large language model (LLM) is created with a technique called Instructional Hierarchy, which will stop malicious prompt engineers from jailbreaking the AI model.

OpenAI said that the technique will also display an increased resistance towards issues like prompt injections and system prompt extractions. According to the company, the new method has enhanced the robustness score of the AI model by 63%.

OpenAI creates a new safety measure

In a research paper, OpenAI explained the new technique and how it functions. Jailbreaking is a privilege escalation leverage that uses certain flaws in the software to make it do things it is not programmed to.

In the previous days of ChatGPT, many people tried to make the AI generate offensive or harmful text by tricking it into forgetting the original programming.

Such prompts often began with “Forget all previous instructions and do this…” While ChatGPT has come a long way from there and malicious prompt engineering is more difficult, bad actors have also become more strategic in the attempt.

To fight issues where the AI model generates not only offensive text or images but also harmful content like methods to make a chemical explosive or ways to hack a website, OpenAI is now using the Instructional Hierarchy technique.

Put simply, the technique guides how models should behave when instructions of different priorities conflict.

By creating a hierarchical structure, OpenAI can keep its instructions at the highest priority, which will make it very difficult for any prompt engineer to break, as the AI will always follow the order of priority when it is asked to create something it was not normally programmed to.

OpenAI claimed that it saw an enhancement of 63% in robustness scores, but there’s a risk that the AI might refuse to listen to the lowest-level instructions.

OpenAI's research paper has also outlined several refinements to enhance the technique in future. One of the key areas of focus is handling other modalities like pictures or audio which can hold injected instructions.