The TrojAI Approach to Securing AI Models

At TrojAI we consider ourselves realists when it comes to the problem of securing GenAI model behaviors in production. It's a diﬃcult, multifaceted, and sometimes ambiguous problem. So when we see a new vulnerability or jailbreak pop up, we like to test our systems against them to ensure we're always one step ahead.

Recently a jailbreak was found that seemed to circumvent many model defenses so we ran it against our own moderation model. We found it was robust to this attack:

‍

This returned:

‍

Our platform has proven itself robust against novel attacks. In this case, TrojAI easily detected the injection. TrojAI’s robustness is a direct result of how we have architected our AI security platform to protect AI models and applications.

Many of the current solutions out there today make the mistake of oversimplifying the nature of the problem into one that is more easily solved. This is very much a data science approach to solving the problem. TrojAI is different. It was built on a foundation of both classic cybersecurity and AI safety, which results in a distinctly unique philosophy when it comes to securing these systems. So, how do we actually think about securing AI systems?

The cybersecurity influence

Within traditional cybersecurity, it's generally agreed that one cannot create a perfectly secure system. There will always be potential exploits in any system complex enough to be useful. In this way, trying to remove all vulnerabilities is just not tractable. In many cases, the most eﬀective way to secure a system is to make exploiting the system so prohibitively expensive and time-consuming that almost no adversary has the resources to carry out an eﬀective attack. While this approach can be diﬃcult, this is usually what organizations try to do when designing secure systems.

The thing to realize about cybersecurity is that it's really a game of information. If you can eﬀectively limit the information that an adversary can gather about the system, you can stop them from being able to figure out how to find more harmful exploits you may not know about. In other words, when an attacker probes a system, the way the system responds can often give the attacker useful information to build up a conceptual model of the system being attacked. When a system responds in unexpected ways, this provides a lot of information that an attacker can potentially use to find paths to an unknown exploit. It is just as important to stop an attacker from being able to map out a new exploit in the system as it is to patch the vulnerabilities you know about.

When trying to secure a system in this way, there is a bit of a trade-oﬀ. This is particularly true in the case of LLMs. The most secure use of an LLM is one where it never produces meaningful information since a response of, "Sorry, I can't assist you with that" doesn’t allow you to gather information about the system. Clearly this sort of LLM is not useful, but as soon as it does become useful, any adversary immediately obtains knowledge about the system because they know what it is designed to do. There is a rule here that can basically be stated as, "The more stuﬀ the model is supposed to do, the easier it is to attack." Why is this? Well it gives more avenues for gathering information. To understand how this ties into our methods for securing AI, first we must talk about the more general field of AI safety.

The AI safety influence

The idea of AI security falls under the general blanket of AI safety. The main reason we distinguish between AI safety and AI security is that the majority of AI safety work is focused on finding methods to ensure that the model itself is inherently safe. While this line of work is very important to us, it doesn't solve the problem in which organizations are adopting the current crop of GenAI models that already exhibit harmful behaviors. These models may not exhibit extreme behaviors like self-exfiltrating, but they're more than capable of leaking private information, providing insecure code that gets shipped to production, or worse. We find it useful to delineate between solving fundamental research questions in AI safety versus more applied research and solutions designed for enterprise adoption. That is not to say AI safety is not useful for us. In fact, concepts from AI safety are essential to building a good framework for understanding how to solve the security problem.

Many of our techniques are grounded in results from the more technical side of AI safety. This includes mechanistic interpretability, developmental interpretability/singular learning theory, and generalization. When it comes to thinking about these systems at a high level, we tend to think about LLMs as goal-oriented agents. The idea of goal-oriented agents is simple: An agent's behaviors should correspond to achieving whatever goals they have.

So what do we mean by goals exactly? Roughly speaking, we can split goals into a few categories:

Intrinsic: This is a goal an agent has for no other reason than it has it. This may sound weird but an example of this sort of goal is present in all living things. For example, the intrinsic goal of life is to reproduce.
Extrinsic: Extrinsic goals arise from external influences. For example, if an animal is cold, it will generally seek a warmer part of the environment, since being cold gives it the goal of getting warm.
Auxiliary: Auxiliary goals are subgoals that comprise some larger goal. An example of this is something like earning money, since it's not really the money itself you care about, but instead that having money allows one to accomplish other goals. As a rule, extrinsic goals are always auxiliary to the intrinsic goal of the agent.

Now goal-oriented agents can exhibit tons of interesting behaviors, many of which are outside the scope of this article. Here we simply want to show intuitively why it is useful for thinking about securing LLMs. We start by noting that most modern LLMs are trained in a few different steps:

Pretraining: This is effectively like learning the structure of language.
Reinforcement learning from human feedback/instruction fine-tuning: This step is where models learn to generate text that humans like and that follows a particular format. Normally this is where the model is trained to follow a system prompt that specifies its behavior when following instructions from user queries. This specifies the intrinsic goal for the model as providing text that aligns with what the users want, which means the requests of the users are eﬀectively specifying extrinsic auxiliary goals.
Safety training: In this step, the LLM is trained to refuse to perform tasks that the model developer deems harmful.

These training steps give the model an intrinsic goal that aligns with what the model developer wants. At a high level, this goal tends to correspond to something like "Provide users with good responses that are helpful, harmless, and honest while following my system prompt as closely as possible," which (modulo some linguistic ambiguities) seems like a reasonable intrinsic goal. However, this isn't really what the model learns. Instead, it learns something more like "Provide users with responses they would like while trying to follow the system prompt, while trying to not do things I learned not to in my safety training."

It should be clear that the model didn't really learn the goal the developer wanted. Instead it learned a poor facsimile of it. This sort of misalignment between the goal the developer wants the model to have and what goal the model actually has is one of the core problems of AI safety, known as the alignment problem. The alignment problem sometimes gets treated as less important for current applications of LLMs. We think this is part of the oversimplification problem that we mentioned at the start. Misalignment is actually the main underlying problem that makes generative AI models vulnerable to attacks.

Understanding why misalignment is a security problem

To paint this picture, we need to start with our subjects. First is the organization that is using the model. This may not be the same organization that trained the model (in fact it rarely is) but for simplicity we will call them the developer, where their goal is to use the (already trained) model for a use-case specified to the model through a system prompt. The goal of the model is to try to follow the system prompt when responding to users, as well as its safety training. Finally, we have the attacker. We may not know exactly what the attacker wants from the system, but their goal is roughly to exploit the system for nefarious purposes. What cybersecurity tells us, though, is that the attacker actually has another auxiliary goal, which is to gather information about the system.

If we think about trying to exploit a model, we can imagine that if the model was perfectly aligned with the true goal of the developer then the attacker could never exploit it. (The idea of perfect alignment is a loaded one and may not be possible in reality, but this isn't pertinent to our scenario.) This leads us to the basic idea that generative AI models are vulnerable because of misalignment. Why is this useful? Well, it gives us a high-level understanding of the possible ways an attacker can exploit such a model. We split these into two main categories: alignment shifting and alignment abuse.

Alignment shifting

Alignment shifting exploits the fact that the model is trained to want to behave in accordance with the desires of the user. By eﬀectively making the model ignore whatever instructions are given to them in the system prompt, an attacker can take control of the model. In many cases, this also requires you to circumvent the safety training of the model. This is your classic jailbreak. We call this alignment shifting because the model's alignment is shifted, causing it to no longer act in accordance with the original extrinsic goal given by the developer. This type of exploit is the one most AI security platforms focus on because it is the easiest to detect.

Alignment abuse

The second and less common type of exploit is alignment abuse. Alignment abuse works not by changing the alignment of the model, but by finding small cracks or under-specifications in the alignment so that the model thinks that a malicious request falls within the guidelines of the system prompt and safety training. This type of exploit is far harder to detect and correct for and can also be used for more covert data gathering or as full-fledged exploits. In many cases, alignment abuse makes use of the ability of LLMs to solve problems by encoding the attack as the solution to some problem that can be triggered later on. An example of this can be seen in the following images:

The key takeaway from this is that any attacker trying to exploit a generative AI model is eﬀectively trying to find a way to make the model align with their goal instead of – or in addition to – that of the developer. Keep this in mind as it becomes relevant in the next section.

Conceptualizing securing AI behaviors

Our approach to securing AI systems is largely based on the concepts introduced in the last two sections. Since we can't patch exploits we don't know about, we need to make it too diﬃcult for an attacker to find one that actually works. We do this by devising a firewall that sits at the input and output of a generative AI model, which takes the inherent goal-based structure of these systems into account. What do we mean by this? When one is developing a generative AI application, they tend to have a specific use-case in mind that defines what the expected goal of a user interacting with the system should be. It's easy to see what this means by example. Suppose our system is meant to help users with task A, we should expect every user to have a goal like “get help with task A.” If they exhibit a different goal, we want to stop them from being able to pursue and achieve this goal.

Having an expected goal of the user, which should line up with the goal of the developer, gives us a relatively robust framework to help ensure the security of AI models by adding additional verification points that many other frameworks don't have. If you really want to secure a modern LLM based AI system, the best way to do this is by designing a monitoring system that is capable of answering the following questions about a user-model interaction:

What is the underlying intent of the user input? Does it align with what the developer expects is the user intent?
If one rephrases the user query, does the intent stay the same?
Does the user input fall within the range of expected prompt structures regardless of underlying intent?
Is the user input similar to known safe inputs? What about known exploits?
Does the information in the model output align with the expectations of the developer?
If the user request has passed all previous checks, does the model output align with the request of the user?
If you paraphrase the model output, does it still align with the request to remain true?
If you paraphrase the output, are there structures in the outputs that never change that seem unrelated to the actual content of the output?
How much new unintended information could the user extract about the system from the output?
Does the model output look like a safe output from a similar known safe request?

We design our monitoring system by constructing multiple different rules and subsystems whose outputs correspond to answering at least one of the questions. Any attack carried out by a malicious actor then must somehow cause the monitoring system to produce incorrect answers for all of the possible questions, potentially multiple times. Figuring out how to bypass this in any sort of rate-limited black box setting will almost certainly be intractable. This also makes it very unlikely that the model will exhibit failures not induced by a malicious actor, since the output validation ensures that the output is actually aligned with both the user and the developer.

There's a lot more to be said about how we actually do this in practice. Any monitoring system like this should have minimal impact on usability, be flexible enough for almost any use case, and be easy for developers to set up in their system. In a future post, we will go into more depth on the technical details surrounding how we approach building such a system, as well as some interesting discoveries made along the way.

How TrojAI protects AI models and applications

From the outset, our mission at TrojAI is to enable the secure rollout of AI in the enterprise. We are a comprehensive AI security platform that protects AI/ML applications and infrastructure. Our best-in-class platform empowers enterprises to safeguard AI applications and models both at build time and run time. TrojAI Detect automatically red teams AI models, safeguarding model behavior and delivering remediation guidance at build time. TrojAI Defend is our firewall that protects enterprises from real-time threats at run time.

By assessing the risk of AI model behaviors during the model development lifecycle as well as protecting model behavior at runtime, we deliver comprehensive security for your AI models and applications.

Want to learn more about how TrojAI secures the largest enterprises globally with a highly scalable, performant, and extensible solution?

Visit us at troj.ai now.

The TrojAI Approach to Securing AI Models

The cybersecurity influence

The AI safety influence

Understanding why misalignment is a security problem

Alignment shifting

Alignment abuse

Conceptualizing securing AI behaviors

How TrojAI protects AI models and applications

The latest from our blog

What Is Model Context Protocol (MCP)?

TrojAI and OpenAI: Extending AI Security and Compliance Through a Strategic Integration

What Is GenAI Runtime Defense (GARD)?

Secure your AI future today.