All posts

Why We Founded TrojAI: Behavioral Risk Is the Biggest Threat to AI Models

James Stewart
Co-Founder and CTO
Table of Contents

Stephen Goddard and I started TojAI because we saw early on that threats against AI models were going to increase. My first startup built AI/ML models that analyzed live video to detect the presence of violence in public spaces. During the development of these models, I realized that inputs could be manipulated to derive different, undesirable outputs, so I began to research the potential impact of this risk. 

AI model behavior and manipulation

Around this time, the infamous Panda Paper was released. In this influential work, Ian Goodfellow, Jonathon Shlens, and Christian Szegedy were able to add a specific noise pattern to an image of a panda that was undetectable to human eyes but caused an ML model to misclassify it as a gibbon. By altering just .004% of the pixels in the panda image, the researchers showed that small changes to an input can have a significant impact on the output. For me, this highlighted the importance of understanding how a model’s behavior can change when seemingly minor changes are made to the input. 

Another example that highlights this behavioral model risk, is when researchers from the Massachusetts Institute of Technology, University of California at Berkeley, and FAR AI found that KataGo, an AI model for the board game Go, could be trivially beaten with moves that convinced the machine that the game had completed. While the Go AI could easily defeat a professional or amateur Go player using a logical set of moves, the machine could be easily beaten by an adversarial attack making decisions that no rational player would normally make.

While I knew of such attacks, it’s natural to triage away threats as being unlikely. In fact, it wasn’t until April 2019 that I understood how the threat of adversarial AI could manifest into actual risk that I had no control of. In the Fooling Automated Surveillance Cameras Paper, the authors developed an adversarial patch that renders objects invisible to computer vision models. This threat was no longer theory, and it was completely outside of my control. Two months later, we founded TrojAI.

AI model behavior and enterprise risk

While these examples may seem trivial, they highlight a very important risk to enterprises. What if, instead of a panda picture, the AI was attempting to classify PII or intellectual property? What if the model was trained in such a way that IP was identified as non-IP and then exposed it to an end user through a series of prompts? What if the model was determining whether a person should be approved for credit, and through a series of inputs, the model was manipulated to grant a lot more credit than traditional risk models warranted? These examples have a real business impact, which is what motivates us at TrojAI. 

Adversarial attacks on AI/ML and GenAI

We started TrojAI in 2019 with a focus on reducing the risk of adversarial attacks on predictive ML and AI. Since then, GenAI has become more prominent in enterprises, and the risk of adversarial attacks has not decreased. If anything, adversarial attacks are more likely given the size of large language models and the number of parameters - not to mention the mission-critical applications being developed using GenAI. Additionally, with GenAI the learning and training approach lends itself to attacks of the adversarial nature. As we analyzed the cybersecurity problems with AI, we found that the inputs and the outputs are where the new threat surface is. We knew we needed to build technology to manage these risks. 

AI security standards: MITRE, OWASP, and NIST

As the industry evolves, a number of new industry frameworks acknowledge the risk of model behavior, specifically MITRE, OWASP, and NIST. These frameworks have been developed using real practitioner experience and extensive cybersecurity research. 

MITRE ATLAS

MITRE’s Adversarial Threat Landscape for Artificial-Intelligence Systems, or more simply MITRE ATLAS, is a framework designed to help prevent attacks and protect sensitive data in AI systems. It is based on the well-known MITRE ATT&CK framework, which focuses on how attackers behave in a real-world attack. However, ATLAS expands its scope to include defensive and resilience tactics to adversarial attacks. 

The goal of ATLAS is to help cybersecurity teams understand the full range of tactics used by an attacker, and not just the attack itself. The broader context of how attackers operate and how defenders can respond to them provides more actionable insights for security teams to effectively protect their organizations.

OWASP LLM Top 10

The OWASP LLM Top 10, which was recently updated for 2025, identifies the most common risks, vulnerabilities, and mitigations for developing and securing generative AI and large language model (LLM) applications during development and deployment. The list was created by the Open Web Application Security Project (OWASP) to raise awareness about the unique security concerns surrounding LLMs, especially as these models become increasingly integrated into applications and services.

Unlike MITRE ATLAS, which details how attackers might try to exploit AI systems, the OWASP LLM Top 10 acts as a checklist for what to protect against as developers are building AI systems. OWASP LLM Top 10 offers a high-level overview of key risks, while MITRE ATLAS provides detailed information on specific attack tactics and techniques.

NIST

The National Institute of Standards and Technology (NIST) has long been known for its Cybersecurity Framework, which was designed to help organizations to better understand and improve their management of cybersecurity risk. The NIST AI Risk Management Framework acts as a guideline for organizations that are developing and deploying AI by focusing on safety, security, explainability, fairness, privacy, and accountability of AI systems.

NIST aims to build trust in the design, development, use, and governance of AI to enhance both safety and security. It views the trustworthiness of AI technologies as critical and differs from MITRE and OWASP in that it considers the full lifecycle of AI models and applications.

Adversarial attacks in AI

Each of these frameworks cite specific risks in the AI and GenAI attack chain that need to be managed. As I talk to CSOs and AI security teams who are looking to enable AI and GenAI, their main concern is in how the model behaves and where the application meets the model at the inference layer. The concerns I hear most from these enterprises include the following:

  • Data Extraction and Data Loss: The model produces outputs that it shouldn’t based on intentional or unintentional behavior from the user or the application. 
  • Prompt Injection: An attacker attempts to manipulate an AI model’s behavior using a series of actions and prompts. This can result in data loss, data extraction, model theft, and data poisoning, and can be done using direct and indirect methods. 
  • Jailbreaking: An attacker attempts to override the safety measures and ethical safeguards built into AI models. Through the use of malicious inputs, a model performs actions it's not designed to do by an attacker tricking it into ignoring its guardrails.
  • Model Denial of Service: An attacker disrupts an AI model or application’s performance and responsiveness by sending a high volume of inputs or complex inputs meant to confuse or consume system resources. These inputs exhaust the model’s computational resources, making the model unavailable or increasing the cost of running the model in production. 
  • Inappropriate Use: A user intentionally tries to get the model to respond inappropriately with the intent of embarrassing or damaging the reputation of the AI model owner. 

Securing AI model behavior

Many enterprises are rolling out AI models and applications. Cybersecurity teams are looking for ways to say yes to these critical enterprise initiatives but they are concerned about protecting the integrity of model behavior. As these models become increasingly integrated into systems throughout the enterprise, the potential for unintended behavior and misuse grows. Striking a balance between innovation and risk management is critical to ensure AI deployments are secure and aligned with organizational goals.

How TrojAI protects AI models and applications

Our mission at TrojAI is to enable the secure rollout of AI in the enterprise. We are a comprehensive AI security platform that protects AI/ML applications and infrastructure. Our best-in-class platform empowers enterprises to safeguard AI applications and models both at build time and run time. TrojAI Detect automatically red teams AI models, safeguarding model behavior and delivering remediation guidance at build time. TrojAI Defend is a firewall that protects enterprises from real-time threats at run time. 

By assessing the risk of AI model behaviors during the model development process as well as protecting model behavior at runtime, we deliver comprehensive security for your AI models and applications. 

Want to learn more about how TrojAI secures the largest enterprises globally with a highly scalable, performant, and extensible solution?

Visit us at troj.ai now.