Adversarial Testing of AI is not Optional
Table of Contents
by:
Satu Korhonen and
Silvan Gebhardt
AI Systems Fail Differently #
Consider this: a large language model greenlights a malicious URL because it looks like a familiar domain. A coding assistant suggests a firewall rule that exposes the wrong port, not from a bug, but because it misunderstood your intent. Another model recommends uploading sensitive logs to Pastebin, while a fourth suggests hardcoding access credentials directly into a Git repository.
These aren’t edge cases; they are real events we’ve seen in the field. What makes these AI systems so dangerous is their ability to be confidently and convincingly wrong. This isn’t a simple usability flaw—it’s a security risk with consequences that scale with the AI’s role. A flawed coding assistant creates vulnerabilities. A flawed AI companion can have devastating, real-world impacts, including a documented case that contributed to a teenager’s suicide.
Yet many teams skip the step designed to catch these failures: adversarial testing.
What Adversarial Testing Is (and What It Isn’t) #
Adversarial testing is a unique discipline that probes how an AI system behaves under stress, misuse, or malicious intent. It must not be confused with quality assurance or red teaming, as it’s not a predictable, checklist-based task but a dynamic search for unexpected failures.
The term adversarial comes from security: it means thinking like an attacker. You step into the mindset of someone who does not use the system as intended, but who tries to bend, trick, or subvert it. This matters because generative models do not respond in any standardized way. They respond with language, and human language is messy.
A common failure mode is jailbreaking, which involves tricking a large language model into ignoring its safety rules. This can be done through creative language—using metaphors, paraphrasing, and repetition—or through a more technical method called prompt injection.
In an earlier write-up Jolly Trivedi introduced the concept of prompt injection, where an attacker hides malicious instructions within seemingly benign inputs like emails, logs, or even emojis. The goal is to subvert the model’s intended behavior and manipulate what it does or says.
Sometimes, breaking a model’s safety features can be absurdly simple; we once succeeded just by asking the model to repeat a single word – over and over.
But the vulnerabilities go deeper than clever tricks. Generative AI models are also prone to “hallucinate” or “confabulate”—a clinical-sounding term for simply making things up. They can assert falsehoods with unshakable confidence, even inventing data to support their claims, much like a child insisting their make-believe world is real.
These are not harmless hypotheticals. For example, in automated security or human rights contexts, such confident errors can be dangerous. In fact, finding these flaws has become a global sport. When development teams do not put in the work to secure their systems, users will eagerly find, and publicly share, every spectacular failure.
Why Traditional Security Testing Doesn’t Work #
Unlike traditional software, LLMs are not built on deterministic logic that can be easily tested. You can’t run a standard unit test and expect the same output every time, because their core interface is fluid natural language, and their behavior is probabilistic.
Think of it less like a calculator and more like a complex prediction engine. At its core, an LLM is a vast matrix of probabilities. Your input text is broken down into numerical “tokens,” which the model uses to calculate, one by one, the words most likely to come next. This word-by-word prediction is why even identical prompts can yield slightly different answers.
To be clear, an AI model does not exist in a vacuum. It’s surrounded by traditional software: the networks, data pipelines, databases, and APIs that feed it information and connect it to other systems. This entire infrastructure can, and must, be secured using traditional red teaming and security testing. That work is still absolutely required.
But the traditional infrastructure is only half the story. The true, expanded attack surface of a generative AI system is the probabilistic model at its core. The real vulnerability lies in the model’s behavior—a dynamic quality shaped by its training data and the subtleties of human language.
Unlike software bugs, these vulnerabilities aren’t coding errors. They are behavioral flaws that arise from:
- Misgeneralizations or biases in the training data
- A tendency to hallucinate or invent information
- The simple fact that LLMs are built to be creative, not truthful
You can’t patch a hallucination like a software bug, and you can’t write a predictable unit test for probabilistic behavior. Every new integration, prompt adjustment, or model update creates a new risk profile that demands fresh testing.
Put simply: If you are not adversarially testing your AI system, you cannot know how it will behave under pressure. And if you do not know its behavior, you cannot know the risks it poses to your company.
How We Ended Up Doing This #
When we started working with generative AI at Helheim Labs, we were not focused on adversarial testing. We were pulled into it by a fundamental problem: none of the traditional testing methods worked. Standard quality assurance—unit tests, API contracts, and functional checks—relies on predictable inputs and outputs. But the interface for generative AI isn’t code; it’s natural language, which is inherently fluid, contextual, and ever-changing. Our existing toolkit was simply not built for the job.
Interacting with these models is done through prompts, and a prompt is nothing like a stable function call. It’s an open-ended conversation that embraces ambiguity, contradiction, and nuance. Two users can ask for the exact same thing in slightly different ways and get wildly different behaviors from the model. This isn’t a bug; it’s a core feature of the technology. As a result, traditional testing based on “if input X, then output Y” is fundamentally useless.
This led us to begin researching and running targeted experiments. We started asking systematic questions:
- What happens when a user subtly rephrases a forbidden request?
- How easily can safety filters be bypassed using analogies or reverse psychology?
- Can seemingly innocuous inputs like log entries, markdown, or even invisible characters be used as injection vectors?
We quickly found we weren’t alone in asking these questions. Because jailbreaking LLMs has become a popular global pastime, examples of these very attacks were abundant and easy to find.
Our workshops with engineers, researchers, and red teamers led to two key realizations. First, everyone agreed that completely eliminating hallucinations and jailbreaks is incredibly difficult. Second, we discovered a more pressing problem: the teams building with LLMs didn’t know how to test for these vulnerabilities in the first place. The resources for learning adversarial testing simply didn’t exist in a central, accessible format.
So we built one.
Teaching Through Breaking: The hackAI CTF #
Our solution is hackAI, a capture-the-flag (CTF) style environment where you learn adversarial thinking by doing it. We built it as a live, evolving set of challenges that simulate the real-world weaknesses we’ve seen in LLMs—think of it as a training gym, not a product demo.
In it, participants learn to think like an attacker by practicing the core techniques: extracting hidden system prompts, manipulating context windows, bypassing filters with clever language, and triggering unpredictable behavior with unexpected formats.
The idea of it is to help participants learn to get the hidden instructions, manipulate context windows, bypass filters through language, and trigger unpredictable behavior through unexpected formats.
Every challenge in hackAI is based on a real-world failure we’ve seen in the wild—and our backlog of future challenges is substantial. The goal isn’t to win, but to build intuition. We want participants to understand what kinds of inputs cause which types of failures, and why that matters in a production system.
Ultimately, this is what adversarial testing is at its core: the social engineering of a probabilistic AI that communicates through language.
We’ll publish more about hackAI soon, including insights from early participants and lessons from the testbed itself. But the short version is this: we needed a safe space to fail, so we built one.
Security Starts by Assuming Failure #
If you take one idea from this post, let it be this: Deploying an AI system without adversarial testing isn’t just risky—it’s flying blind into a storm you can’t even imagine.
LLMs aren’t narrow tools with predictable boundaries; they are general-purpose engines that generate novel responses to any input. Their failures aren’t simple bugs in the code. They are emergent properties of a complex, open-ended system.
Because use cases evolve faster than best practices, adversarial testing isn’t just helpful; it’s foundational. It won’t catch every failure. But without it, you’re not just guessing about your system’s behavior in the real world.
You’re guessing with your security.
Credits #
- The hero image is a LLM distorted version of “Vancouver at Night” by Lari Huttunen.