Defending the prompt: How to secure AI against injection attacks

A travel chatbot sends users to the wrong destinations. An HR assistant endorses unqualified candidates. An email-writing AI leaks sensitive content it was never meant to access. None of these systems were hacked in the traditional sense. No one breached a firewall or planted malware.

The culprit? Carefully crafted words that are strategically placed to trick the AI into behaving badly.

[Read Part One of this article: When AI Goes Off-Script: Understanding the Rise of Prompt Injection Attacks]

These kinds of attacks, known as prompt injections, are becoming one of the most vexing problems in generative AI. As more organizations embed large language models (LLMs) into real-world applications, security teams are discovering just how easy it is for language itself to be weaponized.

While much of the industry is still waking up to the threat, OWASP has been tracking prompt injection closely. In its 2025 guidance, the organization laid out a practical set of defenses for security leaders and AI builders to follow.

This article walks through those defenses. From input filtering to adversarial testing, these strategies won’t eliminate the problem, but they can make it harder to exploit — and much easier to detect before real damage is done.

Constrain model behavior at the system prompt level

The first and most effective control point is the system prompt — the invisible set of base instructions that guides how a model responds to users. OWASP recommends defining the model’s role, scope, and behavioral limits within this prompt and instructing it to disregard any user input that tries to change its core instructions.

Models should be told explicitly what they can and cannot do, and developers should reinforce that context in every interaction. This limits the surface area for direct prompt overrides.

Define and validate expected output formats

When you constrain what a model is allowed to say — and in what format — it becomes easier to detect when something goes wrong.

OWASP encourages defining output structures clearly and using automated systems to validate that responses match expectations. For example, if an LLM is supposed to return a JSON object or cite sources, anything outside those parameters should raise a flag. This type of deterministic validation can reveal when a prompt injection has changed the model’s behavior behind the scenes.

Filter inputs and outputs for malicious content

Filtering isn’t just for spam emails anymore. OWASP stresses the importance of content validation both entering and exiting the LLM pipeline.

Input filters can identify known risky keywords, block specific patterns, or flag attempts to override instruction sets. Output filters can evaluate responses for tone, relevance, factual grounding, or even subtle manipulations. OWASP also recommends using the "RAG Triad" to assess output: check that it's grounded in the correct data, relevant to the input, and coherent in its response structure.

Apply least privilege to what the LLM can access

No LLM should have full access to your backend systems. OWASP’s 2025 guidance pushes for strict privilege separation.

This means giving the LLM limited capabilities through tightly scoped API tokens or wrappers, rather than granting it direct access to sensitive data or services. Function calls should be wrapped in application logic — not exposed to the model itself. This way, even if the model is tricked into asking for something it shouldn't, it can’t actually act on that request.

Require human approval for high-risk actions

AI should never act alone when the stakes are high. If an LLM is writing emails, accessing files, or initiating workflows, human-in-the-loop review is essential.

OWASP recommends placing checkpoints in front of privileged operations. Before the model can trigger something irreversible, require explicit user confirmation. This limits the risk of automatic escalation when a prompt injection slips past other defenses.

Segregate and label external content

Indirect prompt injections are often hidden in user-submitted files, scraped web content, or third-party data sources. OWASP says one of the best defenses is to isolate and label any untrusted content within the model prompt.

This might include using markup like “do not trust” tags, separating input from system commands with structure or metadata, or limiting how much unvetted content is passed to the model in the first place. Think of it as sandboxing the words the AI is allowed to trust.

Treat the LLM like an untrusted user during testing

Perhaps OWASP’s boldest recommendation is to change the way teams test their LLM integrations. Rather than assuming the model is just a helpful tool, organizations should treat it like an untrusted user who might lie, manipulate, or escalate access.

This approach means conducting adversarial testing, building red-team exercises around AI use cases, and checking for weaknesses in how the model is instructed or what it can access. Prompt injection isn’t always obvious — so your testing shouldn’t be either.

Prompt injection can’t be fully erased. But it can be contained

Unlike code injections or buffer overflows, prompt injection isn’t a programming bug. It’s a linguistic vulnerability built into how LLMs reason and respond.

OWASP makes clear that we’re not going to “patch” our way out of this one. But thoughtful design, strict boundaries, and layered safeguards can significantly reduce the risk.

These strategies give teams a playbook — not just for blocking known threats, but for identifying weak spots in the way AI is used throughout the organization. Security teams, developers, product managers, and compliance leaders all have a role to play.

This article is part of a 10-part SC Media series on the OWASP 10 for LLM Applications 2025. As part of an editorial collaboration with the OWASP Generative AI Security Project, this series aims to raise awareness around secure GenAI development, threat modeling, and mitigation best practices.

Next up: SC Media tackles Sensitive Information Disclosure, another high-priority GenAI risk with far-reaching consequences for privacy and compliance.