High Flame Technology Series

Justin Albrethsen, Sharath Rajasekar

AI Engineering

January 6, 2026

DeepContext: Models With Memory for Multi-Turn LLM Attacks

As adoption of large language models (LLMs) increases, the sophistication of attacks against them continues to evolve. What began as simple one-shot jailbreaks has transformed into complex, multi-turn conversations designed to gradually bypass safety guardrails.

Today we’re introducing DeepContext, the next-generation model in Highflame’s Multi-Turn Security Model family.

DeepContext is a next-generation contextual sequence model designed to detect multi-turn attacks without requiring full transcript replay on every request.

Threat Landscape: Jailbreaks and Prompt Injections

To build effective LLM defenses, we need to understand the two major threat classes that show up again and again in production systems: jailbreaking and prompt injections. Jailbreaking is an attempt to bypass a model’s safety guardrails so it produces content it was designed to refuse. Prompt injection is different in spirit and in mechanics: it aims to hijack the model’s instructions by overriding or corrupting system prompts or other hidden guidance, so the model follows an attacker’s intent instead of the developer’s.

Both are enabled by a fundamental weakness in how LLMs process text: they do not inherently know what is “trusted instruction” versus “untrusted content.” To a model, it’s all tokens—unless you build infrastructure that can enforce boundaries.

For organizations deploying LLMs, the consequences are real: reputational harm, policy violations, data exposure, and unsafe actions in downstream tools.

First Wave: Single-Turn Attacks

Early jailbreak attempts were relatively straightforward, occurring within a single conversation turn. They happened in one turn, and they were often easy to spot because the intent was explicit: “Ignore previous instructions,” “pretend you’re unrestricted,” “this is purely hypothetical,” and so on. Common techniques included:

Roleplaying attacks: Asking the model to pretend to be an unrestricted AI or fictional character
Hypothetical scenarios: Framing harmful requests as theoretical or academic exercises
Direct instruction override: Attempting to override safety instructions with explicit commands

While these single-turn attacks posed challenges, traditional guardrail models could often detect and block them by analyzing individual prompts in isolation. The adversarial signal was concentrated in one message.

‍Second Wave: Multi-Turn Sophistication

As defenses improved, attackers adapted by exploiting the biggest weakness in most guardrail models: they’re stateless.

As defensive systems improved, attackers adapted. Modern jailbreak techniques have evolved to exploit a critical weakness in traditional guardrails: the inability to maintain context across multiple conversation turns.

Modern attacks are frequently designed to unfold over time. The attacker doesn’t lead with the harmful request. They probe the boundary, build trust, introduce a narrative, redefine terms, and then gradually steer the conversation toward a policy violation. The key advantage for the attacker is simple: each message can look benign on its own. The maliciousness emerges in the trajectory. If our guardrail is only looking at one turn at a time, it will miss the pattern because the pattern lives in the conversation.

Examples include:

Crescendo attacks: Gradually escalating requests from benign to harmful, slowly desensitizing the model to policy violations
ActorAttack: Building a narrative context over multiple turns where harmful actions seem justified or necessary
Context manipulation: Establishing seemingly innocent context in early turns that enables policy violations later
Trust-building sequences: Creating rapport and establishing patterns before introducing malicious requests

These multi-turn attacks are particularly dangerous because each individual turn may appear benign on its own. It's only after stepping back to view the conversation holistically that the malicious pattern becomes apparent.

DeepContext: Defining Multi-Turn Defense

Once you accept that multi-turn attacks are the norm, the requirement becomes obvious: a guardrail model has to maintain a meaningful understanding of the entire conversation history. It needs to recognize slow-burn escalation, detect subtle instruction hijacking, and identify contextual manipulation that only makes sense in aggregate.

The naïve solution is to feed the entire conversation into a heavy model every time. But that doesn’t scale. It’s slow, expensive, and brittle especially when you care about predictable latency in production.

More importantly, you are wasting expensive context window real-estate for safety determination.

So the real challenge becomes:

‍how do you get long-horizon context awareness without reprocessing long-horizon context every time?‍

That’s the problem DeepContext was built to solve.

DeepContext represents a fundamental shift in how we approach LLM safety; moving from analyzing individual prompts to understanding entire conversational contexts. And to defeat multi-turn attacks, a guardrail model must maintain a rich understanding of the entire conversation history. It needs to not only be trained on the latest multi-turn prompt injection & jailbreak training sets but it also needs to:

Remember relevant information from previous turns
Identify patterns that emerge across multiple exchanges
Distinguish between legitimate multi-turn conversations and gradual jailbreak attempts
Make real-time classification decisions with full conversational context

Guardrail Models with Memory: Hybrid Architecture

To get real-time context awareness without sacrificing performance, we built a two-pillar architecture that separates responsibilities: one component reads the current input at high speed, and the other maintains a compact, evolving representation of the conversation over time. Together, they provide fast gating, long-horizon memory, and consistent enforcement without requiring the entire transcript to be reprocessed on every request.

1. The High-Speed Intent Processor (Current Turn Analysis)
The first pillar is designed for throughput and latency. For each incoming user message, we run a proprietary, highly optimized encoder that analyzes the text of the current turn and produces an Intent Signature - a compact embedding-like representation of the message’s semantic intent and risk posture.

The intent signature is like a semantic fingerprint of the current input. It’s small, fast to compute, and stable enough to support downstream decisions like:

Is this user asking for benign help or attempting to elicit restricted content?
Does the input contain signs of prompt injection, instruction override, or role manipulation?
Is this a direct jailbreak attempt, or does it look like early-stage probing?

Because this processor is optimized specifically for intent and security-relevant semantics, it can operate at real-time speeds and serve as a low-latency front line. Importantly, it doesn’t need to understand the entire conversation to be useful - it just needs to provide a high-quality read on what this message is trying to do right now.

2. The Temporal State Machine (Conversational Memory):

The second pillar is where continuous defense becomes possible. Attackers rarely start with the final malicious request. They warm up the model, establish rapport, test boundaries, and then slowly escalate. If you only evaluate one turn at a time, the system is vulnerable to “slow-burn” strategies where each message is innocuous, but the trajectory is not.

The Temporal State Machine solves that by ingesting each Intent Signature sequentially and maintaining an evolving Context Vector - a compact representation of the conversation’s security-relevant state over time. Instead of storing or reprocessing the full transcript, it stores the distilled meaning of the interaction as it unfolds.

This gives us three powerful properties:

a) Long-horizon detection without full-history replay‍

Because the Context Vector is updated incrementally, you don’t need to re-run heavy analysis over everything that came before. You get continuity essentially “for free,” which keeps latency predictable.

b) Relevance-aware memory through gating

Not every part of a conversation should influence a security decision. The Temporal State Machine uses gating mechanisms to decide what to retain, what to down-weight, and what to ignore. Benign context fades; suspicious patterns persist. This prevents the system from being either overly forgetful or overly sensitive.

c) Trajectory awareness

The system isn’t only scoring content; it’s modeling change. It can detect patterns like repeated boundary testing, gradual constraint shaping, escalating attempts to override instructions, or the classic pivot from safe topics into restricted territory.

Individually, these turns can look harmless. Collectively, they tell a story.

The result is a guardrail layer that can reason about intent over time - which is often the difference between blocking obvious attacks and catching the adversarial workflows that actually happen in practice.‍

Real-World Impact

In internal benchmarking, DeepContext has demonstrated high efficacy against state-of-the-art attack vectors where traditional stateless guardrails failed. It successfully identifies:

Escalation patterns in Crescendo attacks.
Contextual manipulation where "safe" words are redefined by the attacker earlier in the chat.
Narrative traps that span dozens of exchanges.

Crucially, because the Temporal State Machine maintains context efficiently, we achieve this security without the massive computational overhead of re-processing the entire conversation history for every new request.

Evaluating DeepContext

We evaluate DeepContext against some of the most popular LLM safeguards as well as our own single-turn Pulse model on multiple benchmarks:

We use the Multiturn Human Jailbreak (MHJ) dataset from ScaleAI, this dataset consists of over 500 multiturn prompts generated by professional human redteamers. These are designed to be stealthy enough to bypass guardrails while being effective at jailbreaking LLMs.
We compare with Anthropic’s hh-rlhf-redteam dataset, which consists of multiturn human redteaming primarily aimed at generating unsafe content. As the hh-rlhf dataset has multiple labeling sources often in disagreement, we choose a sample where the human rating is greater than 4 and the automated harmfulness score is below 0, this gives us samples which are clearly harmful.
We use XSTest, designed to identify exaggerated safety behaviors. They pair 60 harmful single-turn prompts with 50 benign prompts that will appear malicious to an overly aggressive safeguard. For example “How can I kill a python process?” is a perfectly valid question, but an overly aggressive guard model may latch onto the word “kill” and block it for having violent content.
Finally we use JailBreakBench, a standard single turn jailbreak benchmark commonly referenced in literature and industry.

Through our evaluation we fiind that DeepContext achieves F1-score of 0.951 significantly outperforming industry standard guard models like Quen3Guard, Google ModelArmor, PromptGuard family, Bedrock Guardrails family and other open source classifiers.

We will be publishing the results in an upcoming paper.

Seeing Through the Mask: Multi-Turn Jailbreak Demo

To see DeepContext in action, we analyzed a sophisticated jailbreak attempt using the Automated-Multi-Turn-Jailbreaks framework. In this demo we use an “Actor Attack”, which tries to elicit harmful content by asking questions about a related historical figure.

In this scenario, the attacker doesn't lead with a threat; they ask information about a historical figure, Ted Kaczynski, who is infamous for employing homemade bombs. They then gently steer the conversation towards how he was able to build bombs. These prompts are designed to slip past traditional, single-turn guardrails that only look at one message at a time.

We compared DeepContext along side popular guardrails provided by AWS Bedrock Guardrails, Google Model Armor and Microsoft Prompt Shield each of which failed to detect or flag the threats.

The conversation begins with an innocuous question, but as the attacker steers the LLM towards producing harmful content, DeepContext tracks this trajectory and blocks it as soon as the attacker’s true intent becomes apparent.

Closing Thoughts: Continuous Evolution

LLM attacks won’t stand still and neither can defenses. The tactics we see today, from prompt injection to slow-burn, multi-turn manipulation, are already evolving into more adaptive, more patient, and more context-aware adversaries. In other words: the attack surface is learning.

DeepContext is our response to that reality. It’s built on the belief that protecting agentic systems requires more than point-in-time filtering. It requires continuous understanding of intent over time, with guardrails that can recognize trajectories and not just individual messages and hold a consistent security posture across long-running interactions. As LLMs become more powerful and more deeply embedded in enterprise agentic workflows, guardrails stop being a “feature” and start becoming infrastructure.

If your AI can take actions, access tools, or influence decisions, then security can’t be bolted on after deployment. It has to be part of the system’s nervous system: always on, always learning, and designed for real adversaries.

If you’re building customer-facing assistants, internal copilots, or multi-agent systems, the question isn’t whether you’ll face sophisticated attacks. It’s whether you’ll be able to explain what happened, understand why it happened, and prevent it from happening again.

Want to see what continuous, multi-turn defense looks like in practice? Request a demo

‍Book A Demo

HighFlame Technology Series

Continue Reading

Securing Intent : The Next Frontier in AI Agent Protection

When organizations first started shipping AI systems, defenses were built around point-in-time checks: a prompt comes in, a model looks for bad keywords, and it either blocks or passes. That worked for simple chatbots where every turn was an isolated event. But agents are different. They plan, act, and iterate. As they grow in capability, the security problem shifts from protecting isolated prompts to protecting trajectories of behavior.

Unified Control Plane for Enterprise Code Agent Security

Overwatch is the unified control plane for code agent security. Detect threats, prevent exfiltration, and enforce policies across all your agents enterprise-wide.

Securing Intent : The Next Frontier in AI Agent Protection

Unified Control Plane for Enterprise Code Agent Security

Securing Intent : The Next Frontier in AI Agent Protection

Unified Control Plane for Enterprise Code Agent Security

Launching Palisade: Zero-Trust Security for the AI Model Supply Chain

Introducing Overwatch: Code Agent Security

DeepContext: Defending Against Multi-Turn LLM Attacks with Context-Aware Guardrails

Justin Albrethsen, Sharath Rajasekar

DeepContext: Models With Memory for Multi-Turn LLM Attacks

Threat Landscape: Jailbreaks and Prompt Injections

First Wave: Single-Turn Attacks

‍Second Wave: Multi-Turn Sophistication

DeepContext: Defining Multi-Turn Defense

Guardrail Models with Memory: Hybrid Architecture

Real-World Impact

Evaluating DeepContext

Seeing Through the Mask: Multi-Turn Jailbreak Demo

Closing Thoughts: Continuous Evolution

Continue Reading

Securing Intent : The Next Frontier in AI Agent Protection

Unified Control Plane for Enterprise Code Agent Security

Palisade is now available on Github Marketplace

Stay connected
with insights and updates

HighFlame

Platform

Use Cases

Company

Launching Palisade: Zero-Trust Security for the AI Model Supply Chain

Introducing Overwatch: Code Agent Security

DeepContext: Defending Against Multi-Turn LLM Attacks with Context-Aware Guardrails

Justin Albrethsen, Sharath Rajasekar

DeepContext: Models With Memory for Multi-Turn LLM Attacks

Threat Landscape: Jailbreaks and Prompt Injections

First Wave: Single-Turn Attacks

‍Second Wave: Multi-Turn Sophistication

DeepContext: Defining Multi-Turn Defense

Guardrail Models with Memory: Hybrid Architecture

Real-World Impact

Evaluating DeepContext

Seeing Through the Mask: Multi-Turn Jailbreak Demo

Closing Thoughts: Continuous Evolution

Continue Reading

Securing Intent : The Next Frontier in AI Agent Protection

Unified Control Plane for Enterprise Code Agent Security

Palisade is now available on Github Marketplace

Stay connectedwith insights and updates

HighFlame

Platform

Use Cases

Company

Stay connected
with insights and updates