AI/ML

‘Mythos-level’ Fable model released to public: How Anthropic plans to prevent misuse

(Credit: Careto – stock.adobe.com)

Anthropic released its Claude Fable 5 model on Tuesday, making “Mythos-level” AI capabilities available to the public for the first time.

Previously only available to about 200 organizations under the company’s Project Glasswing initiative, Claude Mythos demonstrates unprecedented capability in discovering and exploiting software vulnerabilities. This capability raises concerns about potential misuse by cyber threat actors to find and exploit zero-day vulnerabilities before defenders can fix them.

The company said Fable retains Mythos’ abilities but comes with robust safety layers that cause cybersecurity-related tasks to be blocked or rerouted to less powerful models, namely Claude Opus 4.8.

These safeguards include the use of new classifier models trained on “violative cyber exchanges” as well as attacks generated by “internal automated red-teamers,” according to the model’s system card.

The safeguards work in two stages, the first of which screens the model’s internal activations for signs of misuse and the second of which uses a classifier to make the final decision on any traffic flagged by the first probe.

“Overall the evidence suggests that breaking our cybersecurity safeguards is extremely difficult (thought not impossible),” the model card stated.

Fable’s public release comes about a week after Anthropic announced 150 more organizations would be joining Project Glasswing and said it was working on releasing Mythos-level capabilities to the public once safeguards were in place.

The company also said in the announcement that it predicted other companies could release “Mythos-class” models with fewer protections within the next six to 12 months.

“Being first to put a model like this in the market matters, and they want that flag in the ground. But I’d give them credit for their perseverance, too. The hard part was never building the capability; it was figuring out how to get it into customers’ hands at scale without it being reckless,” Chris Boehm, field CTO at Zero Networks, told SC Media.

More than 1,000 red-teaming hours produce a handful of successful jailbreaks

Anthropic announced that Fable’s safeguards blocked 100% of offensive cyber tasks in tests that did not include specific attempts to evade the safeguards.

Internal automated red teaming, which involved attempts to perform offensive cybersecurity tasks across 400 turns with the ability to “rewind” when blocked, resulted in a 5.4% attack success rate compared with 56.6% for Claude Opus 4.8, Anthropic said.

“The tasks are mostly simple and not representative of real cyber usage — they are sometimes as simple as encrypting files on a remote server,” the company added. “On more complex and realistic tasks we have not yet seen successful jailbreaks on our production system.”

Fable’s system card provides more detail about internal and external red teaming efforts. The results of a public bug bounty, totaling about 100,000 attempts and an estimated 1,000 hours of effort, resulted in no universal jailbreaks and only two task-specific jailbreaks on “simpler, more dual-use tasks” the company said.

A private bug bounty with 2,000 submissions yielded no jailbreaks, and a 20-hour effort by 10a Labs to attempt to use Opus 4.8 with Fable safeguards for ransomware creation was unsuccessful.

An unnamed external partner that tested the final launch version of Fable found that it didn’t comply with any single-turn requests related to cyberattack planning, exploit development or defense evasion across 30 public jailbreak techniques.

A handful of successful jailbreaks were discovered by other external partners. Trajectory Labs, which also tested Opus 4.8 with Fable safeguards, managed to use the model to exploit a Firefox vulnerability using a custom harness and repeated iteration over five days of work. However, Anthropic said this involved an earlier version of the safeguards, and that no universal jailbreaks were found.

The UK AI Security Institute (UK AISI) tested the final launch version of Fable and developed a jailbreak for single-turn queries related to vulnerability discovery and exploitation within a few hours. After about two days of testing, AISA expanded their jailbreak approach to enable multi-step malicious agentic tool calls. However, complete long-form malicious agentic tasks were not achieved.

“These are interim results from a compressed testing window, and the testing required substantial adaptation of AISI methods to a new long-form agentic setting; the time and effort involved are therefore not a measure of the relative robustness of [Claude Fable 5]’s safeguards, nor directly comparable to AISI’s testing of other model safeguards,” AISI stated.

Anthropic said it planned to continue working with UK AISI to test the robustness of Fable’s safeguards.

“When we’re thinking about safeguards, we have to remember that the capabilities are in the model and the protections are layered on top. Those protections are important, but they’re not the same thing as removing the capability itself,” noted Etay Maor, vice president of threat intelligence at Cato Networks, in comments to SC Media. “That’s why I describe them as speed bumps rather than barricades. They can slow attackers down, and that’s valuable, but they’re unlikely to stop the threat actors we worry about most — the ones with the time, resources, and motivation to keep testing until they find another way in.”

Public reports show Fable blocks benign cyber tasks

Anthropic’s announcement noted that Fable’s guardrails are designed to be especially cautious and are “still stricter than would be ideal,” likely to trigger false positives for some benign requests.

Public reports by cybersecurity researchers appear to confirm this behavior, with Valentina Palmiotti, head of X-Force Offensive Research at IBM, saying on X that “It rejects any request that could be tangentially cyber related. Even innocuous tasks like reading a blog post.”

While Anthropic intends to keep Mythos’ cyber capabilities behind Project Glasswing, with Fable available for other tasks such as coding, cybersecurity veteran Matt Suiche told TechCrunch even asking the model to “write secure code” triggers the safeguards.

Fable 5’s announcement also came with the announcement of Mythos 5, available only to Project Glasswing participants, which shows slightly greater cyber capability than the initial Mythos Preview.

Anthropic also announced that business customer data for users of Fable 5, Mythos 5 and future models with comparable ability will now be retained for 30 days as a safety measure.

“Organizations in regulated industries need to understand exactly what data is being retained and whether that aligns with their compliance and legal requirements before they start using these models in sensitive environments,” noted Maor.

An In-Depth Guide to AI

Get essential knowledge and practical strategies to use AI to better your security program.

Get daily email updates

SC Media's daily must-read of the most current and pressing daily news

By clicking the Subscribe button below, you agree to SC Media Terms of Use and Privacy Policy.

You can skip this ad in 5 seconds