Rethinking Regex: Smarter detection for a modern threat landscape

Using regular expressions, or regex, was once a convenient and powerful way for web application firewalls (WAFs) to find malicious code in web requests.

Sadly, it doesn't work that well anymore. Regular expressions need far too much tuning, require overly complex matching rules and fail to understand context, resulting in a high rate of false positives and an unacceptable rate of letting actual malicious code get through.

Regex still has a place, yet its limitations compel modern WAFs to rely on other methods to filter out malicious content. One of the most reliable is context-aware parsing, which can analyze the environment around a text string to gauge its possible maliciousness. The same string of characters that's part of a news story in one context can be part of an attack in another.

Context-aware parsing also makes it harder for attackers to escape detection by nesting commands from different languages and protocols inside one another, a classic way to fool regex detection.

"Regex is useful for the copy-and-paste stuff beloved by novice attackers — when the attacker doesn't understand how the attack works because someone else built it for them," said Kelly Shortridge, VP of Security Products at Fastly, in a company blog post. "The people who craft the reusable bits of the attack know they could morph any of it to circumvent the simplistic pattern matching mechanism regex offers."

Fastly's SmartParse, part of the company's own web application firewall, uses several different parsing engines to determine the context of a text string, such as whether the text is used as part of a command or just as part of human speech.

"You have OS commands that overlap with the English language," explains Xavier Stevens, Staff Security Researcher at Fastly, "The command 'ID' and the word, just 'ID,' that might come up in technical text, they're the same."

"You have to start to look for patterns beyond that to determine what was the intent of this word," he adds, "maybe looking at the symbols that came prior to that, or maybe came after that, to try to make some sort of assertion as to whether this might be an attempted attack or not."

The virtues of regex, and why it's now out of date

A regular expression is just a way for a computer to find a piece of text in a document. If you've ever used the find function in a Microsoft Word document, for example to mark all instances of "cat", then you've used a very basic regular expression.

But regexes can get complex very fast. Even when searching for "cat", you may need to specify whether you're looking for the word "cat" by itself or all instances of the letter sequence "c-a-t".

If you want to exclude words like "concatenate" or "catamaran," then you should add a space, e.g. "cat ". But if you do that, you'll accidentally exclude true instances of "cat" when they're followed by a punctuation mark, such as "cat," "cat?" and, well, "cat."

Then you'll need to take your first step into regex madness and specify that "cat" can be followed by a punctuation mark or a space, but NOT a letter or numeral. Your regex may then look like

cat[ ,.?/*<>"'!@#$%^&*()_-+={}[]]

where the "[" and "]" brackets indicate the beginning and end of the permitted characters.

But you're not done, because some of those punctuation marks themselves are used to modify regular expressions. You'll need to introduce "escape" characters such as "\", which specify that the following character is part of the text being searched for rather than part of the regex command.

That leaves you with something like

cat[ ,\.\?/\\*<>"'!@#\$%\^&*/(/)_\-\+=\{\}\[\]]

and I've probably missed a few characters.

If it gets this messy just looking for the word "cat," imagine what it looks like when you're using regex to search for potentially malicious strings of text in a web page.

In her blog post, Shortridge gives an example, borrowed from a discussion on the developer forum Stack Overflow, of a regular expression designed to find standard email addresses.

It looks like this:

(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9]))\.){3}(?:(2(5[0-5]|[0-4][0-9])|1[0-9][0-9]|[1-9]?[0-9])|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])

That's a short one, relatively speaking.

The core of first-gen WAFs

When properly used, regexes are fast and powerful. Computers are very good at finding matching text and have been ever since the green-screen days. That's probably why the first web application firewalls (WAFs) that appeared around the turn of the millennium used regex.

The granddaddy of WAFs is ModSecurity, a regex-based WAF developed by the Open Worldwide (formerly Open Web) Application Security Project, or OWASP. ModSecurity and its Core Rule Set are the basis for many WAFs still in wide use today, including some of those offered by leading cloud service providers.

Regex was great for finding potentially malicious commands in JavaScript, the primary website coding language used when ModSecurity was first developed in 2002. But since then, many other coding and markup languages and protocols such as JSON and XML, and attack techniques such as SQL injection and cross-site scripting (or XSS), have become commonplace.

Regex isn't as good as catching those, because a command or text string that means one thing in one coding framework may mean something very different in another.

"SQL injection in particular was really painful from a regex standpoint," says Stevens. "There's too many patterns that can apply, and it's really hard to make that work within regex without matching other things, and it just becomes unwieldy."

For example, some firewalls might block the Wikipedia page on SQL injection because it contains sample of malicious code — a classic false positive. Because that's a relatively high-profile page, most firewalls will make an exception for it. (Our own SC Media site is occasionally blocked due to discussion or even mention of attack techniques.)

But what if someone were to tweak the Wikipedia page itself to insert actual SQL injection code? That might get past some regex-based WAFs, especially if they're not properly "tuned" to filter out truly malicious stuff while ignoring benign code.

"What typically will either end up happening is you are so broad that you're going to over-match and you're going to get a lot of false positives," Stevens explains. "Or you're going to tighten that pattern down so much that you are going to have your true negatives, the attacks that get through, because the pattern has sort of been de-scoped so much that it can't detect these things."

That tuning takes a lot of time and effort. It might be months after the initial implementation of a regex-based WAF before the humans who run it reach the right balance between false positives and false negatives. Even after they're achieved that equilibrium, the engineers will have to keep fine-tuning the regex rules indefinitely as new attack techniques emerge.

"Regex has its place," said Shortridge bluntly, "and attack detection isn't it."

Putting it all in context

OWASP's Core Rule Set has evolved since 2002 and now has tools for detecting SQL injection, XSS, file injection and other common web attacks.

One of the most powerful is libinjection, developed in 2012 by a coder who went on to build the core of Fastly next-gen WAF. libinjection was designed to look for SQL injection attempts while filtering out benign text, and it now spots XSS attempts too.

Fastly's SmartParse detection engine is built on top of libinjection and uses several other parsing tools. Like libinjection, SmartParse looks at the environment of a text string to determine which protocol, language or framework is being used, if any. Only then does SmartParse evaluate whether the suspicious text is part of an attack.

"By not treating it as a dumb string, it helps isolate the context that you have around where the attack would be located, and not just looking for symbols in text," Stevens explains. "It's a lot of different parsers that understand what protocol or what the attack would be based on."

In addition to JavaScript, SmartParse also recognizes Google's GRPC framework, Meta's GraphQL and the duplex communication protocol WebSockets, among others.

"If we know that you're sending a GraphQL query, we can parse that and understand the components of that query rather than treating the GraphQL operators themselves as part of that string," he adds as an example. "We don't get confused with things like when we see a brace or when we see dollar signs that might be present in other types of attacks."

This context awareness leads to a much lower false-positive rate and a much-reduced need for client-side tuning. Some of the tuning happens on Fastly's end as its engineers add exceptions, Stevens tells us, either for individual clients or across the board.

"It is all pattern matching, ultimately," Stevens says of SmartParse. "It's just that we source that outside the regex box and we use other pattern-detection methods. And sometimes we'll use whatever the right detection method we think is for the job that we're trying to do."

Another positive-feedback mechanism is Fastly's own Fastly Network Learning Exchange, or NLX, a continuous feed of malicious IP addresses discovered by individual Fastly clients and then rebroadcast to all users of its WAF.

"NLX is primarily a way to feed attackers' IPs back through the system," Stevens says. "When we see attacks coming from certain IPs against any customer, that puts them on our radar. They do enough attacks, they get added to the threat feed list. And then that threat feed list gets distributed out to all of our customers."

However, Fastly's WAF does allow the use of regex.

"In our rule builder, we provide customers a way to add regex to check for something, some type of attack or something very specific to them. And they can use it to match on a field name, or they can use it to match on maybe their login pages," he says. "Whatever they need to match on, they could potentially use regex for that if they need to."

It's just one tool out of many, and as Stevens explains, regular expressions are still useful in certain instances.

"I don't like to hate on regex completely, because it does have a purpose in the world," he says. "Just don't use it for everything."