Phishing attacks will use powerful text generation, say machine-learning engineers

In a phishing attack, would-be phishing victims have had one big advantage: Criminals have had to do everything by hand. All of the text that is written comes from somewhere. Every detail, from the email lure to every online breadcrumb to establish legitimacy takes time to create. If you have the presence of mind to be watching out for them, less sophisticated attacks are easy to see through.

What happens when that isn't true anymore?

OpenAI's GPT demonstrated that incredibly powerful machine learning text generation can also be designed to be very simple for lay programmers to implement. More recently, it's Dall-E has demonstrated creating a realistic fake image as simple as calling a function with a brief natural language description of what you want.

Click here for all the coverage coming out of RSAC.

"Think about something as simple as an image of a grocery bag with a fake logo. If you wanted to get this kind of thing a few years ago, you would have needed to pay someone who knows how to do Photoshop to make the logo and create some fake image. Now this whole process has been boiled down to just one single line of English text," said Prashanth Arun, head of data science for Armorblox. Arun and colleague Ben Murdoch, a machine-learning engineer for Armorblox, will be giving a talk Tuesday on generative attacks at the RSA Conference.

"Imagine you make a fake Candle Company, with an entire range of candles with your little logo and product descriptions that say different things, it gives you a sense that, you know, these guys have been around for a long time," said Arun.

The most basic phishing attacks in the future will come from personas with detailed web presences, Arun and Murdoch will argue. A thousand new lures will be generated with the click of a button. Creating hundreds of fake identities bolstered by five-year-old Twitter accounts will be as easy as sitting back for five years while an ML system does the posting.

Machine-learning text generation gets smarter

The idea that generative machine learning creates security problems is not new. Researchers at Black Hat last year, for example, concluded that current iterations of GPT were getting to the point where they could pass for the level of internet discourse used in fake news. But the field is fast-moving. Pairing image generation with text generation was a relatively new wrinkle when the Black Hat research was being conducted. And text generation has gotten smarter.

One of the problems with GPT is that it was much better equipped for sprints than marathons. It was good at short blurbs of text, but the text had a tendency to wobble off track as it became paragraphs or pages long. But, said Arun, there has been substantial work in extending what machine learning researchers refer to as the context window, the amount of previous text that the machine considers before choosing its next words. The more previous text it considers, the less small swerves in the train of thought multiply into outright derailings. A new technique known as retrieval augmented generation has a context window of entire databases of text.

Increasing the amount of data machine learning can handle can be its own problem, said Murdoch. AI doesn't understand certain nuances and the contexts for when a true statement is true. It is true to say that Winnie the Pooh is a bear, but Winnie the Pooh is not a true bear.

"Consider whatever methods you're using to scrape the internet and then to fill that memory, right? Suddenly, you might think that Valhalla is a real place," he said.

That type of systemic problem could even be gamed, he said, by someone looking to poison data collection. If Planters gave Mr. Peanut an employee bio as chief snack officer, that could translate to a business email compromise campaign where Mr. Peanut requests invoices be paid.

Unfortunately, the problems handling facts may not be operationalizable in defense, Arun said. The same problems discerning facts that troubled the phishing ML to would also plague the defensive ML.

The combination of ease of use and difficulty of defense could mean generative attacks make a substantial change in the threat landscape sooner than most defenders would be prepared for.

"For high-value targets, I think it's still going to be humans running the attacks, simply because the ROI on such scams are much higher," said Arun. "But for a lot of these spray and pray kinds of spammy stuff, I think the quality of that is going to be improved significantly."