Artificial intelligence (AI) models that work across different types of media and domains — so-called “multimodal AI” — can be used by attackers to create convincing scams. At the same time, defenders are finding multimodal AI equally useful at spotting fraudulent emails and not-safe-for-work (NSFW) materials.A large language model (LLM) can accurately classify previously unseen samples of emails impersonating different brands with better than 97% accuracy, as measured by a metric known as the F1 score, according to researchers at cybersecurity firm Sophos, who presented their findings at the Virus Bulletin Conference on Oct. 4. While existing email-security and content-filtering systems can spot messages using brands that have been encountered before, multimodal AI systems can identify the latest attacks, even if the system is not trained on samples of similar emails.While the approach will likely not be a feature in email-security products, it could be used as a late-stage filter by security analysts, says Ben Gelman, a senior data scientist at Sophos, which has joined other cybersecurity firms, such as Google, Microsoft, and Simbian, in exploring new ways of using LLMs and other generative AI models to augment and assist security analysts and to help speed up incident response.”AI and cybersecurity are merging, and this whole AI-generated attack/AI generated defense [approach] is going to become natural in the cybersecurity space,” he says. “It’s a force multiplier for our analysts. We have a number of projects where we support our SOC analysts with AI-based tools, and it’s all about making them more efficient and giving them all this knowledge and confidence at their fingertips.”Understanding Attackers’ TacticsAttackers have also started using LLMs to improve their email lures and attack code. Microsoft, Google, and OpenAI have all warned that nation-state groups appear to be using these public LLMs for various tasks, such as creating spear-phishing lures and code snippets used to scrape websites.As part of their research, the Sophos team created a platform for automating the launch of an e-commerce scam campaign, or “scampaigns,” to understand what sort of attacks could be possible with multimodal generative AI. The platform consisted of five different AI agents: a data agent for generating information about the products and services, an image agent for creating images, an audio agent for any sound needs, a UI agent for creating the custom code, and an advertising agent to create marketing materials. The customization potential for automated ChatGPT spear-phishing and scam campaigns could result in large-scale microtargeting campaigns, the Sophos researchers stated in its Oct. 2 analysis.”[W]e can see that these techniques are particularly chilling because users may interpret the most effective microtargeting as serendipitous coincidences,” the researchers stated. “Spear phishing previously required dedicated manual effort, but with this new automation, it is possible to achieve personalization at a scale that hasn’t been seen before.”That said, Sophos has not yet encountered this level of AI usage in the wild.Defenders should expect AI-assisted cyberattackers to have better quality social-engineering techniques and faster cycles of innovation, says Anand Raghavan, vice president of AI engineering at Cisco Security.”It is not just the quality of the emails, but the ability to automate this has gone up an order of magnitude since the arrival of GPT and other AI tools,” he says. “The attackers have gotten not just incrementally better, but exponentially better.”Beyond Keyword MatchingUsing LLMs to process emails and turn them into text descriptions leads to better accuracy and can help analysts process emails that might have otherwise escaped notice, stated Younghoo Lee, a principal data scientist with Sophos’s AI group, in research presented at the Virus Bulletin conference.”[O]ur multimodal AI approach, which leverages both text and image inputs, offers a more robust solution for detecting phishing attempts, particularly when facing unseen threats,” he stated in the paper accompanying his presentation. “The use of both text and image features proved to be more effective” when dealing with multiple brands.The capability to process the context of the text in the email augments the multimodal capability to “understand” words and context from images, allowing a fuller understanding of an email, says Cisco’s Raghavan. LLMs’ ability to focus not just on pinpointing suspicious language but also on dangerous contexts — such as emails that urge a user to take a business-critical action — make them very useful in assisting analysis, he says.Any attempt to compromise workflows that have to do with money, credentials, sensitive data, or confidential processes should be flagged.”Language as a classifier also very strongly enables us to reduce false positives by identifying what we call critical business workflows,” Raghavan says. “If an attacker is interested in compromising your organization, there are four kinds of critical business workflows, [and] language is the predominant indicator for us to determine [whether] an email is concerning or not.”So why not use LLMs everywhere? Cost, says Sophos’s Gelman.”Depending on LLMs to do anything at massive scale is usually way too expensive relative to the gains that you’re getting,” he says. “One of the challenges of multimodal AI is that every time you add a mode like images, you need way more data, you need way more training time, and — when the text and the image models conflict — you need a better model and potentially better training” to decide between the two.