Skip to main content

Do You Trust Your Ears?

| Guy Howard
Speaking as someone who has received a phone call from their CEO, you typically do what the boss says and ask questions later. That’s exactly what the CEO of a UK-based energy firm did when the CEO of his parent company called. Directed to transfer send $243,000 to one of their suppliers within the hour. He complied. Later that day the CEO called back again and for a second payment. This time he hesitated because something seemed wrong. And he was right. The CEO of the parent company never called him that day. Instead he had been talking with an AI-generated impersonation of the CEO. Hackers used a tool that generates deepfakes of voices. But these fakes are very accurate. The UK CEO remarked that he recognized his boss’ “slight German accent” and the “melody of his voice.” This was not just a recording that glued words together. The firm’s insurance company declared that this was the first time they had seen a crime perpetrated using AI mimicry like this.
Voice impersonation has come a long way. With some historical voice data, hackers can do amazing things. A few months before the ill-fated UK money transfers, an AI company created a simulation of Joe Rogan’s voice (see reference below). While it clearly sounds faked at points, overall it is remarkable in its realism. Recently, Amazon released a new voice for its Alexa products, one of the most iconic voices of our time: Samuel L. Jackson. Neither Rogan nor Jackson recorded anything special for these efforts. It was all done by applying AI to publicly available data. Obviously, there are hundreds of hours of both Samuel L. Jackson and Joe Rogan recorded speech to help train AI on. But the perpetrators of the energy company theft proved that with a limited scope you can create a convincing replication with considerably less. While the names of the companies involved have not been released, the parent company’s CEO was probably not a celebrity. They were able to craft a model good enough to fool one of his employees with presumably much less historical voice data. The ramifications of this are just as scary as those involving video deepfakes. So much business is done via phone calls. There are likely to be many more cases like this unless companies develop countermeasures, such as code words to confirm that you’re talking to the real person. Imagine diplomatic calls. If a nation could spoof a phone call as coming from somewhere else, they could impersonate heads of state or other officials and sow discord. Even journalism is vulnerable, but they developed the solution to this problem decades ago. Reporters often demand an in-person meeting to confirm the veracity of an explosive story before running with it. Now, they must have even stronger doubts about anyone contacting them via phone. But there are also legitimate, at least non-criminal, uses. A company called Descript is developing a product to allow users to “edit” their podcasts, add words or sentences they didn’t say and fix problems without needing to re-record. What does this technology mean for the future of voice actors? Will we ever need a human to voice characters such as Mickey Mouse or Bart Simpson again? Debate has raged about machines taking jobs from humans, but I don’t know if we saw this coming. An even more mundane use is as a gift. Imagine developing a card for a spouse, child, or friend that they could make say anything they wanted in your voice. So next time your boss calls… well you should probably still just do what they say. But perhaps listen a little more carefully and follow up if it seems strange.
References: Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402 This AI-Generated Joe Rogan Voice Sounds Eerily Like the Real Thing https://gizmodo.com/this-ai-generated-joe-rogan-voice-sounds-eerily-like-th-1834842151