Published Date : 11/7/2025Â
A technique for generating a spoof of a person’s voice from only a single facial image, demonstrated at the USENIX Security 2024 conference, is among the more alarming deepfake creation methods uncovered so far. Worse, voice deepfake detection tools on the market tend to struggle with these audio deepfakes, according to a team of Australian researchers.
Fortunately, as the team from Australian digital research network Data61 at CSIRO shows in a recently-published paper, it is possible to tune those tools to more accurately detect deepfakes created with Face-to-voice synthesis, also known as “FOICE.”
In the paper “Can Current Detectors Catch Face-to-Voice Deepfakes?”, the researchers tested FOICE outputs with biometric voice authentication software including WeChat Voiceprint and Microsoft Azure. The spoof attempts were frequently successful, and approached a 100 percent success rate when making multiple attempts. The researchers point out that this is troubling because of the wider availability of facial images than voice samples.
Four deepfake detectors the researchers characterize as state-of-the-art models “that span distinct architectural families and design goals” performed poorly when tested with deepfakes produced from four datasets. The best-performing, AASIST, had an equal error rate (EER) of 0.163. All models improved when fine-tuned, with AASIST’s EER dropping to 0.003.
Three of these four fine-tuned voice deepfake detectors were less accurate at identifying other kinds of spoofs, however. The drop in AASIST’s accuracy was modest, and the Ren et al. model’s improved, but TCM dropped by 10 percent and Sun et al. was rendered almost completely ineffective. “Only domain-invariant approaches maintained relatively stable cross-vocoder behavior; noise robustness varied widely, and denoising can unintentionally remove forensic cues,” the researchers conclude. “Lasting defenses therefore require (i) larger, more diverse corpora (including FOICE variants and modern vocoders) and (ii) architectures and training regimes that target vocoder-independent, cross-modal representations.”
Voice deepfakes checks are forecast to surpass 4.8 billion and generate over $2.4 billion in revenue by 2027 in the 2025 Deepfake Detection Market Report and Buyers Guide from Biometric Update and Goode Intelligence.Â
Q: What is the FOICE technique?
A: FOICE, or Face-to-Voice synthesis, is a method that generates a person’s voice from a single facial image. This technique is particularly concerning due to the widespread availability of facial images compared to voice samples.
Q: How effective are current voice deepfake detectors?
A: Current voice deepfake detectors, while state-of-the-art, initially perform poorly when tested with FOICE outputs. However, fine-tuning these models can significantly improve their accuracy, though it may come with a trade-off in detecting other types of spoofs.
Q: What are the implications of the FOICE technique?
A: The FOICE technique is troubling because it can create highly convincing voice deepfakes using only facial images, which are more widely available than voice samples. This increases the potential for misuse in various malicious activities.
Q: What improvements are suggested by the researchers?
A: The researchers suggest the need for larger and more diverse datasets, including FOICE variants and modern vocoders. They also recommend architectures and training regimes that focus on vocoder-independent, cross-modal representations to ensure lasting defenses.
Q: What is the forecast for the deepfake detection market?
A: The deepfake detection market is forecast to surpass 4.8 billion checks and generate over $2.4 billion in revenue by 2027, according to the 2025 Deepfake Detection Market Report and Buyers Guide from Biometric Update and Goode Intelligence.Â