Automatic Speaker-Identification System Performs Better Than Humans
by Michael Dean Thompson
Many courts cases hinge on whether the voice or voices on a recording belong to a specific speaker, such as the defendant or a witness, who is unknown to the listener. Misidentification of a voice can influence juries and lead to disastrous decisions carrying lengthy sentences and even executions. A recent multidisciplinary paper published in Forensic Science International titled “Speaker identification in courtroom contexts - Part 1” found that a forensic voice identification software, E3FS3, outperformed every human listener in the study.
In a courtroom setting, judges and jurors may be confronted with recordings of speakers with accents and languages unfamiliar to them and whom they have never met outside the court. The recordings may further be of poor quality, captured from phone calls, and other challenging conditions. The authors of the paper (“Authors”) sought to determine if a lay listener or “ad hoc expert” could perform as well as a modern deep-neural-network embedded system, i.e., an AI embedded system using deep learning where the recordings are of varying quality. In addition, they attempted to determine how the individual rated their ability to identify voices, both before hearing the recordings and afterwards.
The Authors indicate previous speaker identification system studies were of little use for forensic purposes. Most of the studies were much older, and the speaker identification systems studied used algorithmic approaches, meaning they made decisions on specific parameters (such as voice frequency, amplitude, and so on) within the recordings, which tend to be less flexible and accurate than modern deep-learning systems that use neural networks modeled after the collections of neurons within brains. Likewise, the previous studies failed to include a variety of recording qualities or dissimilar languages and accents. A recent study of another deep-learning system competing against human listeners for speaker identification (Hughes et al., 2022), for example, used only “Standard Southern British English,” and all the human listeners were from the United Kingdom.
The Authors created a a series of recordings of 61 different speakers. Various recordings of each speaker were taken about a week apart from each other and combined with 9,831 known speaker/unknown speaker pairs. While all the recordings were of Australian-English speakers, the 169 human listeners were either Australian, North American English, or European Spanish language users. Although they did include lower quality recordings for the unknown speakers, those were designed to reflect landline calls from an office setting rather than a more probable cell phone in less than ideal conditions.
The human listeners performed worse than the automatic-speaker-identification system, even when the listeners were also Australian English users. Most of the listeners performed worse than a hypothetical speaker identification system that provided no useful information. Australian English listeners did best among human listeners while Spanish listeners were the poorest performers in identifying the voices, illustrating the challenge for all listeners in identifying voices speaking with accents or languages foreign to the listener.
Listeners were asked to indicate how well they would be able to identify Australian-English speakers both before and after the experiment. While a few of the listeners revised their estimation of their abilities upward after the experiment, about half revised downward. Some of the estimations were wildly off, either overestimating or underestimating their performance. In other words, people could not be expected to provide accurate assessments of their skills.
The implications of the study move beyond the capability of the speaker identification system. As the Authors note, despite the self-assessments of judges, they should not perform their own speaker identification. Both judges and jurors could not expect to perform well under the variety of conditions found in forensic contexts. The Authors are adamant on this in a press release from Aston University reprinted on Forensic Science International’s website.
Contributing author Kristy Martire, School of Psychology of New South Wales, said, “This shows that whatever ability a listener may have in recognizing familiar speakers, their ability with unfamiliar speakers is unlikely to be better than a forensic-voice-comparison system.” Contributing author Gary Edmond, School of Law at the University of New South Wales, added, “Unequivocal scientific findings are that identification of unfamiliar speakers is unexpectedly difficult and much more error-prone than judges and others have appreciated. We should not encourage or enable non-experts, including judges and jurors, to engage in unduly error-prone speaker-identification. Instead, we should seek the service of real experts: specialist forensic scientists who employ empirically validated and demonstrably reliable forensic-voice-comparison systems.”
Any listener’s bias in the context of a courtroom setting would create additional challenges with regard to the correct assessment of the speaker’s identification. Indeed, the Authors note that another study will attempt to determine how a pool of 12 jurors and their collective biases would affect speaker identification.
Writer’s note: The high performance of E3FS3 is impressive, though it should be received with some caution. Several of the Authors, including the lead author, also authored reports on the design and implementation of E3FS3. As active participants in the production of the system, they were at greater risk to have biased the training data for the deep-learning system, even if unintentionally. More so, AI throughout the world is being reexamined for racist training data, from the ranting chat bot (an instant messaging tool that responds as if it is human to a user’s messages) to search engines that lead a user to remarkably disturbing content. In this case, did the training data include a sufficient quantity of ethnic backgrounds within the Australian English speakers?
The Authors calibrated the AI, at least in part, using the very recordings they would later have it validate. That is a bit like calibrating a police traffic radar using the same target vehicle you would later scan in order to test the radar. In other words, the calibration was somewhat circular. To be an effective test, the calibrating recordings should be distinct from examined recordings.
It is also important to note that the system’s error rate was non-zero despite the biased calibration. In fact, its correct-classification-rate was 87%. Even powerful AI tools are not infallible and, worse, may be prone to many of the same cognitive biases we find hidden in the juror’s psyche. As in Gary Edmond’s quote, calling something a science gives it an air of authority and ignores the significant error rate.
Source: Forensic Science International 34, (2022) 111499
As a digital subscriber to Criminal Legal News, you can access full text and downloads for this and other premium content.
Already a subscriber? Login