Disciplinary Self-Help Litigation Manual - Header

× You have 1 more free article available this month. Subscribe today.

Probabilistic Genotyping on Trial: Can We Trust  the Secret Algorithms Deciding Guilt?

Loaded on Aug. 1, 2025 by Michael Thompson published in Criminal Legal News August, 2025, page 1

Filed under: DNA Testing/Samples, Crime Labs, Blood Samples/Tests, Scientific Testimony or Evidence. Location: United States of America.

DNA evidence has long been hailed as the gold standard of forensic science—unassailable, precise, and definitive. But what happens when that gold standard is processed through proprietary algorithms that operate in secrecy, shielded from scrutiny by trade secrets and alleged technical complexity?

Probabilistic genotyping (“PG”) software promises to unravel the chaos of complex DNA mixtures, degraded samples, and trace amounts of genetic material. Courts increasingly treat its outputs as infallible, trusting black-box algorithms to deliver justice. But the most common PG tools today are often shrouded in secrecy, their manufacturers unwilling to participate in source code reviews that allow third parties to validate that their actual results match their claims. Meanwhile, their extraordinary complexity typically leaves judges, attorneys, and jurors so overwhelmed that they simply accept the tools’ findings. Something that complex and seemingly scientifically advanced must be correct, right?

Yet, behind the veneer of scientific certainty lies a troubling reality: these systems are often shielded from independent validation, prone to hidden biases, and capable of producing wildly divergent results depending on which software—or which version of that software—is used. From wrongful convictions upheld by flawed CPI calculations to exonerations secured only after relentless challenges to proprietary code, the stakes could not be higher. How do we reconcile the promise of PG with its many current shortcomings? What happens when a defendant’s fate hinges on software whose inner workings are deemed too (monetarily) valuable—or too impenetrable—to examine?

Despite the reluctance of most courts to challenge the claims of PG software and its industry supporters, some courts are starting to pay attention, giving us an unprecedented view into the software and its potential for failure. Rather than focusing on one PG tool, this article will provide an overview of the very complex subject of DNA “fingerprinting” using complex mixtures and the potential of PG tools to miscalculate and misrepresent the evidence.

Uses of Probabilistic Genotyping

In January 2020, San Diego police officers stopped a car driven by Jasmine Canchola, hoping to find her brother. Instead, they found her mother, her mother’s friend, and Francisco Ortiz. When they searched the car, they found a handgun and about 18 grams of methamphetamine. Ortiz was subsequently charged with illegal possession of the gun and the drugs.

A swab of the gun sent to the police lab revealed a “complex” DNA mixture—one containing multiple different DNA strands, reflecting contact with multiple people. The analyst believed that five people contributed to the DNA profile, a parameter provided when the sample was submitted for analysis using the STRmix software program. With its advanced statistical approach, STRmix then determined that it was 540 million times more likely than not that Ortiz was one of those five, as the U.S. District Court for the Southern District of California later recalled.

STRmix’s developers had produced an internal validation study based on the work of 31 laboratories that found its results to be reliable for up to five contributors, provided that the person in question had donated at least 20% of the DNA. But whenever the number of contributors exceeded two, the actual number became far harder to pinpoint; one study showed analysts missed the call for six or more contributors 100% of the time. The determination that there were five contributors to the sample in Ortiz’s case represented a “judgment call” by the analyst, who understood the difficulty the program had in determining the number of contributors. In fact, STRmix had never been validated for samples with six or more contributors. It was therefore entirely possible that the analyst’s judgment was constrained by STRmix’s validation limits.

Fortunately for Ortiz, defense expert Dr. Dan Krane was able to show that there was good reason to believe the number of contributors was six or more. The STRmix result was then invalidated, and the court excluded it from evidence in the case. United States v. Ortiz, 736 F. Supp. 3d 895 (S.D. Cal. 2024).

When a man was fatally stabbed outside a gay bar in Houston in 2010, police told the media that it may have been a “crime of passion.” They soon identified a suspect, based on a description of a car that witnesses said the killer might have driven. The suspect, Lydell Grant, had an alibi. But he also had a criminal record, and his mugshot was picked out of a photo lineup by all but one witness. He was convicted, in part, because analysis of DNA collected from the victim’s fingernails could not exclude him.

His case eventually gained the attention of the Texas Innocence Project, whose advocates asked Cybergenetics to pass the DNA through its TrueAllele software. The software excluded Grant as a contributor. However, even though he had already spent nearly a decade in prison, the results were not enough to free him. So, the team took the information to a lab that had access to the FBI’s Combined DNA Index System (“CODIS”) database and got a lucky hit. The real killer’s DNA was found there, and the police quickly managed to get a confession. As a result, Grant was freed on bail.

In a 2021 case in New Jersey, TrueAllele was used to tie defendant Corey Pickett to a ski mask and two handguns collected during a criminal investigation. Two of the samples from the mask were shown to be a mix of two and three contributors, respectively. When the DNA profile was passed on to Cybergenetics, TrueAllele identified Pickett as a contributor but did not implicate his codefendant in the mix.

Pickett’s defense team then asked the state superior court for permission to review the source code used to create the software, in order to understand how the tool came to its conclusions. After initial rebuffs—which courts sanctioned, accepting Cybergenetics’ arguments that the source code represented a trade secret developed with millions of dollars in investment—the Appellate Division of the Superior Court of New Jersey granted Pickett’s team access to the code. Cybergenetics initially attempted to limit the review to a device at the prosecutor’s office, from which the defense expert could take notes but not copies. The Superior Court, however, recognized that this was an undue burden and directed the lower court to “compel the discovery of TrueAllele’s source code and related materials pursuant to an appropriate protective order.” It appears to be the first appellate court in the country to have done so. State v. Pickett, 466 N.J. Super. 270, (App. Div. 2021).

Genetic Identification

The genetic instructions contained within DNA establish the blueprint for each of us. Identical twins share a DNA structure and are often indistinguishable from one another to the casual observer, testifying to the power of that blueprint. Yet the double helix that many of us recognize on sight is made of just four simple nucleotides: adenine (a), thymine (t), guanine (g), and cytosine (c). Those four nucleotides always pair up (making a base pair or bp), with purines (adenine and guanine) pairing with pyrimidines (thymine and cytosine). Researchers continue to add to the list of DNA sequences that impact genetic expression because there are roughly 3.2 billion (3.2 × 10^9) possible base pairs within humanity.

However, only about 25% of bp are directly involved in protein creation and regulation—genes. The remaining 75% or so are considered “extragenic,” having little to do with direct genetic expression. For that reason, much of it is often called “junk DNA.” Of interest for genetic fingerprinting, though, are sections of extragenic DNA called short tandem repeats (“STR”). These STRs are identified by their location (for DNA we use the singular locus and the plural loci to express allele positions in the DNA) and their type. For example, the STR marker THO1 is located on chromosome 11.

Each STR has a set of known possible variants. THO1 is typed according to its repeats of AATG. So, a THO1 STR that has been allele-typed as 5 would look like AATGAATGAATGAATGAATG, indicating five consecutive repeats of the AATG sequence. A THO1 STR that is allele-typed as 9.3 will have nine repeats of AATG plus three repeats of ATG. Because half our DNA comes from each parent, every STR has two copies, one from each parent. If both parents donate the same allele type, then the offspring’s STR is homozygous. If each donates a different allele type, then the STR is heterozygous. The loci used for identification were chosen because they are stable (they do not change over time) and are polymorphic (there is more than one type). Again, using THO1 as an example, the FBI accepts allele types of 5, 6, 7, 8, 9, 9.3, 10, and 11 for the THO1 locus.

Stripping the DNA from the cell is no longer very difficult. There are several processes available, and the one chosen can vary based on the lab’s goals. The purified DNA is then quantified so that the lab can understand how much material is needed for the amplification process. The most common amplification techniques are known as PCR and reverse transcription PCR, and they are capable of targeting specific regions of a DNA template using binding agents called primers. Just a single DNA molecule can be multiplied up to a billion times with 30 cycles of the PCR process, which typically runs 28 to 32 times but can be increased to 34 passes for very low quantity DNA. A DNA strand copied correctly during each pass would make 17 billion copies in 34 cycles.

There are four primary methods for analyzing the amplified DNA products: (1) autosomal short tandem repeat profiling; (2) Y chromosome analysis; (3) mt-DNA analysis; and (4) autosomal single nucleotide polymorphism (“SNP”) typing. For our discussion, the first three are the most interesting. Autosomal DNA refers to the first 22 pairs of chromosomes, which are non-sex determinant. The 23rd pair is either two X chromosomes for a female or an X and a Y chromosome for a male. Therefore, the Y chromosome analysis can be done only on samples that originate from genetic males. Because the mt-DNA is almost always inherited from the mother—just as the Y chromosome is only inherited from the father; both arrive in the offspring unchanged except for rare mutations—mt-DNA can be used to provide matrilineal information. Interestingly, mt-DNA has an increased survival chance over other (nuclear) DNA. Finally, SNPs can also be useful for identification with highly degraded DNA samples because they are able to take advantage of smaller template sizes (up to 50 bp) versus STRs (300 bp).

The industry leader in DNA detection, Applied Biosystems, uses “capillary electrophoresis” to take a sample, apply a primer with a fluorescent molecule attached, and then pass it through tiny glass tubes (called capillaries), where a laser strikes it—causing the various DNA molecules to reflect their own unique light. This fluorescence is then mapped onto a graph that is arranged according to the alleles at a given locus. The peaks on the graph look like inverted Vs, where the size of the peak reflects the relative fluorescent units (“rfu”). For very low template DNA, i.e., for DNA with very few copies, the rfu’s can drop below 50, where stochastic (random) effects can impact the validity of the peaks.

Whether or not a given allele is heterozygous or homozygous will have an impact on peak size. For a homozygous allele, the sample started with two copies of the same allele type, while a heterozygous allele has one copy of each type. Therefore, a lab technician would then expect that an ideal single source would display peaks for homozygous alleles at roughly twice the height of heterozygous alleles. So, if a homozygous allele peaks at 2,000 rfu, then each of the heterozygous alleles would be expected to peak at 1,000 rfu. However, that does not always happen. If a sample shows only one allele of a specific STR, an analyst may assume that the donor is homozygous when the second allele that would have shown the donor to be heterozygous was not shown due to stochastic or other effects, a so-called “drop-out.”

Forensic scientists attempt to understand allele frequency by counting the number of times that each allele appears in a database. From that, they can build a distribution chart. A heterozygous person adds two alleles to the distribution for each locus. If a person is homozygous at a locus for a given allele, then it is counted twice. We might in a hypothetical database find that THO1 type 5 appears 11 times out of 1,000 (0.011, or 1.1%), and type 9 appears 22 times out of 1,000 (0.022 or 2.2%). This is repeated until every STR locus is covered. In one 2005 chart, there were 20 known THO1 homozygous variants and 190 known THO1 heterozygous variants.

It is possible, even likely, that a given database will represent a distinct distribution from the population, which may represent a substructure. But the Random Match Probability within the “Gold Standard” approach assumes that there is no substructure. If the database were composed primarily of people of Italian descent, then the database would present the allele frequencies of Italians, which may be quite different from the population as a whole. As an example, THO1 type 5 may have a 1% probability in an overall population database but a 20% frequency in the subgroup. That means that a homozygous allele for the population would appear in 1 in every 10,000 times but 1 in every 25 times in the subgroup (computed using (0.01)(0.01) = 0.0001 and (0.2)(0.2) = 0.04).

The previous example uses the multiplication rule, which assumes each probability is independent. In a pair of fair, six-sided dice, each die has a 1 in 6 chance of landing on a 6. Because each die is independent, the probability of both landing on 6 is calculated via the multiplication rule to be (1/6)(1/6) = 1/36. Thus, the frequency with which the dice will land on double-six over an infinite number of throws will be 1/36. The infinite number of throws is important because a finite number of throws can only approach 1 in 36 as the number of throws increases. That is, you might not throw a pair of sixes in the first 100 throws but then throw them twice in the following 10 throws. We can expect that as the number of throws increases, the actual frequency will come closer to the theoretical frequency.

The same can be expected with any database. As the number of samples in the database increases, that database’s distribution will approach the distribution of the population. If a random sample of the population proportional to its size is acquired, then it is possible to come close to estimating the population distribution, which is how political pollsters work. Yet, as we have seen by the failure of the polls in recent elections, deriving a truly random sample is challenging when the samples are self-selecting (people who will answer the phone and respond to polls or people who will supply their DNA to a database).

Many databases attempt to segregate certain populations into groups, such as North American Caucasians, Blacks, Hispanics, Asians, and Native Americans. But these groups have been shown in some studies to have their own internal substructures. Consider the breadth of physical differences between people from the Indian subcontinent and the Chinese Han, yet both are considered “Asian.” Some researchers argue that each subgroup is a mixture of subgroups with their own allele frequencies. Because people tend to mate within these groups, the subgroups cannot be homogenized, nor can they be predicted. Instead, they must be empirically tested. But too few of these studies have been performed.

While mindful of these challenges, a conservative approach has been identified. If I have four siblings and we can measure only one STR, then at least two siblings will match, as there are only four combinations possible, assuming both parents are heterozygous. Yet, if we expand to two STRs, again where both parents are heterozygous, there are now 16 possible combinations, reducing the odds of a coincidental match—though it is still possible. Likewise, even if a substructure within a dataset can be identified, increasing the number of measured STRs can help to overcome the effects of the substructure. For that reason, the FBI’s CODIS compares DNA across a broad spectrum of STR loci.

CODIS also takes a conservative approach to calculating rare alleles. Should a single allele appear less often than 5 out of 100 times, the analyst will still calculate it as if its frequency is 5 in 100. This works in favor of the accused in a criminal context. Consider an allele that has a 1% frequency (1 in 100). If it is multiplied by another allele with a 1% frequency, then the chance of the two alleles appearing together is (0.01)(0.01) = 0.0001, or 1 in 10,000. On the other hand, a 5% frequency against another 5% frequency becomes (0.05)(0.05) = 0.0025 or 1 in 400. Since a 1 in 10,000 frequency is 25 times more selective than a 1 in 400, the latter tends to favor the accused.

The system of multiplying the odds of given DNA sequences is called the Random Match Probability (“RMP”). At times, the number may be shown as the Exclusion Probability, which is equal to 1 – RMP. If the RMP is 0.025, then the Exclusion Probability is 0.975. Despite their names, neither presents a probability of guilt. The RMP simply says that given a population of unrelated people, the chance of finding a certain DNA signature is X. It is the assumed rarity of the event, but forensic scientists know rare events do occur. There is no reason why another unrelated person would not have the same DNA signature of the given STRs, however unlikely it might be.

Complex Mixtures
and Other Challenges

Any DNA sample can pose challenges to an analyst. This is particularly true if there is very little DNA, if it is degraded, or if it has originated from multiple people. The latter may be especially challenging if the donors are related. For each of these challenges, the whole process is also subject to stochastic (random) events and poor lab practices that have remarkable yet unpredictable impacts on the output.

One common occurrence in DNA analysis is allelic drop-out. This can occur from both degradation in the sample and stochastic effects. In general, a longer allele is more likely to get cut into smaller fragments as it degrades. When this happens, it will often be seen as a negative slope among peak heights (peaks get shorter on the right side of the chart) within the electropherogram. There can also be problems with the process itself, such that some alleles get preferential amplification. For some reason, not every template gets used in every PCR cycle, so those missed alleles will not be amplified during that cycle. If that happens early enough, it can result in a dramatic influence on allele peak heights. Because this effect is essentially chance, there is no way to determine that it has occurred without performing multiple tests (which may not be possible in a criminal justice context). Likewise, some people may have rare mutations in their DNA, polymorphisms, that were not picked up by the primers, causing the sample to appear to be missing the allele altogether.

“Stutter peaks” are another common problem in DNA analysis. Recall that the THO1 STR is made up of repeats of the “word” AATG. In some cases, not every word will be copied for a specific STR allele. That means that a THO1 type 6 will have a corresponding peak for type 5. In a single source, this is often identifiable by the height of the stutter peak being 10 - 15% that of the true allele. Unfortunately, the preferential amplification process described above can at times make a stutter peak appear to be taller than the true allele. In that case, within the single source sample, the stutter peak may still be recognizable by being one word shorter than its neighbor and accompanied by a third allele type.

A very small quantity of source DNA must be amplified more often and will generally give off more stochastic effects, including effects from background “noise,” and it will have a higher likelihood to include additional DNA sources. Tiffany Roy, a DNA expert, described how smaller samples increase the chances of interference and noise from the environment during the test process, an effect sometimes described as “snowing from the ceiling.”

Touch DNA is an example of a low quantity sample. The mere presence of the DNA does not tell us how or when the DNA got there. An entire field of study has arisen on the transfer, persistence, prevalence, and recovery (“TPPR”) of DNA. In one study, a load of clothes was washed with a single pair of semen-stained underwear. Fifty percent of the items in the load were subsequently found to have at least one sperm cell on them after the wash.

It can be impossible to determine how many contributors have donated to a sample, yet knowing that number is critical to interpreting DNA samples. There are several edge cases with regard to calculating who is a contributor in a sample, including when two or more parties contribute equally and when any minor contributors are below the threshold for interpreting stutter peaks.

A 2005 study found that labs failed to correctly identify a four-person mixture 70% of the time. The standard for determining the number of contributors is to count alleles, but another study found that a four-person mixture counted in the same manner could easily be mistaken as either two- or three-person mixtures. Allele peak heights are also cumulative. In an unknown four-person mixture where everyone shares an allele type and contributes equally, that peak would be expected to be four times higher than a single-person allele. But if three additional alleles are of relatively equal heights, to whom would those alleles be attributed?

This challenge is potentially exacerbated when the mixture contains people from the same population subgroup. As University of Oregon data scientist Rori Rohfs pointed out to SciTech Daily, “The accuracy of DNA mixture analysis really varies by genetic ancestry. Groups with less diverse genetic variants are going to have higher false inclusion rates for DNA mixture analysis, and this gets worse when you have more contributors.” This is because the substructure of the ancestral group makes differentiating alleles that are more common for the subgroup more challenging to distinguish. This is especially true when their ancestry is itself a subgroup of a major subgroup. For example, an enclave of Norwegian Americans will likely have a greater proportion of blond-haired, blue-eyed members than the American Caucasian subgroup against which they would be compared, and this would have a corresponding genetic factor that might be reflected in the STRs.

Bearing in mind that a stutter peak is usually 10 - 15% of the height of a true allele, we have an apparent limit to detectable contributions of about 1/6 the sample. Most labs try to set the threshold for peak heights between 50 and 150 rfu. But a low quantity contributor might easily slip below the threshold. Such a threshold would ignore the low quantity donors, changing the number of detectable contributors. But lowering the threshold would mean that stutter peaks and other phenomena become inseparable from the true alleles. Furthermore, the cumulative nature of alleles would mean that a stochastic effect might add to a true allele, making it seem a larger contributor relative to the rest of the sample than it actually is.

A single cell averages about 6 picograms (“pg”), each one trillionth (1/1,000,000,000,000) of a gram. It is an unimaginably small number; one trillionth of the diameter of the Earth is about 10 micrometers. Yet, the number of investigated samples with less than 100 pg of DNA has been skyrocketing. A 2019 study in the Journal of Forensic Science looked at a criminal case where a lab examined just 92 pg of DNA, of which just 6.9 pg was the (suspect’s) male DNA. The company that made the system which prosecutors used, Applied Biosystems, recommends at least 1 nanogram—144 times more DNA than the male sample in this case. The firm’s own validation studies found full profiles were obtained at 125 pg or greater. “At 15.6 pg, [less than] 3 alleles were returned.”

Problems in DNA Reporting

In 2012, Lukis Anderson, a 26-year-old alcoholic and homeless Black man in California’s Santa Clara County, was linked by his DNA to a murder 10 miles away. The police arrested him, and he spent five months behind bars despite a rock-solid alibi. He was in the hospital detoxing after consuming 21 beers in a single night. His prior criminal record helped buoy the cops’ suspicion of him, and they were even able to find a link between him and his suspected accomplices. Fortunately for Lukis, it was discovered that the same paramedics who took him to detox worked on the murder victim; presumably, his DNA was transferred to the victim via an oxygen monitoring device.

In a TPPR study covered in the March issue of CLN, originally reported in Forensic Science International: Genetics, indirect transfer of DNA showed up in some surprising ways. A female participant’s DNA appeared on a male participant’s glass. Video of the event showed they had no contact with each other. The last thing they both contacted was likely a bathroom towel after washing their hands. Likewise, the female visitor brought her own male partner’s DNA with her and deposited it on a mug that was her 48th contact during the event.

Understanding how the DNA landed on the studied items was often not easy. The researchers had to check the recordings of the event to determine what was touched, when, and how often or long. And despite attempts to be conscientious with regard to contamination, the researcher overseeing the event still managed to deposit an extraordinarily intimate sample of DNA from one of the male participants.

A 2023 study reported in the Journal of Forensic Science examined 732 wrongful convictions. Rather than laboratory errors, what stood out was forensic reporting or testimony miscommunication—reports or testimony that do not conform to established standards or that fail to provide appropriate limiting information. Tiffany Roy, who sits on the National Institute of Standards and Technology (“NIST”) Research Triangle Institute Working Group on Human Factors in Forensic DNA Interpretation, has listed four questions that DNA cannot answer. The first is that DNA cannot tell us how it came to be on the item from which it was recovered. Even in the case of the glass that ended up with a female participant’s DNA, the researchers can only provide a hypothesis of how it got there. The same goes for Lukis Anderson’s DNA on the murder victim.

Secondly, the profile DNA is not time stamped; it could have arrived at any time before, after, or during the crime. It could be an earlier accidental deposit or contamination after the fact. The amount of DNA present has no bearing on when the deposit happened either.

The third point is that DNA does not give the cell type from which it came. As an example, Roy noted, a common test for blood actually tests for peroxidase activity, to which substances other than blood will respond. So, an attentive analyst should instead report that peroxidase activity was detected. Even seeing a sperm cell in a sample does not mean that the DNA came from sperm. Those things cannot be known without actually watching the DNA being extracted from the cell.

The final question that DNA cannot answer is the identity of the single person from whom it originated. Even if every STR in the CODIS database is a match, it does not indicate that only one person has that match. Considering the pair of six-sided dice again, just because the chance of rolling a pair of sixes is 1/36 does not mean it could not be rolled several times in a row. Rare events do happen. The only way to know for certain that no two people share the signature would be to test every person on the planet.

Roy pointed out how analysts misrepresent important concepts. We can now achieve a profile from just a relatively few cells. But the complex profile that results is not the same as the profile of a couple of decades ago that required a significant sample size to generate a signal.

Problems in Probability

With RMP, given 20 STRs as required for a CODIS match, probabilities are calculated according to the multiplication rule. The rule assumes that each STR’s probability is independent. In other words, there is no substructure to the DNA. However, we have seen that this may not be accurate. In cases where there is low genetic diversity, such as a collection of Amish people, there will be a strong likelihood of genetic substructure, improving the chances that any number of STRs will match. That substructure changes the necessary odds. But even separating people into African American, Caucasian, and other groups will not be sufficient.

As an example, the defense in a 1991 California case showed that there was some linkage between the STRs in the database of a lab, a “linkage disequilibrium” that forced the company to recalculate its database in a manner that did not consider all the STRs to be independent. The result was a drop in the RMP from a highly selective 1 in 6 billion to a poorly selective 1 in 50. In testing the assumptions of the lab by investigating the database, the defense showed the assumptions to be incorrect, though the Court of Appeal of California, Second Appellate District, Division Six, upheld the defendant’s conviction on other grounds. People v. Axell, 235 Cal. App. 3d 836 (1991).

In addition, a prosecutor, judge, or juror may commit an error called a Base Rate Fallacy. Given a population that is 85% White and 15% Black, an eyewitness reports that he saw a Black man fleeing the scene. You test him under similar conditions and discover he correctly identifies the fleeing man 80% of the time and is wrong 20% of the time. What is the probability then that the suspect is actually Black?

The answer is not 80%, as most of us would assume. The percentage of Black people in the city has to be taken into account. When the base rate, 80%, is multiplied against that percentage, 15%, the chance that the fleeing man was correctly identified as Black is just 12% (0.8 × 0.15). The chance that the fleeing person was White and incorrectly identified is 17% (0.2 × 0.85). We can then find the relative probability by taking 12% and dividing it by 17% + 12% to get 41%. So, the relative chance that the fleeing man was Black becomes 41%, not 80%. This result is not intuitive for most of us, making it a very common error.

Related to the base rate fallacy is the Conjunction Fallacy, which happens when we confuse a specific case as more likely than a general one. If Jane is a smart, outspoken, and single young adult who majored in philosophy, which is more likely: (1) Jane is a dishwasher at a restaurant or (2) Jane is a dishwasher at a restaurant and is active in the feminist movement. That Jane is a dishwasher at a restaurant is more likely because dishwasher at a restaurant is the broader category which encompasses the dishwasher at a restaurant feminist.

Then there’s the Prosecutor’s Fallacy, which occurs when we “transpose the conditional.” In terms of RMP, it changes the question from “the probability that a random innocent unrelated person matches the DNA profile” to “the probability that the defendant is the true source of the sample given a DNA match.” An RMP cannot declare the latter, called a source probability. It can only give the probability that another unrelated DNA sample would match. The two statements are not equivalent. To put it another way, the probability that there is a DNA match if this person is innocent is not the same as the probability that the person is innocent given the DNA match.

Problems in Probability—CPI

Given the room for misunderstanding and misstatement within relatively simple single-person samples, complex mixtures bring out entirely new challenges. The Combined Probability of Inclusion (“CPI”) looks at all the DNA and tries to consider all the possible profiles that would have created the complex mixture profile that was tested. Analysts examining a degraded mixture might find one or more of the suspect’s alleles missing and decide that their absence resulted from sample degradation. But an analyst finding a missing allele and using RMP would report it as an exclusion. That is not the case with CPI or a likelihood ratio (which we will soon discuss). An analyst may argue that assuming the allele is a conservative approach, but that would only be true if it is also assumed that the defendant is guilty.

James Curran, a statistician, and John Buckley, a forensic scientist, decided to test CPI because they were concerned that it was misleading findings in favor of conviction. The study found that 87% of the profiles of people known not to be in the mix would have been found to be possible contributors. The CPI provided a hypothesis other than exclusion (innocence) such that most people would be in the mix. This challenge appears again for likelihood ratios, just in a more sophisticated form.

Problems in Probability—Likelihood Ratios

Historically, American courts have not taken up likelihood ratios (“LR”), except in paternity cases. But the use of LRs has become more common as courts have been willing to consider complex mixtures in criminal trials. However, LRs effectively subvert the presumption of innocence—by beginning with the assumption that the defendant is guilty and comparing it to the hypothesis that he or she is innocent. As California attorney and DNA expert witness Bess Stiffelman put it, “The possibility of guilt and innocence (inclusion or exclusion) are treated equally, thus shifting the burden of proof to something more akin to a civil standard.” Stiffleman, Bess, No Longer the Gold Standard: Probabilistic Genotyping is Changing the Nature of DNA Evidence in Criminal Trials, 24 Berkeley J. Crim. L. 1 (2019).

An LR compares the probabilities of two different hypotheses to explain a given piece of evidence. They are always conditional, such as “The probability the suspect is a contributor to the DNA profile is X times more likely than if the DNA has some unknown contributor.” Because we are comparing two probabilities, the LR is not itself a probability. It is only a comparison of two hypotheticals. To convert an LR into a probability, we still have to perform a difficult Bayesian analysis to determine a “prior probability of guilt” apart from the DNA evidence.

Stiffelman gives a fantastic example of how meaningless an LR can be in the “case of the exploding pillow,” paraphrased here. Suppose you come home to find your dog sitting among a cloud of feathers and a destroyed pillowcase. Your house may have been burglarized while you were gone, or it is possible your IKEA pillow was designed to explode after six months of use. You compare the hypotheses and determine that since nothing was taken, the exploding pillow is 10,000 times more likely than a burglar. While that LR is damning for IKEA, it says nothing about the dog in this case.

An LR for probabilistic genotyping attempts to provide some manner of weighing the possibilities of contribution to mixed DNA. In order to create an LR, the analyst must first assume that the defendant is guilty and contributed to the complex sample. What the jury hears in turn is a probability of guilt rather than a comparison of hypotheticals.

Both the software used to create an LR and the analyst running it must make some rather important assumptions. First, the analyst must decide how many contributors were in the sample. As we have already seen, that is often a nearly impossible task that boils down to a guess based on the number of allele types and peak heights. Labs using one of the probabilistic genotyping tools, like STRmix, may opt to apply a threshold so that peaks below a certain height are not considered while another major tool, TrueAllele, refuses to ignore any peaks, even those below the stochastic threshold. Likewise, the tools must decide how to handle missing alleles, possibly by substituting allelic frequencies commonly found with extant alleles (thereby converting the alleles from independent to dependent probabilities) or by simply giving it a nonzero value.

Tools also make assumptions about the relevance of their findings. The Forensic Statistical Tool (“FST”), once used (and developed) by the Office of the Chief Medical Examiner (“OCME”) in New York, set out guidance for the relevance of the conclusions. A “one” indicated the probabilities were even, so there were no conclusions. From “one” to “ten” showed limited support and “10” to “100” showed moderate support. A range of “100” to “1,000” indicated strong support, while anything over “1,000” showed very strong support. The lab has since switched to using STRmix. It is a more sophisticated piece of software, but its guidelines state that any LR under 1,000 is uninformative. The difference has nothing to do with the sensitivity of the program, just how its designers view the statistical significance, according to Stiffelman. To jurors, the phrasing of LR and the numbers are formatted the same; the only difference they would see, if permitted, is that one says to disregard the findings of another.

Much like the exploding pillow, the LR also reflects the hypotheses proposed. In an assumed four-person mixture, is the hypothesis limited to just the individual defendant and three unknown individuals? Or does it attempt to run combinations of related individuals? How many subpopulations does it consider? In what combinations?

The problem of varying LR thresholds with regard to what is informative is important. Even if we assume that FST and STRmix (or some other software) would return the same numbers on the same evidence (they would not), the lab would have to consider much of what was previously considered conclusive to be inconclusive. That no two likelihood ratios are alike, even within the same software run with the same data under the same assumptions, illustrates the lack of a ground truth. A key measure of any scientific system is its testability and its error rate. But LRs are incompatible with error rates.

Probabilistic Genotyping

Probabilistic genotyping tools have arisen to address the binary (included or excluded) nature of RMP and its weaknesses when addressing complex DNA mixtures. These tools all take advantage of the massive improvements in processing power. There have been at least eight such programs in use. At least three programs are open source, meaning that individuals can download the source code (the set of computer language instructions) and peruse or even enhance it. One proprietary software, FST, is no longer being used, and its source code is now downloadable off the internet thanks to ProPublica. The remainder are commercial products that guard their source code to varying degrees from public view. STRmix was originally developed in New Zealand by John Buckleton for his lab but has since found its way to being marketed to the world.

The software applications come in two varieties: (1) discrete and (2) fully continuous. Discrete systems, such as open-source software, consider only the probabilities with regard to the alleles, as well as the drop-in and drop-out probabilities. Fully continuous systems like STRmix and TrueAllele, the market leaders, also consider peak height (rfu) data and other parameters to determine likelihood ratios.

The LRs produced by the various systems do not come to the same conclusions. In one study, STRmix and EuroforMix compared over 400 two-, three-, and four-person mixtures from NIST’s PROVEDit dataset. The two produced LRs that differed by more than three orders of magnitude—meaning more than a thousand-fold difference—in more than 14% of the cases. Larger differences were obtained when the contributor proportions were roughly equal, a type of edge case that makes distinguishing donors more difficult. A study in the Journal of Forensic Science (referenced earlier) also looked at a different kind of edge case, where the DNA sample size was tiny (a total of about 92 pg) and the male donor provided just 6.9 pg. In that case, STRmix reported in favor of a non-contributor hypothesis a value of 24. TrueAllele took the same data and gave it a 1.2 million to 16.7 million value in favor of the non-contributor hypothesis. The difference in these results was largely caused by the way the two programs model missing information and the assumptions they make. The fact that the data they analyzed came from an edge case of extremely low template DNA just helped to highlight the differences. The study’s author, William Thompson, pointed out that in an effort to be conservative, TrueAllele tends to favor non-contributors in low template cases judging from the results it gave here.

The PG tools make use of a calculation technique called Markov Chain Monte Carlo (“MCMC”). It works by taking random samples from a dataset to build a probability distribution. To understand that better, consider this simplified version of the process it uses. An MCMC tool starts by looking at the initial data and generates a potential solution that may fit the data. If it fits, the data is accepted as supporting the contributor hypothesis and given a probability. If it fails, then it supports the non-contributor hypothesis. The MCMC tool then repeats the process until a pattern begins to develop. The pattern—the convergence—is directed by the process to the target distribution and can be challenging. An improper convergence can lead to biased results. One way to think about it is like that of a target shooter. The target might show a very tight grouping of shots, but if the shooter’s sights were off, they may well be off the target’s center. In the case of the FST in New York, it was found to have an undisclosed (and previously hidden) function that supported the prosecutor’s hypothesis.

The tools will potentially perform millions of passes to derive the final likelihood ratio. MCMC has itself been used since 1906 in biology and other sciences, and it has been shown to be valid independent of the current problem. While RMP uses the multiplication rule, which assumes that the variables (STR alleles) are independent, MCMC was created to deal with dependent variables. For example, a stutter allele is dependent on the true allele, but that is not true of all cases that look like stutters. So, an MCMC algorithm might run a certain percentage of calculations assuming otherwise.

MCMC solutions, therefore, look at STR frequencies, as well as drop-in and drop-out probabilities. Some, such as STRmix and TrueAllele, add additional variables (also called parameters) like peak height data. The problem is that with each new variable, the number of possible solutions grows by a factor of that variable’s range, which can itself adjust with other variables (hence its dependency). If you have a bag of marbles, you might want to categorize them by type, size, and color. You might find that you have four types, six colors, and three sizes. In that case, if the variables are independent, you would have 72 combinations. But maybe one type has only two colors and another type has just one size. Those variables are now dependent on the type, and the number of combinations is reduced. In the same sense, knowing an STR’s relative dependency on being found with another allele can help to determine the validity of the allele as a true allele or provide a probability of it being a stutter peak. Every new variable added to the calculation increases the number of passes the software must make by some percentage of the variable range.

Unlike STR population frequencies, many of the variables that are relevant for complex mixtures are not well known. Assumptions must be made, such as the number of people who might have contributed or the minimum peak height threshold. For some cases where there is plenty of data and there are no edge cases, some of these assumptions might have very little negative relevance. Others, like mistakes in the number of contributors and contribution proportion, will have dramatic results. An LR missing by a factor of 1,000 when it is in the billions may not be as big an issue. But it becomes especially prominent for the edge cases where noise from stochastic effects becomes prevalent. In addition, even within a lab, individual technicians can generate different results. Labs vary among each other in their standards and practices as well, so the PG tool may need to be calibrated to the lab, where the lab will provide it with samples.

Nathaniel Adams, a systems engineer hired by the defense in New Jersey v. Pickett, testified to that effect, saying, “Complex systems such as TrueAllele involve a hierarchy of models with dozens or hundreds of parameters, each affecting the overall system’s behavior.” To deal with stochastic effects, probabilistic genotyping programs use biological modeling to predict their occurrence. These models vary by program, but the assumptions underpinning the models can make or break the reliability of the entire program.

PG Software Engineering Challenges

For a moment, consider Boeing’s tragic example with its 737 Max, two of which crashed in 2018 and 2019, killing a total of 346 passengers. The problem began when Boeing found itself needing to revamp its remarkably successful 737 in order to make it more efficient. But rather than going through the expense of creating and certifying a whole new aircraft, executives decided that upgrading the existing plane would be less expensive—especially because the “Organization Designation Authorization” used by the Federal Aviation Administration (“FAA”) allowed Boeing to certify itself for 96% of the process.

But new, more powerful engines on the 737 Max caused the plane to “pitch up,” increasing its instability. Necessary structural changes to compensate would have resulted in more FAA involvement, so Boeing added the Maneuvering Characteristics Augmentation System (“MCAS”) to the flight control systems. The MCAS would automatically pitch the plane down if it was pitching up too far and likely to stall. Boeing then sold—as an option—software that would automatically alert pilots if there was a problem.

It all centered around a single point of failure that Boeing classified as a “Major Failure Condition,” two classes below a “Catastrophic Failure Condition” that would result in a crash and lives lost. Despite evidence that pilots would take longer than necessary to address any problems, which would result in a Catastrophic Failure Condition, Boeing allegedly “concocted a false calculation to downgrade the likelihood of an MCAS malfunction rate.”

In an amicus brief submitted in Corey Pickett’s case in New Jersey, the nonprofit Upturn said that the FAA “did not thoroughly test Boeing’s new Maneuvering Characteristics Augmentation System [] software because Boeing stated the software was not ‘safety critical.’” In other words, the government abdicated its responsibility to inspect the vehicle, allowing Boeing to conceal safety issues that Boeing convinced itself were not a problem. Boeing then self-validated its own complex software, which led to catastrophic results. Sound familiar?

The makers of TrueAllele have fought hard to prevent the courts from permitting defense teams to examine the source code of its tool. They are not alone. The New York OCME lab fought hard against releasing the code of the FST to defense teams, even though the tool was not being sold in a competitive market. Mark Perlin, the creator of TrueAllele who leads Cybergenetics, has argued that its 170,000 lines of code render any attempts at independent evaluation useless; reading each and every line of the complex mathematical code, he claims, would require eight and a half years of effort. Somehow, this had been a highly successful argument against oversight.

The massive amount of code necessary to perform probabilistic genotyping speaks to the difficulty of the task. Developers must reason through the means to symbolically represent the models in languages that sit somewhere between human and native computer symbols. That is far from straightforward. Meanwhile, every new line of code increases opportunities for error. Beyond the application programmer’s errors, the underlying hardware, operating system, runtime environment, and the programming language itself can introduce new problems. The process is further challenged as the road from concept to deployment is neither smooth nor direct. Thus, the immense number of lines of code actually increases the likelihood of errors, yet remarkably, the manufacturers flip and contort that reality regarding code size in arguing against independent evaluation of the code.

One of the primary challenges in engineering of PG tools, once the models are translated into effective algorithms, is the lack of determinism. Computer software is deterministic if a given input passed through the software function always generates the same output. The MCMC process’ iterative approach is constantly taking effectively random conditions and running tests, the results of which cumulate with other results to create the probability distribution. While the Markov Chain is built to eventually converge on the actual distribution, we have seen that the assumptions built into the models can alter that, especially in edge cases. It is for this reason that even in the best of situations, not only will different tools have different results, the same version of the same tool will never display the same result twice. It is therefore incredibly difficult to confirm or deny the likelihood ratios generated by the tools.

According to Christopher D. Steele and David J. Balding, validating the measured concentration is “infeasible for software aimed at computing an LR because it has no underlying true value (no equivalent to a true concentration exists). The LR expresses our uncertainty about an unknown event and depends on modeling assumptions that cannot be precisely verified in the context of noisy [crime scene profile] data.” That is, the likelihood ratios have no ground truth and are indeterministic by the nature of the process being used to create them.

Of course, even if we assume that the models underlying the system are correct, we must still struggle to understand how coding errors might impact their instantiation and output. Coding errors can arise in some astoundingly mundane places, such as a window not receiving focus when it should. These mundane errors are common to every skill and attention level of the programmer. Because every aspect of software development relies on multiple layers of coding underneath it, the errors may often reflect the skills of the person who developed the underlying object as much as the skills and attentiveness of the application developer. These types of errors can only be found as part of an effective Quality Assurance (“QA”) process. Nathaniel Adams, the expert in Corey Pickett’s case, argues that “[p]roducts and processes will always fail under certain conditions. Rather than perfection, QA’s goal is knowledge of when and how a product or process will fail under which conditions” (emphasis supplied). For this reason, he says, “Defect assessment and mitigation efforts should be proportional to the costs of potential failures.”

It is entirely likely that some conditions will propagate results in a PG system that the developer will find too rare and costly to repair. But those conditions should be well-documented for the public, much as Applied Biosystems publishes the limits of its DNA detection systems. One open-source PG software, LikeLTD, actually lists its bug and revision history via a “change log.” If there is no public change log like this, are we to assume that new versions of software are nothing more than feature upgrades? Considering that the liberty interests of innocent people are at stake, as well as the interest of the public that dangerous people are incarcerated, this would seem an absolute minimum requirement.

But most probabilistic genotyping systems are black boxes—that is, whatever is going on inside the system is not known outside the system. It is therefore impossible for a user, or a defendant, to know that the program has acted properly because the states are not saved. One of the primary arguments made for open-source software is that as it becomes more popular, more people have an opportunity to review its code, meaning it is more likely that errors and other unwanted behaviors will be found. Related to that concern is the auditability of the code. If we can see which states are tested and verify the displayed results, then we can see where a particular Markov Chain might have swerved off track.

Nevertheless, a PG tool may have sampled millions of potential solutions to a problem that might have had billions or even trillions of possibilities. As a result, not all the runtime data could be readily recorded because it might require a terabyte or more of raw data space. However, even if such a thing were impractical for financial reasons, it has strong correlations to other aspects of the criminal justice process. A software system that is indeterminate is no different from a cop who decides a person is guilty. We need to see the evidence the cop used to make that determination. Unlike a cop’s work product, each random sample that either supports inclusion or exclusion becomes evidence as it builds the probability distribution. Unfortunately, a defendant’s ability to examine a PG software this closely is far from guaranteed.

One solution to understanding the effectiveness of probabilistic genotyping is to perform validation studies, which are distinct from the QA process. But validation studies can never address the infinite variety of circumstances and mixtures under which DNA samples are acquired, where age and environment can have unpredictable detrimental effects on DNA quality. Part of the process of validation is coming to a determination of the bounds of acceptable errors. The federal court in United States v. Anderson, 673 F. Supp. 3d 671 (M.D. Pa. 2023), mentioned that in one study of STRmix, 300,000 known non-contributors were correctly excluded 99.1% of the time. The court concluded that this was an acceptable error rate. But even if we ignore the known contributors who were improperly excluded, there were still 2,700 known non-contributors who would struggle to establish their innocence in a criminal prosecution based on the DNA findings. Where RMP will conclusively exclude a person whose DNA does not match, the likelihood ratio can never do so.

In New Jersey v. Pickett, the state submitted a substantive list of validation studies for TrueAllele. But a closer examination by the defense found that “[n]one of the peer-reviewed studies listed as part of the state’s appendix appear to be performed on the version of the [] client … used in this case,” according to Upturn’s amicus brief. New versions will always introduce new bugs. They will also reintroduce old bugs; these so-called “regression errors” are so common that an entire class of programmers writes scripts to test software for them well into the future. For probabilistic genotyping that requires a great many computationally intensive instances running against a vast number of past successes and failures, if a single run of TrueAllele requires more than a day to run in extreme cases, time and hardware will limit any tests on those cases. These companies should make public the range of LR variance they will allow from version to version, as well as what was tested. They should also show how a version change with an adverse effect on past cases is reported. Do programmers even test against real data used to obtain convictions, especially in edge cases?

A report by the President’s Council of Advisors on Science and Technology (“PCAST”) suggested that additional validation studies be made that are independent of the software developers. In part, that was because of the situation involving Mark Perlin of Cybergenetics, who has coauthored most of the peer-reviewed validation studies performed on the company’s TrueAllele tool. [Writer’s note: Early in my software development career, I recall “thoroughly” testing my code only to watch amazed as the end users took an approach I never imagined. It was not that I was lazy, just unimaginative regarding the remarkable number of different things a user might do or directions the data might go. When developers are involved in testing their own software, they will always (usually inadvertently) funnel the application into the paths they have imagined. The real test is when the developer is not around.] Unsurprisingly, Perlin disagrees with this strategy and in 2016 sent an open letter accusing PCAST of attempting to unilaterally impose “a novel notion of ‘independent authorship’ for peer review.”

The parallels between Boeing’s problems with the 737 Max and those with probabilistic genotyping are remarkably stark. The biggest purveyors of complex mixture analysis have often fought against public review of their source code and have resisted large-scale independent validation of their software output, just as Boeing escaped federal tests and resisted review. It does not appear that there is any definitive list of edge cases that can lead to unreliable results, much like we now know that under certain conditions, pilot response times to the MCAS grew to unacceptable levels.

Unlike Boeing, however, there are no terrifyingly public crashes to prove to the world in no uncertain terms that a problem exists. Absent the public horror of hundreds of mangled bodies, it is unlikely that the federal government will scrutinize the development of a given PG software. In the case where a wrongly included defendant wishes to understand how he was implicated, no cockpit recorder exists. Nor is there any analog to the governor that prevents some high-performance machines from exceeding their engineered limits. And, even if there were an audit of the software’s decisions, additional runs of the software would result in different outcomes—an additional complication that even Boeing did not need to worry about.

Trade Secrets

Courts have broadly limited access to the source code of some probabilistic genotyping systems under the protection of trade secret privilege or under the belief that source code access is irrelevant—even after the 60 notable failures of STRmix in Australia and the remarkable code challenges of the FST in New York. Greg Hampikian, a Boise State University biologist who heads the Idaho Innocence Project and uses TrueAllele to aid in exonerations, supports the release of the source code, but he has echoed a sentiment heard from prosecutors. He told ProPublica, “Microsoft Excel doesn’t release its source code either, but we can test it and see that it works, and that’s what we care about.”

It is not difficult to sympathize with this position, but the analogy is flawed. Microsoft Excel is a phenomenal product, but it is far from perfect. (No product can be.) Its bugs are reported on several websites (depending on the user’s focus). A massive team of developers and product testers is engaged in a code review process that produces a constantly evolving public bug list—something available only to products that sell in massive volumes because so many businesses rely upon it. Nevertheless, it seems improbable that Excel’s developer team would entrust anyone’s liberty by running unpredictable, indeterministic algorithms through the product. In contrast to that, Perlin has testified that only he and one other employee have intimate knowledge of TrueAllele’s source code.

Code reviews are an important part of the defense process. Allowing trade secret objections to stand serves to place corporate profits over the liberty interest of the defendant as well as the integrity of the criminal justice system itself. Boeing gives us some key lessons in this regard. FAA regulators abdicated their responsibility to evaluate Boeing’s systems and test results. Only after the tragic loss of life in two closely timed crashes did they find what they had missed.

There is no corresponding agency to validate the PG tools’ programming, and placing the burden on the defense team fails to address the incredible financial barrier to an effective defense. Mark Perlin has argued that TrueAllele is made up of 170,000 lines of dense mathematical code. He clearly has a remarkable mind. He possesses degrees in computer science, biology, and mathematics. In addition, he has written articles on the topic of genetics and probabilistic genotyping and has managed effectively to present his case for trade secrecy to the courts. The intersection of advanced statistical methods, software engineering, and genetic science is rare at best. Assuming the defense is able to find the necessary skills to understand the models, algorithms, and code, that person/team will not come cheap, making them available only to the wealthiest of defendants. All of this assumes Perlin’s code is legible to someone who is not him; coding projects made up of just one or two people are generally unconstrained by requirements that emphasize legibility as well as functionality. Illegible “spaghetti” code can be just as effective as encryption in blocking effective oversight within the time and budget constraints of a criminal defense.

Trade secret claims ignore the depth of protections available to software developers. Copyright protections prevent anyone from simply copying another company’s code. In fact, copying another’s code requires that the system use the same programming language, the same data models, and the same architecture. But again, no two programmers taking completely independent paths will code in the same way. That renders code piracy far more difficult than simple cut and paste. And, in the case of TrueAllele, Mark Perlin has an associated patent for his model’s algorithm.

Allowing trade secret claims further runs the risk of preventing any possible future mandate of auditing requirements. If a software produces a list of variables tried, with relevant results, a developer may argue it violates trade secrecy by providing competitors with too great an access to the underlying models so that they can be reverse engineered.

Trade secrets do not effectively protect the developer, but they do incalculable harm to the defendant. Despite Microsoft’s trade secret claims to products like Windows and Excel, a vibrant list of competitors has arisen to compete with them. People running Linux distros can execute many Windows applications through emulators even though the operating system was developed by volunteers. Meanwhile, similar open-source software applications have grown to compete with Microsoft Office. Their developers readily share their code, giving everyone the opportunity to innovate, even Microsoft.

There has been a significant volume of work by criminal law and intellectual property scholars arguing for abandoning trade secret protections and increasing transparency in the criminal law context. Rebecca Wexler, writing in the Stanford Law Review, agrees and argues that trade secret privileges should be removed from criminal proceedings. Much like other kinds of sensitive information, courts could limit their use and distribution through protective orders. “Courts should refuse to extend the privilege wholesale from civil to criminal cases, and legislators should pass new laws that limit safeguards for trade secret evidence in criminal courts and nothing more,” according to Wexler.

Navigating the Black Box: Challenging Probabilistic Genotyping Evidence

The allure of probabilistic genotyping tools lies in their ability to untangle complex DNA mixtures, offering courts a seemingly scientific path to justice. However, as this article has shown, these tools are far from infallible. Their proprietary nature, reliance on untested assumptions, and susceptibility to variability present fertile ground for defense lawyers, defendants, and the wrongfully convicted to contest their findings. Below are practical strategies to pierce the black box of PG evidence.

Understanding the Flaws

To challenge PG evidence effectively, one must first understand its vulnerabilities:

Opacity: Proprietary tools like STRmix and TrueAllele hide their source code, thwarting independent validation.

Subjectivity: Analysts’ “judgment calls”—such as the number of contributors—can skew results, as seen in United States v. Ortiz.

Inconsistency: Different tools, or even different runs of the same tool, yield divergent likelihood ratios (“LRs”), undermining reliability.

Assumptions: Models depend on variables like allele frequencies or peak height thresholds, which may not hold in edge cases like low-template DNA.

No Ground Truth: LRs compare hypotheticals, not certainties, leaving room for doubt about their real-world accuracy.

Armed with this knowledge, defense teams can turn complexity into an advantage, exposing the cracks in PG’s scientific façade.

Strategies for Defense Lawyers

1. Demand Source Code Access

File motions to compel disclosure of the PG tool’s source code, arguing that trade secret claims cannot override a defendant’s Sixth Amendment right to confront evidence.

Reference State v. Pickett,466 N.J. Super. 270 (App. Div. 2021), in which the court granted access to TrueAllele’s code under a protective order, rejecting restrictive review conditions as an undue burden.

Without the code, you cannot audit the “random” sampling that brands a defendant as a contributor.

Demand version-specific validation studies; generic corporate white papers are meaningless if the software version used in your case failed edge-case testing.

Engage forensic software experts to scrutinize the code for errors or biases, highlighting any lack of transparency as a due process violation.

2. Question the Number of Contributors

Challenge the analyst’s estimate of contributors, a critical input that shapes LR outcomes.

Reference studies showing analysts misjudge contributor numbers in mixtures of four or more, with error rates soaring as complexity increases.

In Ortiz, the defense invalidated STRmix results by proving the sample likely had six or more contributors—beyond the tool’s validated range.

3. Expose Variability in Results

Introduce evidence that LRs vary widely across tools (e.g., STRmix vs. EuroforMix differing by over 1,000-fold in 14% of cases) or within the same tool across runs.

Argue that such inconsistency fails the Daubert standard for scientific reliability, as no consistent error rate can be established.

4. Scrutinize Assumptions

Probe the tool’s handling of allele drop-out, stutter peaks, or population substructure—assumptions that can distort results, especially in degraded samples.

Highlight how genetic ancestry impacts accuracy, as noted by Rori Rohfs, groups with less diverse variants face higher false inclusion rates.

5. Undermine Likelihood Ratios

Emphasize that LRs are not probabilities of guilt but comparisons of hypotheticals, lacking a verifiable “ground truth.”

Jurors—and even judges—often misinterpret likelihood ratios as direct probabilities of guilt. Correct this Prosecutor’s Fallacy relentlessly.

Argue that LRs subtly shift the burden of proof, undermining the presumption of innocence.

6. Push for Independent Validation

Advocate for studies by neutral third parties, echoing the President’s Council of Advisors on Science and Technology’s call for independent testing.

Note that developer-led validations (e.g., Mark Perlin’s TrueAllele studies) risk bias, skewing reliability assessments.

7. Critical Questioning: Unmasking PG’s
Illusions at Trial

When cross-examining analysts, weaponize their jargon:

“You claim 540 million-to-1 odds against my client—but STRmix had never validated 6-contributor mixtures. Isn’t ‘5 contributors’ just a guess to fit the software’s limits?”

“This ‘black box’ reported 1.2 million LRs—but without source code, how can we audit its 170,000 lines of assumptions?”

“Does your company still forbid independent code review? If Boeing’s secrecy killed 346 people, what might yours do?”

8. The Core Argument

Probabilistic genotyping is not science—it’s opinion rendered by algorithms. Defense victories prove that when defense lawyers dissect its “certainty,” jurors see the cracks and flaws. As one exoneree stated: “They called DNA ‘gold.’ But fools’ gold glitters brightest before it crumbles.”

Educating Judges and Juries

Simplify the Science: Use analogies—like comparing DNA mixtures to a blurred photograph—to demystify PG’s complexity and reveal its flaws.

Visual Tools: Present graphs showing how LRs fluctuate across tools or runs, making abstract variability tangible.

Expert Witnesses: Retain credible forensic experts to testify on PG limitations, countering the aura of DNA infallibility.

For Defendants and the Convicted

Post-Conviction Challenges: If new flaws in a PG tool emerge (e.g., software updates exposing prior errors or recalls), petition for relief based on newly discovered evidence.

Innocence Networks: Partner with groups like the Texas Innocence Project, which used TrueAllele to exonerate Lydell Grant after a CODIS hit identified the true culprit.

Re-Analysis Requests: Seek independent re-testing with a different PG tool or lab, leveraging variability to question original findings.

Raise Ineffective Assistance of Counsel Claims if Trial Counsel Failed to:

—Consult TPPR experts on transfer risks

—Challenge lab accreditation or error-rate disclosures

—Cross-examine analysts on parameter manipulation (e.g., “Did you adjust contributor numbers to fit the software’s limits?”)

Conclusion

There is no silver bullet to resolve any of the problems outlined in this article for addressing complex mixtures and probabilistic genotyping systems. Even if we suspend trade secrets, mandate detailed audits, open the source code to enable requests for comments, and more, the very nature of likelihood ratios is antithetical to the American criminal justice system. A likelihood ratio begins by presuming guilt in a system where we profess that a defendant is innocent until proven guilty. Meanwhile the LR’s lack of a ground truth and PG’s indeterminism stands against the very reproducibility we demand from scientific evidentiary systems.

DNA evidence should be a beacon of truth in the pursuit of justice—not a black box of unchecked assumptions. Probabilistic genotyping has undeniably expanded forensic capabilities, but its rapid adoption has outpaced critical scrutiny. Courts, attorneys, and forensic professionals must confront an uncomfortable reality: when proprietary algorithms dictate guilt or innocence, justice demands more than blind faith in opaque, for-profit technology.

The cases examined in this article—from Ortiz to Pickett—reveal a troubling pattern. Flawed inputs, untested assumptions, and unchecked software updates can distort outcomes, leaving defendants at the mercy of systems they have no meaningful way to challenge. Trade secrecy claims compound the problem, placing corporate interests above due process and fundamental fairness. If we accept that DNA evidence must meet scientific and legal standards, then PG tools must be held to that same rigorous standard—open validation, peer-reviewed methodologies, and unfettered defense access to the mechanisms that decide fates.

Probabilistic genotyping systems are not likely to disappear. Legislators should force PG companies to implement the recommendations outlined above and submit to open, unrestrained testing and validation, as well as permitting the NIST to innovate edge testing that helps to determine the limits of its use. Likewise, the NIST should be permitted to establish plain guidelines that at least attempt to rationalize likelihood ratios in the context of criminal prosecutions. Probabilistic genotyping tools should be required to make bug reports public and notify the courts when version changes may have exclusionary impacts on previously included cases. The tools should also establish “governors” that prohibit lab use of the tools beyond established safe boundaries outside of research contexts.

Courts must demand independent validation and establish clear boundaries for PG’s use, particularly in edge cases where error rates soar. Forensic analysts must resist overstating results, acknowledging the limitations of complex mixtures and stochastic effects. And the legal community must equip itself to challenge—not just accept—the algorithms helping to dictate verdicts.

Probabilistic genotyping is here to stay. But its role in the criminal justice system must be measured by reliability, reproducibility, and fairness—not by convenience or conviction rates.

Sources: Berkely Journal of Criminal Law, California Law Review, Champion, Drexel Law Review, Federal Judicial Center, Forensic Science International: Genetics, Forensic Science Magazine, Global Medical Genetics, Journal of Forensic Sciences, Litigator’s Guide to DNA, Mercer Law Review, National Academy of Science, National Institute of Health, National Institute of Justice, NBC News, NIST Forensic Sciences, North Carolina Criminal Law Blog, ProPublica, Richmond Journal of Law and Technology, SciTech Daily, Stanford Law Review, The Markup, The Register, Upturn, Yahoo! News

As a digital subscriber to Criminal Legal News, you can access full text and downloads for this and other premium content.

Subscribe today

Already a subscriber? Login

Related legal cases

United States v. Ortiz

Year	2024
Cite	736 F. Supp. 3d 895 (S.D. Cal. 2024)
Level	District Court
District Court Edition	F.Supp.3d

State v. Pickett

Year	2021
Cite	466 N.J. Super. 270 (App. Div. 2021)
Level	State Trial Court

Probabilistic Genotyping on Trial: Can We Trust the Secret Algorithms Deciding Guilt?

Related legal cases

United States v. Ortiz

State v. Pickett

More from this issue:

More from Michael Thompson:

More from these topics:

Probabilistic Genotyping on Trial: Can We Trust  the Secret Algorithms Deciding Guilt?