DNA- 98%? And models

and stat modelling

Sep 09, 2024

While reading the usually excellent Scott Galloway, I came across the 98% DNA quote.

“We share 98% of our DNA with chimpanzees, including our capacity to grieve.”

I remember hearing Craig Venter say this was false and he should know.

In fact there is no number- it all just depends on which model you use.

“Richard Buggs performed a one-to-one analysis of human and chimp nucleotides. He reported that “the percentage of nucleotides in the human genome that had one-to-one exact matches in the chimpanzee genome was 84.38%.”

Other methodologies have yielded numbers ranging from the mid-80s to 90s. Why the different results? Well, because as Luskin explains, it is not entirely clear how we should compare human and chimp genomes:

Are you comparing the number of genes or copies of genes that the organism has? Are you comparing one-to-one nucleotide similarities? Are you including just the protein-coding DNA or also the non-coding DNA? Are you looking at certain segments of the genome that aren’t even necessarily places where the sequence matters, like the centromeres … ?

It turns out you’ll get a different answer depending on which method you choose. And there is an even deeper problem, says Luskin:

All of the chimp genomes we have today were effectively humanized. … The human genome was used as a scaffolding during the construction of these chimp genomes, which essentially makes the chimp genomes appear more similar to humans than they truly are.

None of this is new. Back in 2007, a paper in the journal Science admitted that the “1%” statistic was a myth and called for the truism to be retired. Yet 16 years later, this zombie idea shambles on, perpetuated by publications like Smithsonian Magazine, Nature, and the American Museum of Natural History website.

Why? Well, simply put, the “98-99%” myth holds great power in advancing the materialist worldview that dominates the hard sciences. Even if this pseudo-scientific meme were true, it is rarely stated or received as a mere fact. Instead, it is used to imply things about human beings: that we are the sum of our DNA; that we are, by that measure, almost 100% animal; and that therefore everything theologians once meant by “the image of God” is a self-flattering illusion. We’re not exceptional. We’re basically gussied-up monkeys.

But is that true? Well, the fact that humans share roughly 60% of our DNA with bananas should clue us in that there is much more behind what makes us us than just genes.”

frpm

https://www.breakpoint.org/of-primates-and-percentages-no-humans-arent-99-chimp/

DNA in plants

“n 2008, it was discovered that the inactivation of only two genes in one species of annual plant leads to its conversion into a perennial plant.^[16] Researchers deactivated the SOC1 and FUL genes (which control flowering time) of Arabidopsis thaliana. This switch established phenotypes common in perennial plants, such as wood formation.”

annuals biennials perennials

also see Venter and Watson biographies

https://www.theguardian.com/books/2007/oct/27/featuresreviews.guardianreview6

digression on models

“All models are wrong, but some are useful".

“On p. 440, the relevant sentence is this: "The most that can be expected from any model is that it can supply a useful approximation to reality: All models are wrong; some models are useful".

In 1947, the mathematician John von Neumann said that "truth ... is much too complicated to allow anything but approximations".^[22]

In 1942, the French philosopher-poet Paul Valéry said the following.^[23]

Ce qui est simple est toujours faux. Ce qui ne l'est pas est inutilisable.

What is simple is always wrong. What is not is unusable.^[24]

the artist Pablo Picasso.

We all know that art is not truth. Art is a lie that makes us realize truth, at least the truth that is given us to understand. The artist must know the manner whereby to convince others of the truthfulness of his lies.

https://en.wikipedia.org/wiki/All_models_are_wrong

Anscombe's quartet – Four data sets with the same descriptive statistics, yet very different distributions
Bonini's paradox – As a model of a complex system becomes more complete, it becomes less understandable
Lie-to-children – Teaching a complex subject via simpler models
Map–territory relation – Relationship between an object and a representation of that object
Pragmatism – Philosophical tradition
Reification (fallacy) – Fallacy of treating an abstraction as if it were a real thing
Scientific modelling – Scientific activity that produces models
Statistical model – Type of mathematical model
Statistical model validation – Evaluating whether a chosen statistical model is appropriate or not
Verisimilitude – Resemblance to reality

on June 12, 2008 12:24 AM by Andrew

J. Michael Steele explains why he doesn’t like the above saying (which, as he says, is attributed to statistician George Box). Steele writes, “Whenever you hear this phrase, there is a good chance that you are about to be sold a bill of goods.”

He considers a street map of Philadelphia as an example of a model:

If I say that a map is wrong, it means that a building is misnamed, or the direction of a one-way street is mislabeled. I never expected my map to recreate all of physical reality, and I only feel ripped off if my map does not correctly answer the questions that it claims to answer. My maps of Philadelphia are useful. Moreover, except for a few that are out-of-date, they are not wrong.

Actually, my guess is that his maps are wrong, in that there probably are a couple of streets that are mislabeled in some way. Street maps are updated occasionally (even every year), but streets get changed, and not every change is captured in an update. I expect there are a few places where Steele’s map has mistakes. (But I doubt it’s like those old tourist street maps of Soviet cities which, I’ve been told, had lots of intentional errors to make it harder for people to actually find their way around too well.) In any case, I take his general point, which is that a street map could be exactly correct, to the resolution of the map.

Statistical models of the sort that I typically use are different in being generative: that is, they are stochastic prescriptions for creating data. As such, they can typically never be proven wrong (except in special cases, for example a binary regression model can’t produce a data value of 0.6). The saying, “all models are wrong,” is helpful because it is not completely obvious, since it can’t always be proved in special cases.

from https://statmodeling.stat.columbia.edu/2008/06/12/all_models_are/

another digression

and of course a quick nod to Peter Donnelly for his joke about models and jeans in his excellent talk.

“She told this to one of her colleagues, who said, "Well, what does your boyfriend do?"

Sarah thought quite hard about the things I'd explained --

and she concentrated, in those days, on listening.

Don't tell her I said that.

And she was thinking about the work I did developing mathematical models for understanding evolution and modern genetics.

So when her colleague said, "What does he do?"

She paused and said, "He models things."

Well, her colleague suddenly got much more interested than I had any right to expect

and went on and said, "What does he model?" Well, Sarah thought a little bit more about my work and said, "Genes. He models genes."

more seriously

“now suppose we've got a test for a disease which isn't infallible, but it's pretty good. It gets it right 99 percent of the time. And I take one of you, or I take someone off the street,and I test them for the disease in question.

Let's suppose there's a test for HIV -- the virus that causes AIDS -- and the test says the person has the disease. What's the chance that they do? The test gets it right 99 percent of the time.

So a natural answer is 99 percent.

That's what you might think. It's not the answer, and it's not because it's only part of the story. It actually depends on how common or how rare the disease is. So let me try and illustrate that. Here's a little caricature of a million individuals.

So let's think about a disease that affects -- it's pretty rare, it affects one person in 10,000. Amongst these million individuals, most of them are healthy and some of them will have the disease. And in fact, if this is the prevalence of the disease, about 100 will have the disease and the rest won't.

So now suppose we test them all. What happens? Well, amongst the 100 who do have the disease, the test will get it right 99 percent of the time, and 99 will test positive.

Amongst all these other people who don't have the disease, the test will get it right 99 percent of the time. It'll only get it wrong one percent of the time. But there are so many of them that there'll be an enormous number of false positives.

Put that another way -- of all of them who test positive -- so here they are, the individuals involved -- less than one in 100 actually have the disease.

So even though we think the test is accurate, the important part of the story is there's another bit of information we need.

Here's the key intuition.What we have to do, once we know the test is positive, is to weigh up the plausibility, or the likelihood, of two competing explanations. Each of those explanations has a likely bit and an unlikely bit.

One explanation is that the person doesn't have the disease -- that's overwhelmingly likely, if you pick someone at random -- but the test gets it wrong, which is unlikely.

The other explanation is that the person does have the disease -- that's unlikely -but the test gets it right, which is likely.And the number we end up with -that number which is a little bit less than one in 100 -is to do with how likely one of those explanations is relative to the other. Each of them taken together is unlikely.

Here's a more topical example of exactly the same thing. Those of you in Britain will know about what's become rather a celebrated case of a woman called Sally Clark, who had two babies who died suddenly. And initially, it was thought that they died of what's known informally as "cot death," and more formally as "Sudden Infant Death Syndrome." For various reasons, she was later charged with murder. And at the trial, her trial, a very distinguished pediatrician gave evidence that the chance of two cot deaths, innocent deaths, in a family like hers -- which was professional and non-smoking -- was one in 73 million.

To cut a long story short, she was convicted at the time.

Later, and fairly recently, acquitted on appeal -- in fact, on the second appeal. And just to set it in context, you can imagine how awful it is for someone to have lost one child, and then two, if they're innocent, to be convicted of murdering them. To be put through the stress of the trial, convicted of murdering them -- and to spend time in a women's prison, where all the other prisoners think you killed your children -- is a really awful thing to happen to someone. And it happened in large part here because the expert got the statistics horribly wrong, in two different ways.

So where did he get the one in 73 million number?

He looked at some research, which said the chance of one cot death in a family like Sally Clark's is about one in 8,500.

So he said, "I'll assume that if you have one cot death in a family, the chance of a second child dying from cot death aren't changed." So that's what statisticians would call an assumption of independence. It's like saying, "If you toss a coin and get a head the first time, that won't affect the chance of getting a head the second time."

So if you toss a coin twice, the chance of getting a head twice are a half -- that's the chance the first time -- times a half -- the chance a second time. So he said, "Here, I'll assume that these events are independent. When you multiply 8,500 together twice, you get about 73 million." And none of this was stated to the court as an assumption or presented to the jury that way.

Unfortunately here -- and, really, regrettably --first of all, in a situation like this you'd have to verify it empirically.

And secondly, it's palpably false.

There are lots and lots of things that we don't know about sudden infant deaths. It might well be that there are environmental factors that we're not aware of, and it's pretty likely to be the case that there are genetic factors we're not aware of.

So if a family suffers from one cot death, you'd put them in a high-risk group.

They've probably got these environmental risk factors and/or genetic risk factors we don't know about. And to argue, then, that the chance of a second death is as if you didn't know that information is really silly. It's worse than silly -- it's really bad science. Nonetheless, that's how it was presented, and at trial nobody even argued it. That's the first problem.

The second problem is, what does the number of one in 73 million mean? So after Sally Clark was convicted -- you can imagine, it made rather a splash in the press -one of the journalists from one of Britain's more reputable newspapers wrote that what the expert had said was, "The chance that she was innocent was one in 73 million." Now, that's a logical error. It's exactly the same logical error as the logical error of thinking that after the disease test, which is 99 percent accurate, the chance of having the disease is 99 percent. In the disease example, we had to bear in mind two things, one of which was the possibility that the test got it right or not. And the other one was the chance, a priori, that the person had the disease or not.

It's exactly the same in this context. There are two things involved -- two parts to the explanation. We want to know how likely, or relatively how likely, two different explanations are. One of them is that Sally Clark was innocent -- which is, a priori, overwhelmingly likely -most mothers don't kill their children. And the second part of the explanation is that she suffered an incredibly unlikely event. Not as unlikely as one in 73 million, but nonetheless rather unlikely. The other explanation is that she was guilty. Now, we probably think a priori that's unlikely. And we certainly should think in the context of a criminal trial that that's unlikely, because of the presumption of innocence. And then if she were trying to kill the children, she succeeded. So the chance that she's innocent isn't one in 73 million. We don't know what it is.

It has to do with weighing up the strength of the other evidence against her and the statistical evidence. We know the children died. What matters is how likely or unlikely, relative to each other, the two explanations are.

And they're both implausible. There's a situation where errors in statistics had really profound and really unfortunate consequences.

In fact, there are two other women who were convicted on the basis of the evidence of this pediatrician, who have subsequently been released on appeal.

Many cases were reviewed. And it's particularly topical because he's currently facing a disrepute charge at Britain's General Medical Council.”

The Sheep that thinks it's a Wolf

Discussion about this post

The Sheep that thinks it's a Wolf

DNA- 98%? And models

and stat modelling

See also

Discussion about this post