Bayesian Reasoning and the false positive paradox

There are 3 kinds of lies...

Jun 06, 2022

Julia Galef explaining how Bayesian reasoning works with a puzzle (to.45)

Then to 2.30

More Bayes here with Nate Silver

Julia Galef

She then detoured into urban design before becoming dissatisfied with that field’s subjectivity, too. Inspired by an essay by Y Combinator founder Paul Graham on the value of “holding your identity lightly” — so that defending it doesn’t get in the way of seeing the world clearly — she has stopped referring to herself as a Democrat. Ten years ago, an atheist blogger Galef followed published a list of 14 “Sexy Scientists,” which in more than 1,500 comments on a dozen blogs was alternately blasted (“fucking skeevdood”) and defended (“just silly fun”). The next week, the blogger, Luke Muehlhauser, posted an apology, declaring that publishing the list had been morally wrong. Galef was so impressed by his willingness to reconsider his position that she emailed him a fan note. The two are now engaged.

Bayes update your priors

Keynes

Once during a high-profile government hearing a critic accused him of being inconsistent, and Keynes reportedly answered with one of the following:

When events change, I change my mind. What do you do?
When the facts change, I change my mind. What do you do, sir?
When my information changes, I alter my conclusions. What do you do, sir?
When someone persuades me that I am wrong, I change my mind. What do you do?

Bayes

“A classic 1978 article in the New England Journal of Medicine reveals this problem. The researchers asked 60 Harvard physicians and medical students a seemingly simple question: If a test to detect a disease with a prevalence of 1/1,000 has a false positive rate of 5%, what is the chance that a person found to have a positive result actually has the disease?

Only 14% gave the correct answer of 2% with most answering 95%.

Base rate fallacy/false positive paradox is derived from Bayes theorem. When the incidence of a disease in a population is low, unless the test used has very high specificity, more false positives will be determined than true positives. The difference in the numbers can be quite striking and certainly not inherently understandable.

We have learned in the past from routine PSA testing and mammograms that a positive test in a screening situation needs to be taken in context. The incidence of a disease in the population that you are testing is extremely important for accuracy.

Purdue University made the decision in late spring to resume in-person classes for its fall session. Purdue is a major research university with a strong emphasis on STEM education. Many of these classes include practicums, laboratory sessions, and group projects that require some in-person attendance.

An elaborate plan was implemented, including a signed pledge from all students to behave properly, wear masks, maintain social distancing. A decision was made to perform random testing on 10% of the students and staff each week. Since staff and students combined are 50,000 at Purdue University, 5,000 tests are done every week. The purpose of the random testing was surveillance to encourage students and staff to maintain proper behavior.

The Indiana State Department of Health advised against a random testing program, as it felt overall data accuracy would be difficult. Commingling of data in our county from the people tested WITH symptoms together with the randomly tested Purdue students WITHOUT symptoms has occurred. Base rate fallacy/false positive paradox unfortunately becomes ignored when one does this.

Up to this point, Purdue has done random testing on about 1,000 students per weekday. Of those, about 35 are positive each day, according to the university's dashboard. Students who test positive have to isolate in an old dormitory or go home. Those who choose to go home will often have another test by their personal physician. When these tests return negative, significant confusion occurs.

So far, 90% of the students who test positive do not develop symptoms. Only one has been hospitalized and none have died. Had Purdue chosen to test all 50,000 students and staff every week, 10 times the number would have reported as testing positive weekly. Had this data been commingled with testing of symptomatic individuals, there certainly would have been an outcry by the casual observer to close everything down again. Yet those numbers would be only representative of the positivity of mass testing, not the prevalence of infective patients.

Those 35 students who test positive daily are added to our county totals (many of those county positive tests are done on people with COVID-19 symptoms). Thus, it makes it look like our county's number of positive tests has doubled since Purdue started in-person classes in August.

The numbers have caused our county health department to move cautiously. Restaurant occupancy, sporting events and other large gatherings are again limited at a greater level than state requirements.

Without knowing the specificity of the test, the number of these positives that are false positives is unknown.

By base rate fallacy/false positive paradox, if the specificity of a test is 95%, when used in a population with a 2% incidence of disease -- such as healthy college students and staff -- there will be 5 false positives for every 2 true positives. (The actual incidence of active COVID-19 in college age students is not known but estimated to be less than 0.6% by Indiana University/Fairbanks data. Even using a test with 99% specificity with a 1% population incidence generates 10 false positives for every 9 true positives.”

https://www.medpagetoday.com/infectiousdisease/covid19/89522

****

more Bayes here

The study looked at two questions. First was the one I mentioned above: “True or false: the p-value is the probability that the null hypothesis is correct”. The correct answer is “false” – the p-value is the chance of obtaining results at least as extreme as those actually obtained if the null hypothesis were true. 42% correctly said it was false, 46% said it was true, and 12% didn’t even want to hazard a guess.

The question seems sketchy to me. It is indeed technically false, but it seems pretty close to the truth. If I were asked to explain why the definition as given was false, the best I could do is say that your probability of the null hypothesis being true should take into account both something like your p-value, and your prior. But since no one ever receives Bayesian statistical education, I am not sure it is fair to expect a doctor to be able to generate that objection. What I would want a doctor to know is that the lower the p-value, the more conclusively the study has rejected the null hypothesis. The false definition as given accurately captures that key insight. So I’m not sure it proves anything other than doctors not being really nitpicky over definitions.

(which is also false, actually)

Next came very nearly the exact same question about mammogram results as Eliezer’s Short Explanation Of Bayes Theorem. It offered five multiple-choice answers, so we would expect 20% correct by chance. Instead, 26% of doctors got it correct. What shocks me about this one is that the question very nearly does all the work for you and throws the right answer in your face. Compare the way it was phrased in Eliezer’s example:

1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer?

to the way it was phrased on the obstetrician study:

Ten out of every 1,000 women have breast cancer. Of these 10 women with breast cancer, 9 test positive. Of the 990 women without cancer, about 89 nevertheless test positive. A woman tests positive and wants to know whether she has breast cancer for sure, or at least what the chances are. What is the best answer?

The obstetrician study seems to be doing everything it can to guide people to the correct result, and 74% of people still got it wrong. And nitpicky definitions don’t provide much of an excuse here.

There were three other results of this study worth highlighting.

First, people who got the statistics questions wrong were more likely to say they had good training in statistical literacy than those who did not, giving a rare demonstration of the Dunning-Kruger effect in the wild. Doctors who didn’t know statistics were apparently so inadequate that they didn’t realize there was any more to know, whereas those who did know some statistics at least had a faint inkling that something was missing.

Second, women rated their statistical literacy significantly worse than men did (note that a large majority of Ob/Gyn residents are women) but did not actually do any worse on the questions. This highlights an important limitation of self-report (tendency to confuse incompetence with humility) and probably has some broader gender-related implications as well.

And third, even though 42% of people got Question 1 correct and 26% of people Question 2, only 12% of people got both questions correct. Just from eyeballing those numbers, it doesn’t look like getting one question right made you much more likely to do better on the other. This is very consistent with most people lucking in to the correct answer.

I do not want to use this to attack doctors. Most doctors are technicians and not academics, and they cultivate, and should cultivate, only as much statistical knowledge as is useful for them. For a technician, “a p-value is that thing that gets lower when it means there’s really strong evidence” is probably enough. For a technician, “I can’t remember what exactly the positive predictive value of a mammogram is but it doesn’t matter because you should follow up all suspicious mammograms with further testing anyway” is probably enough.

But it really does seem relevant that only 12% of doctors can answer two simple statistics questions correctly when you’re trying to deny the entire non-doctor population access to certain information because only doctors are good enough at statistics to understand it.

from https://slatestarcodex.com/2013/12/17/statistical-literacy-among-doctors-now-lower-than-chance/

***

bullshitters

“British parents were asked to estimate their children’s IQs. They put their sons, on average, at 115 and their daughters at only 107, even though girls tend to develop earlier than boys, have a bigger vocabulary, and outperform boys academically from reception to PhD.

So boys grow up subliminally absorbing this mistaken notion that they are cleverer than girls, and girls grow up absorbing it too. No wonder that, when the same researchers asked adult men and women to estimate their own IQ, men on average said it was 110, and women, only 105. Yet the IQ distribution is identical between the genders, except at the extreme ends of the bell curve.

One academic paper, unusually entitled Bullshitters. Who are they and what do we know about their lives?, studied 40,000 15-year-olds in nine countries. They were given a list of 16 mathematical concepts and asked to rate their knowledge of them, from “never heard of it” to “know it well, understand the concept”. Unbeknown to the teenagers, the researchers had inserted three fake concepts – ‘subjunctive scaling’, ‘declarative fraction’ and ‘proper number’ – into the list. In all nine countries, boys were much more likely than girls to claim that they knew and understood the fake concepts.”

https://www.theguardian.com/commentisfree/2024/dec/06/why-do-some-men-behave-badly-i-think-i-have-the-answer

How complicated is it- Emperor’s new clothes or Dunning Kruger?

***

from https://egertonconsulting.com/a-comparison-of-classical-and-bayesian-statistics/

“In contrast Bayesian statistics looks quite different, and this is because it is fundamentally all about modifying conditional probabilities – it uses prior distributions for unknown quantities which it then updates to posterior distributions using the laws of probability. In fact Bayesian statistics is all about probability calculations!

In essence the disagreement between Classical and Bayesian statisticians is about the answer to one simple question:

“Can a parameter (e.g. the mean of a distribution such as the mean life of a component) which is fixed but unknown be represented by a random variable?”

In other words can a quantity that has a fixed but unknown value be represented by a quantity that has a random value? In particular, if we initially have no information at all about the fixed parameter, is there a way of representing this state of knowledge (or lack of it) by assuming that the parameter instead has a “vague” (or flat) probability distribution over all of the values that it could possibly take?

What is your view on this question? Your answer determines whether you are a member of the Classical school or the Bayesian school of statistics.

However the Classical school points out that this subjectivism does not sit well with “the scientific method” which must be as objective as possible, and in particular must not depend on who does the experiment or who analyses the results.

If it works, why not be pragmatic and use the Bayesian approach anyway? Being realistic, some problems cannot begin to be tackled without making the sort of subjective judgements required for the Bayesian approach. Clearly the Bayesian approach is an appropriate choice in such cases. However, the greater power of the Bayesian approach comes at the high price of subjectivism.”

all from https://egertonconsulting.com/a-comparison-of-classical-and-bayesian-statistics/

***

The truth has been fixed in the universe

Also Hannah Fry

and the trap

real life updating probability

also see pirates (point of view)

https://www.medpagetoday.com/infectiousdisease/covid19/89522

modelling mindsets

https://x.com/ChristophMolnar/status/1864285279541928190

see also

Ship of Theseus

https://en.wikipedia.org/wiki/Ship_of_Theseus

Bayes

Bayes' rule and computing conditional probabilities provide a solution method for a number of popular puzzles, such as the Three Prisoners problem, the Monty Hall problem, the Two Child problem and the Two Envelopes problem. false positives

Mathematics portal

Bayesian epistemology
Inductive probability
Quantum Bayesianism
Why Most Published Research Findings Are False, a 2005 essay in metascience by John Ioannidis
Regular conditional probability
Bayesian persuasion

Benford, Harford, Hawthorne, Overton window

Karl Popper- a meta idea?

A theory or hypothesis is falsifiable (or refutable) if it can be logically contradicted by an empirical test.

Popper contrasted falsifiability to the intuitively similar concept of verifiability that was then current in logical positivism. He argues that the only way to verify a claim such as "All swans are white" would be if one could theoretically observe all swans,^[F] which is not possible. Instead, falsifiability searches for the anomalous instance, such that observing a single black swan is theoretically reasonable and sufficient to logically falsify the claim. On the other hand, the Duhem–Quine thesis says that definitive experimental falsifications are impossible^[1] and that no scientific hypothesis is by itself capable of making predictions, because an empirical test of the hypothesis requires one or more background assumptions.^[2]

philosopher Michael Ruse defined the characteristics which constitute science as (see Pennock 2000, p. 5, and Ruse 2010):

It is guided by natural law;
It has to be explanatory by reference to natural law;
It is testable against the empirical world;
Its conclusions are tentative, i.e., are not necessarily the final word; and
It is falsifiable.

In his conclusion related to this criterion Judge Overton stated that

While anybody is free to approach a scientific inquiry in any fashion they choose, they cannot properly describe the methodology as scientific, if they start with the conclusion and refuse to change it regardless of the evidence developed during the course of the investigation.
— William Overton, McLean v. Arkansas 1982, at the end of section IV. (C)

social norms, descriptive norms, injunctive norms,

decision making

core assumptions? core beliefs? cogn diss, war in a world with bad people, CIA, Lafarge, rape, sexual abuse, birth, genes

Thomas Bayes was an English minister who lived in the 18th century. Though Bayes was elected as a Fellow of the Royal Society and did publish during his lifetime, he did not achieve a good deal of influence until after his death; and today his influence is stronger than ever. Bayes’ influence comes mainly from a paper of his that was published after his death called ‘An Essay toward Solving a Problem in the Doctrine of Chances,’ which “concerned how we formulate probabilistic beliefs about the world when we encounter new data” (loc. 4123).

The paper was intended as a response to the famous philosopher and skeptic David Hume, who argued that we could not truly predict anything with any amount of certainty. This is the case, according to Hume, because all of our information about the world comes from past experience, and just because something happened in the past (even with great frequency) does not mean we can logically deduce that it will happen again in the future. For instance, our knowledge that the sun rises in the morning is derived from the fact that on all previous occasions the sun has risen in the morning. However, because our sample size is necessarily limited, we have no way of knowing whether this is a matter of necessity or simply chance (loc. 4135). This being the case, Hume “argued that since we could not be certain that the sun would rise again, a prediction that it would was inherently no more rational than one that it wouldn’t” (loc. 4135).

Bayes agreed with Hume that we can never predict anything with absolute certainty. However, he disagreed that this effectively made all prediction an irrational process. Instead, Bayes contended that prediction could be made rational by way of treating it as a matter of probability rather than certainty (loc. 4135). For instance, when it comes to the sun rising in the morning, we may never be able to predict with certainty that it will, but the more it happens, the more we are justified in raising the probability that it will: “Gradually, through this purely statistical form of inference, the probability [we] assign[] to [our] prediction that the sun will rise again tomorrow approaches (although never exactly reaches) 100 percent” (loc. 4128).

A famous example here is one involving breast cancer. About 1.4% of women develop breast cancer when they are in their 40’s (loc. 4196). One way to detect breast cancer is with a mammogram, but these tests are not foolproof. Specifically, if a woman has breast cancer, a mammogram will detect it about 75% of the time. If, on the other hand, she does not have breast cancer, a mammogram will still come up positive 10% of the time (loc. 4199). Let’s say a woman in her 40’s has a mammogram and it comes up positive. What are the chances that she has breast cancer? The answer is a lot less that what you might think. It’s actually 10% (a number that Bayes’ theorem accurately comes up with [loc. 4201]).

If you badly misjudged the probability here, you’re not alone. As Silver explains, “a recent study that polled the statistical literacy of Americans presented this breast cancer example to them—and found that just 3 percent of them came up with the right probability estimate” (loc. 4204). The reason why most of us tend to get problems like this wrong is because most of just aren’t very good at intuitively recognizing how new information interacts with previously established probabilities to yield new probabilities. Our problem is that we tend to “focus on the newest or most immediately available information, and the bigger picture gets lost” (loc. 4210). Applying Bayes’ theorem prevents us from falling prey to this tendency, so this is one reason why approaching life through the lens of the theorem can be helpful. However, the benefits do not stop here.

In order to get at this little something extra, Silver points to the work of the psychology and political science professor Philip Tetlock. As Silver explains, “beginning in 1987, Tetlock started collecting predictions from a broad array of experts in academia and government on a variety of topics in domestic politics, economics, and international relations” (loc. 905). While Tetlock found that, on aggregate, the predictions of the experts were quite poor, he also found that some experts did better than others. When Tetlock looked into the cognitive styles and personality traits of the various experts he found that a clear pattern (we might call it a signal) emerged. Specifically, the more accurate predictors tended to have a particular set of cognitive strategies and personality traits that differed from the less accurate ones.

Tetlock organized his subjects along a spectrum with what he called ‘foxes’ on one end, and ‘hedgehogs’ on the other. The difference between foxes and hedgehogs can be summed up in the following way: “‘the fox knows many little things, but the hedgehog knows one big thing’” (loc. 949). More specifically, “hedgehogs are type A personalities who believe in Big Ideas—in governing principles about the world that behave as though they were physical laws and undergird virtually every interaction in society. Think Karl Marx and class struggle, or Sigmund Freud and the unconscious. Or Malcolm Gladwell and the ‘tipping point’” (loc. 955). Foxes, by contrast, “are scrappy creatures who believe in a plethora of little ideas and in taking a multitude of approaches toward a problem. They tend to be more tolerant of nuance, uncertainty, complexity, and dissenting opinion. If hedgehogs are hunters, always looking out for the big kill, then foxes are gatherers” (loc. 958).

While hedgehogs tend to be bold and brash, and express singular confidence in their predictions, foxes are much more cautious, as they consider numerous perspectives, carefully weighing their pros and cons. This being the case, foxes can often seem dithering and unsure of themselves (loc. 1006). As you might expect, then, hedgehogs make for much better television than foxes; and indeed, Tetlock found that the former garnered a lot more media attention than the latter (loc. 991-1006). However, when it came to the quality of their predictions, the hedgehogs were well outperformed by their foxier counterparts. As Silver notes, Tetlock found that “whereas the hedgehogs’ forecasts were barely any better than random chance, the foxes’ demonstrated predictive skill” (loc. 962).

Aside from being less susceptible to black and white thinking, and also overconfidence, the foxes had one other quality that allowed them to make better predictions than the hedgehogs. This was the fact that they tended to be less ideological in outlook, and to rely more on empirical evidence to help shape their opinions (loc. 980). Again, this harkens back to the fact that bias (here in the form of ideology) tends to interfere with the activity of formulating accurate predictions.

An algorithm brought him his first fame. His Pecota model, which stands for “Player Empirical Comparison and Optimisation Test Algorithm”, compared baseball players with other similar individuals in major-league history. Silver’s model looked past the commonly watched stats, such as batting average, and assigned greater weight to less-quoted ones, such as how often a batter gets on base, which correlated more to teams winning games.

Similarly, FiveThirtyEight’s model weighs up factors, from pollsters’ past accuracy to the religious and economic make-up of each state, then simulates the election 10,000 times to provide a probabilistic assessment of likely outcomes, based on polls going back to 1952. “We know that we’re going to get some of them wrong,” Silver cautions. The probability of any election victory is almost never 100 per cent. “You have a 70-30 bet, you’re supposed to get that wrong 30 per cent of the time.”

He is frustrated by people who prefer simple blue-or-red forecasts to such numerate nuance. “In baseball, it’s reached a healthy equilibrium where numbers-driven analysis is used in an appropriate way.” But politics is still far behind, he says: “I feel like we’re still fighting the Moneyball wars of 2003 and it might take another 10 years, if at all.”

The problem with political commentary is that it favours ideologues. CNN, stuck between Fox News on the right and MSNBC on the left, is struggling to revive its ratings because “the energy in politics is on the extremes,” Silver observes.

If most people struggle to interpret simple polls, will companies fare any better with the “big data” they are excitedly crunching? “I don’t think we’re on the edge of a singularity in terms of people becoming much more productive,” he says. In many cases, when he explores a new field and discovers how people misread the data about it, “it just kind of becomes depressing”.

Silver says he does not get on well with political reporters but is friends with media entrepreneurs such as Gawker’s Denton and Andrew Sullivan, the prominent blogger. His generation shares that entrepreneurial ambition, he says. “It used to be that you would idolise the guy who graduated at the top of his class from Harvard, and now you idolise the guy who drops out of Harvard to run a business,” he smiles. “I think these newspapers have a lot in common with Ivy League universities.” There is the prestige and the bright people “but there’s lots of internal politics. There are pockets of amazing things that are happening, but also pudgy bureaucratic cultures in other respects.”

Silver is not exactly dropping out; ESPN is a corporate giant with the resources to commercialise FiveThirtyEight more than ever. “I’m still pretty hungry,” he says, explaining that moving to ESPN was a decision to work “really, really hard for four more years” instead of coasting. “You could take a more relaxed route and kind of write now and then and travel a lot. That’s great, but I can do that in my fifties or sixties or seventies. Right now I want to build something while I’m still young.”

The Sheep that thinks it's a Wolf

Discussion about this post