# Philosophy of statistics, or data science

I’m catching up on my reading list finally and read Stark (2022) this weekend. I don’t think I understood all the subtleties in the paper, but it nonetheless made me take a step back and think a bit about what we are doing when we are doing statistics (or data science).

In my opinion, there are four main things covered in Stark (2022):

- We like to be quantitative despite it is impossible to be quantitative.
- We answer the wrong question.
- What is probability?
- Some models are built but rarely falsified.

## Being quantitative

When faced with different choices, a reasonably quantitative approach is to compare the change in total utility. Even if a decision involve multiple metrics with different scales and units, or different importance, all tradeoffs will comes down to a utility function.^{1} These different metrics correspond to the attributes in the sandwich example. Stark (2022) makes a good point that this assumption is deceptively innocuous. However, I don’t find the failure of the cancellation axiom convincing argument that we cannot find a utility function. Perhaps we should not expect the utility to be additive, and therefore, certain combinations (peanut butter with jelly) is just a lot better than peanut butter or jelly alone. The latter argument, that human generally prefers receiving $1m for sure than to receive $20m at a 10% chance, also didn’t convince me against a utility function.^{2}

^{1} It does not mean that we know this function, but it at least allows us to assume local linearity, and justify tradeoffs.

^{2} See this tweet and the discussion therein. For the record, I think the correct answer is to sell this button, or pool enough money from a large group of individuals so that clicking the green button makes sense.

However I think the point stands, based on an argument against a sensible utility function from Mark Tygert. Kloumann and Tygert (2020) gave a good summary of Allais’s paradox: Consider two different decisions:

- choose between a guaransteed 1 billion dollars, or getting 1 billion with probability 89%, 2 billion with probability 10% and nothing with probability 1%;
- choose between 1 billion dollars at a 11% chance, or getting 2 billion at 10%.

If your decision is to pick the first option for decision 1, but the second option for decision 2, you are “inconsistent” even if we allow for potentially non-linear utility: Using \(U(x)\) to denote the utility for getting \(x\) billion dollars, then picking the first option for decision 1 amounts to

\[\begin{eqnarray} U(1) & > & 0.89 \cdot U(1) + 0.1 \cdot U(2) + 0.01 \cdot U(0) \\ 0.11 \cdot U(1) & > & 0.1 \cdot U(2) + 0.01 \cdot U(0), \end{eqnarray}\]

but picking the second option for decision 2 amounts to

\[\begin{eqnarray} 0.11 \cdot U(1) + 0.89 \cdot U(0) & < & 0.1 \cdot U(2) + 0.9 \cdot U(0) \\ 0.11 \cdot U(1) & > & 0.1 \cdot U(2) + 0.01 \cdot U(0). \end{eqnarray}\]

This can point either to our inability to do good calculation on-the-fly, or to the impossibility of a reasonable utility function here. Notice that we are not even looking at multiple metrics here in that there is one metric (money) involved, and so Stark’s point definitely stands.

## Answering the wrong question

Stark (2022) then moves on to a disconnection between the scientific null hypothesis and a statistical null hypothesis. An example is analyzing how many bird the installation of wind turbines killed. An analyst may set up a zero-inflated Poisson regression model, and estimate some of the coefficients, which became the quantity of interest. While the coefficient may have little to do with birds killed, I do not think this as necessarily a bad thing: Anyone approaching this data has to start somewhere, and if we know why the coefficient carries no practical meaning, then we should have baked in the missing complexity in the model.

I also have doubts on another example given: a standard randomized controlled test. The scientific null hypothesis is that there is no average treatment effect, while the statistical null hypotesis is that the sample mean of the two groups are independent and distributed normally, so that the *t*-test applies. Stark pointed out in the paper too, that a permutation test converges to a *t*-test under mild conditions. I cannot say for the others but I certainly have the scientific null hypothesis in mind when I run a *t*-test. The translation from a scientific null hypothesis to a statistical null hypothesis is always going to require extra modelling assumptions, and I’d say the assumptions that goes into a *t*-test is pretty barebone already.^{3}

^{3} Except for cursed heavy-tailed data.

Finally the paper introduced a concept *displacement*. It reminds me of *reificiation* in psychiatry (see e.g. Hyman 2010). Take schizophrenia as an example, we have some rough sense of what it entails, but the diagnosis mostly done by ratings. The rating mostly matches our intuition of what the disorder is, but carries no practical meaning. But it became the quantity of interest: medical treatment research would be done attempting to lower this rating, a decrease in rating is taken as treating schizophrenia. Like I said earlier, these analyses have to start somewhere, but I often wonder if it’s possible to treat the metric but not the disorder, analogous to Goodhart’s law.

## What is probability?

This was definitely a topic covered in grad school when we philosophized about statistics! Specifically the terms *aleatory* and *epistemic* were definitely used in one of the introductory grad level statistics classes. Aleatory refers to randomness coming from a mechanism, whereas epistemic refers to subjective randomness stemming from our ignorance. The distinction between the two types of probabilities seem to be clear to most, but quite blurry to me. A typical instance of aleatory probability would be a die roll. But to me, the unpredictability of the outcome is still a limtation on our knowledge or computation ability. The weight distribution of the die, or the aerodynamics as it is cast, can all be modelled and thus predicted. Perhaps except for quantum mechanics^{4}, we have to take a stand on what information to include in a model and what to discard as randomness or “noise”.

^{4} where we believe there is never precise knowledge of quantities

Another interesting perspective is rates vs probability, “but the mere fact that something has a rate does not mean that it is the result of a random process”. Probability is definitely a more precise concept than rates: if an event occurs with 1% probability, we will observe roughly one occurance every 100 trials. And perhaps the common argument for equating the two is ignorance. I really liked the question quoted from LeCam, “what is the probability that the \(10^{137} + 1\) digit of \(\pi\) is a 7?” We would probably guess 10% but that is a non-sensical answer. No probability is involved in defining \(\pi\) and so the probability really only comes from our own ignorance and a firm believe in some form of ergodicity. Curiously, this ergodicity is what powers all modern applications of probability and statistics via random number generator (RNG). We only think an RNG is uniformly random because it is supposed to be ergodic, or in the case of an RNG based on a hash function, we:

- refuse to compute the inverse function;
- feign ignorance of the seed;
- rely on histograms (read: rates) of the output, or histograms of short sequences of output, to claim uniformity.

But there still is nothing random about these output, given the seed the output is fully determined, like the \(10^{137} + 1\) digit of \(\pi\).

## Some models are rarely falsified

I think the only way a model can be falsified is by comparing its predictions to real outcomes. For models that give a lot of predictions, e.g. hourly or daily weather forecast, or stock price models, this should be easy. For models whose predictions are not that important, falsification is not a priority for humanity. So the models that Stark (2022) really wants to focus on must be ones that give few but consequential predictions. I think an example similar to climate change models would be election forecast: certainly consequential, and happens really only once. Predictions that come with error bars help, but the outcome that every one cares about (the winner of the presidential election) is a binary variable,^{5} where falsification only happens if the model assigned very high probability to one candidate. So perhaps falsification is also tied to the outcome of interest. And if the outcome of interest is hard to falsify, why do we bother with predicting that?

^{5} at least in a two-party system

## References

*Annual Review of Clinical Psychology*, 6, 155–179. https://doi.org/10.1146/annurev.clinpsy.3.022806.091532.