# Normal distribution in Python

Working on theoretical statistics, all of my work in my PhD was done in R. But for production code in industry, for the sake of speed and easier maintenance, I have taken up to implement some of my ideas in Python even when the prototyping is done in R. I did not expect to be learning this much by just looking at a normal distribution.

## scipy.stats vs math

Suppose we want to compute the CDF of a standard normal. An R user like me who is used to `pnorm(z)`

would probably write this

But another way is to use the `erf`

function, given by

\[ \text{erf}(z) = \frac{2}{\sqrt{\pi}} \int_0^z e^{-t^2} \,dt. \]

We can then write

In fact this is the given example in the Python documentations for the `math`

library.

For my purpose, this function needs to be run many times, so speed is definitely important here. We run it for 10000 times:

```
from timeit import default_timer
from scipy.stats import norm
from math import erf
def phi(z):
return (1.0 + erf(z / 1.4142135623730951)) / 2.0
start = default_timer()
scipy_output = [norm.cdf(1) for i in range(10000)]
end = default_timer()
print("scipy took:", str(end - start))
```

`scipy took: 0.9739241569999999`

```
start = default_timer()
math_output = [phi(1) for i in range(10000)]
end = default_timer()
print("math took:", str(end - start))
```

`math took: 0.011614696000000091`

Computing using `erf`

was a lot faster than using `scipy`

, but that should not come as a surprise. Even in R, a loop without any vectorization is bound to be slow. After all, the strength of `scipy`

is that it works well with `numpy`

arrays. So let’s vectorize it:

```
array([0.84134475, 0.84134475, 0.84134475, ..., 0.84134475, 0.84134475,
0.84134475])
```

`scipy took: 0.013408950000000086`

Too bad that in my use case, the CDF of the normal distribution has to be computed in a loop, so I will go with the `erf`

route…

## erf vs erfc

…which brings us to another question. How accurate is computing the CDF based on erf? The part where we add `erf`

to `1.0`

means that anything that is close to or smaller than the machine epsilon will get erased. What if we do care about the magnitude of the CDF far in the tail of the normal? With the time constraint, we cannot really call `norm.logcdf`

here.

`[2.8665157186802404e-07, 9.865876449133282e-10, 1.2798095916366492e-12, 6.106226635438361e-16, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]`

We can instead use `erfc`

, defined as

\[ \text{erfc}(z) = 1 - \text{erf}(z). \]

Now we have

```
from math import erfc
def phi(z):
return erfc(-z / 1.4142135623730951) / 2.0
[phi(-z) for z in range(5, 15)]
```

`[2.866515718791945e-07, 9.865876450377014e-10, 1.2798125438858348e-12, 6.22096057427182e-16, 1.1285884059538425e-19, 7.619853024160593e-24, 1.9106595744986828e-28, 1.7764821120777016e-33, 6.117164399549921e-39, 7.793536819192798e-45]`

It preserved many more digits. It is curious how the Python documentations chose to use this example for `erf`

instead of `erfc`

. To check if this is correct, I run in R:

```
[1] 2.866516e-07 9.865876e-10 1.279813e-12 6.220961e-16 1.128588e-19
[6] 7.619853e-24 1.910660e-28 1.776482e-33 6.117164e-39 7.793537e-45
```

which gave the exactly same numbers. Did R perhaps implement `pnorm`

using `erfc`

as well?