Normal distribution in Python
Working on theoretical statistics, all of my work in my PhD was done in R. But for production code in industry, for the sake of speed and easier maintenance, I have taken up to implement some of my ideas in Python even when the prototyping is done in R. I did not expect to be learning this much by just looking at a normal distribution.
scipy.stats vs math
Suppose we want to compute the CDF of a standard normal. An R user like me who is used to pnorm(z)
would probably write this
But another way is to use the erf
function, given by
\[ \text{erf}(z) = \frac{2}{\sqrt{\pi}} \int_0^z e^{-t^2} \,dt. \]
We can then write
In fact this is the given example in the Python documentations for the math
library.
For my purpose, this function needs to be run many times, so speed is definitely important here. We run it for 10000 times:
from timeit import default_timer
from scipy.stats import norm
from math import erf
def phi(z):
return (1.0 + erf(z / 1.4142135623730951)) / 2.0
start = default_timer()
scipy_output = [norm.cdf(1) for i in range(10000)]
end = default_timer()
print("scipy took:", str(end - start))
scipy took: 0.9739241569999999
start = default_timer()
math_output = [phi(1) for i in range(10000)]
end = default_timer()
print("math took:", str(end - start))
math took: 0.011614696000000091
Computing using erf
was a lot faster than using scipy
, but that should not come as a surprise. Even in R, a loop without any vectorization is bound to be slow. After all, the strength of scipy
is that it works well with numpy
arrays. So let’s vectorize it:
array([0.84134475, 0.84134475, 0.84134475, ..., 0.84134475, 0.84134475,
0.84134475])
scipy took: 0.013408950000000086
Too bad that in my use case, the CDF of the normal distribution has to be computed in a loop, so I will go with the erf
route…
erf vs erfc
…which brings us to another question. How accurate is computing the CDF based on erf? The part where we add erf
to 1.0
means that anything that is close to or smaller than the machine epsilon will get erased. What if we do care about the magnitude of the CDF far in the tail of the normal? With the time constraint, we cannot really call norm.logcdf
here.
[2.8665157186802404e-07, 9.865876449133282e-10, 1.2798095916366492e-12, 6.106226635438361e-16, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]
We can instead use erfc
, defined as
\[ \text{erfc}(z) = 1 - \text{erf}(z). \]
Now we have
from math import erfc
def phi(z):
return erfc(-z / 1.4142135623730951) / 2.0
[phi(-z) for z in range(5, 15)]
[2.866515718791945e-07, 9.865876450377014e-10, 1.2798125438858348e-12, 6.22096057427182e-16, 1.1285884059538425e-19, 7.619853024160593e-24, 1.9106595744986828e-28, 1.7764821120777016e-33, 6.117164399549921e-39, 7.793536819192798e-45]
It preserved many more digits. It is curious how the Python documentations chose to use this example for erf
instead of erfc
. To check if this is correct, I run in R:
[1] 2.866516e-07 9.865876e-10 1.279813e-12 6.220961e-16 1.128588e-19
[6] 7.619853e-24 1.910660e-28 1.776482e-33 6.117164e-39 7.793537e-45
which gave the exactly same numbers. Did R perhaps implement pnorm
using erfc
as well?