A collection of many libraries to make numerical computation easier.
NumPy: $N$-dimensional arrays, fast linear algebra
SciPy: Optimization, interpolation, statistics, special functions, …
SymPy: Symbolic computation
MatPlotLib: Data visualization, plotting
Pandas: Data analysis
Various others: TensorFlow, PyTorch for Machine Learning, …
Python Sucks
Python is slow (interpreted, dynamically typed).
Parallelization limited by the Global Interpreter Lock
Long live Python
Use multiprocessing / memhive to sidestep the Global Interpreter Lock
Use fast optimized libraries to do the bulk of the work. (NumPy
, etc)
Python provides an easy to learn/use interface.
Python list
: One dimensional, can expand / grow.
NumPy array
(aka ndarray
): Fixed size; multiple dimensions.
>>> a = np.array( [ [1,2,3], [4,5,6] ] )
>>> a
array([[1, 2, 3],
[4, 5, 6]])
>>> a.shape
(2, 3)
>>> a[1,2]
np.int64(6)
Array operations: Operators act on arrays element-wise:
>>> a ** 2
array([[ 1, 4, 9],
[16, 25, 36]])
>>> a + 1
array([[2, 3, 4],
[5, 6, 7]])
Slicing works as expected
a[1, :] # [4, 5, 6] (second row)
a[:, 0] # [1, 4] (first column)
a[1] # Equivalent to a[1, :]
Use a[..., m:n:s, ...]
in any position
m
(inclusive) to n
(exclusive).Iterating is done over the first axis:
for r in a: # iterates over rows a[0], a[1], ...
for x in a[1]: # interates over column a[1, 0], a[1, 1], ...
math
functions don’t act on arrays:
math.sin(a) # Error
NumPy
functions act on arrays element wise:
np.sin(a) # gives [ sin(a[0]), sin(a[1]), ... ]
NumPy
functions also act on scalars
np.sin(0) # works
math.sin(0) # works
See NumPy
documentation on universal functions
Regular python
for i in range( N ):
x = 0.1*i/N
plot( x, x*math.sin(1/x) )
# Division by 0 exception
NumPy
:
xx = np.linspace( 0, 2*pi, N )
plot( xx, xx*np.sin(1/xx) )
# Runtime warning, but works
Option 1: %pylab inline
(Quick and dirty MatLab
type environment)
matplotlib.pylab
and numpy
sum
with numpy.sum
Option 2: Import the functions you want.
%matplotlib inline
# Another option is `%matplotlib widget`
import numpy as np, matplotlib as mpl, scipy as sp
from matplotlib import pylab, mlab, pyplot as plt
from matplotlib.pylab import plot, scatter, contour
from tqdm.notebook import tqdm, trange
from numpy import sqrt, pi, exp, log, floor, ceil, sin, cos
from numpy.linalg import norm
rng = np.random.default_rng()
$U \sim \Unif([0,1]^2)$
$\P( \abs{U} < 1 ) = \frac{\pi}{4}$
Compute $\displaystyle \frac{1}{N} \sum_1^N \one_{\set{\abs{U_i} < 1}}$.
Set N = 10**7
“First” implementation: (took 33.31s)
n_inside = 0
for i in trange(N):
(U1, U2) = (rng.uniform(), rng.uniform())
if sqrt(U1**2 + U2**2) < 1:
n_inside += 1
Implementation with NumPy
functions:
(took 0.195s)
U = rng.uniform( size=(2, N) )
n_inside = np.sum( norm( U, axis=0 ) < 1 )
Use library functions as much as you can.
Avoid Python loops as much as possible.
Easy to set $N = 10^9$, and run U = rng.uniform( size=(2, N) )
Takes 15Gb
memory to store. May crash your computer
Run code in chunks:
N = 10**9
M = 1000
n_inside = 0
for i in trange(M):
U = rng.uniform( size=(2, N//M) )
n_inside += np.sum( norm( U, axis=0 ) < 1 )
Can trivially parallelize chunks.
for n in trange( n_iters ):
U = rng.uniform( size=(2, M, n_trials) )
n_inside = np.sum( norm( U, axis=0 ) < 1, axis=0 )
hat_π[n] = 4*n_inside
hat_π = (np.cumsum( hat_π, axis=0 ).T / NN).T
# Use `polyfit` to fit $\log(σ)$ to $a + b N$.
plt.loglog( NN, σ, label='σ' )
a = np.polyfit( log(NN), log(σ), 1 )
plt.loglog( NN, exp( a[0]*log(NN) + a[1] ), '--',
alpha=.5, label=f'${exp(a[1]):.2f}\\, N^{{{a[0]:.3f}}}$' )
Suppose $U \sim \Unif( [0, 1]^d )$. Then $\displaystyle \P( \abs{U} < 1 ) = \frac{\vol( B(0, 1) )}{2^d}$
Exact formula: $\displaystyle \vol(B(0, 1)) = \frac{\pi^{d/2}}{\Gamma(\frac{d}{2}+1)}$
Easy to implement:
for d in trange( 1, D ):
vol[d] = pi**(d/2) / sp.special.gamma( d/2 + 1 )
U = rng.uniform( size=(d, N) )
n_inside = np.sum( norm(U, axis=0) < 1 )
hat_vol[d] = 2**d * n_inside/N
Simulations don’t find even one point inside $B(0,1)$ in high dimensions.
A bad implementation is to:
Sample one $X \sim p$, one $U \sim \Unif([0, 1])$.
Set color=red
if $Mp(X) > q(X)$.
Set color=blue
otherwise.
plot( X, M p(X), color )
Repeat 1,000 times.
Try to vectorize this and plot without looping.
Say X
has N
samples from some distribution.
Problem: Numerically approximate the CDF and PDF.
Algorithm for computing the CDF:
Say the points are $x_1, \dots, x_N$ with $x_1 < x_2 \cdots < x_N$.
Know $F(x_k) = k/N$. Find $F$ at other points by interpolating.
Implement this so you get acquainted with NumPy/SciPy:
s
to get good results.Data: 10,000 samples from $N(0, 1)$
Option 1: Differentiate the CDF (use UnivariateSpline.derivate)
Option 2: Create a histogram.
numpy.histogram divides the data into bins
Assume the PDF at bin midpoints is proportional to the counts
Interpolate and smooth.
Jupyter Notebooks make it easy to run simulations interactively.
Online documentation is excellent.
Shift+Tab
/ Ctrl+I
/ right click for contextual help
Use library functions. Don’t reinvent the wheel.
scipy.random
samples from most common distributionsscipy.interpolate
has curve fitting / interpolation functionsscipy.linalg
, numpy.linalg
various linear algebra functionsUse vectorized functions; avoid loops as much as possible.
Notebook with code from these slides is here.