I’ve discovered numba
a few months ago, and haven’t looked back since. I wouldn’t consider myself a numba
expert as there’s lots of capabilities I’m still wrapping my head around (e.g. @overload
decorators), but there’s some use cases where Numba could be really useful. If you find you’re doing a lot of numpy
for your work, this is for you!
If you have:
- 30 seconds:
@numba.jit(nopython=True)
your functions if it’s purely numpy driven. Can be a bit of effort though. - 5 minutes: read on.
Okay, what is numba, exactly?
Numba is a “just-in-time” (JIT) compiler, which essentially means that a function you create will be compiled and can run independently of the Python interpreter. Simply, if your code involves lots of numpy
and math, chances are, you can speed it up even more.
That sounds great, but…
Anecdotally speaking I’ve seen about 100-200x performance gains for some of the functions I’ve written with numba
. The beauty of it is that it involves almost no work, provided that the code is mostly already in numpy
, and that numba
is compatible with your implementation. This is the catch that I think the 5-minute numba
docs don’t cover.
numba
is heavily typed and has specific implementations and signatures of numpy
functions. In other words, you might have found that you can “get away” with calling numpy
in specific ways in “normal” Python, but upon compiling with numba
, you will come to some very surprising errors. You could argue that this is helpful though.
Example
The example we’re going to cover is the RAPDF function from Samudrala and Moult. RAPDF represents a sort of log odds ratio that two atoms of types $i$ and $j$ come into contact in a specific distance bin $d$:
$$\textrm{RAPDF} = -log\left(\dfrac{\dfrac{n(d_{ij})}{\sum_d n(d_{ij})}}{\dfrac{\sum_{ij} n(d_{ij})}{\sum_d \sum_{ij} n(d_{ij})}}\right) $$
This is typically the kind of data where RAPDF would be applied to; I’ve simplified the contacts to amino acids rather than atoms, but in practice, it would involve the same computation:
import pandas as pd
# this is just a randomly generated contact map
# each cell shows number of contacts between 2 amino acids
# at a given distance bin (e.g. 0-3 Angstroms)
df = pd.read_csv("contact_map.tsv", sep='\t', index_col = 0)
df.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
If you’re curious:, the top fraction in the numerator represents the number of contacts between types $$i$$ and $$j$$ in distance bin $$d$$, divided by the total number of contacts between types $$i$$ and $$j$$ across all distance bins. This is then divided by the bottom fraction, where the number of contacts between all atom types at a specific distance bin $$d$$ is divided by all contacts across all distance bins.
You may have noticed some 0s in the data; since we’re going to compute a log later, we’ll add a pseudocount of 0.0001.
df = df+1e-4
df.head(5)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Now to the code
We can code RAPDF as so:
import numpy as np
def rapdf(contact_map: np.ndarray):
all_contact_d = np.sum(contact_map, axis=0)
all_contact_all_d = np.sum(contact_map)
# this is constant per distance bin, so only calculate once!
denominator = all_contact_d / all_contact_all_d
scores = np.zeros(contact_map.shape)
for i, row in enumerate(contact_map):
# this just ensures we don't divide by 0
rowsum = max(row.sum(), 1)
for j, cell in enumerate(row):
numerator = (cell/rowsum)
scores[i][j] = -np.log(numerator/denominator[j]).round(2)
return scores
Now we can see it in action:
from time import time
t1 = time()
scores = rapdf(df.values)
t2 = time()
print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23 0.65 -0.82 0.2 10.55]
[-0.84 -0.19 0. 0.73 10.55 0. ]
[-0.52 0.64 1.3 10.55 -0.54 -0.65]
...
[10.73 0.48 -0.4 -0.62 -0.59 1.54]
[-0.14 1.71 -0.27 -0.49 -0.57 10.88]
[-0.14 -0.08 0.98 -0.24 0.24 -0.27]] This took 13474.225997924805 miroseconds
Cool, we can also use the timeit
magic function to do this with more runs:
%timeit rapdf(df.values)
7.64 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
~7.8ms (or about 7800 microseconds) isn’t bad, but this is where numba
will shine. To start, we just need to apply a decorator, @numba.jit(nopython=True)
. This tells numba
to compile the function in nopython mode, meaning maximum performance gains; otherwise, without the flag, I’ve found that the performance gains I had gotten wasn’t really worth the effort.
import numba
# This is a decorator that tells numba to compile this function in "no python mode".
@numba.jit(nopython=True)
def rapdf(contact_map: np.ndarray):
all_contact_d = np.sum(contact_map, axis=0)
all_contact_all_d = np.sum(contact_map)
# this is constant per distance bin, so only calculate once!
denominator = all_contact_d / all_contact_all_d
scores = np.zeros(contact_map.shape)
for i, row in enumerate(contact_map):
# this just ensures we don't divide by 0
rowsum = max(row.sum(), 1)
for j, cell in enumerate(row):
numerator = (cell/rowsum)
scores[i][j] = -np.log(numerator/denominator[j]).round(2)
return scores
To start with, let’s use the exact same definition as above; then we can call rapdf again:
t1 = time()
scores = rapdf(df.values)
t2 = time()
print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
...
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'round' of type float64
File "<ipython-input-6-e7160609fabf>", line 22:
def rapdf(contact_map: np.ndarray):
<source elided>
numerator = (cell/rowsum)
scores[i][j] = -np.log(numerator/denominator[j]).round(2)
^
This was half-expected. From experience, there will always be something that numba
isn’t compatible with; in this case, the round
operation. We can “get rid of it” by doing a trick:
@numba.jit(nopython=True)
def rapdf(contact_map: np.ndarray):
# The rest of the function has remained identical
...
for j, cell in enumerate(row):
numerator = (cell/rowsum)
# notice where the round operator has gone now!
scores[i][j] = np.round(-np.log(numerator/denominator[j]), 2)
return scores
And now, will it run?
t1 = time()
scores = rapdf(df.values)
t2 = time()
print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23 0.65 -0.82 0.2 10.55]
[-0.84 -0.19 0. 0.73 10.55 0. ]
[-0.52 0.64 1.3 10.55 -0.54 -0.65]
...
[10.73 0.48 -0.4 -0.62 -0.59 1.54]
[-0.14 1.71 -0.27 -0.49 -0.57 10.88]
[-0.14 -0.08 0.98 -0.24 0.24 -0.27]] This took 1207432.9853057861 miroseconds
Now it works! You might now be thinking, wait a minute, this was actually slower than before! Hold on a minute. The reason for this is that numba
had to compile your function the first time you call it. This is where most of the time has gone in. In fact, if you now call rapdf
again, you’ll see:
t1 = time()
scores = rapdf(df.values)
t2 = time()
print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23 0.65 -0.82 0.2 10.55]
[-0.84 -0.19 0. 0.73 10.55 0. ]
[-0.52 0.64 1.3 10.55 -0.54 -0.65]
...
[10.73 0.48 -0.4 -0.62 -0.59 1.54]
[-0.14 1.71 -0.27 -0.49 -0.57 10.88]
[-0.14 -0.08 0.98 -0.24 0.24 -0.27]] This took 232.93495178222656 miroseconds
And if we do this over several runs:
%timeit rapdf(df.values)
28.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Now look at that! It’s shrunk down to 28 microseconds in comparison to the >7 miliseconds we had before. That’s equivalent to a 250-fold speed up. You can imagine that if you had code that you’d call over and over again, this is where numba
could be so useful.