I’ve discovered numba a few months ago, and haven’t looked back since. I wouldn’t consider myself a numba expert as there’s lots of capabilities I’m still wrapping my head around (e.g. @overload decorators), but there’s some use cases where Numba could be really useful. If you find you’re doing a lot of numpy for your work, this is for you!

If you have:

  • 30 seconds: @numba.jit(nopython=True) your functions if it’s purely numpy driven. Can be a bit of effort though.
  • 5 minutes: read on.

Okay, what is numba, exactly?

Numba is a “just-in-time” (JIT) compiler, which essentially means that a function you create will be compiled and can run independently of the Python interpreter. Simply, if your code involves lots of numpy and math, chances are, you can speed it up even more.

That sounds great, but…

Anecdotally speaking I’ve seen about 100-200x performance gains for some of the functions I’ve written with numba. The beauty of it is that it involves almost no work, provided that the code is mostly already in numpy, and that numba is compatible with your implementation. This is the catch that I think the 5-minute numba docs don’t cover.

numba is heavily typed and has specific implementations and signatures of numpy functions. In other words, you might have found that you can “get away” with calling numpy in specific ways in “normal” Python, but upon compiling with numba, you will come to some very surprising errors. You could argue that this is helpful though.

Example

The example we’re going to cover is the RAPDF function from Samudrala and Moult. RAPDF represents a sort of log odds ratio that two atoms of types $i$ and $j$ come into contact in a specific distance bin $d$:

$$\textrm{RAPDF} = -log\left(\dfrac{\dfrac{n(d_{ij})}{\sum_d n(d_{ij})}}{\dfrac{\sum_{ij} n(d_{ij})}{\sum_d \sum_{ij} n(d_{ij})}}\right) $$

This is typically the kind of data where RAPDF would be applied to; I’ve simplified the contacts to amino acids rather than atoms, but in practice, it would involve the same computation:

import pandas as pd

# this is just a randomly generated contact map
# each cell shows number of contacts between 2 amino acids
# at a given distance bin (e.g. 0-3 Angstroms)
df = pd.read_csv("contact_map.tsv", sep='\t', index_col = 0)
df.head(5)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

If you’re curious:, the top fraction in the numerator represents the number of contacts between types $$i$$ and $$j$$ in distance bin $$d$$, divided by the total number of contacts between types $$i$$ and $$j$$ across all distance bins. This is then divided by the bottom fraction, where the number of contacts between all atom types at a specific distance bin $$d$$ is divided by all contacts across all distance bins.

You may have noticed some 0s in the data; since we’re going to compute a log later, we’ll add a pseudocount of 0.0001.

df = df+1e-4
df.head(5)
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Now to the code

We can code RAPDF as so:

import numpy as np

def rapdf(contact_map: np.ndarray):
    
    all_contact_d = np.sum(contact_map, axis=0)
    all_contact_all_d = np.sum(contact_map)
    
    # this is constant per distance bin, so only calculate once!
    denominator = all_contact_d / all_contact_all_d
    
    scores = np.zeros(contact_map.shape)
    
    for i, row in enumerate(contact_map):
        
        # this just ensures we don't divide by 0
        rowsum = max(row.sum(), 1)
        
        for j, cell in enumerate(row):
            
            numerator = (cell/rowsum)
            scores[i][j] = -np.log(numerator/denominator[j]).round(2)
    
    return scores

Now we can see it in action:

from time import time

t1 = time()
scores = rapdf(df.values)
t2 = time()

print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23  0.65 -0.82  0.2  10.55]
 [-0.84 -0.19  0.    0.73 10.55  0.  ]
 [-0.52  0.64  1.3  10.55 -0.54 -0.65]
 ...
 [10.73  0.48 -0.4  -0.62 -0.59  1.54]
 [-0.14  1.71 -0.27 -0.49 -0.57 10.88]
 [-0.14 -0.08  0.98 -0.24  0.24 -0.27]] This took 13474.225997924805 miroseconds

Cool, we can also use the timeit magic function to do this with more runs:

%timeit rapdf(df.values)
7.64 ms ± 186 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

~7.8ms (or about 7800 microseconds) isn’t bad, but this is where numba will shine. To start, we just need to apply a decorator, @numba.jit(nopython=True). This tells numba to compile the function in nopython mode, meaning maximum performance gains; otherwise, without the flag, I’ve found that the performance gains I had gotten wasn’t really worth the effort.

import numba 
# This is a decorator that tells numba to compile this function in "no python mode".
@numba.jit(nopython=True)
def rapdf(contact_map: np.ndarray):
    
    all_contact_d = np.sum(contact_map, axis=0)
    all_contact_all_d = np.sum(contact_map)
    
    # this is constant per distance bin, so only calculate once!
    denominator = all_contact_d / all_contact_all_d
    
    scores = np.zeros(contact_map.shape)
    
    for i, row in enumerate(contact_map):
        
        # this just ensures we don't divide by 0
        rowsum = max(row.sum(), 1)
        
        for j, cell in enumerate(row):
            
            numerator = (cell/rowsum)
            scores[i][j] = -np.log(numerator/denominator[j]).round(2)
    
    return scores

To start with, let’s use the exact same definition as above; then we can call rapdf again:

t1 = time()
scores = rapdf(df.values)
t2 = time()

print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
...

TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Unknown attribute 'round' of type float64

File "<ipython-input-6-e7160609fabf>", line 22:
def rapdf(contact_map: np.ndarray):
    <source elided>
            numerator = (cell/rowsum)
            scores[i][j] = -np.log(numerator/denominator[j]).round(2)
            ^

This was half-expected. From experience, there will always be something that numba isn’t compatible with; in this case, the round operation. We can “get rid of it” by doing a trick:

@numba.jit(nopython=True)
def rapdf(contact_map: np.ndarray):
    # The rest of the function has remained identical
        ...     
        for j, cell in enumerate(row):
            
            numerator = (cell/rowsum)
            
            # notice where the round operator has gone now!
            scores[i][j] = np.round(-np.log(numerator/denominator[j]), 2)
    
    return scores

And now, will it run?

t1 = time()
scores = rapdf(df.values)
t2 = time()

print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23  0.65 -0.82  0.2  10.55]
 [-0.84 -0.19  0.    0.73 10.55  0.  ]
 [-0.52  0.64  1.3  10.55 -0.54 -0.65]
 ...
 [10.73  0.48 -0.4  -0.62 -0.59  1.54]
 [-0.14  1.71 -0.27 -0.49 -0.57 10.88]
 [-0.14 -0.08  0.98 -0.24  0.24 -0.27]] This took 1207432.9853057861 miroseconds

Now it works! You might now be thinking, wait a minute, this was actually slower than before! Hold on a minute. The reason for this is that numba had to compile your function the first time you call it. This is where most of the time has gone in. In fact, if you now call rapdf again, you’ll see:

t1 = time()
scores = rapdf(df.values)
t2 = time()

print(scores, "This took {} miroseconds".format((t2-t1)*(10**6)))
[[-0.07 -0.23  0.65 -0.82  0.2  10.55]
 [-0.84 -0.19  0.    0.73 10.55  0.  ]
 [-0.52  0.64  1.3  10.55 -0.54 -0.65]
 ...
 [10.73  0.48 -0.4  -0.62 -0.59  1.54]
 [-0.14  1.71 -0.27 -0.49 -0.57 10.88]
 [-0.14 -0.08  0.98 -0.24  0.24 -0.27]] This took 232.93495178222656 miroseconds

And if we do this over several runs:

%timeit rapdf(df.values)
28.1 µs ± 341 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

Now look at that! It’s shrunk down to 28 microseconds in comparison to the >7 miliseconds we had before. That’s equivalent to a 250-fold speed up. You can imagine that if you had code that you’d call over and over again, this is where numba could be so useful.