When I’m first starting a new code base for some project, the first place I usually start is Python. Python is an expressive and approachable language with a rich package ecosystem. However, it’s no secret that generally speaking interpreted languages are slower than compiled ones. Does this matter? For many projects, no; but there are plenty of situations where speed is the number one concern! Machine learning, big data, and scientific computing are all largely bottle-necked by speed of execution. And that speed can be the difference between an analysis taking an hour or 10 seconds.
But that doesn’t mean we have to throw our little Python baby out with the slow bathwater! It is often the case that small portions of a code-base account for the majority of execution time. These might be things like expensive numerical computations inside of a nested loop, or linear algebra operations on large matrices, etc. This means we can focus our effort on the small snippet and yield huge returns on speed-up. Of course there are optimizations to be had with Python itself, but once those are exhausted we have two other solid options: calling compiled functions and parallelization.
As a case study consider the following snippet of code:
import numpy as np import my_F90 import my_F90_OMP def my_func(nx, ny, nz, tgt): result = <big expensive function involving inputs> return(result) <Various pre-processing and setup> # Very large 3-D structure of data, dependent on setup nx, ny, nz = np.meshgrid(thing1, thing2, thing3) test_tgts = <input data> for i, tgt in enumerate(test_tgts): kern1 = my_func(nx, ny, nz, tgt) # I take 3480 seconds kern2 = my_F90.func(nx, ny, nz, tgt) # I take 29 seconds kern3 = my_F90_OMP.func(nx, ny, nz, tgt) # I take 8 seconds <Post-processing code>
There are two important features of this code: First we have a big expensive function to evaluate inside
my_func(). Second we have a big chunk of data/memory that the function is operating over. The data is generated from a 3-D geometric structure, so increasing the size of the problem by 10x means a 1000x increase in the computations needed! The quoted runtimes in the example are for ~13 Gb of data.
The 3 kernels in the test loop are respectively: the native serial Python function, a compiled Fortran90 serial implementation of
my_func(), and a F90 implementation compiled with OpenMP and run on 4 CPU cores. The code was used in an application where many different “test targets” were being examined, so iterating through the various tests on the native Python kernel started out being annoying and quickly moved to being silly/impossible as the problem sizes increased.
As you can see, calling the compiled Fortran implementations are no more troublesome than the native Python code; they happily accept and return Numpy structures, the compiled objects simply need to be imported. So how do we get there?
There are 3 main pieces needed to get callable OpenMP enabled F90 modules: F2PY, a Fortran compiler, and that compiler’s implementation of OpenMP. F2PY is included with Numpy by default, many people already have it and don’t know it! I used gfortran from the GNU project, but you can also use your favorite flavor including LLVM and Intel implementations. gfortran’s OpenMP implementation is
The serial F90 module is compiled from a F90 source file that looks like this:
subroutine func(x, y, z, tgt, result, n) ! output implicit none integer, intent(in) :: n real(8), intent(in), dimension(n) :: x, y, z real(8), intent(in) :: tgt(3) real(8), intent(out) :: result(n) !f2py depend(n) result real(8), dimension(n) :: dummy, temp result = <big expensive function involving inputs> end subroutine
f2py -m compiled_object --fcompiler=gfortran -c my_source.f90
I won’t attempt to cover all the nuances of F2PY here, but there are a couple differences from plain F90 that are worth noting. First, variables are decorated with the
intent keyword, indicating whether they are read from the Python program or return outputs. Additionally you can signal data dependencies using the
depend directive. In this case
x, y, z, tgt are explicit inputs, and
n is an implicit input that varies based on the shape of the explicit inputs. We can of course have private variables that exist only within the F90 module, in this case the examples
temp. Finally we have any outputs, like
result, that use the
depend directive to indicate their returned shape.
Moving from the native interpreted Python to the F90 compiled version already gets us a 120x speedup in this example, but we can also easily take advantage of shared-memory parallelism within the laptop/desktop/node using OpenMP. In this example case the calculations are all independent of each other, so the implementation is almost trivial:
subroutine func(t, x, y, z, tgt, result, n) ! output use omp_lib implicit none integer, intent(in) :: n, t real(8), intent(in), dimension(n) :: x, y, z real(8), intent(in) :: tgt(3) real(8), intent(out) :: result(n) !f2py depend(n) result real(8), dimension(n) :: dummy, temp integer :: i call omp_set_num_threads(t) !$OMP PARALLEL DO PRIVATE(dummy, temp) SHARED(result) DO i = 1, n result(i) = <same function, with inputs indexed into using i> END DO !$OMP END PARALLEL DO end subroutine
f2py -m compiled_object --fcompiler=gfortran --f90flags='-fopenmp' '-lgomp' -c my_source.f90
We include omp_lib to link against
libgomp, and add
t as an input parameter so we can choose the number of threads to use. We then set the thread count with
call omp_set_num_threads() and pass into the parallel region of execution in the code using the
!$OMP PARALLEL DO directive. We also indicate that
result is memory to be shared by all threads. We then loop over the data in parallel in each of the threads and compute our big expensive function.
The result is a further 3.6x faster using 4 cores in the example and my specific hardware. Less than optimal strong scaling in this case is primarily due to lack of data-reuse. Although we are doing a fairly expensive calculation on our data, the work we do is linearly proportional to the quantity of data. If we have N inputs, we are doing work (where is some constant, in this example c~=300 FLOPs). Memory latency and bandwidth has not scaled at nearly the same rate as CPU throughput, and so in cases of weak data-reuse it tends to dominate the behavior. Compare this to say a naive n-body calculation where computation scales as despite still using just memory.
All told, our two improvements mean the code runs 435x faster, with a typical run taking 8 seconds instead of almost an hour. Not bad!
If you’d like to play around with the original code that served for the inspiration for this post, it’s available on my Github:
–Python driving script
–Serial F90 module
–OpenMP F90 module
Keep in mind the linked code is early research code, so a bit like peering into a sausage factory I make no claims to it’s quality, readability, or sanity.