# Using Data Structures to Improve the Un-improvable, a Fenwick Tree Christmas Story

tl;dr: Use the right data structure for the job. I implemented a Fenwick tree for Fortran90 with a couple neat features, available here.

Our fictitious story begins with an eccentric billionaire and Christmas aficionado who is holding a contest for the whole USA: they are giving away $1 billion over the span of 1 hour, all people have to do to compete is try and put as many ornaments on a Christmas tree as possible and submit their total to the contest website! If they manage to fit even more on, or if some fall off they need to update their total. You are the intern in the team tasked by the billionaire to handle the drawing of winners. The rest of the team has already set up the website, database, etc. The team reserved this simplest of tasks for the intern, surely you can at least do what they ask! Your boss tells you they expect around 100 million contestants, they will award$250,000 once a second for a full hour. In order to reward contestants with more ornaments on their tree with a better chance of winning they choose winners in the following way: there is an array with an entry for each contestant that contains their total, perform a cumulative sum over this array and pick a random number in the range from 1 to the total number of ornaments, whichever entry in the cumulative sum is closest to the random number is a winner! All you need to do is pick a winner once a second and update any contestants entries if they changed since the second before.

You think a bit about how they are drawing winners, is that fair? After some thought you agree; people with bigger entries are proportionally more likely to be chosen, while those with smaller entries are proportionately less likely. “This implementation should be dirt simple!” you say, and type up a prototype in Fortran90:

use binary_search
use contest_stuff !provides contest related functions by team
integer, dimension(num_contestants) :: contest_entry, cumulative_sum
real :: rand_real
integer :: i,t,rand_int

do t=1,3600 !once a second, for an hour
call wait_and_update(contest_entry) !pulls in from database etc.

cumulative_sum(1) = contest_entry(1)
do i=2, num_contestants
cumulative_sum(i) = cumulative_sum(i-1) + contest_entry(i)
end do

call random_number(rand_real)
rand_int = int(rand_real * cumulative_sum(num_contestants+1))
winner = binary_search(cumulative_sum, rand_int) !returns winner index
call pick_winner(winner) !sends winner out to database etc.
end do

Slick! You even used a binary search over the cumulative sum (you noticed that it will be monotonically increasing since all entries are positive, so you can do this), so it’s cost will be $\mathcal{O}(log \, n)$. You set up a dummy example to test your prototype and run it as fast as you can to see how often it will need to run on one core and go get lunch. You come back after lunch and find it still running! It ends up taking over an hour to run, and that’s without any lag from the server/website etc.

You re-run and profile the code, and are surprised to see that the inner do loop takes 99% of running time! In retrospect this isn’t too surprising, after all the cost of the cumulative sum is $\mathcal{O}(n)$ and $n$ is quite large here (worse still, the overhead of the do loop itself is an appreciable fraction of the total loop cost!). But what room for improvement is there in practically the simplest loop possible?

This is a silly story, but it proves a point. It’s clear that part of the problem is you are reading/writing a single array entry at a time; you could loop unroll and process a vector stride at a time but you’ll find that compiling with -O3 already does this for you. The loop unrolling gave a 3x speedup on my system, but that still leaves the asymptotic complexity untouched.

The Fenwick tree to the rescue! Also known as a binary indexed tree, it gives a fast way to compute cumulative sums and update them based on changes in the original parent array. With our example code we needed to compute the cumulative sum each second, with a cost scaling of $\mathcal{O}(N)$ and a cost of updating of $\mathcal{O}(1)$ per entry. In comparison the Fenwick tree has $\mathcal{O}(log \, N)$ for computing a sum and $\mathcal{O}(log \, N)$ per update.

The Fenwick tree is also quite compact to implement as well, partly because it never actually implements a tree and instead stores it implicitly in a flat array. Here’s some example code to build a tree from a parent array:

pure function build_fen(arr) result(fen)
integer, dimension(:), intent(in) :: arr
integer, dimension(size(arr)) :: fen
integer :: i,j,n

fen = 0
n = size(arr)
do i=1,n
j = i
do while (j<=n)
fen(j) = fen(j) + arr(i)
j = j + iand(j,-j)
end do
end do
end function build_fen

Note: This function uses assumed-shape arrays, as such if you want to use it you either need to give it an explicit interface or stick it in a module. Check links at the top and bottom of this post to my F90 module that implements all this for you.

“But wait!” you say, aren’t there $N$ cumulative sums we need to compute for a total cost of $\mathcal{O}(N log \, N)$? Actually we may need any sum, but the binary search looks at at most $log_2 \, N$ entries. So we only need to compute the sums that the binary search touches each update, giving a total cost of $\mathcal{O}(log^2 N)$ compared to the original $\mathcal{O}(N)$. Neat, but we can do better actually.

It isn’t usually mentioned, but you can actually search the tree in $\mathcal{O}(log \, N)$ time. Rather than search the array by bisecting along it’s indices and computing a full sum each time, you can use the implicit tree structure to bisect on a per-bit basis, accumulate a running total, and compare it to the target you are searching for. Once you’ve fully traversed the tree you will have found the closest tree entry.

There’s another refinement to implementation performance to be had as well. It is often the case that contiguous chunks of updates are made to the parent array. Naive updates would then have a O(M log N) cost for tree size N and M updates. Instead we can break these updates into two regions: inside the contiguous chunk and everything after. As a result we have reduced the cost to O(log N + M log M). For “large” M and N>>M this can translate to a significant savings, but even for M=2 or 3 it is still faster than 2 or 3 naive sequential updates.

When I came across this problem I wasn’t able to find a Fortran90 implementation, so I made one (along with the improvements to chunk updates and binary search I’ve mentioned). You can find it here on my Github!

# Using Fortran and OpenMP to Make Python 500x Faster, a Case Study

When I’m first starting a new code base for some project, the first place I usually start is Python. Python is an expressive and approachable language with a rich package ecosystem. However, it’s no secret that generally speaking interpreted languages are slower than compiled ones. Does this matter? For many projects, no; but there are plenty of situations where speed is the number one concern! Machine learning, big data, and scientific computing are all largely bottle-necked by speed of execution. And that speed can be the difference between an analysis taking an hour or 10 seconds.

But that doesn’t mean we have to throw our little Python baby out with the slow bathwater! It is often the case that small portions of a code-base account for the majority of execution time. These might be things like expensive numerical computations inside of a nested loop, or linear algebra operations on large matrices, etc. This means we can focus our effort on the small snippet and yield huge returns on speed-up. Of course there are optimizations to be had with Python itself, but once those are exhausted we have two other solid options: calling compiled functions and parallelization.

As a case study consider the following snippet of code:

import numpy as np
import my_F90
import my_F90_OMP

def my_func(nx, ny, nz, tgt):
result = <big expensive function involving inputs>
return(result)

<Various pre-processing and setup>

# Very large 3-D structure of data, dependent on setup
nx, ny, nz = np.meshgrid(thing1, thing2, thing3)
test_tgts = <input data>

for i, tgt in enumerate(test_tgts):
kern1 = my_func(nx, ny, nz, tgt)         # I take 3480 seconds
kern2 = my_F90.func(nx, ny, nz, tgt)     # I take 29 seconds
kern3 = my_F90_OMP.func(nx, ny, nz, tgt) # I take 8 seconds

<Post-processing code>

There are two important features of this code: First we have a big expensive function to evaluate inside my_func(). Second we have a big chunk of data/memory that the function is operating over. The data is generated from a 3-D geometric structure, so increasing the size of the problem by 10x means a 1000x increase in the computations needed! The quoted runtimes in the example are for ~13 Gb of data.

The 3 kernels in the test loop are respectively: the native serial Python function, a compiled Fortran90 serial implementation of my_func(), and a F90 implementation compiled with OpenMP and run on 4 CPU cores. The code was used in an application where many different “test targets” were being examined, so iterating through the various tests on the native Python kernel started out being annoying and quickly moved to being silly/impossible as the problem sizes increased.

As you can see, calling the compiled Fortran implementations are no more troublesome than the native Python code; they happily accept and return Numpy structures, the compiled objects simply need to be imported. So how do we get there?

There are 3 main pieces needed to get callable OpenMP enabled F90 modules: F2PY, a Fortran compiler, and that compiler’s implementation of OpenMP. F2PY is included with Numpy by default, many people already have it and don’t know it! I used gfortran from the GNU project, but you can also use your favorite flavor including LLVM and Intel implementations. gfortran’s OpenMP implementation is libgomp.

The serial F90 module is compiled from a F90 source file that looks like this:

subroutine func(x, y, z, tgt, result, n) ! output
implicit none
integer, intent(in) :: n
real(8), intent(in), dimension(n) :: x, y, z
real(8), intent(in) :: tgt(3)
real(8), intent(out) :: result(n)
!f2py depend(n) result
real(8), dimension(n) :: dummy, temp

result = <big expensive function involving inputs>

end subroutine
f2py -m compiled_object --fcompiler=gfortran -c my_source.f90

I won’t attempt to cover all the nuances of F2PY here, but there are a couple differences from plain F90 that are worth noting. First, variables are decorated with the intent keyword, indicating whether they are read from the Python program or return outputs. Additionally you can signal data dependencies using the depend directive. In this case x, y, z, tgt are explicit inputs, and n is an implicit input that varies based on the shape of the explicit inputs. We can of course have private variables that exist only within the F90 module, in this case the examples dummy and temp. Finally we have any outputs, like result, that use the depend directive to indicate their returned shape.

Moving from the native interpreted Python to the F90 compiled version already gets us a 120x speedup in this example, but we can also easily take advantage of shared-memory parallelism within the laptop/desktop/node using OpenMP. In this example case the calculations are all independent of each other, so the implementation is almost trivial:

subroutine func(t, x, y, z, tgt, result, n) ! output
use omp_lib
implicit none
integer, intent(in) :: n, t
real(8), intent(in), dimension(n) :: x, y, z
real(8), intent(in) :: tgt(3)
real(8), intent(out) :: result(n)
!f2py depend(n) result
real(8), dimension(n) :: dummy, temp
integer :: i

call omp_set_num_threads(t)
!$OMP PARALLEL DO PRIVATE(dummy, temp) SHARED(result) DO i = 1, n result(i) = <same function, with inputs indexed into using i> END DO !$OMP END PARALLEL DO

end subroutine
f2py -m compiled_object --fcompiler=gfortran --f90flags='-fopenmp' '-lgomp' -c my_source.f90

We include omp_lib to link against libgomp, and add t as an input parameter so we can choose the number of threads to use. We then set the thread count with call omp_set_num_threads() and pass into the parallel region of execution in the code using the !\$OMP PARALLEL DO directive. We also indicate that result is memory to be shared by all threads. We then loop over the data in parallel in each of the threads and compute our big expensive function.

The result is a further 3.6x faster using 4 cores in the example and my specific hardware. Less than optimal strong scaling in this case is primarily due to lack of data-reuse. Although we are doing a fairly expensive calculation on our data, the work we do is linearly proportional to the quantity of data. If we have N inputs, we are doing $c_1N$ work (where $c_1$ is some constant, in this example c~=300 FLOPs). Memory latency and bandwidth has not scaled at nearly the same rate as CPU throughput, and so in cases of weak data-reuse it tends to dominate the behavior. Compare this to say a naive n-body calculation where computation scales as $c_2N^2$ despite still using just $N$ memory.

All told, our two improvements mean the code runs 435x faster, with a typical run taking 8 seconds instead of almost an hour. Not bad!

If you’d like to play around with the original code that served for the inspiration for this post, it’s available on my Github:
Python driving script
Serial F90 module
OpenMP F90 module
Keep in mind the linked code is early research code, so a bit like peering into a sausage factory I make no claims to it’s quality, readability, or sanity.

# Website Updated!

Just a quick note that the website has been updated and some new pages have been added, while old ones have been revamped. Also this makes it so the last blog post isn’t from 5 years ago any more!

I plan to add in some more pages covering my Intro to DG series, organize my talks I’ve given into a tab, and try to sort through my “DbX” notes, code, and files that I have scattered across several machines.

I’d also like to put up a couple more blog posts covering some interesting little side things I’ve worked on/been working on, including: a “fitting puzzle” solver, a neat little Matlab one-liner for Conway’s Game of Life, and a few other things that would be good to get from brain to page.

# Polynomial Approximations: Why monomials are bad

Define the concept of an inner product on some domain $[a,b]$ as: $(f,g) = \int_a^b \! f(x)g(x) \, \mathrm{d}x = 0$

If we consider vectors in $R^n$ space the inner product is essentially the dot product of the two vectors. Vectors that are orthogonal to each other have a dot product of 0. The concept of orthogonality can be abstracted and extended to vectors in any vector space, so that vectors that have a zero inner product are orthogonal.

Some arbitrary function $u(x,t)$ on the domain $I_k =[x_{k-^1\!/_2},x_{k+^1\!/_2}]$ may not permit an analytical representation and so most be approximated in some way. An easy choice is a polynomial approximation of order M. A natural choice for the polynomial basis is the monomials, $\psi_m(x) = x^{m-1}$. These are satisfactory for low orders, but we encounter a problem at higher orders.

If we normalize the monomial basis with respect to the $l^2$ norm $\|f\|=\sqrt{(f,f)}$ we arrive at $\bar{\psi}_m(x) = \sqrt{2m+1}x^m$. If the inner product of two basis functions is 0, they are orthogonal and therefore linearly independent. If many of our basis functions end up linearly dependent it is difficult to reconstruct the solution approximation.

Consider the inner product of two adjacent basis functions in our normalized monomial basis: $(\bar{\psi}_{m-1},\bar{\psi}_m) = \int_0^1 \! \sqrt{2m-1}x^{m-1}\sqrt{2m+1}x^{m} \, \mathrm{d}x = \sqrt{1-\frac{1}{4m^2}}$

It is easy to see that for even moderate values of $m$ the inner product is close to 1, indicating near linear dependence between basis functions.

It is therefore important to choose a polynomial basis of (ideally) orthogonal functions so that the solution approximation is well conditioned. One good choice is the Legendre polynomials $P_m$ which are orthogonal on the interval $[-1,1]$ (these are defined as the solutions to the Sturm–Liouville equation for a weighting function $w(x)=1$). They can be defined using a recurrence formula or the explicit Rodrigue’s formula. $P_m(x) = \frac{(2m-1)xP_{m-1}(x) - (m-1)P_{m-2}(x)}{m}$ $P_m(x) = \frac{1}{2^m m!}\frac{\mathrm{d}^2}{\mathrm{d}x^m}\left[(x^2-1)^m\right]$

Orthogonality means that the inner product of two Legendre polynomials is zero if they are not of the same order, this can be expressed using the Kronecker product as $(P_n,P_m) = \int_{-1}^1 \! P_n P_m \, \mathrm{d}x = \delta_{nm} \left(\frac{2}{2m+1}\right)$

Note that we could use a similar orthonormal version of the basis such that every inner product is simply the Kronecker product, however one nice property of the current form is that $P_m(1)=1$ and $P_m(-1)=(-1)^m$. In the orthonormal case we would need to explicitly evaluate $P_m(x)$ at the endpoints for every $m$ (it is important to be clear with what one means by normalization in this case: orthonormal with respect to the $l^2$ norm or normalized with respect to the maximal value of $P_m(x)$ on $[-1,1]$).

The Legendre polynomials are orthogonal on the domain $[-1,1]$ however we would like to use them on the domain of interest. We can define a mapping that will transform the domain of interest to that of the Legendre polynomials, $\tilde{x} = \frac{2(x-x_{k-^1\!/_2})}{x_{k+^1\!/_2}-x_{k-^1\!/_2}}-1$ to transform $I_k$ to $[-1,1]$. Using this we can define a variant of the shifted Legendre polynomials such that $\tilde{P}_m(x) = P_m(\tilde{x}(x))$. The newly defined shifted Legendre polynomials are now orthogonal over $I_k$ and the inner product becomes $(\tilde{P}_n,\tilde{P}_m) = \int_{x_{k-^1\!/_2}}^{x_{k+^1\!/_2}} \! P_n(\tilde{x}) P_m(\tilde{x}) \, \mathrm{d}\tilde{x} = \delta_{nm} \left(\frac{\Delta x}{2m+1}\right)$

Using our orthogonal Legendre polynomial basis we can now construct an approximating solution $u$ for arbitrary order M using basis functions $\overset{m}{\psi}(x) \in \tilde{P}_m$ and basis weights $\overset{m}{a}(t)$ $u = \sum_{m=0}^M \overset{m}{a}(t)\overset{m}{\psi}(x)$

This approximation is well conditioned for arbitrary orders and has the additional benefit that it is easy to evaluate integrals and derivatives of it.

Where the orthogonal basis really shines however is when one needs to find the integral of the product of the approximation (or it’s derivative) with some other function (or it’s derivative) using the same orthogonal basis for the approximation. Usually the integrand of the two approximations can be calculated analytically and presents a very succinct closed form.

For example to solve the 1D the scalar conservation law PDE $\frac{\partial u}{\partial t} + \frac{\partial f(u)}{\partial x} = 0$

could be solved using a $N^{th}$ order spatial modal Discontinuous Galerkin method, so the the weak form becomes $\sum_{m=0}^M \left[ \frac{\mathrm{d}\overset{m}{a}}{\mathrm{d} t} \int_{I_k}\! \overset{m}{\psi} \, \overset{n}{\phi} \,\mathrm{d}x \right] + [g(v^-\!,v^+) \; \overset{n}{\phi}] \Big\rvert_{x_{k-^1\!/_2}}^{x_{k+^1\!/_2}} \!-\! \int_{I_k}\! f(v) \,\frac{\mathrm{d} \overset{n}{\phi}}{\mathrm{d} x} \,\mathrm{d}x = 0\;\;for \; 0 \leq n \leq N$

Using an orthogonal Legendre basis approximation for both the solution and test function (a Galerkin approach) simplifies the expression greatly to $\frac{\mathrm{d}\overset{n}{a}}{\mathrm{d} t}\left(\frac{\Delta x}{2n+1}\right)= \!\! \sum_{\underset{by \: 2}{m=n-1}}^0 \overset{m}{a}_k-\sum_{m=0}^M \left[ \overset{m}{a}_k - \overset{m}{a}_{k-1}(-1)^m \right]\;\;\; for \; 0 \leq n \leq N$

Which is then readily solved using an explicit method of lines approach for time discretization. Without an orthogonal basis approximation the integrands would all need to be calculated using quadrature and the routine would be far more expensive to perform.