Using Data Structures to Improve the Un-improvable, a Fenwick Tree Christmas Story

tl;dr: Use the right data structure for the job. I implemented a Fenwick tree for Fortran90 with a couple neat features, available here.

Our fictitious story begins with an eccentric billionaire and Christmas aficionado who is holding a contest for the whole USA: they are giving away $1 billion over the span of 1 hour, all people have to do to compete is try and put as many ornaments on a Christmas tree as possible and submit their total to the contest website! If they manage to fit even more on, or if some fall off they need to update their total.

You are the intern in the team tasked by the billionaire to handle the drawing of winners. The rest of the team has already set up the website, database, etc. The team reserved this simplest of tasks for the intern, surely you can at least do what they ask!

Your boss tells you they expect around 100 million contestants, they will award $250,000 once a second for a full hour. In order to reward contestants with more ornaments on their tree with a better chance of winning they choose winners in the following way: there is an array with an entry for each contestant that contains their total, perform a cumulative sum over this array and pick a random number in the range from 1 to the total number of ornaments, whichever entry in the cumulative sum is closest to the random number is a winner! All you need to do is pick a winner once a second and update any contestants entries if they changed since the second before.

You think a bit about how they are drawing winners, is that fair? After some thought you agree; people with bigger entries are proportionally more likely to be chosen, while those with smaller entries are proportionately less likely. “This implementation should be dirt simple!” you say, and type up a prototype in Fortran90:

use binary_search
use contest_stuff !provides contest related functions by team
integer, dimension(num_contestants) :: contest_entry, cumulative_sum
real :: rand_real
integer :: i,t,rand_int

do t=1,3600 !once a second, for an hour
    call wait_and_update(contest_entry) !pulls in from database etc.

    cumulative_sum(1) = contest_entry(1)
    do i=2, num_contestants
        cumulative_sum(i) = cumulative_sum(i-1) + contest_entry(i)
    end do

    call random_number(rand_real)
    rand_int = int(rand_real * cumulative_sum(num_contestants+1))
    winner = binary_search(cumulative_sum, rand_int) !returns winner index
    call pick_winner(winner) !sends winner out to database etc.
end do

Slick! You even used a binary search over the cumulative sum (you noticed that it will be monotonically increasing since all entries are positive, so you can do this), so it’s cost will be \mathcal{O}(log \, n). You set up a dummy example to test your prototype and run it as fast as you can to see how often it will need to run on one core and go get lunch. You come back after lunch and find it still running! It ends up taking over an hour to run, and that’s without any lag from the server/website etc.

You re-run and profile the code, and are surprised to see that the inner do loop takes 99% of running time! In retrospect this isn’t too surprising, after all the cost of the cumulative sum is \mathcal{O}(n) and n is quite large here (worse still, the overhead of the do loop itself is an appreciable fraction of the total loop cost!). But what room for improvement is there in practically the simplest loop possible?

This is a silly story, but it proves a point. It’s clear that part of the problem is you are reading/writing a single array entry at a time; you could loop unroll and process a vector stride at a time but you’ll find that compiling with -O3 already does this for you. The loop unrolling gave a 3x speedup on my system, but that still leaves the asymptotic complexity untouched.

The Fenwick tree to the rescue! Also known as a binary indexed tree, it gives a fast way to compute cumulative sums and update them based on changes in the original parent array. With our example code we needed to compute the cumulative sum each second, with a cost scaling of \mathcal{O}(N) and a cost of updating of \mathcal{O}(1) per entry. In comparison the Fenwick tree has \mathcal{O}(log \, N) for computing a sum and \mathcal{O}(log \, N) per update.

The Fenwick tree is also quite compact to implement as well, partly because it never actually implements a tree and instead stores it implicitly in a flat array. Here’s some example code to build a tree from a parent array:

pure function build_fen(arr) result(fen)
    integer, dimension(:), intent(in) :: arr
    integer, dimension(size(arr)) :: fen
    integer :: i,j,n

    fen = 0
    n = size(arr)
    do i=1,n
        j = i
        do while (j<=n)
            fen(j) = fen(j) + arr(i)
            j = j + iand(j,-j)
        end do
    end do
end function build_fen

Note: This function uses assumed-shape arrays, as such if you want to use it you either need to give it an explicit interface or stick it in a module. Check links at the top and bottom of this post to my F90 module that implements all this for you.

“But wait!” you say, aren’t there N cumulative sums we need to compute for a total cost of \mathcal{O}(N log \, N)? Actually we may need any sum, but the binary search looks at at most log_2 \, N entries. So we only need to compute the sums that the binary search touches each update, giving a total cost of \mathcal{O}(log^2 N) compared to the original \mathcal{O}(N). Neat, but we can do better actually.

It isn’t usually mentioned, but you can actually search the tree in \mathcal{O}(log \, N) time. Rather than search the array by bisecting along it’s indices and computing a full sum each time, you can use the implicit tree structure to bisect on a per-bit basis, accumulate a running total, and compare it to the target you are searching for. Once you’ve fully traversed the tree you will have found the closest tree entry.

There’s another refinement to implementation performance to be had as well. It is often the case that contiguous chunks of updates are made to the parent array. Naive updates would then have a O(M log N) cost for tree size N and M updates. Instead we can break these updates into two regions: inside the contiguous chunk and everything after. As a result we have reduced the cost to O(log N + M log M). For “large” M and N>>M this can translate to a significant savings, but even for M=2 or 3 it is still faster than 2 or 3 naive sequential updates.

When I came across this problem I wasn’t able to find a Fortran90 implementation, so I made one (along with the improvements to chunk updates and binary search I’ve mentioned). You can find it here on my Github!

Using Fortran and OpenMP to Make Python 500x Faster, a Case Study

When I’m first starting a new code base for some project, the first place I usually start is Python. Python is an expressive and approachable language with a rich package ecosystem. However, it’s no secret that generally speaking interpreted languages are slower than compiled ones. Does this matter? For many projects, no; but there are plenty of situations where speed is the number one concern! Machine learning, big data, and scientific computing are all largely bottle-necked by speed of execution. And that speed can be the difference between an analysis taking an hour or 10 seconds.

But that doesn’t mean we have to throw our little Python baby out with the slow bathwater! It is often the case that small portions of a code-base account for the majority of execution time. These might be things like expensive numerical computations inside of a nested loop, or linear algebra operations on large matrices, etc. This means we can focus our effort on the small snippet and yield huge returns on speed-up. Of course there are optimizations to be had with Python itself, but once those are exhausted we have two other solid options: calling compiled functions and parallelization.

As a case study consider the following snippet of code:

import numpy as np
import my_F90
import my_F90_OMP

def my_func(nx, ny, nz, tgt):
        result = <big expensive function involving inputs>

<Various pre-processing and setup>

# Very large 3-D structure of data, dependent on setup
nx, ny, nz = np.meshgrid(thing1, thing2, thing3)
test_tgts = <input data>

for i, tgt in enumerate(test_tgts):
        kern1 = my_func(nx, ny, nz, tgt)         # I take 3480 seconds
        kern2 = my_F90.func(nx, ny, nz, tgt)     # I take 29 seconds
        kern3 = my_F90_OMP.func(nx, ny, nz, tgt) # I take 8 seconds

<Post-processing code>

There are two important features of this code: First we have a big expensive function to evaluate inside my_func(). Second we have a big chunk of data/memory that the function is operating over. The data is generated from a 3-D geometric structure, so increasing the size of the problem by 10x means a 1000x increase in the computations needed! The quoted runtimes in the example are for ~13 Gb of data.

The 3 kernels in the test loop are respectively: the native serial Python function, a compiled Fortran90 serial implementation of my_func(), and a F90 implementation compiled with OpenMP and run on 4 CPU cores. The code was used in an application where many different “test targets” were being examined, so iterating through the various tests on the native Python kernel started out being annoying and quickly moved to being silly/impossible as the problem sizes increased.

As you can see, calling the compiled Fortran implementations are no more troublesome than the native Python code; they happily accept and return Numpy structures, the compiled objects simply need to be imported. So how do we get there?

There are 3 main pieces needed to get callable OpenMP enabled F90 modules: F2PY, a Fortran compiler, and that compiler’s implementation of OpenMP. F2PY is included with Numpy by default, many people already have it and don’t know it! I used gfortran from the GNU project, but you can also use your favorite flavor including LLVM and Intel implementations. gfortran’s OpenMP implementation is libgomp.

The serial F90 module is compiled from a F90 source file that looks like this:

subroutine func(x, y, z, tgt, result, n) ! output
    implicit none
    integer, intent(in) :: n
    real(8), intent(in), dimension(n) :: x, y, z
    real(8), intent(in) :: tgt(3)
    real(8), intent(out) :: result(n)
    !f2py depend(n) result
    real(8), dimension(n) :: dummy, temp

    result = <big expensive function involving inputs>

end subroutine
f2py -m compiled_object --fcompiler=gfortran -c my_source.f90

I won’t attempt to cover all the nuances of F2PY here, but there are a couple differences from plain F90 that are worth noting. First, variables are decorated with the intent keyword, indicating whether they are read from the Python program or return outputs. Additionally you can signal data dependencies using the depend directive. In this case x, y, z, tgt are explicit inputs, and n is an implicit input that varies based on the shape of the explicit inputs. We can of course have private variables that exist only within the F90 module, in this case the examples dummy and temp. Finally we have any outputs, like result, that use the depend directive to indicate their returned shape.

Moving from the native interpreted Python to the F90 compiled version already gets us a 120x speedup in this example, but we can also easily take advantage of shared-memory parallelism within the laptop/desktop/node using OpenMP. In this example case the calculations are all independent of each other, so the implementation is almost trivial:

subroutine func(t, x, y, z, tgt, result, n) ! output
    use omp_lib
    implicit none
    integer, intent(in) :: n, t
    real(8), intent(in), dimension(n) :: x, y, z
    real(8), intent(in) :: tgt(3)
    real(8), intent(out) :: result(n)
    !f2py depend(n) result
    real(8), dimension(n) :: dummy, temp
    integer :: i

    call omp_set_num_threads(t)
    !$OMP PARALLEL DO PRIVATE(dummy, temp) SHARED(result)
    DO i = 1, n
        result(i) = <same function, with inputs indexed into using i>
    END DO

end subroutine
f2py -m compiled_object --fcompiler=gfortran --f90flags='-fopenmp' '-lgomp' -c my_source.f90

We include omp_lib to link against libgomp, and add t as an input parameter so we can choose the number of threads to use. We then set the thread count with call omp_set_num_threads() and pass into the parallel region of execution in the code using the !$OMP PARALLEL DO directive. We also indicate that result is memory to be shared by all threads. We then loop over the data in parallel in each of the threads and compute our big expensive function.

The result is a further 3.6x faster using 4 cores in the example and my specific hardware. Less than optimal strong scaling in this case is primarily due to lack of data-reuse. Although we are doing a fairly expensive calculation on our data, the work we do is linearly proportional to the quantity of data. If we have N inputs, we are doing c_1N work (where c_1 is some constant, in this example c~=300 FLOPs). Memory latency and bandwidth has not scaled at nearly the same rate as CPU throughput, and so in cases of weak data-reuse it tends to dominate the behavior. Compare this to say a naive n-body calculation where computation scales as c_2N^2 despite still using just N memory.

All told, our two improvements mean the code runs 435x faster, with a typical run taking 8 seconds instead of almost an hour. Not bad!

If you’d like to play around with the original code that served for the inspiration for this post, it’s available on my Github:
Python driving script
Serial F90 module
OpenMP F90 module
Keep in mind the linked code is early research code, so a bit like peering into a sausage factory I make no claims to it’s quality, readability, or sanity.

Website Updated!

Just a quick note that the website has been updated and some new pages have been added, while old ones have been revamped. Also this makes it so the last blog post isn’t from 5 years ago any more!

I plan to add in some more pages covering my Intro to DG series, organize my talks I’ve given into a tab, and try to sort through my “DbX” notes, code, and files that I have scattered across several machines.

I’d also like to put up a couple more blog posts covering some interesting little side things I’ve worked on/been working on, including: a “fitting puzzle” solver, a neat little Matlab one-liner for Conway’s Game of Life, and a few other things that would be good to get from brain to page.

Polynomial Approximations: Why monomials are bad

Define the concept of an inner product on some domain [a,b] as:

(f,g) = \int_a^b \! f(x)g(x) \, \mathrm{d}x = 0

If we consider vectors in R^n space the inner product is essentially the dot product of the two vectors. Vectors that are orthogonal to each other have a dot product of 0. The concept of orthogonality can be abstracted and extended to vectors in any vector space, so that vectors that have a zero inner product are orthogonal.

Some arbitrary function u(x,t) on the domain I_k =[x_{k-^1\!/_2},x_{k+^1\!/_2}] may not permit an analytical representation and so most be approximated in some way. An easy choice is a polynomial approximation of order M. A natural choice for the polynomial basis is the monomials, \psi_m(x) = x^{m-1}. These are satisfactory for low orders, but we encounter a problem at higher orders.

If we normalize the monomial basis with respect to the l^2 norm \|f\|=\sqrt{(f,f)} we arrive at \bar{\psi}_m(x) = \sqrt{2m+1}x^m . If the inner product of two basis functions is 0, they are orthogonal and therefore linearly independent. If many of our basis functions end up linearly dependent it is difficult to reconstruct the solution approximation.

Consider the inner product of two adjacent basis functions in our normalized monomial basis:

(\bar{\psi}_{m-1},\bar{\psi}_m) = \int_0^1 \! \sqrt{2m-1}x^{m-1}\sqrt{2m+1}x^{m} \, \mathrm{d}x = \sqrt{1-\frac{1}{4m^2}}

It is easy to see that for even moderate values of m the inner product is close to 1, indicating near linear dependence between basis functions.

It is therefore important to choose a polynomial basis of (ideally) orthogonal functions so that the solution approximation is well conditioned. One good choice is the Legendre polynomials P_m which are orthogonal on the interval [-1,1] (these are defined as the solutions to the Sturm–Liouville equation for a weighting function w(x)=1 ). They can be defined using a recurrence formula or the explicit Rodrigue’s formula.

P_m(x) = \frac{(2m-1)xP_{m-1}(x) - (m-1)P_{m-2}(x)}{m}

P_m(x) = \frac{1}{2^m m!}\frac{\mathrm{d}^2}{\mathrm{d}x^m}\left[(x^2-1)^m\right]

Orthogonality means that the inner product of two Legendre polynomials is zero if they are not of the same order, this can be expressed using the Kronecker product as

(P_n,P_m) = \int_{-1}^1 \! P_n P_m \, \mathrm{d}x = \delta_{nm} \left(\frac{2}{2m+1}\right)

Note that we could use a similar orthonormal version of the basis such that every inner product is simply the Kronecker product, however one nice property of the current form is that P_m(1)=1 and P_m(-1)=(-1)^m . In the orthonormal case we would need to explicitly evaluate P_m(x) at the endpoints for every m (it is important to be clear with what one means by normalization in this case: orthonormal with respect to the l^2 norm or normalized with respect to the maximal value of P_m(x) on [-1,1] ).

The Legendre polynomials are orthogonal on the domain [-1,1] however we would like to use them on the domain of interest. We can define a mapping that will transform the domain of interest to that of the Legendre polynomials, \tilde{x} = \frac{2(x-x_{k-^1\!/_2})}{x_{k+^1\!/_2}-x_{k-^1\!/_2}}-1 to transform I_k to [-1,1] . Using this we can define a variant of the shifted Legendre polynomials such that \tilde{P}_m(x) = P_m(\tilde{x}(x)) . The newly defined shifted Legendre polynomials are now orthogonal over I_k  and the inner product becomes

(\tilde{P}_n,\tilde{P}_m) = \int_{x_{k-^1\!/_2}}^{x_{k+^1\!/_2}} \! P_n(\tilde{x}) P_m(\tilde{x}) \, \mathrm{d}\tilde{x} = \delta_{nm} \left(\frac{\Delta x}{2m+1}\right)

Using our orthogonal Legendre polynomial basis we can now construct an approximating solution u for arbitrary order M using basis functions \overset{m}{\psi}(x) \in \tilde{P}_m and basis weights \overset{m}{a}(t)

u = \sum_{m=0}^M \overset{m}{a}(t)\overset{m}{\psi}(x)

This approximation is well conditioned for arbitrary orders and has the additional benefit that it is easy to evaluate integrals and derivatives of it.

Where the orthogonal basis really shines however is when one needs to find the integral of the product of the approximation (or it’s derivative) with some other function (or it’s derivative) using the same orthogonal basis for the approximation. Usually the integrand of the two approximations can be calculated analytically and presents a very succinct closed form.

For example to solve the 1D the scalar conservation law PDE

\frac{\partial u}{\partial t} + \frac{\partial f(u)}{\partial x} = 0

could be solved using a N^{th} order spatial modal Discontinuous Galerkin method, so the the weak form becomes

\sum_{m=0}^M \left[ \frac{\mathrm{d}\overset{m}{a}}{\mathrm{d} t} \int_{I_k}\! \overset{m}{\psi} \, \overset{n}{\phi} \,\mathrm{d}x \right] + [g(v^-\!,v^+) \; \overset{n}{\phi}] \Big\rvert_{x_{k-^1\!/_2}}^{x_{k+^1\!/_2}} \!-\! \int_{I_k}\! f(v) \,\frac{\mathrm{d} \overset{n}{\phi}}{\mathrm{d} x} \,\mathrm{d}x = 0\;\;for \; 0 \leq n \leq N

Using an orthogonal Legendre basis approximation for both the solution and test function (a Galerkin approach) simplifies the expression greatly to

\frac{\mathrm{d}\overset{n}{a}}{\mathrm{d} t}\left(\frac{\Delta x}{2n+1}\right)= \!\! \sum_{\underset{by \: 2}{m=n-1}}^0 \overset{m}{a}_k-\sum_{m=0}^M \left[ \overset{m}{a}_k - \overset{m}{a}_{k-1}(-1)^m \right]\;\;\; for \; 0 \leq n \leq N

Which is then readily solved using an explicit method of lines approach for time discretization. Without an orthogonal basis approximation the integrands would all need to be calculated using quadrature and the routine would be far more expensive to perform.

Project Euler 5: Smallest Multiple

2520 is the smallest number that can be divided by each of the numbers from 1 to 10 without any remainder. 
What is the smallest positive number that is evenly divisible by all of the numbers from 1 to 20?

A naive implementation of this in Matlab (especially with for() loops) exceeded the 1 minute time limit. After playing around some with the LCM of 1-10 to arrive at the provided answer of 2520, I came to the following algorithm (which has been previously posted in the Project Euler solution forum in similar but less clear forms).

All numbers have a prime factorization. Create a set from the prime factorization, but allow repeats of a prime factor to be distinct from each other. The LCM of a set of numbers has a prime factorization that is equal to the union of the prime factorization sets of each number. Pseudo-code implementation is as follows:

  1. Find all primes 1 to N inclusive, call this P. -This can be done most efficiently for large N using a sieve.
  2. Form a subset of P of primes < sqrt(N), call this S.
  3. For all primes in S find one less than the largest power that still occurs in 1-N inclusive. -This can be done easily by calculating floor( log(N)/log(S) )-1
  4. The LCM is the product of all primes in P times the product of all primes in S raised to their max_exponent-1 found in step 3.

N.b. We need only need to find the max exponents of the primes in S, not P, because any prime greater than the sqrt(N) will not have at least a square <N so we can skip it.

The Matlab code is quite straightforward. I use the Matlab function primes() to generate all primes to N. For N=20 the LCM fits within a 64-bit double or int. For N>42 the LCM is too large to fit in an int, and the accuracy starts to degrade as a double. To avoid these issues, I represent the result symbolically using the Symbolic Math Toolbox.

The code is very fast and able to calculate N=10^7 under the 1 minute time limit, the result is 4342311 digits long!

Check my Github repo for the code.

Vectorization isn’t *Always* the Best Approach in Matlab

Usually if you are using for() loops in Matlab for code “hotspots”, you are doing something wrong. The fastest way to solve large systems is usually to “vectorize”. Essentially you unroll your for() loop into a vector or matrix and use sneaky linear algebra tricks to solve the whole shebang in one go.

Things start to break down when you are trying to solve data in multidimensional arrays though. If a Matlab function specifically operates on 2D matrices (say for example backslash: ‘\’) and you pass it a stack of 2D arrays it will choke. You would expect it to elegantly flatten it and solve, but I haven’t found a built in way to do this.

So say you’d like to calculate the gradient of the equation of a plane prescribed by 3 points in R^3 for many such planes. I thought I would be clever and flatten these out into a large 2D sparse array and solve the whole thing in one go. I wanted to see how much faster (hah!) this would be then naively iterating through them one by one in a for() loop.

Well take a look for yourself:MVectTest

Turns out it is 81 times faster to use that for() loop, significantly less complex, and much more readable!

Exporting Video from Matlab

It’s rather useful to be able to export video from Matlab of say an animated plot for later sharing. I found this SO thread rather helpful.

I had the best luck with the VideoWriter method, generic code is as follows:

%# figure
Z = peaks; surf(Z);  axis tight
set(gca,'nextplot','replacechildren','Visible','off');%# preallocate
nFrames =20;
mov(1:nFrames)=struct('cdata',[],'colormap',[]);%# create movie
for k=1:nFrames
   surf(sin(2*pi*k/20)*Z, Z)
   mov(k)= getframe(gca);end
close(gcf)%# save as AVI file,and open it using system video player

Depending on what you actually want to capture, you may need to choose a different handle, in my case I used the handle of the figure itself so I could also get the axes and the text I used for output displays.

Using this technique, and this code, I was able to make a video of my prototype 1D DG code.

Non-linear ODEs

Came across this in my applied math grad course. The correct answer is incredibly simple, the equation of a circle. But in order to arrive at that solution a fair bit of geometry, algebra, and solution of a non-linear ODE is required.

A curve passing through (1,2) has the property that the length of a line drawn between the origin and intersecting perpendicular with the normal extending from any point along the curve is always numerically equal to the ordinate of that point on the curve. Find the equation of the curve.

Put another way: “Any curvature has a normal and a tangent. Draw the normal out and draw a line passing from the origin that is perpendicular to that (this line will be parallel to the tangent). The length of this line is equal to the ‘y’ of the point from which the normal emanates.”

Here’s a crude diagram:

Solution strategy is as follows:

We have some function y=f(x). We pick any point on f, and call that point (x,y) and draw a normal to the function at that point. We then draw a line from the origin such that it intersects with the normal perpendicularly at a point (a,b). The length of the vector (a,b) is equal to the ordinate of the point on the function from which the normal emanates, ‘y’. So that |(a,b)| = y.

What function satisfies this condition, and also passes through the point (1,2)?

Here’s the completed solution: BLAM!

Note the necessity of solving a non-linear (!) first order ODE to get the function! This is a special case ODE that has a ready analytic solution and is referred to as a Bernoulli ODE. It takes the general form dy/dx + P(x)y = Q(x)*(y^n).
Here’s a plot of the solution function:

Source: Schaum’s Advanced Mathematics for Engineers and Scientists, Ch2 Prob 83

Orthogonal Functions

After consulting numerous sources I finally found something that clearly and satisfactorily explains the details of orthogonal functions. The concept is the easy part: Two vectors are considered orthogonal if their dot product is zero.

The tough part is the details: This can be generalized to functions, with an inner product of a vector space taking the place of the dot product. How you define this inner product defines the orthogonality conditions and is dependent on the vector space. This is where the useful source comes in:

I have also attached the PowerPoint file should it be lost to the sands of time.
What made it click for me is him drawing direct analogues between the various pieces of the dot product, and the inner product. Take Legendre polynomials as an example: instead of Cartesian 3-space directions (i, j, k) your basis is the the monomial power (1, x, x²) in polynomial “space”. In general for real functions over the domain [a,b] the inner product is:
Restricting the domain to [-1,1] yields the Legendre polynomials.
These ruminations stem from attempting to solve problem 7.42 in Schaum’s Advanced Mathematics.
Given the functions  where a_0, a_1+a_2x, a_3+a_4x+a_5x^2 where a_0,\ldots,a_5 are constants.  Determine the constants so that these functions are mutually orthogonal in (-1,1) and thus obtain the functions.

Attached: Orthogonal