# Using Data Structures to Improve the Un-improvable, a Fenwick Tree Christmas Story

tl;dr: Use the right data structure for the job. I implemented a Fenwick tree for Fortran90 with a couple neat features, available here.

Our fictitious story begins with an eccentric billionaire and Christmas aficionado who is holding a contest for the whole USA: they are giving away $1 billion over the span of 1 hour, all people have to do to compete is try and put as many ornaments on a Christmas tree as possible and submit their total to the contest website! If they manage to fit even more on, or if some fall off they need to update their total. You are the intern in the team tasked by the billionaire to handle the drawing of winners. The rest of the team has already set up the website, database, etc. The team reserved this simplest of tasks for the intern, surely you can at least do what they ask! Your boss tells you they expect around 100 million contestants, they will award$250,000 once a second for a full hour. In order to reward contestants with more ornaments on their tree with a better chance of winning they choose winners in the following way: there is an array with an entry for each contestant that contains their total, perform a cumulative sum over this array and pick a random number in the range from 1 to the total number of ornaments, whichever entry in the cumulative sum is closest to the random number is a winner! All you need to do is pick a winner once a second and update any contestants entries if they changed since the second before.

You think a bit about how they are drawing winners, is that fair? After some thought you agree; people with bigger entries are proportionally more likely to be chosen, while those with smaller entries are proportionately less likely. “This implementation should be dirt simple!” you say, and type up a prototype in Fortran90:

use binary_search
use contest_stuff !provides contest related functions by team
integer, dimension(num_contestants) :: contest_entry, cumulative_sum
real :: rand_real
integer :: i,t,rand_int

do t=1,3600 !once a second, for an hour
call wait_and_update(contest_entry) !pulls in from database etc.

cumulative_sum(1) = contest_entry(1)
do i=2, num_contestants
cumulative_sum(i) = cumulative_sum(i-1) + contest_entry(i)
end do

call random_number(rand_real)
rand_int = int(rand_real * cumulative_sum(num_contestants+1))
winner = binary_search(cumulative_sum, rand_int) !returns winner index
call pick_winner(winner) !sends winner out to database etc.
end do

Slick! You even used a binary search over the cumulative sum (you noticed that it will be monotonically increasing since all entries are positive, so you can do this), so it’s cost will be $\mathcal{O}(log \, n)$. You set up a dummy example to test your prototype and run it as fast as you can to see how often it will need to run on one core and go get lunch. You come back after lunch and find it still running! It ends up taking over an hour to run, and that’s without any lag from the server/website etc.

You re-run and profile the code, and are surprised to see that the inner do loop takes 99% of running time! In retrospect this isn’t too surprising, after all the cost of the cumulative sum is $\mathcal{O}(n)$ and $n$ is quite large here (worse still, the overhead of the do loop itself is an appreciable fraction of the total loop cost!). But what room for improvement is there in practically the simplest loop possible?

This is a silly story, but it proves a point. It’s clear that part of the problem is you are reading/writing a single array entry at a time; you could loop unroll and process a vector stride at a time but you’ll find that compiling with -O3 already does this for you. The loop unrolling gave a 3x speedup on my system, but that still leaves the asymptotic complexity untouched.

The Fenwick tree to the rescue! Also known as a binary indexed tree, it gives a fast way to compute cumulative sums and update them based on changes in the original parent array. With our example code we needed to compute the cumulative sum each second, with a cost scaling of $\mathcal{O}(N)$ and a cost of updating of $\mathcal{O}(1)$ per entry. In comparison the Fenwick tree has $\mathcal{O}(log \, N)$ for computing a sum and $\mathcal{O}(log \, N)$ per update.

The Fenwick tree is also quite compact to implement as well, partly because it never actually implements a tree and instead stores it implicitly in a flat array. Here’s some example code to build a tree from a parent array:

pure function build_fen(arr) result(fen)
integer, dimension(:), intent(in) :: arr
integer, dimension(size(arr)) :: fen
integer :: i,j,n

fen = 0
n = size(arr)
do i=1,n
j = i
do while (j<=n)
fen(j) = fen(j) + arr(i)
j = j + iand(j,-j)
end do
end do
end function build_fen

Note: This function uses assumed-shape arrays, as such if you want to use it you either need to give it an explicit interface or stick it in a module. Check links at the top and bottom of this post to my F90 module that implements all this for you.

“But wait!” you say, aren’t there $N$ cumulative sums we need to compute for a total cost of $\mathcal{O}(N log \, N)$? Actually we may need any sum, but the binary search looks at at most $log_2 \, N$ entries. So we only need to compute the sums that the binary search touches each update, giving a total cost of $\mathcal{O}(log^2 N)$ compared to the original $\mathcal{O}(N)$. Neat, but we can do better actually.

It isn’t usually mentioned, but you can actually search the tree in $\mathcal{O}(log \, N)$ time. Rather than search the array by bisecting along it’s indices and computing a full sum each time, you can use the implicit tree structure to bisect on a per-bit basis, accumulate a running total, and compare it to the target you are searching for. Once you’ve fully traversed the tree you will have found the closest tree entry.

There’s another refinement to implementation performance to be had as well. It is often the case that contiguous chunks of updates are made to the parent array. Naive updates would then have a O(M log N) cost for tree size N and M updates. Instead we can break these updates into two regions: inside the contiguous chunk and everything after. As a result we have reduced the cost to O(log N + M log M). For “large” M and N>>M this can translate to a significant savings, but even for M=2 or 3 it is still faster than 2 or 3 naive sequential updates.

When I came across this problem I wasn’t able to find a Fortran90 implementation, so I made one (along with the improvements to chunk updates and binary search I’ve mentioned). You can find it here on my Github!