Laurence Tratt: Four Kinds of Optimisation

Premature optimisation might be the root of all evil, but overdue optimisation is the root of all frustration. No matter how fast hardware becomes, we find it easy to write programs which run too slow. Often this is not immediately apparent. Users can go for years without considering a programâs performance to be an issue before it suddenly becomes so â often in the space of a single working day.

I have devoted more of my life to optimisation than I care to think about, and that experience has led me to make two observations:

Human optimism leads us to believe that we can easily know where a program spends most of its time.
Human optimism leads us to believe that we can easily know how to make the slow parts of a program run faster.

You will not be surprised to learn that I think both forms of optimism misplaced. Partly this is because, as hardware and software have become more sophisticated, it has become harder to understand their effects on performance. But, perhaps more fundamentally, we tend to overestimate how much we know about the software weâre working on. We overemphasise the parts of the system weâve personally worked on, particularly those weâve most recently worked on. We downplay other parts of the system, including the impact of dependencies (e.g. libraries).

The solution to the first of these observations is fairly widely known â one should rigorously profile a program before assuming one knows where it is spending the majority of its time. I deliberately say ârigorously profileâ because people often confuse âI have profiled a program onceâ with âI have built up a good model of a programâs performance in a variety of situationsâ. Sometimes, a quick profiling job is adequate, but it can also mislead. Often it is necessary to profile a program with different inputs, sometimes on different machines or network configurations, and to use a variety of sampling and non-sampling approaches [1].

However, the multiple solutions, and their inevitable trade-offs, to the second observation are, I believe, underappreciated. I tend to think that there are four main solutions:

Use a better algorithm.
Use a better data-structure.
Use a lower-level system.
Accept a less precise solution.

In the rest of this post Iâm going to go through each of these and give some suggestions for the trade-offs involved.

Use a better algorithm

Letâs imagine â and Iâve genuinely seen this happen! â that after careful profiling of a Python program, I find that Iâm spending most of my time in a function which looks like this:

def f1(l):
  while True:
    c = False
    for i in range(0, len(l) - 1):
      if l[i+1] < l[i]:
        t = l[i]
        l[i] = l[i+1]
        l[i+1] = t
        c = True
    if not c: return l

Itâs a bubble sort! At this point, many people will start guffawing, because itâs an obviously slow way of sorting elements. However, bubble sort has an often-forgotten advantage over many âbetterâ algorithms: it runs in constant memory [2]. I could gamble that my program doesnât need to use constant memory, but if Iâm unsure, I can use an alternative algorithm which preserves this property. Letâs try a selection sort:

def f2(l):
  for i in range(0, len(l) - 1):
    m = i
    for j in range(i + 1, len(l)):
      if l[j] < l[m]: m = j
    if m != i:
      t = l[i]
      l[i] = l[m]
      l[m] = t
  return l

If I use this quick testing code:

import random, time
l = [random.random() for _ in range(1000)]
before = time.time()
l1 = f1(l[:])
print(time.time() - before)
before = time.time()
l2 = f2(l[:])
print(time.time() - before)

and run it on CPython 3.11 on a Linux server I consistently get timings along the lines of:

0.0643463134765625
0.020025014877319336

In other words, selection sort is about three times faster than bubble sort in my test.

You donât need me to tell you that selection sort isnât the fastest possible sorting algorithm, but âfastestâ is a more slippery concept than it first appears. For example, the selection sort algorithm above is faster than the bubble sort for random data, but the bubble sort is much faster for sorted data [3]. The relationship between inputs and algorithmic performance can be subtle. Famously, if you choose an unfortunate âpivotâ when implementing quicksort, youâll find that it is very non-quick (e.g. you can make it as slow on already-sorted data as the selection sort above).

We can generalise from this that âuse a better algorithmâ requires understanding the wider context of your system and the nature of the algorithm youâre thinking of using. For example, Iâve often seen people conflate an algorithmâs best-case, average-case, and worst-case performance â but the differences between those three pieces of information can be vital when Iâm optimising a program. Sometimes I might know something about my program (e.g. the nature of its inputs) that makes me confident that the worst case canât happen, or I donât consider the worst case to be a problem (e.g. its a batch job and no-one will notice occasional latency). But, generally, I care more about the worst case than the best case, and I select algorithms accordingly.

Itâs also not uncommon that algorithms that have good theoretical performance have poor real-world performance (big O notation can hide many sins). If in doubt, I try gradually more test data until I feel I have truly understood the practical consequences of different choices.

Itâs also easy to overlook complexity. Fundamentally, faster algorithms are faster because they observe that some steps in a calculation can be side-stepped. I can still remember the first time I read the description for timsort: the beauty of its algorithmic observations has stayed with me ever since. But verifying those observations is harder than we imagine â even timsort, created by one of the greatest programmers I have ever come across, had a subtle bug in it [4].

When us mortals implement faster algorithms, they are often slightly incorrect, particularly when newly implemented, either producing wrong results or not having the expected performance characteristics [5]. For example, parallelising an algorithm can often lead to huge speedups, particularly as CPUs gain more cores, but how many of us understand the C11 memory model well enough to feel confident of the consequences or parallelisation?

The combination of (in)correctness and the difficulty in understanding the context in which an algorithm is fast means that I frequently encourage people to start with a simple algorithm and only move to something âfasterâ if they really find they need to. Picking (and, if necessary, implementing) the right algorithm for the task at hand is a surprisingly difficult skill!

Use a better data-structure

Letâs imagine that I profile another program and find that I spend most of my time in the following function:

def f3(l, e):
  for x in l:
    if x == e: return True
  return False

Itâs an existence check function! Optimising these can be quite interesting, because my choices will depend on how the lists passed to this function are used. I could change the list to a binary tree, for example. But if I can tell, as is not uncommon, that we repeatedly check for the existence of elements in a list that is never mutated after initial creation, I might be able to get away with a very simple data-structure: a sorted list. That might sound odd, because âsorted listâ doesnât sound like much of a data-structure, but that then allows me to do a binary search. For anything but the smallest lists [6], binary search is much quicker than the linear search above.

Just as with âuse a better algorithmâ, âuse a better data-structureâ requires careful thought and measurement [7]. In general, while I often find it necessary to implement my own âbetter algorithmsâ, I rarely find it necessary to implement my own âbetter data-structuresâ. Partly this is laziness on my part, but itâs mostly because data-structures are more easily packaged in a library than better algorithms [8].

There is an important tactical variant on âbetter data-structuresâ that is perhaps best thought of as âput your structs/classes on a dietâ. If a program is allocating vast numbers of a given struct/class, the size of that struct/class in bytes can become a significant cost in its own right. When I was working on error recovery in grmtools, I found that simply reducing the most commonly allocated struct by 8 bytes in size improved total program performance by 5% â a trick that, from memory, I repeated twice!

There are many similar tactics to this, for example reducing âpointer chasingâ (typically by folding multiple structs/classes into one), encouraging memory locality and so on. However, while itâs easy to measure the size of a struct/class and how often itâs allocated, itâs difficult to measure the indirect impact of things like memory locality â I have heard such factors blamed for poor performance much more often than I have seen such factors proven as responsible for poor performance. In general, I only look to such factors when Iâm getting desperate.

Use a lower-level system

A time-honoured tradition is to rewrite parts of a program in a lower-level programming language. Letâs rewrite our Python bubble sort into Rust:

use std::cmp::PartialOrd;
fn f1<T: Copy + PartialOrd>(l: &mut Vec<T>) {
  loop {
    let mut c = false;
    for i in 0..l.len() - 1 {
      if l[i + 1] < l[i] {
        let t = l[i];
        l[i] = l[i + 1];
        l[i + 1] = t;
        c = true;
      }
    }
    if !c {
      return;
    }
  }
}

I mildly adopted my Python program from earlier to save out 1000 random floating point numbers, and added this testing code in Rust:

use {env::args, fs::read_to_string, time::Instant};
fn main() {
  let mut l = read_to_string(args().nth(1).unwrap())
    .unwrap()
    .lines()
    .map(|x| x.parse::().unwrap())
    .collect::>();
  let before = Instant::now();
  f1(&mut l);
  println!("{}", (Instant::now() - before).as_secs_f64());
}

My Rust bubble sort runs in 0.001s, about 60x faster than the Python version. This looks like a great success for ârewrite in a lower-level programming languageâ â but you may have noticed that I titled this section âUse a lower-level systemâ.

Instead of spending 15 minutes writing the Rust code, it would have been smarter of me to recognise that my Python bubble sort is likely to emphasise CPythonâs (the most common implementation of Python) weaknesses. In particular, CPython will represent what I conceptually thought of as a list of floating point numbers as an array of pointers to individually heap-allocated Python objects. That representation has the virtue of generality, but not efficiency.

Although itâs often forgotten, CPython isnât the only implementation of Python. Amongst the alternatives is PyPy, which just so happens to represent lists of floats as efficiently as Rust. Simply typing pypy instead of python speeds my bubble sort up by 4x! There are few changes I can make that give me such a big performance improvement for such little effort. Thatâs not to say that PyPy runs my program as fast as Rust (PyPy is still about 15x slower) but it may well be fast enough, which is what really matters.

I have seen multiple organisations make the mistake of trying to solve performance problems by rewriting their software in lower-level programming languages, when they would have got sufficient benefit from working out how to run their existing software a little faster. There are often multiple things one can do here, from using different language implementations, to checking that youâve got compiler optimisations turned on [9], to using faster libraries or databases, and so on. Sometimes rewriting in a lower-level programming language really is the right thing to do, but it is rarely a quick job, and it inevitably introduces a period of instability while bugs are shaken out of the new version.

Accept a less precise solution

A common problem we face is that we have n elements of something and we want to understand the best subset or ordering of those for our situation. Letâs imagine that Iâve implemented a compiler and 30 separate optimisation passes. I know that some optimisation passes are more effective if they run after other optimisation passes, but I donât know what the most effective ordering of all the passes is.

I could write a program to enumerate all the permutations of those 30 passes, run them against a benchmark suite I possess, and then select the fastest permutation. But if my benchmark suite takes 1 second to run then it will take roughly 2⁸² years to evaluate all the possibilities â which is rather longer than the current age of the universe. Clearly I canât wait that long for an answer: I can only run a subset of all the permutations. In situations such as this, I have to accept that Iâll never be able to know for sure what the best possible answer is: but, that said, I can at least make sure I end up with a better answer than not trying anything at all.

There are various ways of tackling this but most boil down to local search. In essence, we define a metric (in our running example, how fast our benchmark suite runs) that allows us to compare two solutions (in our case, faster is better) and discard the worst. We then need a way of generating a neighbour solution to the one we already have, at which point we recalculate the metric and discard the worse of the old and new solution. After either a fixed time-limit, or if we canât find solutions which improve our metric, we return the best solution weâve found. The effectiveness of this simple technique (the core algorithm is a few lines of code) tends to stun newcomers, since the obvious problem of local optima seems like it should undermine the whole idea.

As typically implemented, local search as Iâve outlined it above produces correct but possibly non-optimal solutions. Sometimes, however, weâre prepared to accept an answer which is less precise in the sense that it is possibly âincorrectâ. By this I donât mean that the program is buggy, but that the program may deliberately produce outputs that do not fully match what we would consider the âfull and properâ answer.

Exactly what constitutes âcorrectâ varies from one situation to another. For example, fast inverse square root approximates multiplicative inverse: for situations such as games, its fast nearly-correct answer is a better trade-off than a slow definitely-correct answer. A Bloom filter can give false positives: accepting that possibility allows it to be exceptionally frugal with memory. JPEG image compression deliberately throws away some of an imageâs fine details in order to make the image more compressible. Unlike other image compression approaches I cannot recover the original imagine perfectly from a JPEG, but by foregoing a little bit of image quality, I end up with much smaller files to transmit.

I think that, in general, most programmers struggle to accept that correctness can sometimes be traded-off â personally, it offends a deep internal conviction of mine that programs should be correct. Probably because of that, I think the technique is used less often than it should be.

Recently, though, weâve become much more willing to accept incorrect answers thanks to the explosion of ML (Machine Learning). Whereas local search requires us to explicitly state how to create new solutions, ML is trained on previous data, and then generates new solutions from that data. This can be a very powerful technique, but MLâs inevitable âhallucinationsâ are really just a form of incorrectness.

We can thus see that there are two different ways of accepting imprecise solutions: possibly non-optimal; and possibly incorrect. Iâve come to realise that many people think theyâre the same thing, but possible incorrectness more often causes problems. I might be happy trading off a bit of image-quality for better compression, but if an ML system rewrites my code and leaves off a ânotâ Iâm unhappy. My rule of thumb is that unless you are convinced you can tolerate incorrectness, youâre best off assuming that you canât.

Summary

Iâve listed the four optimisation approaches above in the frequency with which Iâve seen them used (from most to least used).

It will probably not surprise you that my least favourite approach is ârewrite in a lower-level programming languageâ, in the sense that it tends to offer the poorest ratio of improvement/cost. That doesnât mean that itâs always the wrong approach, but we tend to reach for it before weâve adequately considered cheaper alternatives. In contrast, I think that until recently we have too rarely reached for âaccept a less precise solutionâ, though the ML explosion has rapidly changed that.

Personally, when Iâm trying to optimise a program I tend to reach for the simplest tricks first. One thing that Iâve found surprises people is how often my first attempt at optimisation will be to hunt for places to use hashmaps â only rarely do I go hunting for exotic data-structures to use. I less often turn to clever algorithms. Of those clever algorithms I tend to implement myself, I suspect that binary search is the one I use the most often, and I probably do so at most once or twice a year â each time I implement it, I have to look up the correct way to do so [10]!

Ultimately, having written this post, Iâve come to realise that there are three lessons that cut across all of the approaches.

First, when correctness can be sacrificed for performance, itâs a powerful technique â but we often sacrifice correctness for performance unintentionally. When we need to optimise a program, itâs best to use the least complex optimisation that will give us the performance we want, because thatâs likely to introduce the fewest bugs.

Second, human time matters. Because us programmers enjoy complexity so much, itâs tempting for us to reach for the complex optimisations too soon. Even if they succeed in improving performance â which they often donât! â they tend to consume much more time than is necessary for the performance improvement we needed.

Third, I think that breadth of optimisation knowledge is more important than depth of optimisation knowledge. Within each of the approaches Iâve listed in this post I have a couple of tricks that I regularly deploy. That has helped give me a reasonable intuition about what the most appropriate overall approach to my current performance woes might be, even if I donât know the specifics.

Acknowledgements: Thanks to Carl Friedrich Bolz-Tereick, and Jake Hughes for comments.

Update (2023-11-14): My original phrasing of a Bloom filter could be read in a way that seemed to be a contradiction. Iâve tweaked the phrasing to avoid this.

Newer 2023-11-14 09:30 Older

If youâd like updates on new blog posts: follow me on Mastodon or Twitter; or subscribe to the RSS feed; or subscribe to email updates:

Footnotes

[1]

Often, when Iâm first looking at performance, I only need a rough sense of whatâs going on, as I narrow down what parts I need to investigate in detail. I often start with normal Unix time, move to something like multitime, then perhaps a sampling profiler like Linuxâx perf, before finally using a more detailed profiler like Valgrind. Each of these tools takes a bit more time to run and understand than the previous, which is why I donât start with (say) Valgrind.

A harder challenge is what inputs to use. It is always tempting to benchmark a program with a single input, and I fall into that trap often, despite knowing that it is a trap.

[2]

In other words: the amount of memory the algorithm uses is not affected by the size or nature of the input.

[3]

Formally speaking the bubble sort is O(n) for sorted data whereas the selection sort is O(n²). Thatâs a rather big difference!

[4]

The initial paper analysing timsort isnât even the end of this story: see this subsequent Python issue for even more!

[5]

Carl-Friedrich Bolz pointed out something which I have seen many times, but never thought to classify specifically: itâs really easy to implement algorithms in a way thatâs accidentally quadratic. A classic case is string concatenation in languages where strings are immutable: each concatenation looks harmless from a performance perspective, but if you generate a large string by gradually concatenating a small starting string, you can end up in trouble. This blog has many more examples!

[6]

In my experience, binary search becomes faster than linear search after about 8-12 elements.

[7]

Indeed, âbetter data-structuresâ generally implies âbetter algorithmâ somewhere.

[8]

For example, I have implemented some fiercely complex algorithms, but Iâve never written a tree rebalancer.

[9]

Iâm continually surprised by how often I see people run software that hasnât been compiled with optimisations.

[10]

Rustâs binary_search_by method has largely stopped my annual confusion over how to pick the next midpoint correctly.

Four Kinds of Optimisation