When you can, build a list rather than pick it out of a larger list

Photo by Glenn Carstens-Peters on Unsplash

Computers are so fast these days that mildly inefficient algorithms generally don’t cause any noticeable problems. It takes a really awful, inefficient algorithm to cause any readily noticeable performance degradation.

One kind of mildly inefficient algorithm is one that picks out from a list elements to create a smaller list in situations when it would make more sense to simply build the list from scratch.

Sometimes you can build a list from scratch, sometimes you can’t. Basic arithmetic provides plenty of concrete examples of both situations.

Suppose we’ve been tasked with writing a function that makes a list of positive integers of the form 4n + 1 up to some threshold specified at runtime.

For example, given a threshold of 50, our function should give a list consisting of these numbers: 1, 5, 9, 13, 17, 21, 25, 29, 33, 37, 41, 45, 49 (see A016813 in the OEIS).

We’re going to write precisely such a function in Java using the principles of test-driven development. We’ll also look at Scala a little bit.

Test-driven development might seem like overkill for a simple exercise like this one. Here, though, and for other problems, it gives us the benefit of being able to explore different solutions very quickly.

A given solution might be correct, or it might be subtly wrong (e.g., 1, 5, 9, …, 49, 53 for a threshold of 50), or it might be way off (e.g., −50, 50, −50, …). We can find out quickly enough by running the tests.

So, for this exercise, I would start out by writing a stub that returns an empty list regardless of the specified threshold. Then I would write a test that gives a threshold of 50 and expects 1, 5, 9, …, 49.

You may prefer to write the test first and then the stub. Either way, the first time we run this test, it should fail (if you try to run it but it doesn’t run due to a compilation error, it doesn’t count, in my opinion).

The easiest way to get our first test to pass would be to have the function always return the 1, 5, 9, …, 49 list regardless of the threshold.

Next, we might write a test that comes up with a pseudorandom threshold between, say, 501 and 1000, then checks the returned list to make sure it starts with 1, ends with the appropriate number, and in between, any two consecutive numbers have a difference of 4.

Make sure that now the test with the threshold of 50 passes but the test with the random threshold fails (from this point on I’ll just write “random” to mean “pseudorandom”).

Now it’s time to rewrite the function so that it actually does something with the threshold parameter. In Java, our first attempt might look something like this:

Do choose a better identifier than a bunch of underscores. Even with a better identifier, this function still looks painfully clumsy. It looks like it’s doing a lot more work than it needs to.

Actually, though, in a real world scenario my first attempt would be a little better than that, I think yours would be also. With that one I was going out of my way to write something that would be very inefficient.

For instance, it makes little sense to add 0 to the list of numbers if we’re just going to remove it later. Worse, roughly 75% of the numbers added to the list will be removed later. That’s unnecessary.

And yet, on my obsolete computer, this clumsy algorithm passed the threshold of 50 test in just 17 milliseconds and the random threshold test in just 7 milliseconds (the random threshold was 614).

Streamlining the algorithm, we might come up with something like this:

In a run on my computer, the threshold of 50 test passed in 14 milliseconds and the random threshold test passed in 11 milliseconds (the random number for the threshold was 687).

With a somewhat better algorithm, how could it be that one test was slightly faster but one test was slightly slower? These timings might say more about what else was going on in my computer at the time than about this one function’s performance.

Not even half a second for both tests with either algorithm. In the Scala REPL, it would take a lot longer just to get the Java Virtual Machine and the Scala Build Tool started than to execute a couple of “for comprehensions”:

The limitation here is how fast you can type. And remember: you can use the Up arrow keyboard shortcut to repeat a command you’ve given before in the current session.

Likewise, in Scastie, it would probably take a lot more time to send the commands from your Web browser to the server and for the server to send back the results than to actually execute the commands.

Despite these speeds, it should be clear by now that it is still inefficient to iterate through the integers from 1 to the threshold and pick out the numbers of the form 4n + 1, even though we’re not adding to the list numbers that we’ll just remove later.

It is better to iterate through only the integers we will actually add to the list, thereby eliminating the need to check each integer modulo 4.

Aside from the practically meaningless identifier, this is a huge improvement. It is cleaner and more concise. In Scala it gets even more concise:

You can put comments in the Scala REPL. It’s a feature that’s of use only to people writing tutorials.

So these are all pretty fast. But unless you look at the timings reported by JUnit, you might not actually notice any difference whatsoever.

In one run, the threshold of 50 test took 13 milliseconds and the random threshold test took 2 milliseconds (with a threshold of 501). And in another run, the threshold of 50 test took 17 milliseconds and the random threshold test took 3 milliseconds (with a threshold of 520).

It takes our brains longer to read the timings and compare them to previous timings than it takes the Java Virtual Machine to run through even the clumsy algorithm with all the removes.

We should regard these timings as imperfect measures. Improving our algorithm definitely improved performance, but the performance gain is noticeable only when we analyze test timings.

Performance is not the only reason to optimize an algorithm, however. Maintainability is also an important reason. When your program does more work than necessary, that means more lines of code.

And more lines of code written today could mean more lines of code to review later on. And then your teammates, or consultants, or even yourself, might become confused as to whether there was a legitimate reason to do something a certain way when apparently there was an easier way to do it.

For the sake of performance and maintainability, in situations where you can build a list from scratch, you should probably build that list from scratch.

You can’t always do that, however. For example, consider a function that gives prime numbers of the form 4n + 1. There are a few different ways we could go about this, though I’m pretty sure in this case there is no way to build this list from scratch.

The worst way we could go about this is to create a list of all the integers from 0 to the threshold and then remove the ones that are not prime or not of the form 4n + 1.

First though, a test and a stub. The stub can just return an empty list regardless of the threshold, just like in the previous exercise.

The threshold of 50 test should be easy to adapt with just a few deletions (it expects the list 5, 13, 17, 29, 37, 41, see A002144 in the OEIS) and one change (call our primes of the form 4n + 1 function instead of the general numbers of the form 4n + 1 function).

The random threshold test will require a little more thought and much more careful changes, and maybe a digression to test and write primality checking functions.

Suppose that someone else has already created for us the functions sieve() and isPrime() for us. The former creates a list of primes up to a specified threshold, the latter is a Boolean function that determines whether a given number is prime or not.

Further suppose that the author of sieve() and isPrime() has already tested those for correctness and efficiency, and the source for both of those is closed off from our tinkering.

We might use both sieve() and isPrime() in our random threshold test, but our clumsy first attempt will only use the latter. Once you have a first fail for the stub that just returns an empty list, we can move on to our clumsy algorithm.

The threshold of 50 test passed in 5 milliseconds, while the random threshold test passed in 95 milliseconds, with 5050 for a random threshold. Overall, the whole test suite took half a second.

We now come to a fork in the road. Two different ways to improve our clumsy algorithm suggest themselves: we could use our numbers of the form 4n + 1 function to make a list of numbers of that form, and then pick out the primes from that, or we could use sieve() to come up with a list of primes and then pick out the numbers of the form 4n + 1 from that.

My gut feeling is that the former is better than the latter, but I can’t say for sure yet. Since we have the tests in place, we can just try both ways and see if there’s any significant difference in performance. First, then:

Hmm… threshold of 50 test, 31 milliseconds; random threshold (6502) test, 109 milliseconds. It seems to have actually done worse than the clumsy algorithm.

Alright then, let’s try the other way.

Whoa, it seems to perform even worse than the previous two: still 31 milliseconds for the threshold of 50 test, but the random threshold (7096) test took 325 milliseconds — almost a third of a second.

We shouldn’t put too much stock into these timings, since they can vary even when we don’t make any changes to the source or the tests.

It could have been the case that there was a lot more going on in my computer when I ran the test for the improved algorithm than when I ran the test for the clumsy algorithm. We need another way to reason about efficiency.

It is a well known mathematical fact that the prime numbers thin out as the numbers get larger, even though there are infinitely many primes.

Obviously the density of numbers of the form 4n + 1 remains constant. But as n gets larger, 4n + 1 is less likely to be prime. It is also less likely for 4n − 1 to be prime.

Even if sieve() is an implementation of the sieve of Eratosthenes, it is still more efficient than using isPrime() on numbers of the form 4n + 1, because sieve() removes more numbers from consideration earlier, and before handing the list over to our program.

For example, given that 461 is prime, we see that:

  • 465 is obviously divisible by 5,
  • 469 looks like it could be prime but it’s divisible by 7,
  • 473 is divisible by 11,
  • 477 is divisible by 3,
  • 481 is divisible by 13,
  • 485 is obviously divisible by 5,
  • 489 is divisible by 3,
  • 493 is divisible by 17,
  • 497 is divisible by 7,
  • 501 is divisible by 3, and
  • 505 is obviously divisible by 5.

With a threshold of 505 or greater, using isPrime() would mean our function has to spend a short but nonzero span of time checking that 465, 485 and 505 are not prime, and likely more time than the others checking that 493 is not prime either.

Using sieve() instead, our function would instead have to look at the primes 463, 467, 479, 487, 491, 499 and 503. All the obvious multiples of 5, as well as the less obvious multiples of small primes, would have been discarded already.

But a single division operation for each of the seven primes between 462 and 504 would demonstrate they’re of the form 4n − 1 rather than 4n + 1, whereas it might take seven divisions just to prove 493 is composite.

So now I think that on a closed system devoted exclusively to bench-marking this task, we would find that it’s better to first come up with a list of primes and then pick out the ones of the form 4n + 1.

Although I don’t have timings to back it up, I assert that on an “open” system, it is still better to pick the numbers of the form 4n + 1 out of a list of primes than to pick out the primes from a list of numbers of the form 4n + 1.

Still, the timings help make my overall point here. If there was a way to build the list of primes of the form 4n + 1 from scratch rather than carve it out of another list, it would probably be faster and more efficient.

And if not, we might have to assess trade-offs between different lists to pick out of for our new list.

is a composer and photographer from Detroit, Michigan. He has been working on a Java program to display certain mathematical diagrams.