Sorting short sequences with Enumerable#sort_by

A deep-dive into an unexpected ordering

Posted by Owen Stephens on May 30, 2023

Sorting Ranges

Here are some seemingly-redundant uses of sort_by:

>> RUBY_PLATFORM
=> "arm64-darwin22"
>> RUBY_VERSION
=> "3.2.2"
>> (1..5).sort_by { nil }
=> [1, 2, 3, 4, 5]
>> (1..6).sort_by { nil }
=> [1, 2, 3, 4, 5, 6]
>> (1..7).sort_by { nil }
=> [1, 2, 3, 4, 5, 6, 7]

given the above, it seems reasonable to expect the next expression in the sequence to continue the pattern and equal [1, 2, 3, 4, 5, 6, 7, 8].

Let's check:

>> (1..8).sort_by { nil }
=> [8, 2, 3, 4, 5, 6, 7, 1]

...huh! I wonder where that particular behaviour comes from?

The short story of finding out the answer is the subject of this blog post.

Why are we looking at this?

While testing some code that enqueued asynchronous jobs in a certain order, I was surprised to find that a test that passed on my machine was failing in our CI runner.

The RSpec test in question looked something[1] like this:

it "enqueues jobs in the expected run_at order" do
  jobs = enqueue_jobs(...)

  job_args_ordered_by_run_at = jobs.sort_by(&:run_at).map(&:arg)

  expect(job_args_ordered_by_run_at).to eq([1, 2, 3, 4, 5, 6, 7, 8])
end

where each job instance has a run_at time and a single arg. Our test is asserting that the job args will be processed in order of (ascending) run_at.

Surprisingly, the test failed in CI[2] with:

expected: [1, 2, 3, 4, 5, 6, 7, 8]
     got: [8, 2, 3, 4, 5, 6, 7, 1]

What could be going on? I reran the test locally and it passed. Hmm... the main difference between my machine and the CI runner was host platform; specifically, macOS vs Linux.

Discovering the key documentation line

After some local debugging using byebug I realised that the run_at values were (unexpectedly) all the same nil value. This meant that sort_by was attempting to sort based on values that were all the same. A quick read of the sort_by docs leads us to this important line:

The ordering of equal elements is indeterminate and may be unstable.

Aha!

  • "unstable" here refers to the sort not being a stable sort. That is, any pre-existing order of equal-valued objects is not guaranteed to be preserved,
  • "indeterminate" here allows the ordering of equal-valued objects to not be guaranteed to be equal across platforms (or perhaps other context).

In short, by the documentation, if we sort_by the same (equal) value, then Ruby is allowed to return the elements in any particular order, and it may differ when run on macOS vs on Linux.

We can demonstrate the difference in behaviour on Linux using the code snippet from the first section and a podman container; First, launch an irb session inside the container with:

$ podman run --rm -it ruby:3.2.2-alpine irb --prompt simple

and then, evaluate the same expressions as before...

>> RUBY_PLATFORM
=> "aarch64-linux-musl"
>> RUBY_VERSION
=> "3.2.2"
>> (1..5).sort_by { nil }
=> [1, 2, 3, 4, 5]
>> (1..6).sort_by { nil }
=> [1, 2, 3, 4, 5, 6]
>> (1..7).sort_by { nil }
=> [1, 2, 3, 4, 5, 6, 7]
>> (1..8).sort_by { nil }
=> [1, 2, 3, 4, 5, 6, 7, 8]

so no unexpected order of 8 elements on this particular Linux platform.

The next question is therefore: why are we seeing the strange-but-definitely-allowed-by-the-documentation sort_by output on macOS, but not Linux?

Explaining the sort behaviour on macOS

We can follow a chain of functions through the MRI source code to reach a call to the qsort_r[3] library function, which as the name suggests, performs a quicksort.

We can then dig through Apple's Open Source webpage to find the source code of a recent version of qsort_r. As hinted by a comment in that file, for the full detail of this implementation of quicksort, the very readable 1993 paper "Engineering a Sort Function" by Bentley and McIlroy is an excellent guide. However, for this post, we can copy some short excerpts to explain why we see the ordering we do.

First, we can see why ranges with less than 8 elements are not ordered unexpectedly - internally, qsort falls back to insertion sort for short sequences, and insertion sort is stable:

if (n <= 7) {
	/*
	 * For sufficiently small inputs, we'll just insertion sort.
	 *
	 * Pass 0 as swap limit, since this must complete.
	 */
	_isort(a, n, es, thunk, cmp, 0, swaptype_long, swaptype_int);
	return;
}

Next, we see that the first step of quicksort is to find an approximation of the median element of the sequence, by taking a small (3 element) sample and finding the median of that sample.

pm = med3(pl, pm, pn, cmp, thunk);

here pl, pm, pn are pointers to the first, middle and last elements, respectively. cmp is the user-supplied comparison callback and thunk is unimportant for our discussion.

The med3 function is slightly obscured by preprocessor macros:

/*
 * Find the median of 3 elements
 */
static inline char *
med3(char *a, char *b, char *c, cmp_t *cmp, void *thunk
#ifndef I_AM_QSORT_R
__unused
#endif
)
{
	return CMP(thunk, a, b) < 0 ?
	       (CMP(thunk, b, c) < 0 ? b : (CMP(thunk, a, c) < 0 ? c : a ))
	      :(CMP(thunk, b, c) > 0 ? b : (CMP(thunk, a, c) < 0 ? a : c ));
}

but we can simplify by removing some preprocessor macros and ignoring the (unimportant to us) thunk argument, and replacing the generalised[4] CMP with a simple comparison:

return a < b ?
       (b < c ? b : (a < c ? c : a))
      :(b > c ? b : (a < c ? a : c));

then, for example, if a, b, c is 3, 5, 1, then

  1. a < b
  2. not b < c
  3. not a < c

so the median element is 3.

In our surprising example, a, b, c are all the same value, and thus cmp returns 0 for all comparisons. If we follow the ternary operators, this means

  1. not a < b
  2. not b > c
  3. not a < c

and so we use c as the median. Since we passed in (a pointer to) the last element of the array as c, we use this last element as our chosen median.

The next step is to (insert drum roll here) swap the median with the first element of the array!

/*
 * Pull the median to the front, starting us with:
 *
 * +-+-------------+
 * |=|      ?      |
 * +-+-------------+
 * a pa,pb         pc,pd
 */
swap(a, pm);

After that, there are some (albeit relatively involved and slightly obscured) steps that implement the remainder of the quicksort algorithm, but in our case, all fall out as no-ops. Hence we return from qsort_r having only modified the array by swapping the first and last elements, which is exactly the behaviour we observed.

Why did we not see this with Linux?

Our Linux demonstration used the platform "aarch64-linux-musl". If we look at the musl source of qsort we see that it is not actually a quicksort and instead a smoothsort implemented to the same qsort interface. Smoothsort is also not stable, and given that a different sorting algorithm is used, we wouldn't expect to necessarily see the same resultant ordering. For an excellent deep-dive into all the details of implementing smoothsort see this article.

Hot off the press

A very recent PR to MRI has added special handling to sort_by so that if the block returns Integer or Float values then an optimised introsort implementation can be used, rather than calling the platform's qsort_r. In particular, the implementation returns early without making any modifications if the input sequence is already sorted.

In 3.2.2 sort_by with a value of nil or 0 behaves the same:

>> RUBY_VERSION
=> "3.2.2"
>> (1..8).sort_by { nil }
=> [8, 2, 3, 4, 5, 6, 7, 1]
>> (1..8).sort_by { 0 }
=> [8, 2, 3, 4, 5, 6, 7, 1]

but if we install the development version of Ruby 3.3 with rbenv install 3.3.0-dev, then we can infer that introsort is triggered for 0, which exhibits the return-early behaviour:

>> RUBY_VERSION
=> "3.3.0"
>> (1..8).sort_by { nil }
=> [8, 2, 3, 4, 5, 6, 7, 1]
>> (1..8).sort_by { 0 }
=> [1, 2, 3, 4, 5, 6, 7, 8]

It strikes me that the same return-early optimisation might be possible when all elements are exactly the same object (so e.g. nil, true or a frozen String), but I've not investigated further.

Summing up

In this post we took a deep-dive into explaining an unexpected ordering, which was originally encountered due to (accidentally) leaving an attribute set to nil. There is no Ruby bug, the documentation clearly explains that the output we observed is expected, but it was nonetheless a fun task to hunt down why we saw the ordering we did.

Blog post photo by Paul Bergmeir on Unsplash


[1]: In reality the arg values for the job in question are strings containing filenames and not as easy to read as integers from 1 to 8.

[2]: You may ask why the RSpec test had the "wrong" order and the CI test exposed this. Well, the actual arguments were somewhat complex filenames, so I'd generated the expected ordering by running the test locally and copying the ordered values from the error message. I'd roughly confirmed that the ordering was as I expected, but I certainly didn't expect this failure!

[3]: The _r suffix here signifies that an extra pointer is passed through to the comparison callback, allowing the qsort_r function to be reentrant.

[4]: The cmp callback allows qsort_r to sort arbitrary user data, instead of just (say) arrays of integers. As per the qsort_r docs: "The comparison function must return an integer less than, equal to, or greater than zero if the first argument is considered to be respectively less than, equal to, or greater than the second."