Algorithmic complexity and explaining Timsort

Rosuav · December 26, 2022, 9:06am

Timsort is complicated and hard to explain, but a very VERY good algorithm. Whenever I’m asked to explain it, I end up explaining something that’s kinda similar, but isn’t really it. And it’s not easy to explain the true benefits of Timsort when not actually going into all of its many details! So instead, I’m trying to devise a much simpler version of the same sort of ideas (pun intended), and want to check that my analysis of it is correct.

Enter Fracture Sort, a hybrid of merge-sort and selection-sort.

Fracture the array into runs of monotonically-increasing value.
Repeatedly find the run with the smallest head value and remove it to the start of the result array.
Once there’s only one run remaining, it must contain all the largest data values in order, so append it to the result array and return.

Simple enough to explain, particularly with a deck of cards or other physical objects to be sorted. Its worst case is pretty easy to prove: if the data is perfectly backwards, it devolves to a naive selection sort for O(n²). Its best case is also quite easy: for already-sorted data, one O(n) pass in the first step, no work at all in the second, and another O(n) pass at the end, giving an O(n) overall best case.

Pinning down the average case is a little trickier. My analysis is: with random data, there’s equal probability that the value will increase or decrease, meaning that runs will average just 2 values in length; that means there’ll be O(n) runs and a naive selection sort will take O(n²) time. Does that seem correct?

As a small refinement, I replaced the naive selection sort with a heap-based priority selection. I’m not sure how to prove that heapify takes O(n) time, but that’s what the docs say, so I’ll trust it; and heappop and heapreplace are both O(log n), so that should mean that random data can be sorted in O(n log n) time, same as with any other good algorithm.

Does this simplified algorithm capture the broad essence of Timsort? Does the loss of simplicity in going to a heapq destroy the value of this as a teaching tool? Thoughts welcomed!

Here’s the test code I’ve been using:

import heapq

def fracture(arr):
	# 1. Find all runs of nondecreasing values
	runs = []
	last = arr[0]; size = 0
	for i, v in enumerate(arr):
		if v < last:
			# New run. Record the end point and length.
			# For convenience, also record the initial (and lowest) value,
			# and the run index.
			runs.append((arr[i - size], len(runs), i, size))
			size = 0
		size += 1
		last = v
	# The last run has its endpoint just beyond the array.
	runs.append((arr[-size], len(runs), len(arr), size))
	return runs

def fracture_sort_simplistic(arr):
	if not arr: return []
	runs = fracture(arr)
	# 2. Grab the smallest head and move it to the result.
	ret = []
	while len(runs) > 1:
		# The run tuples begin with their lead element, then their
		# position, so the minimum run tuple is the correct head.
		# Note that in a language like C where reconstructing a list
		# is inefficient, this could be replaced with a minindex()
		# search, and the index values inside the runs would merely
		# be tie-breakers. The empty-run removal could then move the
		# last run into the slot that was used by the empty one, and
		# would not need to move any others or renumber them.
		head, idx, end, size = min(runs)
		ret.append(head); size -= 1
		if not size: runs = [(h, i - (i > idx), e, s) for h, i, e, s in runs if i != idx]
		else: runs[idx] = (arr[end - size], idx, end, size)
	# 3. There's only one left. It must be the largest entries.
	_, _, end, size = runs[0]
	ret.extend(arr[end - size : end])
	return ret

def fracture_sort_heap(arr):
	"""The same basic algorithm, but using a heap to improve the search"""
	if not arr: return []
	runs = fracture(arr)
	# 2. Grab the smallest head and move it to the result.
	heapq.heapify(runs)
	ret = []
	while len(runs) > 1:
		# The run tuples begin with their lead element, then their
		# position, so the minimum run tuple is the correct head.
		# By the heap rule, this is at the start of the list.
		head, idx, end, size = runs[0]
		ret.append(head); size -= 1
		if not size: heapq.heappop(runs)
		else: heapq.heapreplace(runs, (arr[end - size], idx, end, size))
	# 3. There's only one left. It must be the largest entries.
	_, _, end, size = runs[0]
	ret.extend(arr[end - size : end])
	return ret

orig = list("3141592653589793")
# positions: 0123456789012345
print(fracture_sort_simplistic(orig))
print(fracture_sort_heap(orig))
print(sorted(orig))