Mapping list elements to ordinals

abessman · August 29, 2023, 6:10am

This is more of a general computer science question than Python, but here goes:

Given a list of numbers, I want to create a new list containing ordinals 0 through (len(list) - 1), such that 0 maps to the smallest element in the original list, 1 maps to the second smallest, etc. For example:

[6, 3, 0, 1, 4] → [4, 2, 0, 1, 3]

[0.56, 80, 114, 7, -1, 9e6] → [1, 3, 4, 2, 0, 5]

If two elements in the original list are equal, the order of their mapped ordinals does not matter:

[9, 8, 11, 8, 20] → [2, 0, 3, 1, 4] or [2, 1, 3, 0, 4] are both acceptable.

The solution I have come up with is to:

Assign an index to each of the elements in the original list.
Sort the values and their indices, using the values as keys.
Replace the values in the sorted list with ordinals.
Sort the list again, this time using the indices as keys.

def mapords(L):
    d = {k: v for k, v in zip(range(len(L)), L)} # Assign an index to each value.
    d = {k: v for k, v in sorted(d.items(), key=lambda i: i[1])} # Sort by value.
    d = {k: v for k, v in zip(d.keys(), range(len(L)))} # Replace values with ordinals.
    return [v[1] for v in sorted(d.items(), key=lambda i: i[0])] # Sort ordinals by indices.

The bottleneck in this algorithm is obviously the two sort calls, which mean the whole thing runs in O(n*log n). Which isn’t terrible, but I feel like there should be a solution in linear time. After all, the original list is already sorted, I “just” need to remap it.

Can anyone think of a faster way to do this?

bverheg · August 29, 2023, 8:01am

Here’s one possible way of doing it:

def mapords(L):
    s = sorted(range(len(L)), key=lambda i: L[i])
    o = [-1] * len(L)
    for i,j in enumerate(s):
        o[j] = i
    return o

If the lists can be very long, you may want to use numpy:

import numpy as np
def mapords_np(L):
    s = np.argsort(L)
    o = -np.ones(s.size, dtype=np.int32)
    o[s] = np.arange(s.size)
    return o

abessman · August 29, 2023, 8:08am

Nice! Those are still sort-based and so run in O(n*log n), but by eliminating one of the sorts they should still be at least twice as fast as my solution.

I wonder if there is a way to prove the problem can or cannot be solved in linear time? I must admit my compsci-fu is lacking here.

Rosuav · August 29, 2023, 9:15am

There’s no way this could ever run in linear time, because if it could, you could use this to sort the original list in linear time by iterating over the index list and selecting the corresponding elements

The key-based sort you’re looking at is the best you’ll get, I think.

franklinvp · August 29, 2023, 10:56am

If the elements are numbers (integers?), and not too large, you could use radix sort, or Kirpatrick-Reisch sort

abessman · August 29, 2023, 12:11pm

Thanks for the suggestion! I hadn’t heard of either of these sort methods before, but they look very suitable for my use case. The list elements will indeed be integers between 0-100.

Stefan2 · August 29, 2023, 12:55pm

And how long are your lists?

Stefan2 · August 29, 2023, 2:00pm

Benchmark with a few more solutions, with lists of 10,000 random ints from 0 to 100:

  1.11 ± 0.00 ms  Stefan_linear2
  1.12 ± 0.01 ms  Stefan_linear1
  1.15 ± 0.01 ms  Benedict_np
  1.37 ± 0.02 ms  Stefan_np
  2.47 ± 0.05 ms  Benedict
  2.75 ± 0.01 ms  Stefan2
  3.10 ± 0.02 ms  Stefan1
  8.30 ± 0.06 ms  Alexander

Python: 3.11.4 (main, Jun 24 2023, 10:18:04) [GCC 13.1.1 20230429]

(Note I made the numpy solutions return lists of ints as well.)

Code (Attempt This Online!):

def Stefan_np(L):
    return np.argsort(np.argsort(L)).tolist()

def Stefan1(L):
    def argsort(L):
        return sorted(range(len(L)), key=L.__getitem__)
    return argsort(argsort(L))

def Stefan2(L):
    r = [*range(len(L))]
    a = sorted(r, key=L.__getitem__)
    r.sort(key=a.__getitem__)
    return r

def Stefan_linear1(L):
    Is = [[] for _ in range(101)]
    for i, x in enumerate(L):
        Is[x].append(i)
    r = [None] * len(L)
    j = -1
    for I in Is:
        for j, i in enumerate(I, j+1):
            r[i] = j
    return r

def Stefan_linear2(L):
    Is = [[] for _ in range(101)]
    for i, x in enumerate(L):
        Is[x].append(i)
    r = [None] * len(L)
    c = count()
    for I in Is:
        for i, r[i] in zip(I, c):
            pass
    return r

def Alexander(L):
    d = {k: v for k, v in zip(range(len(L)), L)} # Assign an index to each value.
    d = {k: v for k, v in sorted(d.items(), key=lambda i: i[1])} # Sort by value.
    d = {k: v for k, v in zip(d.keys(), range(len(L)))} # Replace values with ordinals.
    return [v[1] for v in sorted(d.items(), key=lambda i: i[0])] # Sort ordinals by indices.

def Benedict(L):
    s = sorted(range(len(L)), key=lambda i: L[i])
    o = [-1] * len(L)
    for i,j in enumerate(s):
        o[j] = i
    return o

def Benedict_np(L):
    s = np.argsort(L)
    o = -np.ones(s.size, dtype=np.int32)
    o[s] = np.arange(s.size)
    return o.tolist()


funcs = Alexander, Benedict, Benedict_np, Stefan1, Stefan2, Stefan_linear1, Stefan_linear2, Stefan_np

import random
import numpy as np
from timeit import timeit
from statistics import mean, stdev
from itertools import count
import sys

n = 10_000
L = random.choices(range(101), k=n)
for f in funcs:
    result = f(L*1)
    print(sorted(result) == list(range(n)) and [x for _, x in sorted(zip(result, L))] == sorted(L), f.__name__)

times = {f: [] for f in funcs}
def stats(f):
    ts = [t * 1e3 for t in sorted(times[f])[:10]]
    return f'{mean(ts):6.2f} ± {stdev(ts):4.2f} ms '

for _ in range(100):
    L = random.choices(range(101), k=n)
    for f in funcs:
        t = timeit(lambda: f(L*1), number=1)
        times[f].append(t)

for f in sorted(funcs, key=stats):
    print(stats(f), f.__name__)

print('\nPython:', sys.version)

abessman · August 29, 2023, 3:18pm

Interesting benchmarks! The lists I’m currently working with are typically a few hundred elements long. I expect that to grow by a factor of ten as the technology scales, but probably not beyond 10,000.

I don’t expect I’ll actually need this optimization, but it’s a fun learning exercise

Stefan2 · August 29, 2023, 3:26pm

Ok, updated the benchmark with list length 10,000.