Making `combinations`, `combinations_with_replacement` and `permutations` indexable

Currently, itertools provides three different combinatorial iterators, iterating over subsets, sub-multisets and permutations of an iterable in lexicographic order. I propose extending them to fully-fledged classes with __getitem__ and index methods (compatible with slicing).

Motivation

  • uniform random sampling from sets of combinations/permutations too large to store directly
  • One could index a combination/permutation into an integer to allow it to be stored in a 1D array instead of 2D list
  • multithreaded/distributed programs could iterate over a search space by slicing it into shards

My implementation

In my pet library dronery (excuse the mess and the idiosyncrasy), I have a permutations class in perms.py, and combinations/combinations_with_replacement in comb.py.

Considerations

In both cases, I included an additional keyword argument to use a different ordering, which I will argue for here.

Reading a simplex by slices

There are three relevant sets whose cardinalities are counted by comb(n+k-1,k); their equivalence is straightforward but they still warrant spelling out here.

  • the set of vectors in k-dimensional space with nonnegative integer coordinates summing to < n
  • the set of nondecreasing vectors in k-dimensional space with elements from range(n)
  • the set of vectors in n-dimensional space with nonnegative integer coordinates summing to exactly k

The second is formed by taking the partial sum of the first, and the third by considering the second as a Young tableau, transposing and taking the differences.

On the OEIS, infinite rectangular arrays are encoded to sequences by reading them along successive antidiagonal lines, and infinite cubic arrays along antitriagonal planes, etc.

Doing likewise for the combinations functions would permit them to accept infinite iterables as their first parameter (which currently neither works nor fails loudly! running >>> islice(combinations_with_replacement(count(),2),0,8) in the shell causes it to become unresponsive and begin rapidly consuming RAM).

This ordering (enabled by the rev argument) is equivalent to reversing the order in which combinations are outputted then replacing each element i inside a combination with n+~i.

>>> list(combinations_with_replacement(range(4),2,rev=True))
[(0, 0), (0, 1), (0, 2), (0, 3), (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]
>>> list(combinations_with_replacement(range(4),2,rev=False))
[(0, 0), (0, 1), (1, 1), (0, 2), (1, 2), (2, 2), (0, 3), (1, 3), (2, 3), (3, 3)]

Could a rev-equivalent argument be added to the official implementation (or automatically switched to whenever encountering a type which throws a has no len() TypeError)?

Lexicographically ordering permutations is not asymptotically optimal

Given a permutation p of length n, the value of the 0st element splits the lexicographic position within Sā‚™ into one of n contiguous bins of length (n-1)!, within which the 1nd element splits it into one of n-1 bins of length (n-2)!, etc.

The kth element adds (range(n) \ p[:k]).index(p[k])*(n-1-k)! to the lexicographic position, but just storing a list that starts as l=range(n) and removing its p[k]th element (l.pop(p[k])) after adding the kth contribution to the sum would take linear time per removal, so quadratic time in total. However, this problem can be rephrased as finding the p[k]th prefix sum of a list that starts as l=[1]*n and has l[p[k]]=0 after each partial sum; this alone merely shifts the linear part each iteration from the pop to the partial sum, but by using a Fenwick tree instead of a list, we make both reassigning values and taking partial sums take logarithmic time, reducing the total to only O(n*log(n)) (linearithmic)! A fenwick class is included in dronery.common and used to make both index and __getitem__ in permutations objects linearithmic.

Myrvold & Ruskey, 2001 explains that Dietz, 1989 gives a ā€˜complicated’ structure which can be used in place of a Fenwick tree to shave off an additional log(log(n)) factor, which I think probably is safe not to implement (since a C Fenwick implementation would be faster than a Python Dietz implementation for all sizes worth considering anyway).

However, the primary purpose of the Myrvold & Ruskey paper is to point out that this is redundant anyway, because if you build the permutation by doing appends-and-swaps instead of insertions (seeding a Fisher-Yates shuffle with a sequence of indexes) you get a linear-time algorithm, at the expense of lexicographicity. (You do gain some nice properties, as mentioned in this page; most notably, the number of 0s in the index (when written factoradically) is the number of cycles in the resulting permutation.)

The __next__ method is fast enough in both cases that this isn’t a consideration (both take asymptotically constant average time; use of a Fenwick tree (with linear-time initialisation and logarithmic-time usage) is overkill, due to Knuth’s algorithm L, see A056542), but the log(n) factor seems undesirable for getting/indexing very large permutations.

1 Like

You mean index right? __index__ is for lossless integer conversion. Also, if you’re going to do this, you should inherit from Sequence, and provide:

  • __getitem__
  • __len__
  • __contains__
  • __index__, and
  • __count__

The latter three being for efficiency.

To me it seems that for these, going through a ā€œflattenedā€ 1-D index is a detour. Sampling is better done on the original sequences; sharding with something like batched. For storing combinations/permutations in a big list, I don’t understand the use case.

infinite rectangular arrays are encoded to sequences by reading them along successive antidiagonal lines, and infinite cubic arrays along antitriagonal planes, etc.

AFAIK, that’s also better with generating the individual indices.

Doing likewise for the combinations functions would permit them to accept infinite iterables as their first parameter (which currently neither works nor fails loudly! running >>> islice(combinations_with_replacement(count(),2),0,8) in the shell causes it to become unresponsive and begin rapidly consuming RAM).

That is a bug.