Currently, itertools provides three different combinatorial iterators, iterating over subsets, sub-multisets and permutations of an iterable in lexicographic order. I propose extending them to fully-fledged classes with __getitem__ and index methods (compatible with slicing).
Motivation
- uniform random sampling from sets of combinations/permutations too large to store directly
- One could
indexa combination/permutation into an integer to allow it to be stored in a 1D array instead of 2D list - multithreaded/distributed programs could iterate over a search space by slicing it into shards
My implementation
In my pet library dronery (excuse the mess and the idiosyncrasy), I have a permutations class in perms.py, and combinations/combinations_with_replacement in comb.py.
Considerations
In both cases, I included an additional keyword argument to use a different ordering, which I will argue for here.
Reading a simplex by slices
There are three relevant sets whose cardinalities are counted by comb(n+k-1,k); their equivalence is straightforward but they still warrant spelling out here.
- the set of vectors in
k-dimensional space with nonnegative integer coordinates summing to< n - the set of nondecreasing vectors in k-dimensional space with elements from
range(n) - the set of vectors in
n-dimensional space with nonnegative integer coordinates summing to exactlyk
The second is formed by taking the partial sum of the first, and the third by considering the second as a Young tableau, transposing and taking the differences.
On the OEIS, infinite rectangular arrays are encoded to sequences by reading them along successive antidiagonal lines, and infinite cubic arrays along antitriagonal planes, etc.
Doing likewise for the combinations functions would permit them to accept infinite iterables as their first parameter (which currently neither works nor fails loudly! running >>> islice(combinations_with_replacement(count(),2),0,8) in the shell causes it to become unresponsive and begin rapidly consuming RAM).
This ordering (enabled by the rev argument) is equivalent to reversing the order in which combinations are outputted then replacing each element i inside a combination with n+~i.
>>> list(combinations_with_replacement(range(4),2,rev=True))
[(0, 0), (0, 1), (0, 2), (0, 3), (1, 1), (1, 2), (1, 3), (2, 2), (2, 3), (3, 3)]
>>> list(combinations_with_replacement(range(4),2,rev=False))
[(0, 0), (0, 1), (1, 1), (0, 2), (1, 2), (2, 2), (0, 3), (1, 3), (2, 3), (3, 3)]
Could a rev-equivalent argument be added to the official implementation (or automatically switched to whenever encountering a type which throws a has no len() TypeError)?
Lexicographically ordering permutations is not asymptotically optimal
Given a permutation p of length n, the value of the 0st element splits the lexicographic position within Sā into one of n contiguous bins of length (n-1)!, within which the 1nd element splits it into one of n-1 bins of length (n-2)!, etc.
The kth element adds (range(n) \ p[:k]).index(p[k])*(n-1-k)! to the lexicographic position, but just storing a list that starts as l=range(n) and removing its p[k]th element (l.pop(p[k])) after adding the kth contribution to the sum would take linear time per removal, so quadratic time in total. However, this problem can be rephrased as finding the p[k]th prefix sum of a list that starts as l=[1]*n and has l[p[k]]=0 after each partial sum; this alone merely shifts the linear part each iteration from the pop to the partial sum, but by using a Fenwick tree instead of a list, we make both reassigning values and taking partial sums take logarithmic time, reducing the total to only O(n*log(n)) (linearithmic)! A fenwick class is included in dronery.common and used to make both index and __getitem__ in permutations objects linearithmic.
Myrvold & Ruskey, 2001 explains that Dietz, 1989 gives a ācomplicatedā structure which can be used in place of a Fenwick tree to shave off an additional log(log(n)) factor, which I think probably is safe not to implement (since a C Fenwick implementation would be faster than a Python Dietz implementation for all sizes worth considering anyway).
However, the primary purpose of the Myrvold & Ruskey paper is to point out that this is redundant anyway, because if you build the permutation by doing appends-and-swaps instead of insertions (seeding a Fisher-Yates shuffle with a sequence of indexes) you get a linear-time algorithm, at the expense of lexicographicity. (You do gain some nice properties, as mentioned in this page; most notably, the number of 0s in the index (when written factoradically) is the number of cycles in the resulting permutation.)
The __next__ method is fast enough in both cases that this isnāt a consideration (both take asymptotically constant average time; use of a Fenwick tree (with linear-time initialisation and logarithmic-time usage) is overkill, due to Knuthās algorithm L, see A056542), but the log(n) factor seems undesirable for getting/indexing very large permutations.