Enhancements to `graphlib`

dam1784 · October 12, 2021, 1:01pm

Hi Steven,

Almost. You’re just missing a few other pieces of data: the “mom” and “dad” and any other attributes/methods that may have been attached to the Person class. My point is that programs are rarely so simple that their data can be stored in plain old dictionaries. Users store their data in big complicated structures, which the std-lib can’t look inside of because it can’t understand the users business logic.

Hot Take: Users do not intentionally create graphs. Instead they write programs with spaghetti-like data structures.

You are correct that dict’s are good for explicit graphs, but I think its much more common to have graphs implicitly embedded inside of other more complex data structures.

pf_moore · October 12, 2021, 1:28pm

In fact, the addition was simply a “topological sort” function. The fact that it was added to a new library wasn’t originally the intention, and has caused some controversy. It definitely did not signal any particular intention to start adding further graph algorithms to the stdlib (in fact, I think one of the arguments against the name was that it might encourage people to think we planned on doing that).

Incremental, small additions, like DFS/BFS may be appropriate, but I suspect it’ll very much be on an algorithm-by-algorithm basis.

Disclaimer: I wasn’t involved in the original discussion, it’s on bugs.python.org somewhere if you want the details.

Also, I’ll note that I like graph algorithms, and I’m not at all against having them be more accessible. I’m not sure I like your design particularly, but I don’t have a strong argument either way. But I definitely don’t want to end up in a situation where we have a set of graph algorithms in the standard library and yet people still consider an external library to be “best of breed” for general use (like urllib and requests, for example…)

mattip · October 12, 2021, 1:45pm

I would hope that the modest transfer of topological_sort from functools to a graphlib module will not be the start of something bigger. If that is the intent, then perhaps the discussion should be around “should we expand graphlib in the stdlib” without diving into library design choices.

mattip · October 12, 2021, 1:46pm

Ahh, duplicate of Paul’s post, sorry.

dam1784 · October 12, 2021, 3:01pm

On the topic of the existing graphlib:

Issue tracker where it was added: Issue 17005: Add a topological sort algorithm - Python tracker

Documentation: graphlib — Functionality to operate with graph-like structures — Python 3.10.0 documentation

I don’t want to throw too much shade or make a scene, but TopologicalSorter has its faults which are now enshrined in the std-lib. Part of why I started this discussion is because I don’t want TopologicalSorter's mistakes being repeated.

It seems like they had a very specific (and frankly niche) use case (parallel processing) and they built the whole tool around it.
The documentation is painful to read because it is confused about the direction of the edges. (This can be somewhat fixed.) To elaborate on this point:

A topological sort makes all of the edges point left to right, within the output list. Sometimes you want to reverse the output of the topo-sort so that the edges point the other direction. You can either reverse the topo-sorted list, or you can either reverse the direction of all the edges. The docs tell the users to do the latter.

So normally/intuitively I think of using a dict-based graph like this:
edges[start-point] = [list-end-points]

But the documentation for the topological sort tells the users to store their edges in the opposite direction:
edges[end-point] = [list-of-start-points]

But then the TopologicalSorter can’t represent nodes with no edges, so they have a method to add more nodes to the graph: TopologicalSorter.add(start_point, *end_points)

So I agree with the sentiment: we should be slow, careful, and deliberative about adding stuff to the standard library. I expect it to take years for any changes to actually happen.

dam1784 · October 15, 2021, 12:50pm

Agreed. I think I made a mistake saying that, BC I checked my prototype and it also agrees with you (traversal accepts a single starting vertex).

When I get the time I will: re-read all of the comments in this thread, do some thinking, fix up my prototype, and return with more.

Sincerely,
David

steven.daprano · October 16, 2021, 3:21am

That’s odd. I think that’s the ideal situation.

The stdlib is the place for batteries. If you want a nuclear reactor,
look at third-party libraries that aren’t constrained by the stdlib’s
release schedule, backwards compatibility requirements, or licencing.

(In principle at least, the best of breed could be a closed-source
proprietary library that charges money.)

pf_moore · October 16, 2021, 10:01am

I specifically said “best of breed for general use” - meaning “the consensus is that you should always use external library X if you can, and avoid using the stdlib module”. Sorry I wasn’t clear.

You do have a point - it’s not a black and white rule, though. For example, it’s definitely fine to have a stdlib version that’s good for general use, and more capable but more specialist 3rd party libraries (statistics vs numpy/scipy for example). The batteries/nuclear reactor analogy is fair, but what I’m saying is that we should put good batteries in the stdlib, not that we should put nuclear reactors in there.

I’ve said this before, but I like the fact that we added graphlib. I feel like having topological sort available “out of the box” will be useful. I don’t have any particular reason to think that the “dict mapping nodes to predecessors” API is a bad one for practical use cases. I do think whatever algorithms end up in graphlib should work well together, so there’s a common notion of how to represent a graph. And of course, now that TopologicalSorter is in the stdlib, for better or worse we need to maintain its API for backward compatibility. I’d be happy to add algorithms on a case by case basis, but they should be demonstrated to be useful “batteries” in their own right, and not be added just because we now “have graph algorithms in the stdlib”.

I’m not an advocate of a minimalist stdlib. But I’m also not in favour of adding too freely - growth that can be perceived as uncontrolled actually adds weight to the arguments for stripping down the stdlib. As with most things, there’s a balance to be had, and nuanced decisions are needed.

sweeneyde · October 16, 2021, 10:37am

I don’t know if I’m representative, but I think I come across
implicit/lazy/infinite/functional graphs (where not all of the nodes
are in memory ahead of time) about as often as eager finite graphs,
so if this happened, I would like those use cases to be considered.
I’m thinking of problems like

* The following-links-on-Wikipedia problem I mentioned:
   I may not want to download all of Wikipedia ahead of time
   and store all connections in memory.

* If I want to traverse a connected components of 1s of a bit-matrix like
    [[1, 0, 0, 1],
     [1, 1, 0, 1],
     [1, 0, 0, 1],]
it would be nice to be able to write a function like
```
def ones_neighbors(i_j):
    i, j = i_j
    assert matix[i][j] == 1
    for i1, j1 in [(i+1, j), (i-1, j), (i, j+1), (i, j-1)]:
        if matrix[i1][j1] == 1:
            yield (i1, j1)
```
And pass that function, rather than re-formatting everything as
a dict like {(0, 0): ..., (0, 1): ...}.
Re-formatting would be especially wasteful if the matrix is huge
or if the component of 1s to traverse is small relative to it.

* "Can I solve this Rubik's cube in N steps?" It's easy enough
to write a function that, given a cube state, outputs
the adjacent cube states, but it's impossible to pre-compute
all of these and put them in a dictionary.

* Generally: given the world state now, what is the shortest number
  of actions an actor can take to get to the desired state?
  There may be infinitely many world states.

To me, having to transform each of these functions into a mapping object
feels as unnatural as having to sort strings by length by doing

lengths = {x:len(x) for x in my_list}
my_list.sort(keys_mapping=lengths)

Of course, it’s possible to use either API for either use-case:

#### pass-a-function API ###

some_graph_function(..., produce_neighbors)
some_graph_function(..., my_graph.get)

##### pass-a-mapping API ####

class VirtualMapping:
    def __getitem__(self, node):
        return produce_neighbors(node)
some_graph_function(..., VirtualMapping())
some_graph_function(my_graph)

So it’s a matter of deciding between making the eager/finite case a
hair less convenient because you have to put the .get
versus making the functional case a moderate amount less convenient
by having to implement something with a getitem.
In my estimation, functional/lazy graphs are common enough that I would favor the former.

pf_moore · October 16, 2021, 12:32pm

Well, to an extent that ship has sailed, as graphlib.TopologicalSorter chose “dict in constructor, plus add method” as the API. Changing that API would now require a deprecation process. And any further algorithms added to graphlib should be designed with consistency with existing functions in mind.

That’s not to say that we couldn’t add an algorithm that handles “generated on demand” graphs, but the trade-offs would need to be discussed. And if you want to make a blanket principle that “graphs are represented as functions that map nodes to iterables of neighbours” then you’d have to explain why topological sort is an exception.

This is why I imagine a 3rd-party library based entirely round the “graphs as functions mapping nodes to neighbours” design would be a better approach. In a 3rd party library you can apply that design principle consistently, and if as you say it is a good abstraction for practical use cases, your library will become popular and act as its own advertisement for the design.

Personally, I do like the idea in the abstract - but I’m very much aware that having a pure mathematics background, I have a tendency to over-generalise (I find it hard to adhere to the “practicality beats purity” zen ). So I compensate by pushing back on generalised solutions if it’s not clear what the practical benefit is.

Your examples, by the way, are good. They make the case well that there are benefits to having access to a “lazily computed” graph library¹. And I’ll add one of my own - pip needs to work with package dependencies, which are a graph. And computing a package’s dependencies is hard (in a lot of cases, it can involve building the package from source code) and is an ideal candidate for computing lazily. But it’s also important that we have some control over the order that nodes get visited (it’s unlikely that the result we want will involve a 10-year old version of a package) and not all algorithms work like that (SAT solving for dependency resolution is an example of one that doesn’t). So a library that uses lazily-computed neighbours needs to be clear on how it explores the graph - it’s at least as important as big-O performance in real-world use cases. That’s going to be an important factor in the usability of your proposed library.

¹ What they don’t do, is demonstrate that all (or even the majority) of uses need lazy computation. With graph algorithms having high big-O complexity, “ordinary” problems tend to become too slow to be usable long before they become too big to fit in memory, in my experience. This comes back to the “battery” vs “nuclear reactor” analogy - are lazy/infinite graphs the common case here? Maybe, for BFS where there’s no real downside to calculating edges lazily. Maybe less so for a connected components algorithm…

dam1784 · October 16, 2021, 10:50pm

Let’s read through some of the highlights from the thread where theTopologicalSorter was designed:

Issue 17005: Add a topological sort algorithm - Python tracker
bpo-17005: Add a topological sort algorithm by pablogsal · Pull Request #11583 · python/cpython · GitHub

[my edits are in square brackets]

In chronological order:

Gareth Rees
2019-01-18.18:19:15

I approve in general with the principle of including a topological sort algorithm in the standard library. However […]

“Topological sort” is a terrible name: the analogy with topological graph theory is (i) unlikely to be helpful to anyone; and (ii) not quite right. I know that the name is widely used in computing, but a name incorporating “linearize” or “linear order” or “total order” would be much clearer.
The proposed interface is not suitable for all cases! The function topsort takes a list of directed edges and returns a linear order on the vertices in those edges (if any linear order exists). But this means that if there are any isolated vertices (that is, vertices with no edges) in the dependency graph, then there is no way of passing those vertices to the function.

Eric V. Smith
2019-01-18 21:39

This is why I prefer the API exposed by toposort · PyPI

[Note: Eric Smith is and always had been the sole author and maintainer of this package.]
[This package uses a dictionary to store the graph. Eric describes the edges as directed in the conventionally forward direction, and in the resulting list the edges point from right to left. In a literal way, this is the same as graphlib except graphlib describes both the edges and the result as reversed and the two reversals cancel each other out.]

Raymond Hettinger
Date: 2019-06-01 03:49

Unless Łukasz gives us a nod to work out this API for the second beta, we’ll need to defer this to 3.9. IMO, the API in the patch is not user friendly. It started on the right path but became contorted (for both inputs and outputs) to serve unusual cases.

Tim Peters
Date: 2020-01-10 02:24

Let’s stir this up a bit I recently had occasion to exchange ideas with Larry Hastings about topsorts for use in a package manager he’s writing. I thought his API for adding edges was … perfect:

add(node, *dependson)

So, e.g., add(A, B, C) says A depends on B, and on C, but says nothing else about B and C. This is almost always the way topsorts show up in real life: you know what a thing depends on directly, but have scant idea how things may be in the opposite direction. For example, you know that baking a cake requires (among other things) flour, but have no real idea of the universe of other things that require flour. Likewise Larry knows which packages each package requires, but not the reverse. Etc.

Nodes with no edges are trivial to add then: add(A).

[Yay, we get a second API to compensate for the first one!]

[…]

The other big thing that came up is that most topsort programs were useless for his goal: downloading and installing packages takes significant wall clock time, and there’s huge opportunity for exploiting parallelism. But a flat sequence in topsort order gives no clue about what can be done in parallel. Instead [… At this point he proposes what went on to become the final API …]

[Topological Sort != Parallel Processing. Why can’t the standard library’s implementation of topological sort do just that?]

Raymond Hettinger
Date: 2020-04-02 18:48

At some point in the next two or three weeks, I’ll have a chance to work on this more and to offer a competing patch. IMO, the current checkin is over-engineered, both in its API and implementation. This could have been a simple, fast tool written as one or two short Python functions.

Also, I would like to try to out the API alternatives on some groups of engineers to get some user feedback.

For me, as the API currently stands, I would have to write a wrapper to make it usable for my applications.

Tim Peters
Date: 2020-04-02 19:32

Raymond, what application do you have that wouldn’t be completely addressed by sticking to just .add() (to record dependencies) and .static_order() (to retrieve a linear order)?

Larry Hastings and I originally worked out the fancier bits of the interface to deal with problems he actually had, and for which no existing Python topsort implementation we could find was of any use: extract maximal parallelism. If you don’t want that, fine, stick to the two simple bits.

The bits to support parallelism are very easy to use to write correct parallelized code, but of course can seem baffling if you don’t give a rip about parallelism. But in that case you have no need to learn about them either.

If your alternative isn’t equally easy to use in a parallelized context, I’ll be at best +0.

Pablo Galindo Salgado
Date: 2020-04-02 20:20

I just want to echo what Tim mentioned with the extra data point that some of the maintainers of some popular and wide-used open-source libraries that indeed have to deal with this problem or the parallel version of the problem (like gaborbernat in this thread, the maintainer of “tox” and “virtualenv”) do indeed find the current API desirable.

Ben Mares
Date: 2020-05-31 16:01

It’s great to have this feature in the standard library, but it really seems to clutter the functools documentation. Everything else in functools applies directly to functions and methods. Suddenly reading graph theory terminology was disorienting for me. (Due to context, I expected “node” to mean some new type of Python function.) It makes the documentation significantly longer, and IMHO abstruse. Would it be sensible to move this to a separate module?

[At this point the discussion moves onto what to name the new module, and they decide on graphlib.]

In Conclusion:

As you can clearly see: they did not set out to create a topological sort. They set out to solve their particular problem involving parallel processing. Along the way the just happened to need to create a topological sort and they decided that because their tool was useful (in the domain of parallel processing) that it should live somewhere in the standard library.

Did they come up with an unambiguously good API for graphs? No. Instead they started with a truly awful API (list of edges) and then tried to move to a better one (dictionaries) but got confused about which direction the edges should point and then tacked on the add method to fix it for good.