The robotparser module was based on the old document The Web Robots Pages (published in 1994?). He goes further, trying to support more modern conventions. For example, it supports not only “Disallow”, but “Allow” rules, and many additional fields. It even contains a trace of supporting “*” as a path, although it was broken since 2001. But the parsing rules still follow the old document.
The modern standard was buplished in 2022, as RFC 9309 - Robots Exclusion Protocol. Several years before this it existed as a draft. It incorporates conventions used by modern web sites and crawlers.
The robotparser module does not support the modern standard. It can incorrectly parse robots.txt and incorrectly interpret its rules. For example:
empty lines are not ignored, but separate groups
groups matching the same user agent are not merged, only the first group has effect,
the first matching rule is used instead of the longest match,
“Allow” does not win over “Disallow” if both matches the same patch,
special characters “$” and “*” are not supported.
The crawlers written in Python with using robotparser are not well-behaving. They can scan disallowed parts of the site or ignore allowed parts. They can unintentionally create excessive load, and be banned as a result. Therefore, even if support for new standards is considered a new feature, in this case I consider it a bug fix. Because it affects not only Python users, but everyone around them.
Unfortunately, the internal structure of the module does not fit the standard very well. Typically, a crawler has a single user agent, so it is easy to filter out matching rules in a single pass. But the RobotFileParser class contains the rules of the entire robots.txt file, and the user agent string is passed as part of each request. I suspect also that RobotFileParser can be used to create and write the robots.txt file, although this part is not documented. So I’m going to implement code that minimally affects the internal structure of the classes for backporting. For the algorithm to be effective, it must use a shadow cache, which can break code that changes the RobotFileParser class on the fly between requests. I hope this is a hypothetical situation. In the future versions, the internal structure and even the public API may change to more suitable ones.
But why? It is not so big and complicated. It is not more complex than the CSV parser or the JSON parser. With a written standard, it can be implemented exactly, once we clarify some ambiguous details.
I have found only one working third-party implementation. robotexclusionrulesparser by Philip Semanchuk was last updated 9 years ago, and all links are now dead. robots-txt-parser looks like its clone, it was last updated 7 years ago. It supports two formats: http://www.robotstxt.org/norobots-rfc.txt, an update of the original specification from 1996, and How Google Interprets the robots.txt Specification | Google Search Central | Documentation | Google for Developers, the Google-Yahoo-Microsoft extensions announced in 2008, it was a foundation of RFC 9309. It contains a bug recently fixed in robotparser and other bugs, includins security vulnerability (specially prepared robots.txt fail can make the crawler to hang undefinitely). And even if the rest is implemented correctly according to the mentioned above documents, it differs from RFC 9309 in many ways.
Why this module? There are many other specialized modules in the stdlib, and some of them are more broken. If urllib.robotparser will be removed, there is no good third-party alternative which we can suggest as replacement. Users will be forced to implement RFC 9309 themselves, and there are more than one pitfalls here (which I already avoided).
One point I’m trying to make applies to many other unmaintained or bit-rotted stdlib modules as well. If there was no robots parser in the stdlib you could never convince me that it deserves to be in the library.
We already removed a bunch of modules that were totally unused. I guess robotparser is still used or it would also have been removed (or maybe it was hiding under urllib and never considered).
Another point I’m trying to make is that if you are proposing to do the work to write a standards-compliant robots.txt parser to the stdlib, that sounds like you would be more effective if you didn’t put it in the stdlib – you can release immediately on PyPI, you can release as many times as you want, you can change the API if needed without a yearslong deprecation period, and your work can also benefit users who are stuck on an earlier Python version.
Remember, the stdlib’s slogan is that it is where code goes to die. <0.5 wink>
I believe that even if we are going to remove a module, it would be good form to first fix existing bugs (if solutions already exist) and make the latest available version as bug-free as possible.
The problems in the stdlib module plus the low downloads for these outdated packages (54,210/month and 79/month respectively) suggest there’s not much demand for this sort of thing; and if we remove it, there’s not much point putting in effort to fix it first.
I don’t think that’s a valid claim. Every single (decent) web crawler application will need to parser the robots.txt file and doing so with a not fully standards compatible robotparser module in the stdlib is still better than not doing this at all.
Also note that such applications are more likely to exist in closed source company internal tools than as packages on PyPI.
If possible, it’s better to simply document the shortcomings of the robotparser module than to remove it. Perhaps someone will provide patches to make it more standards compliant as a result.
We shouldn’t have broken behavior in the standard library when we have someone able and willing to fix it.
The standard library isn’t the right place for this long term.
We don’t have to break things by removal if we can leave them safely unmaintained permanently.
Maybe this is a good case for “standard compliant” + mark as in a final state (deprecation optional IMO, but I understand the arguments for both removal and not), and then anyone who needs further development beyond fixing the known mismatches with standard (due to the order of events here, this was written before standardization) at time of finalizing this can continue development externally.
I’m a big proponent of the idea that software can be both unmaintained and complete. For a standard-compliant version of a text rule parser, this seems to be the case. It only has to be fixed once, and then it is forever compiant with that version of the standard, no further changes ever required.
A few details are still not completely clear to me. I wrote a letter asking for clarification to the RFC authors and am waiting for a response. However, even if I’m wrong, this will only affect very specific cases that may not happen in the real world (and were not foreseen by the authors). But the internal structure can be different, and different methods are needed to avoid the worst cases. So I would like to do everything right the first time.
Nobody is likely to block you making improvements to robotparser. If the module is in better shape if/when we ever actually do deprecate it, nice. It’s just that we don’t see much value in bothering to do so.
If I were going to spend time improving it, I know it would see more value released on PyPI than if done in the stdlib. No Python upgrade required for this ever again. So I’d personally start by just doing that and iterate on improvements there. That would also be a good time to start the stdlib deprecation and reference the external package from the docs while doing so.
If doing a PyPI module for robots.txt parsing, consider wrapping another maintained library such as robotxt - Rust (for example, I have no informed opinion on it).
And that’s just to support different platforms, as uv doesn’t provide a Python API, and there are still platforms they don’t support.
For libraries that provide Python APIs that want to support a wide range of Python versions and a wide range of platforms, you quickly get to over 100 wheels. And just like platforms you won’t be supporting as many Python versions as if you’d just written pure Python.
All that is to say, for simple projects, that aren’t often a performance bottleneck, perhaps consider keeping it in pure Python?
Which doesn’t cover the (risc-v / musl) combo, because there’s no tag for that on PYPI. Nor can I actually test a built wheel on a bunch of architectures, like s390.
+1 from me on functionality being coded in pure Python.
+1 from me on deprecating and removing this from the standard library.
I think it’s very fair to drop the existing code into a v1.0.0 package, hosted on GitHub, create issues with the specific items in the OP, build and push it to PYPI and look for contributors and maintainers.
The contribution bar to a small package is much lower than CPython, and the fixes are available to folks stuck on older Pythons.
This appears to support 9309 and can be used as a substitute for urllib.robotparser.RotobotFileParser. It also has a test suite against the Google robots.txt parser and ships with a command line util. Looks like a great addition to the Python ecosystem.
I would advocate for deprecating urllib.robotparser.
Thank you, it indeed looks supporting RFC 9309. For some reason I couldn’t find it on PyPI when I searched for the most obvious terms – “robotparser”, “robots.txt”, “RFC 9309”.
But looking at its code I have found some bugs and deviations from RFC 9309, and I am not sure about its interpretation of the ambiguous parts of RFC 9309. In some aspects it is closer to the Google’s pre-standard implementation, but not always. I do not think that it is generally better than my PR. It also lacks some features of urllib.robotparser – string representation, crawl-delay, request-rate.