Rewrite CSV Sniffer

The CSV Sniffer object has many bugs (see the csv project board) In trying to solve some of the issues at the sprints (see gh-119123), my research keeps leading to cleverCSV and it’s table uniformity algorithms for sniffing CSV formats (see the csv sniffer codebase).

This seems to traditionally be a “go use a library” view, but csv reading and writing is a very common use case in Python. I think we would be better served at completely rewriting the csv sniffer object using this table uniformity algorithm or porting the CSVSniffer to the standard library.

Thoughts?

3 Likes

Looking at the readme you linked, it, if I am reading it correctly, says that the stdlib implementation is faster than all other implementations by a factor of 5-10.

This could be a real benefit, even with a loss of accuracy. So if it gets rewritten I think the performance should be kept in mind.

2 Likes

Good callout. I guess the tradeoff of accuracy and speed should be sorted out. Personally, I lean more towards accuracy over speed but can understand that is not what the community always desires.

1 Like

PRs to rewrite the Sniffer get rejected because they potentially break code that currently sniffs correctly. The docs don’t make explicit promises beyond the header detection criteria, “this method is a rough heuristic and may produce both false positives and negatives”. However, Hyrum’s law implies that users do rely on the current implementation.

That said, it would be reasonable to offer an alternative sniffer that is smarter and handles more cases correctly.

1 Like

Well considering there are only 15 samples in the tests, it seems that we should build a test suite around it first.

My general feeling on the csv module is that it is really a dead battery but very important as many many users start off reading data with it before moving to something more powerful like pandas. Could certainly work on a new battery and move users to it over time.

1 Like

Pandas uses csv module internally.

Or at least has the option to

1 Like

Sure enough. Looks like you essentially get the csv sniffer from pandas if you don’t set things correctly. Of course it only sniffs the first line which is … well … a way to do things. My PRs at least use the first 5 lines =P.

Is it a dead battery, or is it simple, stable code which works very well for most users?
I think I’m detecting a hidden assumption in this thread that “can do more” is the same as “should do more”.

I’m strongly in favor of improving things, no matter how old they are. But whether the last changes to the module were last year or last week, the obligation to users is the same: to make sure we’re actually making an improvement.

I assume the new code will be somewhat more complex. As long as it’s long-term maintainable, that’s fine.

This will change behaviors. Is it possible that a user’s application relies on current behavior? If so, do they have a migration path which you can describe?

I have very limited knowledge of the csv module’s implementation, but I’ve used it quite a lot as a user for many years. I’ve never had any issue with it, but I’ve also only ever used it on well formed data. I just want to make sure we aren’t too brash in assuming that nobody expects it to keep its current behavior (even if that behavior is technically undefined). I’d like to at least make sure that core devs can understand the change without reading the implementation.

3 Likes

Yeah. Glossing over some of the issue in OP, it seems like the issues fall into a few buckets

  1. User doesn’t like heuristics used by Sniffer to guess something about the CSV because its ill formed
  2. User wants some random feature that another program library implements but isn’t defined as a requirement for csv format.

Right, so currently the sniffer uses some regexs or a count of common delimiters to guess the dialect. Even though the dialect object has other factors like quoting policies, line terminators, and escape chars, it doesn’t try to detect those things.

In working through the bug list, I just thought I would raise the idea, otherwise we might as well mark a lot of these as “won’t fix” and clean up the tracker.

1 Like

I hope my comment didn’t come off too dismissive.

I don’t know what should be done.

I did notice that in a few of the issues, it was suggested that the problematic csv file be added as a test case.

According to here the csv module expert is inactive.

Maybe we can get a core dev to respond to you.

Not at all.

As noted on the linked issue, modifying the sniffer is risky because users could easily be relying on its existing behaviour - quite possibly without even knowing.

Adding a new sniffer might be possible. But equally, if there is a more powerful sniffer available on PyPI, why not just use that? I’d never heard of clevercsv before, but if there is community consensus that it’s the “best of breed” solution, a link to it could be added to the csv module documentation.

1 Like

I’m not convinced by the argument, “We can’t change anything because it breaks code.” If that were the case the language and standard libs would have been fine 20 years ago. Open source software changes and that’s why we have versions. In practice, this is idiom even the standard practice as Python 3.10 changed csv parsing rules already, as seen in csv.writer stopped to quote values with escapechar with csv.QUOTE_MINIMAL in Python 3.10 · Issue #89024 · python/cpython · GitHub, a far larger change than what I’m suggesting here.

I can respect not having a core developer assigned to a module means changing it is even riskier, which is why I brought this up here as an idea. I think it would be better that we mark such issues as won't fix or give some indication to Python users who are submitting issues that there is no intention up updating the code unless there is some security risk. Hence the comment on a dead battery, we keep it here to keep code stable but know there are other options.

People prefer the standard libs because there is more support and less mess with packaging. I’ve had interns work for me who came out of school being instructed to never use anything but the standard library. I’ve also worked in environments where it was faster to just reimplement a library than to go through the process for adopting a new 3rd party package into production. Being a part of the standard library gives a perception that it is code that the community is actually supporting at the same standard as the rest of the ecosystem.

While CleverCSV seems to be a winner in many ways, it’s sniffing ability is still behind the state of the art that the Table Uniformity Algorithm is showing.

1 Like

I am in favor of you doing this work.

But the cited change in 3.10 is exactly the sort of thing I’m concerned about. The change was made as an improvement, but it broke people. So let’s try to avoid doing that again.

Can or should the new feature be added with an opt-out, to get the old behavior?
Assuming it’s a big improvement, I favor opt-out over opt-in. Others may Invert that, favoring stronger stability.

We know the behavior is different. Can you share some cases in which the current behavior is wrong and the new behavior is right?
This will help us understand what the old code does in these cases, and how likely we believe it is that many users are relying on it.

1 Like

But that’s not the argument.

The argument is “if we change something that has the potential to break code, we need to have a transition plan that ensures that users have time to adapt to the change”. Simply changing the behaviour of the sniffer doesn’t provide such a transition.

Adding a new sniffer, which users can select if they choose to do so, does provide a transition.

Adding a new sniffer which is on by default but with an opt-out to get back the old behaviour, offers a transition, but users will only find out that they need to opt out when they get silent failures because the behaviour changed without warning. Adding an explicit warning when the old and new sniffers give different answers would mitigate this.

Whether it’s worth putting that much effort into improving the sniffer hasn’t been established yet. Has anyone checked how often, in real world code, the existing sniffer gives the wrong result? In the cases where it does give the wrong answer, how often is manually configuring the options a problem?

If we’d had a more accurate sniffing algorithm when the csv module was written, we would have used it. So in principle, improving the algorithm is a good thing to do. But it has a cost, and unless someone can quantify the cost, and demonstrate that the benefits justify that cost, leaving things as they are is the default choice. That may be frustrating, but it’s a reality of working on a codebase that’s used by millions of users, often in mission critical projects.

1 Like

Okay I’ll get an implementation together. The paper on the algorithm has many datasets that demonstrate the issue. I’ll prep some reports on those as well.

1 Like

It was in my plans, although I hadn’t worked on it for about a year. There are many many issues with the current sniffer, it needs a complete rewrite.

Current sniffer uses two algorithms. The first one tries to gather some statistics about character distribution to determine the delimiter. It does not work with quoted fields at all and often produce nonsense result, for example it could detect a dot or a digit as a delimiter in a file full of decimal numbers (this may already be fixed). Other algorithm tries to recognize some patterns with quoted fields. It does not work with newlines in quoted fields, with double quotes, with escaped characters, with initial spaces, and can work incorrectly even in “normal” cases.

My idea is that we should just try to parse the file with many different parameters simultaneously, and choose the one that does not fail and produces the most credible looking data. This may be slower than the current code, but this is the cost of reliability. The trick is that we should try all variants simultaneosly, feeding them input line by line, otherwise the file with a single quote at beginning would make the parser swallowing the whole file in attempt to find the closing quote.

If your code is fast and reliable enough, we can compare different approaches, and maybe include several implementations, with possibility to chose.

3 Likes

That’s basically what the table uniformity method does:

I think the code can be greatly simplified and there is probably a way to prune various dialects early.

The trove of csv files that show up is quite nice:

2 Likes

This is fundamental to our development philosophy. That is why there are deprecation periods.

Really, it is a simple matter to just add a new sniffer. There is no need to break the current one.

1 Like