Rewrite CSV Sniffer

storchaka · May 31, 2025, 6:30pm

It was in my plans, although I hadn’t worked on it for about a year. There are many many issues with the current sniffer, it needs a complete rewrite.

Current sniffer uses two algorithms. The first one tries to gather some statistics about character distribution to determine the delimiter. It does not work with quoted fields at all and often produce nonsense result, for example it could detect a dot or a digit as a delimiter in a file full of decimal numbers (this may already be fixed). Other algorithm tries to recognize some patterns with quoted fields. It does not work with newlines in quoted fields, with double quotes, with escaped characters, with initial spaces, and can work incorrectly even in “normal” cases.

My idea is that we should just try to parse the file with many different parameters simultaneously, and choose the one that does not fail and produces the most credible looking data. This may be slower than the current code, but this is the cost of reliability. The trick is that we should try all variants simultaneosly, feeding them input line by line, otherwise the file with a single quote at beginning would make the parser swallowing the whole file in attempt to find the closing quote.

If your code is fast and reliable enough, we can compare different approaches, and maybe include several implementations, with possibility to chose.