Add support for brackets as quotechar in CSV files

BaumiCoder · August 27, 2024, 9:19am

I found a missing feature in the library to handle CSV files in python. It is only possible to define one quotechar. There seems to be no way to define different opening and closing quotechars. This is necessary if you bracket the fields in your CSV files.

Example use case

An example use case are files suitable to process for pgfplots diagramms in LaTeX documents. The format for this uses a whitespace as delimiter and curly brackets a quotechars. The example file pathes reference to this repository (to put not more than 2 links into the post):

An example of such a file can be found latex/thesis/results/plain/metric/manual.dat, which is processed with the LaTeX commands in latex/thesis/results/plotter.tex.

To process such files with python you have to make an workaround. Replacing all opening bracekts with closing ones (or the other way round) helps, but maybe leads to errors in the processing (e.g., if there is an opening bracket inside a field). An example for this work around is implemented in ‘latex/thesis/results/scores_utils.py’ in a helper function, which is used for example in latex/thesis/results/groupInvestigation.py.

hugovk · August 27, 2024, 9:33am

Direct link: Bachelorthesis/latex/thesis/results/plain/metric/manual.dat at main - BaumiCoder/Bachelorthesis - Codeberg.org

First 5 lines:

commit file before after
{./SonarSource-sonarqube/592397657f44ebb8869159e86087fa62f2c64dd0} {QGChangeEventListenersImplTest.java} 0.2554725331856924 0.25835876057253165
{./SonarSource-sonarqube/51ae2098d531a72c7a7136a4da1063fe05a2bc0e} {SearchAction.java} 0.6027389261871576 0.5584944983323415
{./SonarSource-sonarqube/22600d84f370f18b3050e2e06eec9d9975117487} {IssueQueryTest.java} 0.2875728372794886 0.2879662721728285
{./SonarSource-sonarqube/22600d84f370f18b3050e2e06eec9d9975117487} {ComponentTagsAction.java} 0.43067502975463867 0.4439632122715314

Direct link: Bachelorthesis/latex/thesis/results/plotter.tex at main - BaumiCoder/Bachelorthesis - Codeberg.org

Direct link: Bachelorthesis/latex/thesis/results/groupInvestigation.py at main - BaumiCoder/Bachelorthesis - Codeberg.org

storchaka · August 27, 2024, 11:42am

I afraid that this format is much more complex than any format that can be precessed by the csv module. If curly braces are used to quote string values, how curly braces are represented theirself? If some other escape character is used for this, how to represent it? How to represent whitespaces and other “special” characters? Are repeating whitespaces collapsed or completely ignored? How to represent non-ASCII characters? I would not be surprised if it uses \lbrace, \rbrace, \backslash, \space, etc, with a special mapping between names and all special characters and complex rules to determine the end of the command.

blhsing · August 27, 2024, 2:48pm

To process the sample manual.dat file all you really need is an alternation regex pattern that matches either non-bracket characters that are enclosed in brackets or non-space characters. Use additional lookaround patterns to ensure that the fields are separated by spaces:

import re
from io import StringIO

field_pattern = re.compile(
    r'(?<=(?<!\S)\{)[^}]*(?=\}(?!\S))|(?<!\S)(?!\{)\S+(?!\S)')
file = StringIO('''\
commit file before after
{./SonarSource-sonarqube/592397657f44ebb8869159e86087fa62f2c64dd0} {QGChangeEventListenersImplTest.java} 0.2554725331856924 0.25835876057253165
{./SonarSource-sonarqube/51ae2098d531a72c7a7136a4da1063fe05a2bc0e} {SearchAction.java} 0.6027389261871576 0.5584944983323415
{./SonarSource-sonarqube/22600d84f370f18b3050e2e06eec9d9975117487} {IssueQueryTest.java} 0.2875728372794886 0.2879662721728285
''')
next(file) # skip header
for commit, file, before, after in map(field_pattern.findall, file):
    print(commit, file, before, after)

Demo: F2epVN - Online Python3 Interpreter & Debugging Tool - Ideone.com

BaumiCoder · August 28, 2024, 11:40am

How to represent whitespaces and other “special” characters?

Inside the curly braces whitespaces are allowed.

Are repeating whitespaces collapsed or completely ignored?

Yes they are ignored by pdfplots: Columns are usually separated by white spaces (at least one tab or space). from documentation. Python wants only a single delimiter. Is there a way to also ignore multiple delimiter in python?

For your other questions I cannot find any answer. The documentation of pdfplots about its table format is not very detailed: Loading and Displaying data - PGFplots Manual

Maybe this format is real to complex for the python csv module. It first looks very similar to a classical CSV file, but in detail there are many differences

BaumiCoder · August 28, 2024, 11:51am

Nice solution and this complex regex even fits for other number of columns (Example at reg101.com: regex101: build, test, and debug regex) I will consider this at my next project. I’m curious to see whether our regex pattern or my workaround with csv module is faster at runtime.