I found a missing feature in the library to handle CSV files in python. It is only possible to define one quotechar. There seems to be no way to define different opening and closing quotechars. This is necessary if you bracket the fields in your CSV files.
Example use case
An example use case are files suitable to process for pgfplots diagramms in LaTeX documents. The format for this uses a whitespace as delimiter and curly brackets a quotechars. The example file pathes reference to this repository (to put not more than 2 links into the post):
An example of such a file can be found latex/thesis/results/plain/metric/manual.dat, which is processed with the LaTeX commands in latex/thesis/results/plotter.tex.
To process such files with python you have to make an workaround. Replacing all opening bracekts with closing ones (or the other way round) helps, but maybe leads to errors in the processing (e.g., if there is an opening bracket inside a field). An example for this work around is implemented in ‘latex/thesis/results/scores_utils.py’ in a helper function, which is used for example in latex/thesis/results/groupInvestigation.py.
I afraid that this format is much more complex than any format that can be precessed by the csv module. If curly braces are used to quote string values, how curly braces are represented theirself? If some other escape character is used for this, how to represent it? How to represent whitespaces and other “special” characters? Are repeating whitespaces collapsed or completely ignored? How to represent non-ASCII characters? I would not be surprised if it uses \lbrace, \rbrace, \backslash, \space, etc, with a special mapping between names and all special characters and complex rules to determine the end of the command.
To process the sample manual.dat file all you really need is an alternation regex pattern that matches either non-bracket characters that are enclosed in brackets or non-space characters. Use additional lookaround patterns to ensure that the fields are separated by spaces:
import re
from io import StringIO
field_pattern = re.compile(
r'(?<=(?<!\S)\{)[^}]*(?=\}(?!\S))|(?<!\S)(?!\{)\S+(?!\S)')
file = StringIO('''\
commit file before after
{./SonarSource-sonarqube/592397657f44ebb8869159e86087fa62f2c64dd0} {QGChangeEventListenersImplTest.java} 0.2554725331856924 0.25835876057253165
{./SonarSource-sonarqube/51ae2098d531a72c7a7136a4da1063fe05a2bc0e} {SearchAction.java} 0.6027389261871576 0.5584944983323415
{./SonarSource-sonarqube/22600d84f370f18b3050e2e06eec9d9975117487} {IssueQueryTest.java} 0.2875728372794886 0.2879662721728285
''')
next(file) # skip header
for commit, file, before, after in map(field_pattern.findall, file):
print(commit, file, before, after)
How to represent whitespaces and other “special” characters?
Inside the curly braces whitespaces are allowed.
Are repeating whitespaces collapsed or completely ignored?
Yes they are ignored by pdfplots: Columns are usually separated by white spaces (at least one tab or space). from documentation. Python wants only a single delimiter. Is there a way to also ignore multiple delimiter in python?
Maybe this format is real to complex for the python csv module. It first looks very similar to a classical CSV file, but in detail there are many differences
Nice solution and this complex regex even fits for other number of columns (Example at reg101.com: regex101: build, test, and debug regex) I will consider this at my next project. I’m curious to see whether our regex pattern or my workaround with csv module is faster at runtime.