I am working on this PDF file, for year 2017.
I have extracted from it all text, and put it on a .txt
file. But when to transform this .txt
file into a .csv
one, I struggle with some issues.
Basically, the last two column (cv
) and (VALOR
) from my original PDF file are problematic:
For example, take a look at this example:
MARCA | MODELO-TIPO | PERIODO COMERCIAL | CC | Nªdecilind. | G/D | P kW | cvf | CO2 gr/km | cv | Valor Euros |
---|---|---|---|---|---|---|---|---|---|---|
A | 2008-2011 | 2008-2011 | 2776 | 4 D | 120 | 16,22 | 247 | 16 | 3 |
The numbers should be 163 in cv
, an not 16. In the column valor
, the number should be 41700
I have many cases as this one.
Sometimes, it is the reverse that happens: Valor is separated in the .txt
file into two or more separate values, but the result is the same: I end up with the wrong numbers on these two columns.
The code below is inspired on @MRAB:
#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import splitext
import re
import csv
text_file = r"prices_cars_2017.txt"
csv_file = splitext(text_file)[0] + ".csv"
with open(text_file, encoding="UTF-8") as input_file:
text = input_file.read()
fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]
# The columns are separated by 1 space.
pattern = (
# MARCA
r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
# (separator)
" "
# MODELO-TIPO
r"(?P<modelo>.+?)"
# (separator)
" "
# PERIODO COMERCIAL (end year is optional)
r"(?P<periodo>\d{4}-(?:\d{4}|\d{3} \d\b)?)"
# (separator)
" "
# C.C.
r"(?P<cc>\d+(?: \d+\b)?)"
# (separator)
" "
# Nº de cilind.
r"(?P<cilind>\d\b|1\d\b)"
# (separator)
" "
# G/D
r"(?P<gd>[GDMS]|GyE|DyE|Elc)"
# (separator)
" "
# P kW (single value or range)
r"(?P<pkw>\d+(?:-\d+)?)"
# (separator)
" "
# cvf
r"(?P<cvf>\d+ ?,\d+)"
# (separator)
" "
# CO2 gr/km (can be empty)
r"(?P<co2>(?:\d*)?)"
# (separator)
" "
# cv
r"(?P<cv>\d+)"
# (separator)
" "
# VALOR EUROS
r"(?P<valor>\d+)"
)
with open(csv_file, "w", newline="", encoding="UTF-8") as output_file:
writer = csv.DictWriter(output_file, fieldnames=fieldnames)
writer.writeheader()
for m in re.finditer(pattern, text):
row = [
m["marca"],
" ".join(m["modelo"].split()),
m["periodo"].replace(" ", ""),
m["cc"].replace(" ", ""),
m["cilind"],
m["gd"],
m["pkw"],
m["cvf"].replace(" ", ""),
m["co2"],
m["cv"],
m["valor"],
]
writer.writerow(dict(zip(fieldnames, row)))
Thank you so much for your help.
P.S.: Feel free to change the code provided above if needed please, as many rows are still missing. Again, many thanks to people involved in the code provided above.