Hi @kknechtel,
First of all, I apologize for my poor explanations.
Hereâs the thing: I extracted the data from the PDF using the following code, kindly suggested by some people on this blog:
>
> #!/usr/bin/python3
>
> from PyPDF2 import PdfReader
>
> pdf_document = input("file: ")
>
> file_name = pdf_document.split('.')
>
> text_file = f'prices_cars_2015_data.txt'
>
> with open(pdf_document, "rb") as pdf_read, open(text_file, mode='w', encoding='UTF-8') as output:
> pdf = PdfReader(pdf_read)
> num_of_pages = len(pdf.pages)
> print(f"There are {num_of_pages} pages to process.")
> print()
> print("Working...")
> for page_number in range(num_of_pages):
> page = pdf.pages[page_number]
> print(f"Page: {page_number+1}", file=output)
> print('', file=output)
> print(page.extract_text(), file=output)
> print('', file=output)
> print("Done.")
And then I used this TXT to obtain a CSV file to further use in others data analysis softwares, like R or stata. This code creates a unmatched .txt
file, and to extract more data from it, I have to change the pattern (see code below) or directly modify the new âunmatchedâ .txt
file and run the script again to be able to integrate the changes in my âtrueâ CSV file that I want:
#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import exists, getmtime, splitext
import re
import csv
text_path = "prices_cars_2015_data.txt"
# The script puts the unmatched text into another file. If that file exists and is newer
# than the orginal text file, it will be parsed instead and the matched output will be
# appended to the CSV file.
unmatched_path = "%s unmatched%s" % splitext(text_path)
csv_path = splitext(text_path)[0] + ".csv"
if exists(unmatched_path) and getmtime(unmatched_path) > getmtime(text_path):
# Not first time. Work from the unmatched file.
input_path = unmatched_path
csv_mode = "a"
else:
# First time. Work from the text file.
input_path = text_path
csv_mode = "w"
with open(input_path, encoding="UTF-8") as input_file:
text = input_file.read()
fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "NÂș de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]
# The columns are separated by 1 space.
pattern = (
# MARCA
r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
# (separator)
"\s+"
# MODELO-TIPO
r"(?P<modelo>.+?)"
# (separator)
" "
# PERIODO COMERCIAL (end year is optional)
r"(?P<periodo>(?:\d{4}-)?(?:\d{4})?)"
# (separator)
" "
# C.C.
r"(?P<cc>\d+)"
# (separator)
" "
# NÂș de cilind.
r"(?P<cilind>\d+)"
# (separator)
" "
# G/D
r"(?P<gd>[GDMS]|GyE|DyE|Elc)"
# (separator)
" "
# P kW (single value or range)
r"(?P<pkw>\d+(?:-)?(?:\d+)?)"
# (separator)
" "
# cvf
r"(?P<cvf>\d+ ?,\d+)?)"
# (separator)
" "
# CO2 gr/km (can be empty)
r"(?P<co2>(?:\d*)?)"
# (separator)
"\s+"
# cv
r"(?P<cv>(?:\d+)?)"
# (separator)
" "
# VALOR EUROS
r"(?P<valor>\d+)"
)
unmatched = []
with open(csv_path, csv_mode, newline="", encoding="UTF-8") as csv_file:
writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
if csv_mode == "w":
# Write the header row only the first time.
writer.writeheader()
cur_pos = 0
for m in re.finditer(pattern, text):
# Copy the unmatched text.
unmatched.append(text[cur_pos : m.start()])
cur_pos = m.end()
row = [
m["marca"],
" ".join(m["modelo"].split()),
m["periodo"].replace(" ", ""),
m["cc"].replace(" ", ""),
m["cilind"],
m["gd"],
m["pkw"],
m["cvf"].replace(" ", ""),
m["co2"],
m["cv"],
m["valor"],
]
writer.writerow(dict(zip(fieldnames, row)))
unmatched.append(text[cur_pos : ])
unmatched = "\n".join(unmatched)
unmatched = re.sub(r"\n{3,}", r"\n\n", unmatched)
with open(unmatched_path, "w", encoding="UTF-8") as unmatched_file:
unmatched_file.write(unmatched)
This code provided is fantastic. But some lines are not captured from my PDF, and I absolutely need to get all the data that are from page 4 to 605. Itâs a bit difficult to capture the uncaptured template lines manually. Above all, itâs a time-consuming process. Indeed, the structure of the resulting .txt file is not uniform: some values are written as say â1,500â, others as â1 50 0â, and still others as â150 0â, making the process difficult. Some columns are empty, etc.
As a result, some people on Stack Overflow suggested I use the proposal in post #1.
Except, I just started a week ago on Python. My knowledge of it is very limited.
I hope someone can help me with this. Again, I apologize for my poor explanations.
Many thanks in advance.