PDF Scraping to CSV: Some rows not appearing

I am working on this PDF file, for year 2017.

I have extracted from it all text, and put it on a .txt file. But when to transform this .txt file into a .csv one, I struggle with some issues.

Basically, the last two column (cv) and (VALOR) from my original PDF file are problematic:

For example, take a look at this example:

MARCA MODELO-TIPO PERIODO COMERCIAL CC Nªdecilind. G/D P kW cvf CO2 gr/km cv Valor Euros
A 2008-2011 2008-2011 2776 4 D 120 16,22 247 16 3

The numbers should be 163 in cv, an not 16. In the column valor, the number should be 41700

I have many cases as this one.

Sometimes, it is the reverse that happens: Valor is separated in the .txt file into two or more separate values, but the result is the same: I end up with the wrong numbers on these two columns.

The code below is inspired on @MRAB:


#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import splitext
import re
import csv

text_file = r"prices_cars_2017.txt"
csv_file = splitext(text_file)[0] + ".csv"

with open(text_file, encoding="UTF-8") as input_file:
    text = input_file.read()

fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]

# The columns are separated by 1 space.
pattern = (
    # MARCA
    r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
    # (separator)
    " "
    # MODELO-TIPO
    r"(?P<modelo>.+?)"
    # (separator)
    " "
    # PERIODO COMERCIAL (end year is optional)
    r"(?P<periodo>\d{4}-(?:\d{4}|\d{3} \d\b)?)"
    # (separator)
    " "
    # C.C.
    r"(?P<cc>\d+(?: \d+\b)?)"
    # (separator)
    " "
    # Nº de cilind.
    r"(?P<cilind>\d\b|1\d\b)"
    # (separator)
    " "
    # G/D
    r"(?P<gd>[GDMS]|GyE|DyE|Elc)"
    # (separator)
    " "
    # P kW (single value or range)
    r"(?P<pkw>\d+(?:-\d+)?)"
    # (separator)
    " "
    # cvf
    r"(?P<cvf>\d+ ?,\d+)"
    # (separator)
    " "
    # CO2 gr/km (can be empty)
    r"(?P<co2>(?:\d*)?)"
    # (separator)
    " "
    # cv
    r"(?P<cv>\d+)"
    # (separator)
    " "
    # VALOR EUROS
    r"(?P<valor>\d+)"
)

with open(csv_file, "w", newline="", encoding="UTF-8") as output_file:
    writer = csv.DictWriter(output_file, fieldnames=fieldnames)
    writer.writeheader()

    for m in re.finditer(pattern, text):
        row = [
            m["marca"],
            " ".join(m["modelo"].split()),
            m["periodo"].replace(" ", ""),
            m["cc"].replace(" ", ""),
            m["cilind"],
            m["gd"],
            m["pkw"],
            m["cvf"].replace(" ", ""),
            m["co2"],
            m["cv"],
            m["valor"],
        ]

        writer.writerow(dict(zip(fieldnames, row)))

Thank you so much for your help.

P.S.: Feel free to change the code provided above if needed please, as many rows are still missing. Again, many thanks to people involved in the code provided above.

Erm, are you sure you’re having problems with this stage?

From the way you describe it elsewhere and the intermittent nature, it sounds much more likely that you are having a problems is creating a correct .txt file from the PDF. You should be able to determine this by trying to convert a few different copies of the text files where you’ve deleted and retyped the last few columns (so you can have some known good text files to test).

My suspicion is you’re having conversion issues from the PDF and if you can confirm that, you’d need to share the PDF code instead.

2 Likes

Hi @nmstoker:

Thank you for your response. You are probably right. Here is the code used to convert from PDF to TXT. It simply extracts all the text from my PDF.


#!/usr/bin/python3

from PyPDF2 import PdfReader

pdf_document = input("file: ")

file_name = pdf_document.split('.')

text_file = f'{file_name[0]}.txt'

with open(pdf_document, "rb") as pdf_read, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(pdf_read)
    num_of_pages = len(pdf.pages)
    print(f"There are {num_of_pages} pages to process.")
    print()
    print("Working...")
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)
print("Done.")

Thank you again.

P.S.: Code shared by @rob42. Thanks to him.

I suggest you have a go to confirm this for sure before others spend effort digging further into this whilst the actual problem is not clear.

People here are usually glad to help but they’re volunteers and it is unreasonable to expect others to do things for you when you haven’t made a relatively modest effort. I don’t want this to sound abrasive, and I wish you luck getting this solved :+1:

2 Likes

@nmstoker, I completely understand that.
My knowledge of Python is still limited. I apologize for it.

I confirm that the TXT file does indeed contain numbers that don’t correspond to their places in the PDF file (at least in tables I am interested in).

Sorry again.
Thank you for your time and patience.

1 Like

No problem.

I’ll try to take a look later (probably Sunday) unless someone else beats me to it!

(post deleted by author)

Here’s another version. It writes the unmatched text to another file.

If you find any unmatched rows in that file, fix them in that file and/or modify the pattern.

The next time the program is run, it will read from that file instead of the original text file and any additional entries it finds will be added to the existing CSV file.

#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import exists, getmtime, splitext
import re
import csv

text_path = "BOE-A-2014-13181.txt"

# The script puts the unmatched text into another file. If that file exists and is newer
# than the orginal text file, it will be parsed instead and the matched output will be
# appended to the CSV file.

unmatched_path = "%s unmatched%s" % splitext(text_path)
csv_path = splitext(text_path)[0] + ".csv"

if exists(unmatched_path) and getmtime(unmatched_path) > getmtime(text_path):
    # Not first time. Work from the unmatched file.
    input_path = unmatched_path
    csv_mode = "a"
else:
    # First time. Work from the text file.
    input_path = text_path
    csv_mode = "w"

with open(input_path, encoding="UTF-8") as input_file:
    text = input_file.read()

fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]

# The columns are separated by 1 space.
pattern = (
    # MARCA
    r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
    # (separator)
    " "
    # MODELO-TIPO
    r"(?P<modelo>.+?)"
    # (separator)
    " "
    # PERIODO COMERCIAL (end year is optional)
    r"(?P<periodo>\d{4}-(?:\d{4})?)"
    # (separator)
    " "
    # C.C.
    r"(?P<cc>\d+)"
    # (separator)
    " "
    # Nº de cilind.
    r"(?P<cilind>\d+)"
    # (separator)
    " "
    # G/D
    r"(?P<gd>[GDM])"
    # (separator)
    " "
    # P kW (single value or range)
    r"(?P<pkw>\d+(?:-\d+)?)"
    # (separator)
    " "
    # cvf
    r"(?P<cvf>\d+ ?,\d+)"
    # (separator)
    " "
    # CO2 gr/km (can be empty)
    r"(?P<co2>\d*)"
    # (separator)
    " "
    # cv
    r"(?P<cv>\d+)"
    # (separator)
    " "
    # VALOR EUROS
    r"(?P<valor>\d+)"
)

unmatched = []

with open(csv_path, csv_mode, newline="", encoding="UTF-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    if csv_mode == "w":
        # Write the header row only the first time.
        writer.writeheader()

    cur_pos = 0

    for m in re.finditer(pattern, text):
        # Copy the unmatched text.
        unmatched.append(text[cur_pos : m.start()])
        cur_pos = m.end()

        row = [
            m["marca"],
            " ".join(m["modelo"].split()),
            m["periodo"].replace(" ", ""),
            m["cc"].replace(" ", ""),
            m["cilind"],
            m["gd"],
            m["pkw"],
            m["cvf"].replace(" ", ""),
            m["co2"],
            m["cv"],
            m["valor"],
        ]

        writer.writerow(dict(zip(fieldnames, row)))

    unmatched.append(text[cur_pos : ])

unmatched = "\n".join(unmatched)
unmatched = re.sub(r"\n{3,}", r"\n\n", unmatched)

with open(unmatched_path, "w", encoding="UTF-8") as unmatched_file:
    unmatched_file.write(unmatched)
1 Like

PDF is fundamentally not designed for storing text with precise structure. It’s designed for making things look right when they get printed onto paper, and as far as it’s concerned, text is just another graphical element. The same text, with the same overall styling, could nevertheless be represented in countless different ways inside the PDF - it could be broken up into any combination of substrings that are separately rendered with whatever page position, and the representation of those substrings don’t even need to be near each other in the file.

PDF scraping, from everything I have seen, is about as reliable as OCR: i.e., not.

2 Likes

First of all, Thank you all for your willingness to help. I am really happy to have become part of python.org! It’s an amazing community!

@MRAB: Thank you for the new version. I will try it and give you feedback about it soon. Again, thanks for your time, code and above all patience.

@kknechtel: Thank you so much for your clarifications!

All the best,

Hi @MRAB, Hi everyone:

I have a question about the code posted in post #8, please:

  • How can I append the unmatched rows? When I run the code, the .csv file is unmodified, I don’t know why. Could anyone give me more insights about it, please? Because, I observe that getmtime(unmatched_path) > getmtime(text_path), but the CSV is unmodified.
print(getmtime(text_path))
print(getmtime(unmatched_path))
1701689782.5346878
1701690697.3710618

Also, what does it mean in post #8:

“If you find any unmatched rows in that file, fix them in that file and/or modify the pattern”.

Because, in fact, in my “unmatched” file, there are many unmatched models. However, I don’t know how to fix that by myself.

Thank you all again for your amazing help. I am so thankful for that!

Look in the unmatched file for rows that are still there and compare them with what’s in the PDF.

If the row wasn’t matched because, say, a number was split, for example “163” became “16 3”, then correct what’s in the unmatched file.

It might be that a row wasn’t recognised because an entry is blank in the PDF, in which case you’ll need to modify the pattern to make the entry optional, or it might be that an entry has an unexpected value, in which case you’ll need to modify the pattern to also expect that.

When you’ve corrected the errrors in the unmatched file or modified the pattern, run the script again.

Then look in the matched file for any more unmatched rows. Repeat this procedure as necessary.

Perfect! Everything is clear now.
Thank you so much for your help and time. Really appreciated!