PDF scraping with tabula: columns do not correspond with values from the PDF files

I get stuck with PDF scrapping. Basically, I want to obtain a CSV file, but first I want to extract data in DataFrames on Python. Here are the First and Second pdf files from which I want to extract data from. These files present a similar structure among them, but differ slightly (at least visually).

Could anyone help me please?
Thank you.

I’ve written a very simple script that I use for text extraction from .pdf files, which writes said to a .txt file.

I’ve tested it with one of the .pdf files to which you have linked and it does work, but processing the .txt to get the required data into a .csv format is not going to be easy to automate, not least because there are commas within the data that you’re trying to extract.

You’re welcome to have this and see if it’s of any use. There’s no sanity checking, so just use some common sense: have this script in the same location as your .pdf file. Enter the file name and then go make yourself a coffee while it does its job.

#!/usr/bin/python3

from PyPDF2 import PdfReader

pdf_document = input("file: ")

file_name = pdf_document.split('.')

text_file = f'{file_name[0]}.txt'

with open(pdf_document, "rb") as pdf_read, open(text_file, mode='w', encoding='UTF-8') as output:
    pdf = PdfReader(pdf_read)
    num_of_pages = len(pdf.pages)
    print(f"There are {num_of_pages} pages to process.")
    print()
    print("Working...")
    for page_number in range(num_of_pages):
        page = pdf.pages[page_number]
        print(f"Page: {page_number+1}", file=output)
        print('', file=output)
        print(page.extract_text(), file=output)
        print('', file=output)
print("Done.")
1 Like

Hi Rob,

Your script works beautifully well!

Perhaps one little question please:

  1. Is there a way to separate values in tables by commas or tab please?
    The idea is only to get data that are on tables please.

Thank you again for your beautiful script!
Lovely day.

You are very welcome and I do fully appreciate the shortcomings of my simple script, but the fact is that extracting the data in the way that you would like to (that is to extract only the data from the tables and have said written to a .csv file) is hard, if not impossible.

Have a read of this link to see why I say that.

There are no (so far as I can see) markers within the .pdf file that would indicate where any table either begins or ends. Aside from that (as I said, above) there are commas within the table data, so absent any table markers, there’s no way to distinguish between a comma that is a value separator and a comma that is part of the data itself. I’ve also looked for a particular “pattern” so that maybe someone who knows about regex can help with this, but again, I see no opportunity for such an extraction method.

PDFs were never designed to be reverse engineered (so far as I’m aware) rather they were designed to be a presentation model, is all.

1 Like

Hi again Rob,

Thank you for your time.

I now fully understand the issues surrounding what I want!

We’ll have to make do…
Thanks for your time, Rob.

This will parse the text file and produce the CSV file. I haven’t check whether all the rows are correct or any of the rows are missing - there’s over 33_000 of them in that first PDF!

#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import splitext
import re
import csv

text_file = r"BOE-A-2014-13181.txt"
csv_file = splitext(text_file)[0] + ".csv"

with open(text_file, encoding="UTF-8") as input_file:
    text = input_file.read()

fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]

# Build the pattern here for clarity.
fields = []

# MARCA
fields.append(r"[A-Z]+(?: [A-Z]+)?")
# MODELO-TIPO
fields.append(r".+")
# PERIODO COMERCIAL (end year is optional)
fields.append(r"\d{4}-(?:\d{4})?")
# C.C.
fields.append(r"\d+")
# Nº de cilind.
fields.append(r"\d+")
# G/D
fields.append(r"[GD]")
# P kW
fields.append(r"\d+(?:-\d+)?")
# cvf
fields.append(r"\d+,\d+")
# CO2 gr/km (can be empty)
fields.append(r"\d*")
# cv
fields.append(r"\d+")
# VALOR EUROS
fields.append(r"\d+")

pattern = " ".join(f"({field})" for field in fields)

with open(csv_file, "w", newline="", encoding="UTF-8") as output_file:
    writer = csv.DictWriter(output_file, fieldnames=fieldnames)
    writer.writeheader()

    for groups in re.findall(pattern, text):
        row = dict(zip(fieldnames, groups))
        writer.writerow(row)
1 Like

Hi Matthew Barnett,

Thank you very much for the code provided.
I will have a try and get back to you as soon as I know something!

Michael

Hi again Matthew,

I tried your code and it works beautifully well!

Michael


P.S.: Many thanks to you, Matthew, and to Rob for your kindness and for sharing the code!

If you only knew how much time I’ve wasted on this…
Everything works superbly well. The two codes you’ve provided work perfectly and complement each other.

Hello Everyone,
Hello @MRAB,

I have a quick question please:
I have seen carefully the .csv file that you were able to produce with the .pdf files provided. Thank you so much for that. Beautiful work, indeed!

However, I have seen that some lines are not reported in the .csv file, but they are reported on the .txt file provided above.

Could you perhaps give me more insights why is this happening please, and if there is a way to put all the missing values back to the .csv file, please?

Thank you very much for your help and time, again. Sorry for the inconvenience.
If it is too cumbersome to change that, no worries. I’ll have to make do.

If a line in the text file doesn’t match the pattern, it’s not going to be find. The pattern needs to be tweaked to get a better result, but it’ll take a while.

1 Like

The problem is that the extracted text has the entries of each row separated by 1 space, but sometimes an extra space is added, splitting an entry, or 2 rows are put on one line.

For example, for “HONDA CIVIC 1.4 Sport”, C.C. of “1339” became “1 339” and for “HONDA CIVIC 1.4 Trend”, C.C. of “1339” became “13 39”.

Try this:

#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import splitext
import re
import csv

text_file = r"BOE-A-2015-13918.txt"
csv_file = splitext(text_file)[0] + ".csv"

with open(text_file, encoding="UTF-8") as input_file:
    text = input_file.read()

fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "Nº de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]

# The columns are separated by 1 space.
pattern = (
    # MARCA
    r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
    # (separator)
    " "
    # MODELO-TIPO
    r"(?P<modelo>.+?)"
    # (separator)
    " "
    # PERIODO COMERCIAL (end year is optional)
    r"(?P<periodo>\d{4}-(?:\d{4}|\d{3} \d\b)?)"
    # (separator)
    " "
    # C.C.
    r"(?P<cc>\d+(?: \d+\b)?)"
    # (separator)
    " "
    # Nº de cilind.
    r"(?P<cilind>\d\b|1\d\b)"
    # (separator)
    " "
    # G/D
    r"(?P<gd>[GD])"
    # (separator)
    " "
    # P kW (single value or range)
    r"(?P<pkw>\d+(?:-\d+)?)"
    # (separator)
    " "
    # cvf
    r"(?P<cvf>\d+ ?,\d+)"
    # (separator)
    " "
    # CO2 gr/km (can be empty)
    r"(?P<co2>\d*)"
    # (separator)
    " "
    # cv
    r"(?P<cv>\d+)"
    # (separator)
    " "
    # VALOR EUROS
    r"(?P<valor>\d+)"
)

with open(csv_file, "w", newline="", encoding="UTF-8") as output_file:
    writer = csv.DictWriter(output_file, fieldnames=fieldnames)
    writer.writeheader()

    for m in re.finditer(pattern, text):
        row = [
            m["marca"],
            " ".join(m["modelo"].split()),
            m["periodo"].replace(" ", ""),
            m["cc"].replace(" ", ""),
            m["cilind"],
            m["gd"],
            m["pkw"],
            m["cvf"].replace(" ", ""),
            m["co2"],
            m["cv"],
            m["valor"],
        ]

        writer.writerow(dict(zip(fieldnames, row)))
1 Like

Hi @MRAB,

It works nicely!

Thank you for your time and your beautiful code. Your help is really appreciated!
I apologize, this pdf file is really difficult to work it, as when we extract the .txt file, there is no consistency in the values (some are written with a lot of spaces, others with “,”, etc.).

Thanks again for your time and patience.