PDF Extraction with python wrappers

Hi everyone,

I need some advices. Some people recommend me to use python wrappers (poppler pdfto text) to extract data from this PDF file, from page 4 to end or known limit (here: page 605). But I never used poppler pdfto text before and need some help, please.

In this PDF, there are simple lines so the easiest extraction is to set a box that fits the data only (excluding top and right hand side from page 4 to end or known limit). If the output is monospaced characters then column 1 is say for example 12 chars wide & column 2 is 13-20 then its only a case of put a comma between 12 and 13 etc.

Apparently, now most of the python wrappers can use poppler pdfto text which has -x 50 -y 100 -W 500 -H 700 or similar. Thus, combined with -nopgbrk - I should get all the lines as well spaced text that can be sliced at chars(12), chars(21) etc.

  • Could anyone help me, please?
  • Does anyone have any ideas on how to use this?

Thank you very much.
Best,

I don’t understand what help you are looking for. Did you get stuck somewhere? What have you tried doing so far, and what problem did you encounter?

Hi @kknechtel,

First of all, I apologize for my poor explanations.
Here’s the thing: I extracted the data from the PDF using the following code, kindly suggested by some people on this blog:

> 
> #!/usr/bin/python3
> 
> from PyPDF2 import PdfReader
> 
> pdf_document = input("file: ")
> 
> file_name = pdf_document.split('.')
> 
> text_file = f'prices_cars_2015_data.txt'
> 
> with open(pdf_document, "rb") as pdf_read, open(text_file, mode='w', encoding='UTF-8') as output:
>     pdf = PdfReader(pdf_read)
>     num_of_pages = len(pdf.pages)
>     print(f"There are {num_of_pages} pages to process.")
>     print()
>     print("Working...")
>     for page_number in range(num_of_pages):
>         page = pdf.pages[page_number]
>         print(f"Page: {page_number+1}", file=output)
>         print('', file=output)
>         print(page.extract_text(), file=output)
>         print('', file=output)
> print("Done.")

And then I used this TXT to obtain a CSV file to further use in others data analysis softwares, like R or stata. This code creates a unmatched .txt file, and to extract more data from it, I have to change the pattern (see code below) or directly modify the new “unmatched” .txt file and run the script again to be able to integrate the changes in my “true” CSV file that I want:

#!/usr/bin/python3
# -*- encoding: utf-8 -*-
from os.path import exists, getmtime, splitext
import re
import csv

text_path = "prices_cars_2015_data.txt"

# The script puts the unmatched text into another file. If that file exists and is newer
# than the orginal text file, it will be parsed instead and the matched output will be
# appended to the CSV file.

unmatched_path = "%s unmatched%s" % splitext(text_path)
csv_path = splitext(text_path)[0] + ".csv"

if exists(unmatched_path) and getmtime(unmatched_path) > getmtime(text_path):
    # Not first time. Work from the unmatched file.
    input_path = unmatched_path
    csv_mode = "a"
else:
    # First time. Work from the text file.
    input_path = text_path
    csv_mode = "w"

with open(input_path, encoding="UTF-8") as input_file:
    text = input_file.read()

fieldnames = ["MARCA", "MODELO-TIPO", "PERIODO COMERCIAL", "C.C.", "NÂș de cilind.", "G/D", "P kW", "cvf", "CO2 gr/km", "cv", "VALOR EUROS"]

# The columns are separated by 1 space.
pattern = (
    # MARCA
    r"(?P<marca>[A-Z]+(?: [A-Z]+)?)"
    # (separator)
    "\s+"
    # MODELO-TIPO
    r"(?P<modelo>.+?)"
    # (separator)
    " "
    # PERIODO COMERCIAL (end year is optional)
    r"(?P<periodo>(?:\d{4}-)?(?:\d{4})?)"
    # (separator)
    " "
    # C.C.
    r"(?P<cc>\d+)"
    # (separator)
    " "
    # NÂș de cilind.
    r"(?P<cilind>\d+)"
    # (separator)
    " "
    # G/D
    r"(?P<gd>[GDMS]|GyE|DyE|Elc)"
    # (separator)
    " "
    # P kW (single value or range)
    r"(?P<pkw>\d+(?:-)?(?:\d+)?)"
    # (separator)
    " "
    # cvf
    r"(?P<cvf>\d+ ?,\d+)?)"
    # (separator)
    " "
    # CO2 gr/km (can be empty)
    r"(?P<co2>(?:\d*)?)"
    # (separator)
    "\s+"
    # cv
    r"(?P<cv>(?:\d+)?)"
    # (separator)
    " "
    # VALOR EUROS
    r"(?P<valor>\d+)"
)

unmatched = []

with open(csv_path, csv_mode, newline="", encoding="UTF-8") as csv_file:
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)

    if csv_mode == "w":
        # Write the header row only the first time.
        writer.writeheader()

    cur_pos = 0

    for m in re.finditer(pattern, text):
        # Copy the unmatched text.
        unmatched.append(text[cur_pos : m.start()])
        cur_pos = m.end()

        row = [
            m["marca"],
            " ".join(m["modelo"].split()),
            m["periodo"].replace(" ", ""),
            m["cc"].replace(" ", ""),
            m["cilind"],
            m["gd"],
            m["pkw"],
            m["cvf"].replace(" ", ""),
            m["co2"],
            m["cv"],
            m["valor"],
        ]

        writer.writerow(dict(zip(fieldnames, row)))

    unmatched.append(text[cur_pos : ])

unmatched = "\n".join(unmatched)
unmatched = re.sub(r"\n{3,}", r"\n\n", unmatched)

with open(unmatched_path, "w", encoding="UTF-8") as unmatched_file:
    unmatched_file.write(unmatched)

This code provided is fantastic. But some lines are not captured from my PDF, and I absolutely need to get all the data that are from page 4 to 605. It’s a bit difficult to capture the uncaptured template lines manually. Above all, it’s a time-consuming process. Indeed, the structure of the resulting .txt file is not uniform: some values are written as say “1,500”, others as “1 50 0”, and still others as “150 0”, making the process difficult. Some columns are empty, etc.

As a result, some people on Stack Overflow suggested I use the proposal in post #1.
Except, I just started a week ago on Python. My knowledge of it is very limited.

I hope someone can help me with this. Again, I apologize for my poor explanations.

Many thanks in advance.

Given that the data for the PDF build would have had a “source”, would it not be far less trouble to contact the provider and ask for the source data, rather than trying to reverse engineer the PDF file, which (as you’ve learned) is far from easy and less than reliable?

1 Like

Hi @rob42:

I managed to have the source data. However, it is given in XML format.

I don’t know if we can extract something from that. I never worked with .xml data in the past.

The source code in .xml format is available here.
Unfortunately, no CSV or TXT files are available.

Thank you again for your help (and time) Rob! Really appreciated!

Yes, it will be much easier to get data out of an XML file. PDF is not designed for storing information that another computer program can use; it’s designed to look nice when printed.

I do believe that we’ll have a better chance with that. I’ve not done so much with XML DOMs, but I do know that there’s a library for such, so I’ll have a look into that. In the mean time, I’d be a little surprised if there’s not someone here that’ll be able to do that in very short time.

1 Like

I really don’t know how to thank you!
It’s really very nice of you!

Thank you so much for your time! I apologize, lately I’ve been a bit of pain with these various PDFs. In any case, your help is really appreciated. This python.org platform is great!

From what I can see, it seems that the XML source is pointing to .png files, which is not what we’re after here; we need the ‘raw data’.

Like I say, I’ve not done so much with this side of Python & XML DOMs, so I could be wrong here.

I’ll side-line and see what other forum members can make of this.

Hi everyone,

I found a document that explains how to download the XML file from their PDFs here.
Unfortunately, I don’t understand XML very well, but I imagine the code is .xml standard.
It seems to me that data can be retrieved from the previous hyperlink.

Otherwise, they provide also an exemple of how to download some of the BOE and related PDFs data using PHP. Hopefully, the data can be found here.

Thanks again for your help!

Hi everyone,

After discussing with some people, they suggest me the following:

  • Extract all XML from PDFs and later convert them into .csv files.

Apparently, all the XML files start with:

<</Type/Metadata/Subtype/XML/Length 7168>>stream

It includes Length - so we possibly can know how many characters to get.
Do you think that pypdf, fpdf2 should do the trick?

Many thanks in advance for your help. I am stuck and don’t know how to continue
 :frowning:

The site already seems to give you the XML files, right? So you do not need to create those XMLs from the PDF files and there is no need to use pypdf or fpdf2. The XML files can be converted into .csv files, after which processing and merging in pandas is pretty straightforward.

A full tutorial for parsing XML can be found at: xml.etree.ElementTree — The ElementTree XML API — Python 3.12.1 documentation
Short sample code for converting XML into a .csv: Convert XML to CSV in Python - GeeksforGeeks

2 Likes

I would like to thank you all for your help!

I will have a look at what you provided @hansgeunsmeyer, and see if I understand:

The site already seems to give you the XML files, right? So you do not need to create those XMLs from the PDF files and there is no need to use pypdf or fpdf2. The XML files can be converted into .csv files, after which processing and merging in pandas is pretty straightforward.

A full tutorial for parsing XML can be found at: xml.etree.ElementTree — The ElementTree XML API — Python 3.12.1 documentation
Short sample code for converting XML into a .csv: Convert XML to CSV in Python - GeeksforGeeks

Again, thank you so much.

I have just a question. They provide us the .xml file relative to the PDF file, but unfortunately I am unable to retrieve the full data in XML format.

The XML they provide us, and in particular the table of interest, contain only .png of the table, i.e. pictures of the “true” table from the PDF.

  • Is it possible to extract data even with .png files?

Thank you for your help.

Yes, I mentioned this very issue in an earlier post.

It may be possible to do an OCR read, but the same could be said of the PDF file, so you’re no further forward, tbh.

What you (or we) need is the raw data; the data that was used to build the PDF & PNG files. Does the provider not have an API that we could use?

Hi @rob42,

Indeed, it seems to me that they provide an API, but I don’t know how to to do it. They give us an entire PDF document explaining that (in Spanish, unfortunately).

And I am stuck, I don’t understand what to do in this case. Sorry about that.

You’re welcome. I’m working on an data extraction script, but at this point, I don’t know if it’s going to work; one or two problems to solve. If I get it to work, I’ll post back.

You’re welcome. I’m working on an data extraction script, but at this point, I don’t know if it’s going to work; one or two problems to solve. If I get it to work, I’ll post back.

Thank you so much. I am so sorry about that. It is so difficult to extract!

Well, this is certainly a “challenge”. To give you an idea of how bad the data is, have a look at text lines 293 and 295 (this is assuming that you’re using the code that I posted to extract all the text from the PDF file).

You’ll see that the data for the next MARCA is tagged on to the end of the data from the previous MARCA, with not even a space between the two. This is far from the only data exception and makes the task all the more complex.

One way to deal with this, is to manually garb the data lines and process them as “exceptions”, although it’s a bit of a ‘hack’, from a coding view point. I’ll have a think and see if I can come up with a cleaner solution.

I’ve coded the script in a way that simply reads the data lines in blocks (i.e: process(151, 197), process(199, 209), etc, where those numbers are the text line numbers, zero indexed).

As this script is already at 100+ lines in length, maybe it would be better to PM you with this, rather than add 100s of code lines to this thread, as the code is specific to your needs, rather than general purpose code that others could use.

1 Like

I find that most rows are picked up with:

(?P<marca>[A-Z]+) (?P<modelo>.*) (?P<periodo>\d{4}-(?:\d{4})?)? (?P<cc>\d*) (?P<cilind>\d{0,2}) (?P<gd>\S*) (?P<pkw>\S*) (?P<cvf>\S*) (?P<co2>\d*) (?P<cv>\d+(?:,\d)?) (?P<valor>\d+)

In most cases, MARCA is 1 word in capitals, but sometimes it’s 2, and sometimes MARCA is 1 word and the first word of MODELO is capitals, so if there are 2 words in capitals, you can’t tell whether MARCA is 1 or 2 words.

What you need to do is look in the tables for 2-word MARCA and then post-process to fix all such rows.

There are, also, some instances where MARCA appears out of place in the text, so more to double-check.

I find it easier to output to TSV because the editor I use (EditPad) can display such files with the column neatly aligned and there’s no quotes in the way, and, therefore, it’s quicker to scan by eye and spot and any rows that don’t look right and need to be checked. Viewing CSV in a spreadsheet would be another way.

1 Like