PDF Extraction with python wrappers

rob42 · December 13, 2023, 4:32am

Thank you.

I’ve side-stepped that issue by using the index positions of the text in the lines of data. The only re I’m using is to find the index of the “PERIODO COMERCIAL”, which is always a four number sequence:
year = re.search('[0-9][0-9][0-9][0-9]', line)
yearFound = (year.span())

All the processed data is being stored in a dictionary object, which I intend to then process into a csv file.

It’s maybe not optimal, but this means I can keep an eye on the data, as it’s being processed, rather than having the chase down ‘bugs’, due to unclean data.

To add: I can post up the script that I have, so far, if you’d like to see it and maybe collaborate, but I’m unsure if this is too long and too ‘custom to this thread’, to be acceptable content.

(where the above falls over is at text line 329, due to (1993), but I’ll stick with it, as it’s a ‘coding challenge’ and far more interesting than the likes of “Advent of code”, as good as that is. Is there a way to find the four digit year number, while at the same time, exclude the parentheses, so that the likes of (1993) is ignored?)

michaeldg_94 · December 13, 2023, 8:02am

Many thanks to All for your help and collaboration on this project! I’m really sorry about the structure of this dataset… I didn’t mean for it to take up so much of your time.

Would OCR be a better option for this major challenge?
In relation to your question, I’m afraid my skills don’t allow me to do that:

[…] Is there a way to find the four digit year number, while at the same time, exclude the parentheses, so that the likes of (1993) is ignored?)

However, I’ve worked out a code using tabula that does the job fairly well, but isn’t perfect either. Can I share it (a bit long), in the hope that it’s not hidden?

rob42 · December 13, 2023, 8:21am

It’s not a bother, tbh; it’s a coding challenge, and as a coder, well, it’s now in my head and I will have a working script. In doing so, I will also keep my coding skills sharp.

OCR could work, but it’s not a thing that I use. Also, I’m not too sure that it would be 100% accurate, which is my goal.

I’ve solved the ‘date issue’ with a custom exception routine that uses re.search('\([0-9][0-9][0-9][0-9]', line) to snag its pray.

If you have a working solution, good for you; I’ll continue to develop this, for the reasons already stated, so if you do need a solution that works, 100%, then contact me here and I’ll let you have the script.

I suspect that it’ll be ready in a day or so, but I do see a pending issue, which will kick in at page 607: ANEXO II (possibly), but maybe what I have here is robust enough to have at it.

michaeldg_94 · December 13, 2023, 8:28am

Hi @rob42,

Thank you so much for your time! Your code will be much more efficient than mine, and of course I would like to get your script, please.

I may have some good news for you! We’re not interested in Annex II (607 onwards) and everything that comes after!

We are interested in Page 4 up to page 605 (If you manage to include page 606, so much the better! But that page isn’t essential, we can just copy and paste it).

Thank you again!

MRAB · December 13, 2023, 5:13pm

“… find the index of the “PERIODO COMERCIAL”, which is always a four number sequence”

Are you sure?

Look in BOE-A-2014-13181.pdf for “AUSTIN MG 2.0 i”. There’s no PERIODO COMERCIAL.

MRAB · December 13, 2023, 6:46pm

For what it’s worth, here’s a script with a GUI.

Click the “Open…”, select a PDF, let it convert the PDF to a subfolder of pages, and then step through the pages.

The button “Extract page” will extract the highlighted lines on the page, but it’s quicker to step through multiple pages, fixing any errors you find and/or modifying the pattern, and then click “Extract to this page” to extract all the highlighted rows on the pages up to and including the current page.

#!python3
# -*- encoding: utf-8 -*-
# M R A Barnett
# December 2023
from contextlib import suppress
from os import mkdir, scandir
from os.path import dirname, isdir, isfile, join, normpath, splitext
from PyPDF2 import PdfReader
from queue import Empty, Queue
from threading import Thread
from time import sleep
from tkinter.filedialog import askopenfilename
import json
import re
import tkinter as tk
import tkinter.ttk as ttk
import tkinter.font as tkfont

LATEST_PATTERN = r'''(?P<marca>[A-Z]+) (?P<modelo>.*) (?P<periodo>\d{4}-(?:\d{4})?)? (?P<cc>\d*) (?P<cilind>\d{0,2}) (?P<gd>\S*) (?P<pkw>\S*) (?P<cvf>\S*) (?P<co2>\S*) (?P<cv>\d+(?:,\d)?) (?P<valor>\d+)'''

# Tooltip class from https://pythonexamples.org/python-tkinter-label-tooltip/
# Modified to keep the tooltop fully on-screen.
class Tooltip:
    def __init__(self, widget, text):
        self.widget = widget
        self.text = text
        self.tooltip = None
        self.widget.bind("<Enter>", self.show)
        self.widget.bind("<Leave>", self.hide)

    def show(self, event=None):
        x, y, _, _ = self.widget.bbox("insert")
        x += self.widget.winfo_rootx() + 25
        y += self.widget.winfo_rooty() + 25

        # Keep the tooltip fully on-screen.
        font = tkfont.nametofont(self.widget['font'])
        text_width = font.measure(self.text)
        screen_width = self.widget.winfo_screenwidth()
        x = min(x, screen_width - text_width - 8)

        self.tooltip = tk.Toplevel(self.widget)
        self.tooltip.wm_overrideredirect(True)
        self.tooltip.wm_geometry(f"+{x}+{y}")

        label = ttk.Label(self.tooltip, text=self.text, background="#ffffe0", relief="solid", borderwidth=1)
        label.pack()

    def hide(self, event=None):
        if self.tooltip:
            self.tooltip.destroy()
            self.tooltip = None

class ProgressDialog(tk.Toplevel):
    """Displays the progress when converting a PDF to text pages."""

    def __init__(self, parent, pdf_path):
        tk.Toplevel.__init__(self, parent)
        self.title("Converting PDF")
        self.pdf_path = pdf_path

        self.grab_set()
        self.focus()

        self.progress = tk.Label(self)
        self.progress.pack()

        self.queue = Queue()
        self.thread = Thread(target=self.worker_func, daemon=True)
        self.thread.start()
        self.on_tick()

    def on_tick(self):
        try:
            while True:
                progress = self.queue.get_nowait()

                if progress is None:
                    self.destroy()
                else:
                    self.progress["text"] = progress
        except Empty:
            self.after(250, self.on_tick)

    def worker_func(self):
        page_folder = splitext(self.pdf_path)[0]

        with suppress(FileExistsError):
            mkdir(page_folder)

        with open(self.pdf_path, "rb") as pdf_read:
            pdf = PdfReader(pdf_read)
            num_pages = len(pdf.pages)

            for page_number in range(num_pages):
                self.queue.put(f"Reading page {page_number + 1} of {num_pages}")
                text_path = join(page_folder, f"Page {page_number + 1}.txt")

                with open(text_path, mode="w", encoding="UTF-8") as text_file:
                    page = pdf.pages[page_number]
                    text_file.write(page.extract_text())

        self.queue.put(None)

class App(tk.Tk):
    def __init__(self):
        tk.Tk.__init__(self)
        self.title("Extract data")
        self.state("zoomed")

        self.grid_columnconfigure(0, weight=1)
        self.grid_columnconfigure(1, weight=0)
        self.grid_rowconfigure(0, weight=0)
        self.grid_rowconfigure(1, weight=0)
        self.grid_rowconfigure(2, weight=0)
        self.grid_rowconfigure(3, weight=1)
        self.grid_rowconfigure(4, weight=0)

        # Page number and buttons.
        frame = tk.Frame(self)
        frame.grid(row=0, column=0, columnspan=2, sticky="we")

        # Displays the page number.
        self.location = tk.Label(frame)
        self.location.pack(side="left")

        # The buttons.
        self.extract_page_button = tk.Button(frame, text="Extract page", command=self.on_extract_page)
        self.extract_page_button.pack(side="right")
        Tooltip(self.extract_page_button, "Extract rows from this page (F12)")

        self.extract_upto_page_button = tk.Button(frame, text="Extract to this page", command=self.on_extract_upto_page)
        self.extract_upto_page_button.pack(side="right")
        Tooltip(self.extract_upto_page_button, "Extract rows from pages up to and including this page (Shift+F12)")

        next_page_button = tk.Button(frame, text="Next page")
        next_page_button.pack(side="right")
        Tooltip(next_page_button, "Next page (Shift+PageDown)")

        prev_page_button = tk.Button(frame, text="Previous page")
        prev_page_button.pack(side="right")
        Tooltip(prev_page_button, "Previous page (Shift+PageUp)")

        open_button = tk.Button(frame, text="Open...", command=self.on_open)
        open_button.pack(side="right")
        Tooltip(open_button, "Open PDF (or stored pages)")

        # The pattern.
        frame = tk.Frame(self)
        frame.grid(row=1, column=0, columnspan=2, sticky="we")

        tk.Label(frame, text="Pattern:").pack(side="left")

        self.pattern_var = tk.StringVar()
        self.pattern_var.trace_add("write", self.on_pattern_change)
        tk.Entry(frame, textvariable=self.pattern_var).pack(side="left", fill="x", expand=True)

        # The match, if any.
        frame = tk.Frame(self)
        frame.grid(row=2, column=0, columnspan=2, sticky="we")
        tk.Label(frame, text="Match:").pack(side="left")

        self.match_var = tk.StringVar()
        self.match_entry = tk.Entry(frame, state="readonly", textvariable=self.match_var)
        self.match_entry.pack(side="left", fill="x", expand=True)

        # The text box that contains the page.
        yscrollbar = tk.Scrollbar(self, orient="vertical")
        yscrollbar.grid(row=3, column=1, sticky="ns")

        xscrollbar = tk.Scrollbar(self, orient="horizontal")
        xscrollbar.grid(row=4, column=0, sticky="we")

        self.textbox = tk.Text(self, undo=True, maxundo=-1, yscrollcommand=yscrollbar.set, xscrollcommand=xscrollbar.set)
        self.textbox.grid(row=3, column=0, sticky="nswe")
        self.textbox.tag_configure("highlight", background="yellow")
        self.textbox.bind("<<Modified>>", self.text_changed)
        self.textbox.focus()

        yscrollbar.config(command=self.textbox.yview)
        xscrollbar.config(command=self.textbox.xview)

        self.bind("<Shift-Prior>", self.on_shift_page_up)
        self.bind("<Shift-Next>", self.on_shift_page_down)
        self.bind("<F12>", self.on_extract_page)
        self.bind("<Shift-F12>", self.on_extract_upto_page)

        self.config_path = splitext(__file__)[0] + ".json"
        self.load_config()

        self.pdf_path = None
        self.tsv_path = None

        # Initialise the pattern.
        self.pattern_var.set(LATEST_PATTERN)

        self.protocol("WM_DELETE_WINDOW", self.on_quit)

        self.cursor_pos = (0, 0)
        self.matches = []
        self.on_tick()

    def on_open(self):
        path = askopenfilename(initialdir=dirname(__file__), filetypes=[("PDF file", "*.pdf")])

        if not path:
            return

        path = normpath(path)
        pages_folder = splitext(path)[0]

        if not isdir(pages_folder):
            dialog = ProgressDialog(self, path)
            dialog.wait_window()

        self.pdf_path = path
        self.tsv_path = splitext(path)[0] + ".tsv"
        self.load_pages(pages_folder)

        self.cur_page = 0
        self.location["text"] = f"Page {self.cur_page + 1} of {len(self.pages)}"

        self.textbox.insert("1.0", self.pages[self.cur_page])
        self.textbox.mark_set("insert", "1.0")

    def load_pages(self, pages_folder):
        self.pages = []
        pages = []

        for entry in scandir(pages_folder):
            if not entry.name.startswith("Page "):
                continue

            page_number = int(splitext(entry.name)[0].partition(" ")[2])

            with open(entry.path, encoding="utf-8") as file:
                page = file.read()

            pages.append((page_number, page))

        self.pages = [page for page_number, page in sorted(pages)]

    def save_pages(self):
        pages_folder = splitext(self.pdf_path)[0]

        for page_number, page in enumerate(self.pages, start=1):
            page_path = join(pages_folder, f"Page {page_number}.txt")

            with open(page_path, "w", encoding="utf-8") as file:
                file.write(page)

    def on_shift_page_up(self, event=None):
        self.go_to_page(self.cur_page - 1)
        return "break"

    def on_shift_page_down(self, event=None):
        self.go_to_page(self.cur_page + 1)
        return "break"

    def go_to_page(self, new_page):
        if not 0 <= new_page < len(self.pages):
            return

        self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

        self.cur_page = new_page
        self.location["text"] = f"Page {self.cur_page + 1} of {len(self.pages)}"

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", self.pages[self.cur_page])
        self.textbox.mark_set("insert", "1.0")

    def text_changed(self, event=None):
        self.textbox.edit_modified()
        self.highlight_text()
        self.textbox.edit_modified(False)

    def highlight_text(self):
        can_extract = False

        cur_pos = 0
        line_start = 0
        cur_line = 1
        self.matches = []

        if not self.pattern_var.get().strip():
            return

        self.textbox.tag_remove("highlight", "1.0", "end")
        page = self.textbox.get("1.0", "end-1c")

        try:
            pattern = re.compile(self.pattern_var.get())
        except re.error:
            self.show_match()
            return

        for m in pattern.finditer(page):
            line_start = max(page.rfind("\n", cur_pos, m.start()) + 1, line_start)
            extra_newlines = page.count("\n", cur_pos, line_start)
            cur_line += page.count("\n", cur_pos, line_start)
            from_line_col = "%d.%d" % (cur_line, m.start() - line_start)
            cur_pos = line_start

            line_start = max(page.rfind("\n", cur_pos, m.end()) + 1, line_start)
            extra_newlines = page.count("\n", cur_pos, line_start)
            cur_line += page.count("\n", cur_pos, line_start)
            to_line_col = "%d.%d" % (cur_line, m.end() - line_start)
            cur_pos = line_start

            self.matches.append((tuple(map(int, from_line_col.split("."))), tuple(map(int, to_line_col.split("."))), m))
            self.textbox.tag_add("highlight", from_line_col, to_line_col)

            if m.groupdict():
                can_extract = True

        self.show_match()

    def on_extract_page(self, event=None):
        page = self.textbox.get("1.0", "end-1c")
        page = self.extract_from_page(page)

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", page)
        self.textbox.mark_set("insert", "1.0")

        self.cursor_pos = (0, 0)
        self.show_match()
        return "break"

    def on_extract_upto_page(self, event=None):
        self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

        for page_number in range(self.cur_page + 1):
            self.pages[page_number] = self.extract_from_page(self.pages[page_number])

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", self.pages[self.cur_page])
        self.textbox.mark_set("insert", "1.0")

        self.cursor_pos = (0, 0)
        self.show_match()
        return "break"

    def extract_from_page(self, page):
        try:
            pattern = re.compile(self.pattern_var.get())
        except re.error:
            return

        extracted = []
        remainder = []
        cur_pos = 0

        for m in pattern.finditer(page):
            remainder.append(page[cur_pos : m.start()])
            extracted.append(m.groups(default=""))
            cur_pos = m.end()

        remainder.append(page[cur_pos : ])

        with open(self.tsv_path, "a", newline="", encoding="utf-8") as tsv_file:
            for row in extracted:
                print("\t".join(row), file=tsv_file)

        page = "\n".join(remainder)
        page = re.sub(r"(?m)^ +$", "", page)
        page = re.sub(r"\n{2,}", r"\n\n", page)

        return page

    def show_match(self):
        first = 0
        last = len(self.matches)
        display = ""

        while first < last:
            mid = (first + last) // 2

            match = self.matches[mid]

            if match[0] <= self.cursor_pos <= match[1]:
                display = str(match[2].groupdict())
                break

            if self.cursor_pos < match[0]:
                last = mid
            else:
                first = mid + 1

        self.match_var.set(display)

    def on_pattern_change(self, *args):
        self.highlight_text()

    def on_tick(self):
        cur_pos = tuple(map(int, self.textbox.index("insert").split(".")))

        if cur_pos != self.cursor_pos:
            self.cursor_pos = cur_pos
            self.show_match()

        self.after(250, self.on_tick)

    def on_quit(self):
        if self.pdf_path:
            self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

            self.save_pages()

        if self.tsv_path and isfile(self.tsv_path):
            self.tidy_rows(self.tsv_path)

        try:
            self.save_config()
        finally:
            self.destroy()

    def load_config(self):
        try:
            with open(self.config_path, encoding="utf-8") as file:
                config = json.load(file)
        except FileNotFoundError:
            config = {}

        self.pattern_var.set(config.get("pattern", ""))

    def save_config(self):
        config = {"pattern": self.pattern_var.get()}

        with open(self.config_path, "w", encoding="utf-8") as file:
            json.dump(config, file)

    def tidy_rows(self, tsv_path):
        with open(tsv_path, encoding="utf-8") as file:
            rows = file.readlines()

        rows = sorted(set(rows), key=str.casefold)

        with open(tsv_path, "w", encoding="utf-8") as file:
            file.writelines(rows)

App().mainloop()

michaeldg_94 · December 14, 2023, 7:48am

Hi @MRAB,

Thank you for your amazing code!
I obtain the following error when trying to replicate your code:

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[16], line 443
    440         with open(tsv_path, "w", encoding="utf-8") as file:
    441             file.writelines(rows)
--> 443 App().mainloop()

Cell In[16], line 188, in App.__init__(self)
    185 self.bind("<F12>", self.on_extract_page)
    186 self.bind("<Shift-F12>", self.on_extract_upto_page)
--> 188 self.config_path = splitext(__file__)[0] + ".json"
    189 self.load_config()
    191 self.pdf_path = None

NameError: name '__file__' is not defined

Could you give me any insight about that, please?
Thank you in advance for your help!

kknechtel · December 14, 2023, 10:23am

The code shown in the stack trace, is an idiom for finding a file inside the same folder as the source code. (You cannot use a relative path for this, because it will be relative to the current working directory of the program). See for example:

In order to use this, therefore, the code must be in a file - it can’t be tested from the REPL.

michaeldg_94 · December 14, 2023, 2:52pm

“The code shown in the stack trace, is an idiom for finding a file inside the same folder as the source code. (You cannot use a relative path for this, because it will be relative to the current working directory of the program).”

Thank you very much for your suggestion @kknechtel. Everything is fine now! Thanks!

@MRAB: Thank you for your code!
I succeed in open the GUI. But I have a small problem:

When I click on a button, nothing happens… Except for the “Open” button.
How can I extract and merge all pages (well, only those in “table” format) saved in .txt format in the subfolder, please? Without taking into account the repetitive things that appear (e.g. MARCA MODELO -TIPO COMERCIAL C.C. cilind. G/D P kW cvf gr/km cv EurosDATOS FICHA TÉCNICA or BOLETÍN OFICIAL DEL ESTADO, etc.).

Do you have any suggestions to resolve that please?
Thank you very much, and I’m sorry about that.

I think the code is repeating the same problems we’ve had before, which isn’t your fault of course! They’re due to the bad data I’ve got, as well as the fact that we don’t have a proper data set to work with… For example, the spacing between the columns is not consistent for page 53 of the .txt extracted.

Nevertheless, I’d like to thank you all for those fantastic codes. I’ve learned a lot from you, and I’m very grateful. You are very competent and amazing. Thanks for sharing your amazing knowldge with me.

MRAB · December 14, 2023, 8:28pm

A new version.

Click on “Open PDF”, select the PDF, click OK. It shows only one page at a time, which forces you to review the rows.

Click “Next page” or press Shift+PageDown for the next page.

Click “Extract page” or press F12 to extract the rows from the page.

Click “Extract to this page” or press Shift+F12 to extract the rows from the pages up to and including the current page.

As I said before, it’s quicker to step through multiple pages, fixing any errors you find and/or modifying the pattern, and then click “Extract to this page”.

The rows are written to a CSV file.

Once I got into my stride, I managed to extract the rows of the 200 pages (>14_000 rows) in a little over 10 minutes.

#!python3
# -*- encoding: utf-8 -*-
# M R A Barnett
# December 2023
from contextlib import suppress
from csv import DictWriter
from os import mkdir, scandir
from os.path import dirname, getsize, isdir, isfile, join, normpath, splitext
from PyPDF2 import PdfReader
from queue import Empty, Queue
from threading import Thread
from time import sleep
from tkinter.filedialog import askopenfilename
import json
import re
import tkinter as tk
import tkinter.font as tkfont
import tkinter.ttk as ttk

INITIAL_PATTERN = r"""(?P<marca>[A-Z]+(?:\+\S+)?) (?P<modelo>.*) (?P<periodo>\d{4}-(?:\d{4})?)? (?P<cc>\d*) (?P<cilind>\d{0,2}) (?P<gd>\S*) (?P<pkw>\S*) (?P<cvf>\S*) (?P<co2>\S*) (?P<cv>\d+(?:,\d)?) (?P<valor>\d+)"""

# Tooltip class from https://pythonexamples.org/python-tkinter-label-tooltip/
# Modified to keep the tooltop fully on-screen.
class Tooltip:
    def __init__(self, widget, text):
        self.widget = widget
        self.text = text
        self.tooltip = None
        self.widget.bind("<Enter>", self.show)
        self.widget.bind("<Leave>", self.hide)

    def show(self, event=None):
        x, y, _, _ = self.widget.bbox("insert")
        x += self.widget.winfo_rootx() + 25
        y += self.widget.winfo_rooty() + 25

        # Keep the tooltip fully on-screen.
        font = tkfont.nametofont(self.widget["font"])
        text_width = font.measure(self.text)
        screen_width = self.widget.winfo_screenwidth()
        x = min(x, screen_width - text_width - 8)

        self.tooltip = tk.Toplevel(self.widget)
        self.tooltip.wm_overrideredirect(True)
        self.tooltip.wm_geometry(f"+{x}+{y}")

        label = ttk.Label(self.tooltip, text=self.text, background="#ffffe0", relief="solid", borderwidth=1)
        label.pack()

    def hide(self, event=None):
        if self.tooltip:
            self.tooltip.destroy()
            self.tooltip = None

class App(tk.Tk):
    def __init__(self):
        tk.Tk.__init__(self)
        self.title("Extract data")
        self.state("zoomed")

        self.grid_columnconfigure(0, weight=1)
        self.grid_columnconfigure(1, weight=0)
        self.grid_rowconfigure(0, weight=0)
        self.grid_rowconfigure(1, weight=0)
        self.grid_rowconfigure(2, weight=0)
        self.grid_rowconfigure(3, weight=1)
        self.grid_rowconfigure(4, weight=0)

        # Page number and buttons.
        frame = tk.Frame(self)
        frame.grid(row=0, column=0, columnspan=2, sticky="we")

        # Displays the page number.
        self.location = tk.Label(frame)
        self.location.pack(side="left")

        # The buttons.
        self.extract_page_button = tk.Button(frame, text="Extract page", command=self.on_extract_page)
        self.extract_page_button.pack(side="right")
        Tooltip(self.extract_page_button, "Extract rows from this page (F12)")

        self.extract_upto_page_button = tk.Button(frame, text="Extract to this page", command=self.on_extract_upto_page)
        self.extract_upto_page_button.pack(side="right")
        Tooltip(self.extract_upto_page_button, "Extract rows from pages up to and including this page (Shift+F12)")

        next_page_button = tk.Button(frame, text="Next page")
        next_page_button.pack(side="right")
        Tooltip(next_page_button, "Next page (Shift+PageDown)")

        prev_page_button = tk.Button(frame, text="Previous page")
        prev_page_button.pack(side="right")
        Tooltip(prev_page_button, "Previous page (Shift+PageUp)")

        next_match_button = tk.Button(frame, text="Next match", command=self.on_next_match)
        next_match_button.pack(side="right")
        Tooltip(next_match_button, "Find page with next match (F3)")

        open_button = tk.Button(frame, text="Open PDF...", command=self.on_open_pdf)
        open_button.pack(side="right")
        Tooltip(open_button, "Open PDF")

        # The pattern.
        frame = tk.Frame(self)
        frame.grid(row=1, column=0, columnspan=2, sticky="we")

        tk.Label(frame, text="Pattern:").pack(side="left")

        self.pattern_var = tk.StringVar()
        self.pattern_var.trace_add("write", self.on_pattern_change)
        tk.Entry(frame, textvariable=self.pattern_var).pack(side="left", fill="x", expand=True)

        # The match, if any.
        frame = tk.Frame(self)
        frame.grid(row=2, column=0, columnspan=2, sticky="we")
        tk.Label(frame, text="Match:").pack(side="left")

        self.match_var = tk.StringVar()
        self.match_entry = tk.Entry(frame, state="readonly", textvariable=self.match_var)
        self.match_entry.pack(side="left", fill="x", expand=True)

        # The text box that contains the page.
        yscrollbar = tk.Scrollbar(self, orient="vertical")
        yscrollbar.grid(row=3, column=1, sticky="ns")

        xscrollbar = tk.Scrollbar(self, orient="horizontal")
        xscrollbar.grid(row=4, column=0, sticky="we")

        self.textbox = tk.Text(self, undo=True, maxundo=-1, yscrollcommand=yscrollbar.set, xscrollcommand=xscrollbar.set)
        self.textbox.grid(row=3, column=0, sticky="nswe")
        self.textbox.tag_configure("highlight", background="yellow")
        self.textbox.bind("<<Modified>>", self.text_changed)
        self.textbox.focus()

        yscrollbar.config(command=self.textbox.yview)
        xscrollbar.config(command=self.textbox.xview)

        self.bind("<F3>", self.on_next_match)
        self.bind("<Shift-Prior>", self.on_shift_page_up)
        self.bind("<Shift-Next>", self.on_shift_page_down)
        self.bind("<F12>", self.on_extract_page)
        self.bind("<Shift-F12>", self.on_extract_upto_page)

        self.config_path = splitext(__file__)[0] + ".json"
        self.load_config()

        self.pdf_path = None
        self.csv_path = None

        # Initialise the pattern.
        self.pattern_var.set(INITIAL_PATTERN)

        self.protocol("WM_DELETE_WINDOW", self.on_quit)

        self.pages = []
        self.pages_queue = Queue()
        self.cur_page = 0
        self.cursor_pos = (0, 0)
        self.matches = []
        self.on_tick()

    def on_open_pdf(self):
        path = askopenfilename(initialdir=dirname(__file__), filetypes=[("PDF file", "*.pdf")])

        if not path:
            return

        self.pdf_path = normpath(path)
        self.csv_path = splitext(path)[0] + ".csv"
        page_folder = splitext(self.pdf_path)[0]

        with suppress(FileExistsError):
            mkdir(page_folder)

        self.pages = []
        self.load_pages(page_folder)

        with open(self.pdf_path, "rb") as pdf_read:
            pdf = PdfReader(pdf_read)
            num_pages = len(pdf.pages)

        if len(self.pages) < num_pages:
            self.thread = Thread(target=self.reader_func, args=(len(self.pages), num_pages), daemon=True)
            self.thread.start()
        else:
            self.textbox.delete("1.0", "end")
            self.textbox.insert("1.0", self.pages[self.cur_page])
            self.textbox.mark_set("insert", "1.0")

        self.cur_page = 0
        self.location["text"] = f"Page {self.cur_page + 1} of {len(self.pages)}"

    def load_pages(self, page_folder):
        self.pages = []
        pages = []

        for entry in scandir(page_folder):
            if not entry.name.startswith("Page "):
                continue

            page_number = int(splitext(entry.name)[0].partition(" ")[2])

            with open(entry.path, encoding="utf-8") as file:
                page = file.read()

            pages.append((page_number, page))

        self.pages = [page for page_number, page in sorted(pages)]

    def save_pages(self):
        page_folder = splitext(self.pdf_path)[0]

        for page_number, page in enumerate(self.pages, start=1):
            page_path = join(page_folder, f"Page {page_number}.txt")

            with open(page_path, "w", encoding="utf-8") as file:
                file.write(page)

    def on_shift_page_up(self, event=None):
        self.go_to_page(self.cur_page - 1)
        return "break"

    def on_shift_page_down(self, event=None):
        self.go_to_page(self.cur_page + 1)
        return "break"

    def go_to_page(self, new_page):
        if not 0 <= new_page < len(self.pages):
            return

        self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

        self.cur_page = new_page
        self.location["text"] = f"Page {self.cur_page + 1} of {len(self.pages)}"

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", self.pages[self.cur_page])
        self.textbox.mark_set("insert", "1.0")

    def text_changed(self, event=None):
        self.textbox.edit_modified()
        self.highlight_text()
        self.textbox.edit_modified(False)

    def highlight_text(self):
        can_extract = False

        cur_pos = 0
        line_start = 0
        cur_line = 1
        self.matches = []

        if not self.pattern_var.get().strip():
            return

        self.textbox.tag_remove("highlight", "1.0", "end")
        page = self.textbox.get("1.0", "end-1c")

        try:
            pattern = re.compile(self.pattern_var.get())
        except re.error:
            self.show_match()
            return

        for m in pattern.finditer(page):
            line_start = max(page.rfind("\n", cur_pos, m.start()) + 1, line_start)
            extra_newlines = page.count("\n", cur_pos, line_start)
            cur_line += page.count("\n", cur_pos, line_start)
            from_line_col = "%d.%d" % (cur_line, m.start() - line_start)
            cur_pos = line_start

            line_start = max(page.rfind("\n", cur_pos, m.end()) + 1, line_start)
            extra_newlines = page.count("\n", cur_pos, line_start)
            cur_line += page.count("\n", cur_pos, line_start)
            to_line_col = "%d.%d" % (cur_line, m.end() - line_start)
            cur_pos = line_start

            self.matches.append((tuple(map(int, from_line_col.split("."))), tuple(map(int, to_line_col.split("."))), m))
            self.textbox.tag_add("highlight", from_line_col, to_line_col)

            if m.groupdict():
                can_extract = True

        self.show_match()

    def on_extract_page(self, event=None):
        page = self.textbox.get("1.0", "end-1c")
        page = self.extract_from_page(page)
        if page is None:
            return "break"

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", page)
        self.textbox.mark_set("insert", "1.0")

        self.cursor_pos = (0, 0)
        self.show_match()
        return "break"

    def on_extract_upto_page(self, event=None):
        self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

        for page_number in range(self.cur_page + 1):
            self.pages[page_number] = self.extract_from_page(self.pages[page_number])

        self.textbox.delete("1.0", "end")
        self.textbox.insert("1.0", self.pages[self.cur_page])
        self.textbox.mark_set("insert", "1.0")

        self.cursor_pos = (0, 0)
        self.show_match()
        return "break"

    def extract_from_page(self, page):
        try:
            pattern = re.compile(self.pattern_var.get())
        except re.error:
            return

        extracted = []
        remainder = []
        cur_pos = 0
        fieldnames = None

        for m in pattern.finditer(page):
            if fieldnames is None:
                fieldnames = tuple(m.groupdict().keys())

            remainder.append(page[cur_pos : m.start()])
            extracted.append(m.groupdict(default=""))
            cur_pos = m.end()

        remainder.append(page[cur_pos : ])

        if not extracted:
            return page

        csv_mode = "a" if isfile(self.csv_path) and getsize(self.csv_path) > 0 else "w"

        with open(self.csv_path, csv_mode, newline="", encoding="UTF-8") as csv_file:
            writer = DictWriter(csv_file, fieldnames=fieldnames)

            if csv_mode == "w":
                writer.writeheader()

            for row in extracted:
                writer.writerow(row)

        page = "\n".join(remainder)
        page = re.sub(r"(?m)^ +$", "", page)
        page = re.sub(r"\n{2,}", r"\n\n", page)

        return page

    def show_match(self):
        first = 0
        last = len(self.matches)
        display = ""

        while first < last:
            mid = (first + last) // 2

            match = self.matches[mid]

            if match[0] <= self.cursor_pos <= match[1]:
                display = str(match[2].groupdict())
                break

            if self.cursor_pos < match[0]:
                last = mid
            else:
                first = mid + 1

        self.match_var.set(display)

    def on_pattern_change(self, *args):
        self.highlight_text()

    def on_tick(self):
        cur_pos = tuple(map(int, self.textbox.index("insert").split(".")))

        if cur_pos != self.cursor_pos:
            self.cursor_pos = cur_pos
            self.show_match()

        added_page = False
        old_num_pages = len(self.pages)

        try:
            while True:
                page = self.pages_queue.get_nowait()
                self.pages.append(page)
                added_page = True
        except Empty:
            pass

        if added_page:
            self.location["text"] = f"Page {self.cur_page + 1} of {len(self.pages)}"

        if old_num_pages == 0 and self.pages:
            self.textbox.delete("1.0", "end")
            self.textbox.insert("1.0", self.pages[self.cur_page])
            self.textbox.mark_set("insert", "1.0")

        self.after(250, self.on_tick)

    def on_quit(self):
        if self.pdf_path:
            self.pages[self.cur_page] = self.textbox.get("1.0", "end-1c")

            self.save_pages()

        if self.csv_path and isfile(self.csv_path):
            self.tidy_rows(self.csv_path)

        try:
            self.save_config()
        finally:
            self.destroy()

    def load_config(self):
        try:
            with open(self.config_path, encoding="utf-8") as file:
                config = json.load(file)
        except FileNotFoundError:
            config = {}

        self.pattern_var.set(config.get("pattern", ""))

    def save_config(self):
        config = {"pattern": self.pattern_var.get()}

        with open(self.config_path, "w", encoding="utf-8") as file:
            json.dump(config, file)

    def tidy_rows(self, csv_path):
        with open(csv_path, encoding="utf-8") as file:
            rows = file.readlines()

        rows = sorted(set(rows), key=str.casefold)

        with open(csv_path, "w", encoding="utf-8") as file:
            file.writelines(rows)

    def reader_func(self, from_page, to_page):
        page_folder = splitext(self.pdf_path)[0]

        with open(self.pdf_path, "rb") as pdf_read:
            pdf = PdfReader(pdf_read)

            for page_number in range(from_page, to_page):
                text_path = join(page_folder, f"Page {page_number + 1}.txt")

                with open(text_path, mode="w", encoding="UTF-8") as text_file:
                    page = pdf.pages[page_number].extract_text()
                    text_file.write(page)

                self.pages_queue.put(page)

    def on_next_match(self, event=None):
        try:
            pattern = re.compile(self.pattern_var.get())
        except re.error:
            return "break"

        cur_page = self.cur_page + 1

        while cur_page < len(self.pages) and not pattern.search(self.pages[cur_page]):
            cur_page += 1

        self.go_to_page(min(cur_page, len(self.pages)))
        return "break"

App().mainloop()

michaeldg_94 · December 15, 2023, 8:14am

Hi @MRAB,
Nice! Thank you so much for this new version. I confirm that it works very well, thanks!

Amazing job! Perhaps one more question:

How can I include option of one or many occurences of spaces after the following variable, please:

(?P<gd>\S*)

Because from page 514 onwards, it becomes complicated and time-consuming to “manually” change each line. For example, look at this line, please:

YYYY Ranger 2.5TDCi Cb. Senc.4x2 2010- 2499 4 D 105    15,23 244 143 16300

Sorry, I change the true name, as it is a brand, otherwise this post will be hidden.
Thank you again for your beautiful work, and help! It is really appreciated!

michaeldg_94 · December 15, 2023, 4:07pm

Hi everyone again,

I would like to thank you all.

You’ve been a great help and without you this wouldn’t have been possible! Thank you all for taking the time to help me. I’m very pleased and eternally grateful to you.

Especially in this discussion: @MRAB, @rob42 , @kknechtel and finally @hansgeunsmeyer.
I hope I haven’t forgotten anyone. Many thanks for your contributions, be they large or small!

MRAB · December 15, 2023, 6:47pm

In BOE-A-2014-13181.pdf, I see that row on page 418, and the pattern in the latest version does match it. Perhaps it was still using an older pattern from the previous version.

I notice that it doesn’t match a previous “Maverick” row because there’s a range without a hyhen, but this pattern does match it:

(?P<marca>[A-Z]+(?:\+\S+)?) (?P<modelo>.*) (?P<periodo>\d{4}-(?:\d{4})?)? (?P<cc>\d*) (?P<cilind>\d{0,2}) (?P<gd>\S*) (?P<pkw>\S*|\d+ \d+) (?P<cvf>\S*) (?P<co2>\S*) (?P<cv>\d+(?:,\d)?) (?P<valor>\d+)

michaeldg_94 · December 18, 2023, 8:05am

Hi @MRAB,

Beautiful! Everything works nicely!
Thank you so much for your help and work. Your created GUI is awesome!

Again, thank you all! I think we (you!) solved the problem of my PDF extraction. I am really grateful.

rob42 · December 18, 2023, 8:57pm

@michaeldg_94

It’s good to see that you have a working solution, as that takes the pressure off of me, which has facilitated the re-coding of my solution, the aim being: a fully automatic transition from .pdf to .tsv

I will post up the script just as soon as it’s ready, as much for others to be able to learn from, as it is for yourself. So, keep checking in if you’re interested; I’ll be done with this ASAP.

michaeldg_94 · December 19, 2023, 8:08am

Hi everyone,

@rob42 : Many thanks for your perseverance, help and time!
Of course I’m interested in your code. It’s a great way for me to learn how to use Python. I’ve already learned a lot thanks to your kindness in posting your code and your willingness to help, and time!

Many thanks. Have a nice day @rob42.
Many thanks to all!

Have a nice day everyone.

michaeldg_94 · January 12, 2024, 10:04am

Hi @rob42,

I hope you are doing well! I am very curious about the code that you are producing (or produced already). I don’t need it immediately, so no stress, but I’m always happy to learn a bit more about Python, as I’m clearly a novice. And what could be better than learning from the best of python.org community?

Have you been able to do the fully automatic transition from .pdf to .tsv with your code?

Thank you in advance for your help. Really appreciated!
Lovely day.

Michael

rob42 · January 13, 2024, 1:06am

Hey Michael,

Thank you for reaching out, and yes, thank you; I’m very well right now and I trust that you are likewise.

I have a WIP code base, which I’ll post here. It’s not 100% and may never be, but it’s as close as I’m willing to get it right now, because the data errors are many and varied. Some are easy to work around, which my code does, to some degree, but others are simply to random to be able to create the logic with which to deal with them, as it simply becomes a case of writing code to deal with the errors, when it could be far simpler to have the code halt at one of these errors and correct the data by hand.

The issues to which I’m referring occur in the 5 right most columns: P kW…VALOR and as an example or two…

Here we see the cv value split as 16 3.
120 16,22 247 16 3 41700

Here, it’s the VALOR number that’s split.
120 16,22 247 163 36 600

With this one, it’s the cvf number that’s split.
100 1 3,31 155 136 26000

And again, here, but in a different way.
100 13,3 1 155 136 28700

These are just a few of many such errors, each of which needs a different code logic in order to catch and correct the data error. This could run into another 100+ extra code lines, all of which would need to be tested to make sure that, in catching a known data error, you don’t get a false positive and thus break what’s gone before.

The way that I’ve coded this, is to have all eleven fields set to False at the outset, then fill in the data values, using N/A for any missing data (such as many of the CO2 data values, and some of the dates, to name two of the most common). Then, when you come to examine the tsv file, you’ll know that if a field is still set to False, it’s because the data could not be reliably interpreted, and a field that’s set to N/A means (for the most part, but there may be exceptions) that the data is not present in the data set.

The code:

#!/usr/bin/python3

import sys
import re
from time import sleep
def file_read(file_name):
    with open(file_name, mode="r", encoding="UTF-8") as file:
        return file.readlines()


def brand_check(brand):
    """
    return the correct brand name
    """
    if brand in MARC_EXTEND:
        brand = MARC_EXTEND[brand]
    return brand


def extract_data(data_lst, data_dic):
    def sanity_check(date_str):
        """
        as an arbitrary difference of 25 years, this may need to be refactored
        """
        exceptions = [12065, 23106, 23110]
        sane = True
        if LINE_IND in exceptions:
            if LINE_IND == 12065:
                date_str = "1994-1998" # correction for assumed typo: 1994-1968
            elif LINE_IND == 23106:
                date_str = "2011-2015" # correction for assumed typo: 2011-2005
            elif LINE_IND == 23110:
                date_str = "2011-2015" # correction for assumed typo: 2011-2005
        if len(date_str) == 9:
            date_1, date_2 = date_str.split("-")
            sane = int(date_2) - int(date_1) in range(26)
        if not sane:
            sys.exit(f"sanity_check({date_str}) failed at {LINE_IND}")
        return bool(sane)

    patterns = {
        "dd": r"^[0-9]{4}-[0-9]{4}$", # double_date
        "sdrh": r"^[0-9]{4}-$", # single_date_right_hyphen
        "sdlh": r"^-[0-9]{4}$", # single_date_left_hyphen
        "date_bug_1": r"^[0-9]{4}-[0-9]{2}$",
        "date_bug_2": r"^[0-9]{4}-[0-9]$",
        }

    found = {}
    cc_ind = False
    for pattern in patterns:
        for index, date_str in enumerate(data_lst):
                if re.search(patterns[pattern], date_str):
                    if pattern not in found:
                        found[pattern] = [date_str]
                    else:
                        found[pattern].append(date_str)
    if "dd" in found:
        date_str = found["dd"][0]
    elif "sdrh" in found:
        date_str = found["sdrh"][0]
    elif "sdlh" in found:
        date_str = found["sdlh"][0]
    elif "date_bug_1" in found:
        date_str = found["date_bug_1"][0]
        date_ind = data_lst.index(date_str)
        date_str = date_str + data_lst[date_ind+1]
        data_lst[date_ind] = date_str
        data_lst.pop(date_ind+1)
    elif "date_bug_2" in found:
        date_str = found["date_bug_2"][0]
        date_ind = data_lst.index(date_str)
        date_str = date_str + data_lst[date_ind+1]
        data_lst[date_ind] = date_str
        data_lst.pop(date_ind+1)
    if date_str:
        sanity_check(date_str)
    if not found:
        date_str = "N/A"
        ft_found = []
        for item in data_lst:
            if item in FUEL_TYPES:
                ft_found.append(item)
        fuel_ind = data_lst.index(ft_found[-1])
        date_ind = fuel_ind - 2
        data_lst.insert(date_ind, date_str)
        fuel_ind += 1
    data_dic["PERIODO COMERCIAL"] = date_str
    date_ind = data_lst.index(date_str)
    if data_lst[date_ind + 2] in CYLINDERS and data_lst[date_ind + 3] in FUEL_TYPES:
        cc_ind = date_ind + 1
        cynd_ind = date_ind + 2
        fuel_ind = date_ind + 3
    else:
        # most likely an error in the cc data
        if data_lst[date_ind + 3] in CYLINDERS and data_lst[date_ind + 4] in FUEL_TYPES:
            cc = data_lst[date_ind + 1] + data_lst[date_ind + 2]
            cc_ind = date_ind + 1
            data_lst[cc_ind] = cc
            data_lst.pop(cc_ind + 1)
            cynd_ind = date_ind + 2
            fuel_ind = date_ind + 3
        else:
            # most likely an EV
            if data_lst[date_ind+1] == "Elc":
                cc_ind = date_ind + 1
                cynd_ind = date_ind + 2
                fuel_ind = date_ind + 3
                data_lst.insert(cc_ind, "N/A")
                data_lst.insert(cynd_ind, "N/A")
            else:
                # most likely a data error on the number of cylinders
                if data_lst[fuel_ind - 1] == "9": # this should be '8'
                    cc_ind = date_ind + 1
                    cynd_ind = date_ind + 2
                    data_lst[cynd_ind] = "8"
                else:
                    sys.exit(f"Something went wrong here: {LINE_IND}")
    mm_lst = data_lst[:date_ind]
    marca = brand_check(mm_lst[0])
    start = len(marca.split())
    model = " ".join(mm_lst[start:])
    data_dic["MARCA"] = marca
    data_dic["MODELO TIPO"] = model
    date = data_lst[date_ind]
    cc = data_lst[cc_ind]
    data_dic["C.C"] = cc
    cynd = data_lst[cynd_ind]
    data_dic["No de cilind"] = cynd
    fuel = data_lst[fuel_ind]
    data_dic["G/D"] = fuel
    data_lst.remove(cc)
    data_lst.remove(cynd)
    data_lst.remove(fuel)
    data_lst.remove(date)
    for item in mm_lst:
        data_lst.remove(item)
    if len(data_lst) == 5:
        data_dic["P kW"] = data_lst[0]
        data_dic["cvf"] = data_lst[1]
        data_dic["CO2"] = data_lst[2]
        data_dic["cv"] = data_lst[3]
        data_dic["VALOR"] = data_lst[4]
    elif len(data_lst) == 4:
        # there's likely no CO2 data
        data_dic["P kW"] = data_lst[0]
        data_dic["cvf"] = data_lst[1]
        data_dic["CO2"] = "N/A"
        data_dic["cv"] = data_lst[2]
        data_dic["VALOR"] = data_lst[3]
    else:
        STOP_HERE = True
    return data_lst, data_dic


def dr_bug(dr_ind, data_lst):
    data_lst_2 = False
    eol = data_lst.index(DR_BUGS[dr_ind]) # end of list
    eol_str = data_lst[eol]
    start, stop = re.search(r"[A-Z]", eol_str).span()
    value, next_word = eol_str.split(eol_str[start:])[0], eol_str.split(eol_str[:start])[1]
    data_lst[eol] = value
    data_lst_1 = [item for item in data_lst[:eol+1]]
    if next_word in MARC_EXTEND:
        data_lst_2 = [item for item in data_lst[eol+1:]]
        data_lst_2.insert(0, next_word)
    return data_lst_1, data_lst_2


WRITE_DATA = False # set this to True to enable the tsv file output

# this line number value can be set so that the process begins at the top of a 'Page'
START_AT = 0

LINE_IND = 0 # DO NOT ALTER THIS VALUE
DATA_STOP = "cve: BOE-A-2016-11948"
DATA_START = ["C.C. G/D P kW cvf", "cilind. g/km cv Euros"]

FIELDS = ["MARCA", "MODELO TIPO", "PERIODO COMERCIAL", "C.C", "No de cilind",
          "G/D", "P kW", "cvf", "CO2", "cv", "VALOR"
          ]

MARC_EXTEND = {
"ALFA": "ALFA ROMEO",
"ASTON": "ASTON MARTIN",
"COBRA": "COBRA CARS",
"DE": "DE TOMASO",
"MK": "MK SPORTSCAR",
"ROLLS": "ROLLS ROYCE",
"ASIA": "ASIA MOTORS",
"LAND": "LAND ROVER"
}

FUEL_TYPES = ["G", "D", "S", "M", "DyE", "GyE", "Elc"]
CYLINDERS = ["2", "3", "4", "5", "6", "8", "10", "12"]

IGNORE_LST = [
    "BMW (Ver también marca MINI)",
    "E81 = 3p      E87 = 5p"
    ]

DR_BUGS = [ # data run-on  bugs
    "8900DATOS",
    "18400ALFA",
    "19600ALFA"
]

DATA_FLAG = False
DATA = file_read("boe.txt")
while LINE_IND < START_AT:
    DATA.pop(0)
    LINE_IND += 1

for line in DATA: # ep for the main data loop
    line = line.strip()
    if LINE_IND == 29264:
        line_0 = line
        LINE_IND += 1
        continue
    if LINE_IND == 14923:
        line = line_0 + line
    LINE_IND += 1
    if LINE_IND in range(210, 227):
        DATA_FLAG = False
        continue
    if line == DATA_STOP:
        DATA_FLAG = False
        continue
    elif line in DATA_START:
        DATA_FLAG = True
        continue

    if line and line not in IGNORE_LST and LINE_IND >= START_AT:
        if DATA_FLAG:
            data_dic = {}
            for field in FIELDS:
                data_dic[field] = False
            print(LINE_IND)
            print("  ", " ".join([item for item in line.split()]))
            process_lst = [line.split()]
            for data_bug in DR_BUGS:
                if data_bug in process_lst[0]:
                    data_lst_1, data_lst_2 = dr_bug(DR_BUGS.index(data_bug), process_lst[0])
                    process_lst = [data_lst_1]
                    if data_lst_2:
                        process_lst.append(data_lst_2)
            for data_lst in process_lst:
                data_lst, data_dic = extract_data(data_lst, data_dic)
                tsv_str = ""
                for item in data_dic:
                    print(f"{item}: {data_dic[item]}")
                    tsv_str += f"{data_dic[item]}\t"
                print()
                if WRITE_DATA:
                    with open(file="boe.tsv", mode="a", encoding="UTF-8") as output:
                        print(tsv_str, file=output)
                sleep(0.1)

I’ve not done a full run, but only about the first 14K data lines, so I don’t know how it will perform, but it’s as robust as I’m able to make it, so I hope that it’s of some use.

I’ve not coded this to setup a blank tsv file, rather it appends to an existing file, so you’ll have to simply create that before you let this code have at it, and remember: the file will be added to each time this app is run, so you’ll need to manage that.

Note that the WRITE_DATA symbol, is set to False, which is for ‘dry runs’, which displays the data only; set it to True if you want the data written to the tsv file. Also, the START_AT symbol, which can be set to any data line number, but if you set that to a line number that is ‘mid page’, the data processing will not begin until the top of the following page. This is because the data start and stop signals are triggered by text that is found at the beginning and end of data sets. You’ll see what I mean, by studying the code, which I’ve tried to keep as easy as possible to follow, but if you’re unsure about what I’ve done and why, then I’ll pop back and explain as bast I can.

Enjoy your weekend.

Rob.

michaeldg_94 · January 15, 2024, 8:06am

Hi @rob42:

Thank you for all these wonderful explanations. Yes, the data is not optimal unfortunately, and I apologise for the difficulty caused.

Thanks also for the code provided, which looks great! I don’t really need it right now, but I’m very happy to be able to use it for similar applications in the future! Thanks very much.

All the best, and I look forward to “hearing” from you on any other Python difficulties I experience, there most certainly be some and I’ll need your invaluable help and other python.org members!

Have a great week.

Michael

Topic		Replies	Views
PDF File reader Python Help	26	5750	July 8, 2023
Extracting XML from PDFs Python Help help	3	1752	October 5, 2021
Python utility to convert a text file to pdf? Python Help	3	2236	November 10, 2021
Opening multiple PDFs in Python and extracting the XML Files Python Help help	5	2474	October 21, 2021
From list of bytes to PDF file Python Help	1	2274	November 2, 2021

PDF Extraction with python wrappers

Related Topics