Opening multiple PDFs in Python and extracting the XML Files

Bensan96 · October 18, 2021, 1:36pm

Hello,

So my problem here is i can’t open multiple PDFs and let the code extract the XML files from these PDFs.

This is what i have at the moment. It works when i run a single PDF file. But as soon as I want to use more then one i get this error:

expected str, bytes or os.PathLike object, not tuple

from tkinter.filedialog import askopenfilename
import tkinter.filedialog as fd
import sys
from datetime import datetime
import PyPDF2

file = fd.askopenfilenames()
today = datetime.now()

sys.stdout = open(r".\\factur-x.xml",'a')
print("\n"*2, today,"\n"*2)

def getAttachments(reader):
      catalog = reader.trailer["/Root"]
      fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
      attachments = {}
      for f in fileNames:
          if isinstance(f, str):
              name = f
              dataIndex = fileNames.index(f) + 1
              fDict = fileNames[dataIndex].getObject()
              fData = fDict['/EF']['/F'].getData()
              attachments[name] = fData

      return attachments

handler = open(file, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)

for fName, fData in dictionary.items():
    with open(fName, 'wb') as outfile:
        outfile.write(fData)

steven.daprano · October 18, 2021, 1:59pm

Hi Benjamin,

Thank you for posting your code, and the error message, but can you
please post the full traceback of the exception, starting with the line
“Traceback…” and ending with the error message?

The information in the traceback is just as important as the error
message, possibly even more so.

Please copy and paste it as text, not a screen shot or photo.

Thanks.

Bensan96 · October 18, 2021, 2:13pm

I hope you mean this

Message=expected str, bytes or os.PathLike object, not tuple
Source=C:\Users\Benjamin\source\repos\multiple_files\multiple_files\multiple_files.py
StackTrace:
File “C:\Users\Benjamin\source\repos\multiple_files\multiple_files\multiple_files.py”, line 27, in (Current frame)
handler = open(file, ‘rb’)

steven.daprano · October 18, 2021, 3:07pm

Thanks Benjamin, that’s exactly what I meant, except that it looks
nothing like I expected! I’ve never seen an exception printed in that
format before. Normally they look like this:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'int' object is not callable

(obviously the error message at the end will vary).

How are you running Python? Are you using an IDE?

In any case, the problem occurs when you try to open a file:

handler = open(file, 'rb')

except that the file argument is not the name of a file, but the names
of many files, in a tuple.

You need to open each file one at a time. Something like:

for filename in list_of_files:
    handler = open(file, 'rb')
    # ... and then read the PDF and do something with it

You might be able to use the fileinput library instead:

https://docs.python.org/3/library/fileinput.html

but that will depend greatly on how the PDF library works, so it might
be better/easier/safer to just open each file one at a time.

Bensan96 · October 18, 2021, 3:27pm

This made me think that i might have a general problem with my idea, I might be missing something in my code. Maybe it helps when I type what I want to achieve with this.

So my goal is: to extract the attached XML files from PDFs. This works with my current code allthough only with single Files. The optimum outcome would be, multiple PDFs to be extracted and for each extracted XML file a unique name to be given to that XML file.(as in, test1.xml, test2.xml etc.)

Bensan96 · October 21, 2021, 8:59am

Sorry i forgot to thank you for your reply. Thank you for your help. It helped me to get a bit further