Help with a citation map script

I am a researcher working on a review article and need a citation map for illustration. This script promises that: GitHub - jaks6/citation_map: Create a Gephi Citation Graph based on Text Analysis of PDFs from Zotero

However, when I run it, it abruptly ends on an unexpected end of file exception. Can anyone help me? (I am in the social sciences. I have some computer nerd tendencies, but am a complete amateur in coding, so help from the learned community would be much appreciated!) This, specifically, is the error message:


multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/pdfminer/pdfdocument.py", line 721, in __init__
    pos = self.find_xref(parser)
  File "/usr/lib/python3/dist-packages/pdfminer/pdfdocument.py", line 978, in find_xref
    raise PDFNoValidXRef("Unexpected EOF")
pdfminer.pdfdocument.PDFNoValidXRef: Unexpected EOF

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/analyze_papers.py", line 156, in article_worker
    pdf_result, text, pdf_log = process_pdf(metadata)
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/analyze_papers.py", line 115, in process_pdf
    original_page_count, pages = pdf_to_text_list(first_pdf)
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/analyze_papers.py", line 35, in pdf_to_text_list
    pages = layout_scanner.get_pages(file_loc, images_folder=None)  # you can try os.path.abspath("output/imgs")
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/layout_scanner.py", line 214, in get_pages
    return with_pdf(pdf_doc, _parse_pages, pdf_pwd, *tuple([images_folder]))
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/layout_scanner.py", line 28, in with_pdf
    doc = PDFDocument(parser)
  File "/usr/lib/python3/dist-packages/pdfminer/pdfdocument.py", line 727, in __init__
    newxref.load(parser)
  File "/usr/lib/python3/dist-packages/pdfminer/pdfdocument.py", line 241, in load
    (_, obj) = parser.nextobject()
  File "/usr/lib/python3/dist-packages/pdfminer/psparser.py", line 607, in nextobject
    (pos, token) = self.nexttoken()
  File "/usr/lib/python3/dist-packages/pdfminer/psparser.py", line 524, in nexttoken
    self.fillbuf()
  File "/usr/lib/python3/dist-packages/pdfminer/psparser.py", line 239, in fillbuf
    raise PSEOF("Unexpected EOF")
pdfminer.psparser.PSEOF: Unexpected EOF
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/kenstad/Dokumenter/pythonscripts/citation_map-master/analyze_papers.py", line 242, in <module>
    result = pool.map(list_worker, list(titles_dict.items()), chunksize=5)
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
pdfminer.psparser.PSEOF: Unexpected EOF

The error message says “Unexpected EOF”. This is a pretty lousy error message, since it doesn’t indicate which file is being processed. If the error is correct, it suggests either that the pdf being processed is corrupt or perhaps that the file being processed is not a pdf.

How did you call the script? Did you give it a .csv file as input?

The script is rather old (~5 years) but still seems to be working ok (pdfminer may no longer be maintained - but still seems to be ok too). Errror handling in the script is pretty minimal and rough.

First thing to do would be to verify that the zotero .csv file format has not changed in the last few years. The format assumed by the script is that it has (among other columns) these columns:

# Zotero CSV Column indices
YEAR_I = 2   
AUTHOR_I = 3
TITLE_I = 4
FILE_I = 37
# used in `read_titles`

If the format has changed, you made need to modify those values. But it would be better to use pandas instead of a hard-coded list of column indices. For instance like this:

def read_titles(path):
    import pandas as pd
    df = pd.read_csv(path, usecols=["title", "author", "file", "year"], 
                     dtype={"year": str})    
    return dict((pre_process(row.title), dict(row)) 
                for _, row in df.iterrows())

This merely assumes that the zotero.csv has those columns, but makes no assumptions about the ordering. (You do need to verify that the column names are as listed here. They need to match exactly, so you may need to modify them a bit.)

Next thing might be to improve the error handling of the script to find out which file is causing trouble.
For instance, I would change this line:

    original_page_count, pages = pdf_to_text_list(first_pdf)

to

    try:
        original_page_count, pages = pdf_to_text_list(first_pdf)
    except:
        print(f"ERROR: Failed to parse {first_pdf}") 
        raise  

This will tell you which local pdf (if any) is the culprit. It’s possible that file is actually fine, in which case you’ll need to dig deeper.

1 Like

Thank you, Hans!
I believe the column indices are correct. The script runs for a while before the error occurs.
I do get other error messages, such as “Error opening file in with_pdf()” and “Issue parsing PDF file”, and then the name of the file that created it.

It seems the error handling you suggest did not catch the error. The script ends on the same error message as before.

I call the script with “python3 analyze_papers.py zoteroexports.csv”, so, yes, with a .csv file as argument.

Thanks again for responding so quickly, and I would appreciate any other pointers towards a solution! Should I send you the .csv file? (If you have the time to look at it, that is).

Kind regards,
Kjetil

Sorry,
Just a quick note: I have found that some of the pdf files are corrupt, but I am not sure if these are the problem, or at least not the end of the problems… I will try to keep digging, but as I said, if you (or someone else) has any pointers, it would be much appreciated.

K.

To verify if the pdf files you found are causing the problem, I think you could remove those entries from the .csv file and see if it then runs ok.
If that doesn’t solve the mystery, you could also consider contacting the author of the python script by sending him a message thru the GitHub (or opening an issue on the GitHub repo).

A question for people on this forum is whether not people know of other tools in Python that can generate a citation graph (Gephi cittation graph or otherwise?) given a set of research papers in .pdf format.

I noticed there is a commercial tool (Litmaps) (with limited free usage) and a set of tools dedicated to open source (inciteful.xyz - which also has a zotero plugin. I’ve never used those, so don’t know their quality/easy of use.

I suggest you only run 1-3 PDF files through this program to make sure those first 3 are not corrupt PDF files. Verify the program works with one PDF file at least.

Yes, Chuck, I have meticulously gone through all the pdfs over the weekend and weeded out all the corrupt ones. The rather old script still had issues (other issues), so I tried to ask ChatGPT to make me a new one, which I have been polishing, and I think I have got it to work. If anyone is interested in using it, improving on it, or in looking at it to see if it does things it shouldn’t, here it is:

import sys
import os
import pandas as pd
from PyPDF2 import PdfReader
import re

def extract_citations(pdf_path, df):
    citations = []
    with open(pdf_path, 'rb') as file:
        pdf_reader = PdfReader(file)
        doc_length = len(pdf_reader.pages)
        for page_num in range(doc_length):
            page_text = pdf_reader.pages[page_num].extract_text().lower()
            for index, row in df.iterrows():
                title = row['Title'].split(':')[0].split('-')[0].strip().lower()  # Extracting title up to colon or dash
                try:
                    if isinstance(row['Author'], str) and row['Author']:
                        author = row['Author'].split()[0].lower() 
                        author = author.rstrip(' ,') # remove any blank spaces and commas                   
                    else:
                        if isinstance(row['Editor'], str) and row['Editor']:
                            author = row['Editor'].split()[0].lower()
                            author = author.rstrip(' ,') # remove any blank spaces and commas
                        else:
                            raise ValueError("Both author and editor columns are empty")
                except ValueError as e:
                    print(f"Error: {e}. Title: {title}")
                    continue
                if title in page_text and author in page_text:
                    citations.append(row['Id'])
    return citations

def main(csv_file):
    # Read CSV file
    df = pd.read_csv(csv_file)
    
    # Create Nodes.csv
    df['Label'] = df['Author'] + ', ' + df['Title'].apply(lambda x: ' '.join(x.split()[:3]))
    df['Id'] = range(1, len(df) + 1)
    nodes_df = df[['Id', 'Label', 'Author', 'Publication Year', 'Title']]
    nodes_df.to_csv('Nodes.csv', index=False)
    
    # Create Edges.csv
    edges = []
    for index, row in df.iterrows():
        if pd.notna(row['File Attachments']):
            pdf_path = row['File Attachments']
            if os.path.exists(pdf_path):
                citations = extract_citations(pdf_path, df)
                for citation in citations:
                    if row['Id'] != citation: #prevent self-references being registered
                        edges.append({'Source': row['Id'], 'Target': citation})
    
    edges_df = pd.DataFrame(edges)
    edges_df.to_csv('Edges.csv', index=False)

if __name__ == "__main__":
    if len(sys.argv) != 2:
        print("Usage: python3 script.py input.csv")
        sys.exit(1)
    
    csv_file = sys.argv[1]
    main(csv_file)