Error in finding text in a PDF

I was using PyPDF2 to extract text from PDFs and work with them. I encountered this error with a for loop, where it didn’t print the desirable output

from PyPDF2 import PdfReader
from PyPDF2 import PdfFileWriter
from PyPDF2 import PdfWriter

pdf = PdfReader('Test.pdf')

print(len(pdf.pages))

page = pdf.pages[1]

Text = str(page.extract_text())

No = int(Text.count("Ans:"))

for No in range(1,No+1):

    if "Ans: (a)" in Text or "Ans:  (a)" in Text:
        print("Found a ")
        Text.replace("Ans: (a)", "No Answer")
        
    elif "Ans: (b)" in Text or "Ans:  (b)" in Text:
        print("Found b")
        Text.replace("Ans: (b)","No Answer")
        
    elif "Ans: (c)" in Text or "Ans:  (c)" in Text:
        print("Found c")
        Text.replace("Ans: (c)","No Answer")
        
    elif "Ans: (d)" in Text or "Ans:  (d)" in Text:
        print("Found d")
        Text.replace("Ans: (d)","No Answer")
        
    else:
        print("Error")

print(Text)

Basically what I want this program to do is find Ans (a), Ans (b), Ans (c), Ans (d) and convert a answer key document for a test into a question paper.

The Text extracted using module:
Multiple Choice Questions with one correct answer. A correct answer carries 1 mark. No negative
mark. 60 x 1 = 60

  1. Which of the following is an equivalence relation?
    (a)
    ab (b)
    ab
    (c)
    ab− is divisible by
    5 (d)
    a divides
    b
    Sol:
    ab− is divisible by
    5 is the only option satisfying reflexive, symmetric, transitive.
    Hence it is an equivalence relation.
    Ans: (c)
  2. Let
     , A x y z= then an equivalence relation on
    A is
    (a)
    ()()()()   1 , , , , , , , R x y y z x z x x= (b)
    ()()()()   2 , , , , , , , R z y z x z z y y=
    (c)
    ()()()()   3 , , , , , , , R x x y y z z x y= (d) None of these
    Sol:
    1R is not reflexive since
    (),y y R
    2R
    is not reflexive since
    (),x x R
    3R
    is not symmetric since
    (),y x R
    Ans: (d)
  3. In the set
     6,7,8,9,10 A= a relation
    R is defined by
    ()  , : , and a , R a b a b A b=   then
    R is
    (a) Reflexive (b) Symmetric (c)Transitive (d) None of these
    Sol:
    ab is not possible hence not reflexive.
    a b b a  
    hence not symmetric relation.
    , a b b c a c   
    is transitive relation.
    Ans: (c)
  4. The relation
    ()()()   4,4 , 5,5 , 6,6 R= on the set
     4,5,6 is
    (a) transitive only (b) an equivalence relation
    (c) reflexive only (d) symmetric only
    Sol:
    ()()()   4,4 , 5,5 , 6,6 R=
    It is an equivalence relation
    Ans: (b)
  5. The range of function
    ()3
    3xfxx−=− is
    (a)
    R (b)
    1R− (c)
    1− (d)
    1 R−−
    Sol: Let
    ()33133xxyxx−−= = =−− − −
    Ans: (c)
  6. The domain of the function
    ()()2log 2 4 f x x x= − + − is
    (a)
    2, 2− (b)
    ()2, (c)
    ()0, 2 (d)
    (,2− −

As you can see it is successfully able to extract Ans (a,b,c,d) from the pdf.

But when I try to identify these in the PDF using code, it is only able to identify Ans (b), hence the output:

Found b
Found b
Found b
Found b
Found b

Could anyone please tell me what is going wrong. Thank you!

I have not used this package but I suspect that Text is just a string. In Python, strings are immutable objects, and you cannot change them. If you want to replace their content, you need to create a new string such as

Text = Text.replace(...)
1 Like

That did work, but only with a few objects. It did not work with Ans: (d). I think I am understanding why. I don’t know why but the package is extracting the text as Ans: (d) instead of Ans: (d)

If you see in the output, it replaced every ‘Ans’ except for these ones
image
image
These ones have double spaces in between : and (). I did include that in the if condition as well, but it doesn’t seem to work

elif "Ans: (c)" in Text or "Ans:  (c)" in Text:

Thank you by the way!

Your loop is very fragile.

There’s a couple of easy improvements, eg:

No = ...

for No in range(1,No+1):
  ...
  [you don't use the value of No here anywhere]

would be nicer if you didn’t reuse No that way. Because No here means 2 completely different things.

But since you’re not actually using the value of No, I think the tidiest solution would be

for _ in range(Text.count("Ans:")):
  ...

using _ as a variable name singals that it is unused, and this means you run the thing in the loop N times where N=Text.count("Ans:").

The inside of the loop is broken. How are you developing your code? Jupyter, Spyder, IDLE? Personally I’d run

index = Text.find("Ans:")
print(Text[index: index+10])

to see what symbols are actually in the string, and write a procedure from there, then stick the procedure in a function, and finally call the function from the loop.

It finds Ans: (c), but then runs (Text=)(?)Text.replace("Ans: (c)","No Answer"). so it won’t actually replace Ans: (c). if you want to you can catch this by having 8 clauses in your if elseif sequence instead of 4.

Ohh yea, I am stupid for doing this thing. I tried this:

elif "Ans: (d)" in Text or "Ans:  (d)" in Text:
        
        Text = Text.replace("Ans: (d)" or "Ans:  (d)","No Answer")

But it still only finds 3 Ans instead of 5

I am really sorry to ask this, but can you please explain what the below program does

index = Text.find("Ans:")
print(Text[index: index+10])

the string .find() method finds the first index of the specified substring, if it exists.
Then Text[index: index+10] uses that index to find a 10-character substring that starts with "Ans: ".
so for example

text = "12XX34567891234XX56789"
index = text.find("XX")
print(index)
print(text[index : index + 5])
print(text[index : index + 10])

prints

2
XX345
XX34567891

"Ans: (d)" or "Ans:  (d)" == "Ans: (d)"

(seriously, play around with interactive python)
so that’s why that part of your code isn’t working.