I need help with my python OCR project

CommicKick69ing · December 9, 2024, 5:48am

I am having a problem with my python OCR project. I am getting inconsistent results using pytesseract an I would like some second opinion on whether these results are a coding failure on my part or an issue with pytesseract. I included some examples and sample code. I could use some suggestions

Screenshot1: [It returned nothing for some reason]

Screenshot 2:

がんばります!

中 N 大神一族、粉骨大身の覚悟で

screenshot 3:

二
思
球
|
RI
培
唄
堆

名此懐人的集中力,

Screenshot 4:

「よう、ジュード」
「どうも、ローエンさん。

どうですか、新しい浄水装置の方は」
「おかげさんで、何の問題ちないよ」
「そいつは何より」

Sample Code:

import cv2
import pytesseract
from pytesseract import *

pytesseract.tesseract_cmd = r"C:\Program Files\Tesseract-OCR\tesseract.exe"

# Load image
img = cv2.imread(r"C:\Users\tnu20\Downloads\New folder\swstella_intro_cg2.jpg")
if img is None:
    print("Image not found at path")

# Inverts Image
invert = cv2.bitwise_not(img)

# Cranks up contrast Image
contrast = cv2.convertScaleAbs(invert, alpha=1.5, beta=1)

# Greyscales Image
gray = cv2.cvtColor(contrast, cv2.COLOR_BGR2GRAY)

# Apply threshold to convert to binary image
threshold_img = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

# Pass the image through pytesseract
text = pytesseract.image_to_string(threshold_img, lang='jpn')
print(text)
#text = pytesseract.image_to_data(threshold_img, lang='jpn', output_type=Output.DICT)
        
#cv2.imshow('img', img)
#cv2.imshow('invert', invert)
#cv2.imshow('contrast', contrast)
#cv2.imshow('gray', gray)
cv2.imshow('threshold_img', threshold_img)
cv2.waitKey(0)
cv2.destroyAllWindows()

Juandev · December 9, 2024, 11:35am

What about applying ocrmypdf in the console? Then, you can see potential errors that come up.

CommicKick69ing · December 9, 2024, 4:49pm

My OCR project does not involve scanning document only scanning image files

nannerpusser · December 10, 2024, 12:05pm

I just got done with a two-month long side hobby project that used OCR to read game boards from a WebView window using every OCR library that wasn’t dead or near-death (basically PyTesseract, EasyOCR, keras-ocr, and paddle-paddle). I went through them all initially trying to understand why my results were so inconsistent despite all the work I learned to do with OpenCV, PIL and image processing/filtering before attempting OCR and even training EasyOCR with a new font. I learned a lot, it was also my first GUI (CustomTkinter) project. Concerning OCR, here is what I can tell you with absolute certainty before you sink a few months into this thing thinking the OCR library and capability is the answer: (also first, know up front that OCR is still not a precise thing even in 2024 and even with machine learning. Giving vision to python code turns out to be kinda hard…)

Image pre-processing with OpenCV and PIL is required if you want to OCR anything besides large, monochrome shapes on top of a high contrast background. Even if you aren’t dealing with text or digits, you are going to want to run the absolute bare minimum filters on your target image like doing smoothing operations, dilation, erosion, and of course greyscale conversion with the morph stuff. This is 100% required and absolute makes all the difference on images where you are getting maybe 50% accuracy. Basic processing that seems imperceptible to us can raise accuracy 20-40% or more, it’s shocking depending on the situation.

EasyOCR was the best OCR library I used, but Tesseract is much easier to get a grip on and has far, far more help and documentation and stability than any others. The machine learning based OCR like keras and paddle are not much better for most use cases, but will excel if you want to train them yourself on a custom dataset. I absolutely did not give enough fucks to go that far, but YMMV. EasyOCR (and a metric ass ton of the entire ecosystem) is Chinese-based and while it does have a basic How-To and such in English, it does not have much help beyond that officially for English users.

Tweaking parameters in any OCR library can make a huge difference. Even a change from .05 to .06 on a parameter can go from red to black on the next test. Things like how the OCR engine is reading the image, left to right, up down, etc. and how it determines where a line ends and a letter on top of it begins. So experiment.

Mostly, you need to work on image pre-processing in your app and it should be done every time in ever scenario. OpenCV is easy peasy after you run the same BGR-RGB conversion and dilate/open then close process a few times. I didn’t know a goddamn thing about image work, openCV, or even most of python honestly when I started. But having written the dog shit code and breaking everything countless times and reading the OpenCV docs enough times to burn my retinas out, I can say I know a little bit more now than I did before. Good luck, god speed.