Bulk extract text from same location in many images

I have a folder of 1000 images all formatted the same way. need to create a text file that contains the name of each file (sampleimage.jpg) followed by the text that Python “reads” from a specific region of the images. It will be the same region for every image.

For example:

File001.jpg “Series 401 - Slice 655”

File002.jpg “Series 411 - Slice 428”

File003.jpg “Series 403 - Slice 286”

File004.jpg “Series 401 - Slice 689”

etc.

The attached jpg is one of the images in the folder. Personal information has been removed from the file for privacy.

The specific “region” that I want Python to “read” is in the same place for every image. In the sample image shown below, the text to be read is “Series 401 - Slice 655”. That text is in the lower-right quadrant of the image.

I have downloaded Python and now I will start learning to use it for the first time.

According to google, one option is this: “Using Python with OpenCV (for image processing) and Tesseract (for OCR) is the most efficient method for large batches, as it allows you to define a specific crop area. Tools: Python, cv2 (OpenCV), pytesseract. Method: Define the ROI coordinates (x, y, width, height) of the text area. Loop through the folder of images. Crop each image to that area. Apply pytesseract to extract the text. Save the output to a CSV or text file.”

It suggests this code:

import cv2
import pytesseract
import os

Define the region of interest [y:y+h, x:x+w]

YOU MUST CHANGE THESE COORDINATES

ROI = (100, 200, 300, 50)

def process_images(folder_path):
results =
for filename in os.listdir(folder_path):
if filename.endswith(“.jpg”) or filename.endswith(“.png”):
img = cv2.imread(os.path.join(folder_path, filename))

Crop image

crop_img = img[ROI[1]:ROI[1]+ROI[3], ROI[0]:ROI[0]+ROI[2]]

OCR

text = pytesseract.image_to_string(crop_img)
results.append(f"{filename}: {text.strip()}")
return results

Run the function on your image folder

extract_results = process_images(‘path/to/your/images’)

My questions are:

  1. As a new person, any tips so I can avoid common pitfalls?
  2. How do I determine the coordinates of the region to be read?

Tesseract will sometimes get the text wrong. You probably need to train it on samples of your images, or on the font you’re using, to get reliable results. You need to consider what to do for quality control.

When you select a rectangular region in an image editor, it will typically show you the pixel coordinates. Be mindful that not all software agrees on whether (0,0) is the top left corner or the bottom left corner.

This isn’t what you asked for, but I just want to check if there’s an easy option here. Often, an image file will include additional metadata; in the case of a JPEG file, that includes EXIF and possibly other things. Check to see if the relevant information is embedded within the file. For this, you’ll need to use the original (unedited/uncensored) file, as you may lose that during any edits.

You’re expecting series 401 and slice 655 for that image, so I would start by looking for those strings of digits within the file. Most of it is going to be binary and unreadable, but see if the file contains “401” and/or “655” anywhere in it. If not, try getting a list of all the readable strings in the file (on Linux systems, there’s a “strings” command that does that; or you could craft a regex to find them all).

Maybe this is a dead end, but if it works, it would be 100% reliable and a lot faster than tesseract, so it should be worth a little bit of time trying it out!

2 Likes

Thanks! No, there is no metadata in any of the files.

Happy to report that I was able to accomplish this task; a huge feat as a 100% brand-new user. I installed Visual Studio Code, along with various modules that were required for tesseract.

Some things I did to help it be successful:

  1. I tested the code using a dummy folder of only 12 images, instead of the many hundreds. Made it faster because I needed to tweak my process a few times. This is because the in the first rounds of text reading, tesseract got some digits wrong. Interpreted “3” for “8”, for example.

  2. I created a backup folder of all my images and used XNConvert to crop those backup images to the specific region I need to scan (since a cropped image takes less time to read) and I also used XNConvert to increase the brightness and contrast (making the text clearer). Doing this gave 100% accurate text reading and was faster.

  3. Running the code gave me a list of the filenames with their corresponding OCR-read text. I used that list to create a command prompt to rename the files to Slice_655, Slice_656, etc.

Now the images all flow in the order of their slices.

Unfortunately, the flow is not exactly what I was expecting. I wanted to see the images flowing as if the camera was going through the digestive tract slowly, and I had assumed that the but the Slice_655, Slice_656, Slice_657, etc., sequence was tied to that. For many of the consecutively-numbered slices, the camera does flow in a logical manner through the colon; but there are sill many other consecutively-numbered slices that “jump” around from one section of the colon to another.

So, I am now working on a new approach. Stay tuned.