Optimization and code improvement suggestion

Hello everyone I have a script below to extract zip codes… so the demo was to showcase for 100 records which works nicely so far… but am required to process over 10,000 records. But I know its going to bring errors maybe 429 or so am requesting for any suggestion that will be a better approach.

'''
Script to automate the process of retrieving ZIP codes from a dataset containing various addresses.
'''

import time
import openpyxl
from geopy.geocoders import Nominatim 

# Initialize Nominatim API 
geolocator = Nominatim(user_agent="geoapiExercises") 

addresses_not_found = []  # to collect all addresses that have no zip code

# ==================opening and loading the excel work book========================
try:
     address_workbook = openpyxl.load_workbook('m.xlsx')
     address_sheetName = address_workbook['Sheet1']  #specifying the sheet conataining the addresses
except FileNotFoundError as error:
    print(f'{error} File not found, please check the name/path of the file')
    exit()

# Iterate over the rows in the worksheet and extract the addressed
for row_idx, column in enumerate(address_sheetName.iter_rows(min_row=2, values_only=True), start=2):
    addresses_column = column[0]  # column containing the addresses based on index(e.g. 0 will be column A)
    if addresses_column is not None:
        try:
            zipcode = geolocator.geocode(addresses_column)
            if zipcode:
                data = zipcode.raw 
                loc_data = data['display_name'].split()
                print("Full Location") # for testing purposes
                print(loc_data) 
                # print(zip)
                # print("Zip code : ",loc_data[-3])
                code = loc_data[-3].rstrip(',')  # index position of the zipcode in the location list
                print(f'{addresses_column} Zip code is {code}')
            else:
                print(f'{addresses_column} Zip code not found')
                addresses_not_found.append(addresses_column)
        except Exception as e:
            print(f'Error geocoding {addresses_column}: {e}')

        ## writing the zip code back to the Excel sheet
        if code:
           address_sheetName.cell(row=row_idx, column=2, value=code)


No_of_addresses_Not_found = len(addresses_not_found)
print('gathering all missing zipcodes... please wait')
time.sleep(2)
print('check done...')
time.sleep(1)
print(f'{No_of_addresses_Not_found}' + ' not found')
print(addresses_not_found)
print('printing all the addresses not found....')
time.sleep(2)
for number, address in enumerate(addresses_not_found, start=1):
    print(f'{number}: {address}')


# # # saving the workbook after processing
# address_workbook.save('m.xlsx')
# address_workbook.close()

You have:

if code:

However, code is assigned a value only if there’s a zipcode; if there’s no zipcode or there’s an exception for a row, then it’ll still have the previous value, or be undefined if in the first iteration fof the for loop.

I suggest that either you set it to, say, None at the start of each iteration, or move it to where you’ve just after where you assigned to code.

And another suggestion: instead of printing a whole load of messages to the console, print to a log file that you can look at afterwards. You need only print to the console messages that report its progress.

And what’s with those sleeps?

print('gathering all missing zipcodes... please wait')
time.sleep(2)
print('check done...')

It’s not gathering zipcodes at that point; it’s just sleeping.

                code = loc_data[-3].rstrip(',')  # index position of the zipcode in the location list
                print(f'{addresses_column} Zip code is {code}')
                address_sheetName.cell(row=row_idx, column=2, value=code) # if zipcode is found
            else:
                print(f'{addresses_column} Zip code not found')
                address_sheetName.cell(row=row_idx, column=2, value='NOT FOUND')  #if no zip code is found
                addresses_not_found.append(addresses_column)

Nice…Totally makes sense, just corrected and tested it on a small sample… I actually didn’t realize I was getting wrong results.
Aha for the logs, I usually implement but for quick testing purposes I just print to terminal idk why am used to it.
The sleeps just to make it friendly and provide good user experience.
I was thinking since the script doesn’t save until it completes execution… so am like why not save on each iteration, in case an interruption occurs cause above 50k rows is a lot what do you think?

Saving on each iteration seems a bit excessive to me. Also, if it was interrupted, you’d restart it wouldn’t you? And then it would be re-doing all the ones it has already done, wouldn’t it? So you wouldn’t gain much!

If you do want it to be restartable like that, you could save it periodically (every n rows, for a value of n to be determined) and also write out where it has got to. Then, if it’s run again (restarted), it could check to see where it has got to and just continue from there.

But, first, see whether it’s a problem. If processing that many rows doesn’t take too long and/or the chance of interruption is low, it might not be worth it.

Aha yeah it can knock out a good amount of records if connectivity is fast and reliable… I also have measures for the LastRow in place… the lastRow to be processed is stored in a config file, such that the next time it resumes it will check for the presence of config file, with open config .... and then resume from there.
The challenge is the number of requests being made, I think its too many requests from a single IP… yeah definitely I’ll pause for n number of records, save it, then continue

Hi @kyle !

I’m not very familiar with geopy, but I noticed you are using the Nominatim geocoder. Nominatim has a very detailed usage policy: Nominatim Usage Policy (aka Geocoding Policy)

I see you are setting a user agent, which is good. I would also recommend rate limiting your requests per second and checking if caching would be helpful for your data.

Rate Limiting

Rate limiting your requests will make your total script runtime longer, but should greatly reduce the likelihood of your requests being blocked. Using this together with saving your scripts progress like you and @MRAB talked about will make the slowness much more bearable, since the script won’t have to redo work when it fails.

It would be very easy to add simple rate limiting to your script since everything is in a single loop. I would add a time.sleep(1) immediately after any geolocator.geocode(addresses_column) call, since the Nominatim API call happens there. Since the call itself will take some amount of time, adding 1 second to that time should make your requests per second < 1.

If you really need to go faster than this, especially if this script is something you will need to run many times on different data sets, I would recommend looking at another geocoding service that offers paid usage. By paying you support the service and can make many more requests.

Caching / Deduplication

If there is any duplication in the addresses_column of your Excel sheet, then you will be making the same request more than once, which wastes one of your free API requests. You can help this by either deduplicating your records or caching responses. Neither of these will help if you do not have any duplicate addresses.

Deduplication

You can deduplicate your data as separate step before doing your geocoding, even as an entirely separate script if that helps keep things simple. It is easy to deduplicate strings by using a set. I did not test this example:

addresses_seen = set()
rows_to_keep = list()
for row in address_sheetName.iter_rows(min_row=2, values_only=True):
    addresses_column = row[0] 
    if addresses_column is None:
        # this row doesn't have an address column, so we can't deduplicate it; always keep
        rows_to_keep.append(row)
    elif addresses_column not in addresses_seen:
        # we haven't seen this address before, so keep this row and drop any later rows with the same address
        rows_to_keep.append(row)
        addresses_seen.add(addresses_column)
    else:
        # we've seen this address previously, skip it
        pass

# TODO: write rows_to_keep to a sheet

Note that this is not checking whether other columns in a row are different, just the addresses, so this example might not be best for your situation if there is other data in each row that you really need to keep.

Caching

This is just storing each response you get so that you don’t make the same API request twice. In some ways this is simpler than ahead of time deduplication, especially if you need to preserve things from each row other than the address.

# make a data structure to store previous results
geocode_results = dict()

# ...

# before making a request, see if you already have a result for the same address:
if addresses_column in geocode_results:
    response = geocode_results[addresses_column]
else:
    # rate limiting here since we're actually making a request
    time.sleep(1) 
    response = geolocator.geocode(addresses_column)
    geocode_results[addresses_column] = response

# Now use "response" ...
loc_data = response.raw['display_name'].split()

# ...

You can also cache responses for multiple runs of the script by saving and loading to/from a file. That is easier if you store address: zipcode pairs instead of the whole response object, and I can make an example of that if that would be helpful.

Hi Matt… this is nice…gives me a different perspective actually this will take of a lot a bunch… funny I was processing all the rows think I was in a rush to create a workable solution. I just kept track of the addresses not found and appending them to not_found_list… definitely I’ll initialize the addresses_found and store processed rows… yeah rate limiting is well taken care of with addition of this the code will be optimized.
Ummh the last part you are saying like the script should check if the zip code of address has been stored locally first and use it before making a request…such that only newly found addresses should make requests… this will actually do the trick since all rows need to be processed whether there are duplicates and their zipcode written… yeah no need to check other columns all the needed data is in the specified column… we’ll we write to a txt or excel… a txt might be better since we are only storing the code part
PS: sometimes you don’t get the time to read the whole documentation and delve deeper but just the time to get the hang of it and do the work at hand… if I was to do the former then I’ll miss out on gigs lol… so forums help me to see from a different perspective and most of the time I get informative insights and know where to look at and delve deeper much efficient for me… cheers