Assistance with loop only hitting 'else' statement

Hey all,

I am working on a script that will 1) take multiple comma-separated user input for a URL and store it in an array, 2a) iterate through a .CSV file containing masked IP addresses and associated system owners, and attempt to find the array value in it, and if it does, print out that row which includes an IP address, 2b) if it does not find the variable in the .CSV it will then use the socket library and query the array value (again, will be a URL) to get the IP address and print that out instead, 3) repeat prior steps from the start with a new value from the array until it has gone through the whole thing.

I am running into an issue with my if/else statement where, despite the fact the value in the array exists in the .CSV file, it will always hit my else statement and attempts 2b; however, if I remove it, the script can get through and find the same information and perform 2a so long as the data is there.

This may be something that is an easy solution and I am missing it since I am still new to the language, but I cannot figure it out and thought to ask for help. Below are the details:

sample input data:

thisurl-doesnot-exist.domain.com, thisurl-2-doesnot-exist.domain.com, url3.thatdoesnotexist.domain.com, url4.domain.com

sample .CSV file configuration (actual file has 30,000+ lines)

ConfigRef,HostNameRef,MaskedURL,UnmaskedIPAddr
customer101-prod,imnotreal.domain.com,imnotreal-prod.domain.com,100.100.100.100
customer101-dev,imnotreal.domain.com,imnotreal-dev.domain.com,100.100.100.200
customer102-prod,thisurl-doesnot-exist.domain.com,thisurl-doesnot-exist.domain.com,100.100.200.100
customer102-dev,thisurl-2-doesnot-exist.domain.com,thisurl-2-doesnot-exist.domain.com,100.100.200.200

My script is below:

import re
import socket
from urllib.parse import urlparse
from datetime import datetime
defined_url = input("What is the URL being queried for the mapping?  DO NOT include the sheme (e.g., http://), DO NOT include a port, DO NOT include a path (e.g., /somepath/file.php), DO NOT include parameters (e.g., ?param1=foo&param2=bar), and DO NOT include an anchor (e.g., #foobarinthisdoc).  Enter each URL separated by a comma and do not use any spaces:  ")
print("***")
defined_url_array = []
start_time = datetime.now()
defined_url_array.append(defined_url)
for defined_urls in defined_url_array:
    url_array = defined_urls.split(",")
properties_file = r"C:\\Files\\ReferenceFiles\\Properties.csv"
index = 0
while index < len(url_array):
    with open(properties_file, "r") as file_obj:
        file_content = file_obj.read()
        url_regex = re.compile(r"prp\_[0-9]+\,(?P<config_ref>.+)\,(?P<hostname_ref>.+)\,(?P<hostname>.+),(?P<ipaddr>.+)\n")
        for url_match in url_regex.finditer(file_content):
            url_result = url_match.groupdict()
            for url in url_array:
                if url_array[index] in url_result["hostname_ref"]:
                    print(url_result)
                    index += 1
                else:
                    url_parse = urlparse(defined_url)
                    try:
                        ipaddr = socket.gethostbyname(url_parse.path)
                        print("The address was not masked behind Akamai.  The public address of", defined_url, "is", ipaddr)
                    except socket.gaierror:
                        print("The URL", url_array[index], "could not be mapped.  See if the host is active and online.")
                        continue
    index += 1
end_time = datetime.now() - start_time
print("The script took", end_time, "to run and reviewed", len(url_array), "URL(s).") 

[…]

Your if-statement is used like this, yes?

 for url in url_array:
     if url in url_result["hostname_ref"]:

Isn’t url_result["hostname_ref"] a string? Then you’re doing in with
2 strings, which is a substring test eg "def" in "abcdefghi".

The basic approach here is to print out url and
url_result["hostname_ref"]. Maybe they are not what you expected.

1 Like

I edited the script briefly to modify the try/except and the if/else. The if statement is now looking at url_array[index] which will be a string and seeing if that string exists in the value in key url_result["hostname_ref"], and should do the logic.

Printing out the two and comparing them unfortunately is not going to work as I need the script to have that if/else portion working so it prints out one or the other in a sequence. Here is an example of what I want the output to look like (length and information will vary based on input, this is just some fake info to populate the data):

* * * *
{'config_ref': 'subdomain.domain.com_prod', 'hostname_ref': 'primary.domain-2.net', 'hostname': 'subdomain.subdomain-domain.com.provider.net', 'ipaddr': '100.10.10.1'}
{'config_ref': 'subdomain.domain.com_dev', 'hostname_ref': 'www.domain-2.net', 'hostname': 'subdomain.subdomain-domain.com.provider.net', 'ipaddr': '100.10.10.1'}
{'config_ref': 'customer101-PROD', 'hostname_ref': 'property.subdomain.domain.com', 'hostname': 'path.property-information.com.provider.net', 'ipaddr': '100.20.20.1'}
The URL www.proprety3-function.domain.com could not be mapped.  See if the host is active and online.
{'config_ref': 'customer201_DEV', 'hostname_ref': 'property2.domain.com', 'hostname': 'property2.subdomain-domain.provider.net', 'ipaddr': '100.30.30.1'}
{'config_ref': 'customer102_Prod', 'hostname_ref': 'primary.domain-2.net', 'hostname': 'property.subdomain-domain.com.provider.net', 'ipaddr': '100.10.10.1'}
{'config_ref': 'customer102_Dev', 'hostname_ref': 'www.domain-2.net', 'hostname': 'property2.subdomain-domain.com.provider.net', 'ipaddr': '100.10.10.1'}
The address was not masked.  The public address of www.thisisafakedomain is 200.20.20.3
* * * *

One thing I am thinking about: with the for loop going through a file, the script is going to go line-by-line and go through that if/else and try/except for each row at a time before moving on to the next row.

Is there a better way to go about doing this to get my desired result? Again, I need the user to input a list of comma-separated URLs, and that list needs to be iterated through, checked against a .CSV file to see if those URLs exist in that file and if so to pull the IP address, and if they do not exist in the .CSV file then I need the script to use urllib and socket to get the IP address.

The .CSV file in question is a daily export of a configuration that contains a list of URLs that map to their masked and unmasked IP addresses, which is why I need to reference that URL in the first place.

Would it be better to spin up a SQLite3 database and use the sqlite library to just query that instead of parsing through the entire .CSV file?

You should really be parsing the CSV file using the csv module instead of a regex.

I tried it with the csv library and I still had the same issues. Even if I created the reader object being the file and used csv.DictReader I had problems with the if/else statement working correctly.

What would be the best way to implement csv to get this to work correctly? I had tried the following:

import csv

[...]

with open... as file_obj:
	file_obj_reader = csv.DictReader
	for row_obj in file_obj_reader:
		if url_array[index] in row_obj:
			print(row_obj)
			index += 1
		else:
			...

But I got the same results, that being that it will go through MOST of the URLs I enter as the input when the else statement is removed, and if I leave it, I run into the issue explained.

Am I missing something obvious?

Try this:

import csv
import socket
from urllib.parse import urlparse
from datetime import datetime

defined_url = input("What is the URL being queried for the mapping?  DO NOT include the sheme (e.g., http://), DO NOT include a port, DO NOT include a path (e.g., /somepath/file.php), DO NOT include parameters (e.g., ?param1=foo&param2=bar), and DO NOT include an anchor (e.g., #foobarinthisdoc).  Enter each URL separated by a comma and do not use any spaces:  ")
print("***")

start_time = datetime.now()
defined_url_list = defined_url.split(",")
properties_file = r"C:\\Files\\ReferenceFiles\\Properties.csv"

with open(properties_file, "r") as file_obj:
    file_obj_reader = csv.DictReader(file_obj)
    properties_list = list(file_obj_reader)

for defined_url in defined_url_list:
    for row_obj in properties_list:
        if defined_url in row_obj["HostNameRef"]:
            print(row_obj)
        else:
            url_parse = urlparse(defined_url)

            try:
                ipaddr = socket.gethostbyname(url_parse.path)
                print("The address was not masked behind Akamai.  The public address of", defined_url, "is", ipaddr)
            except socket.gaierror:
                print("The URL", defined_url, "could not be mapped.  See if the host is active and online.")

end_time = datetime.now() - start_time
print("The script took", end_time, "to run and reviewed", len(defined_url_list), "URL(s).")
1 Like

I will give this a try over the weekend or on Monday, sorry for the super delayed response, and thank you for the help! Will report back as soon as I can.

I was able to get close with that recommendation. The else still is still triggering even though it should not, but I just made the script without the else statement and left it with the if for the time being. I am going to stop using the .CSV in the near future and instead call the API that contains this information as well.

Thanks for the help!

Final update for anyone who finds this thread: two colleagues of mine found a solution to this, below is the logic:

# start of script
# import our libraries needed to run this script
import re # imports re to use regular expressions in the script
import socket # imports socket to be able to do things like nslookup
from urllib.parse import urlparse # imports urllib to parse urls for their components
from datetime import datetime # imports datetime to be used for script run time
# 
defined_url = input("What is the URL being queried for the mapping?  DO NOT include the sheme (e.g., http://), DO NOT include a port, DO NOT include a path (e.g., /somepath/file.php), DO NOT include parameters (e.g., ?param1=foo&param2=bar), and DO NOT include an anchor (e.g., #foobarinthisdoc).  Enter each URL separated by a comma and do not use any spaces:  ")
print("***")
defined_url_array = []
start_time = datetime.now()
defined_url_array.append(defined_url)
for defined_urls in defined_url_array:
    url_array = defined_urls.split(",")
properties_file = `ENTER YOUR FILE PATH HERE`
#
with open(properties_file, "r") as file_obj:
    file_content = file_obj.read()
    url_regex = re.compile(r"prp\_[0-9]+\,(?P<config_ref>.+)\,(?P<hostname_ref>.+)\,(?P<hostname>.+),(?P<ipaddr>.+)\n")
    url_found = []
    for url_match in url_regex.finditer(file_content):
        url_result = url_match.groupdict()
        for url in url_array:
            if url in url_result["hostname_ref"]:
                print(url_result)
                if url not in url_found:
                    url_found.append(url)
    for url in url_array:
        if url not in url_found:
            hostname_parse = urlparse(url)
            try:
                ipaddr = socket.gethostbyname(hostname_parse.path)
                mapping_dictionary = {"url":url,"hostname":hostname_parse.path,"ipaddr":ipaddr}
                print(mapping_dictionary)
            except socket.gaierror:
                print("Unable to map", hostname_parse.path, "moving on...")
end_time = datetime.now() - start_time
print("The script took", end_time, "to run and reviewed", len(url_array), "URL(s).")