Monitor and notify when servers are down

Hoping someone can help me with this. I have a powershell script I use to ping servers and email me if they are down. This script is run every 15 minutes and if it finds a server that is down it records it in a text file. If the server was down at the next 15 minutes it won’t email me. It starts all over again after one hour. I want to replace the PS script with a Python script. Here is what I created using Python which works fine, but how do I tell it to not notify me about a server if that server was down the last time it checked, but will notify me if it’s been down for an hour?

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import os

server_account = 'MyEmail@outlook.com'
server_password = 'MyPassword'

servers = ('Server1', 'Server2', 'Server3', 'Server4')

for server in servers:
    respond = os.system("ping -n 1 " + server + " > nul")
    if respond == 1:
        msg = MIMEMultipart()
        msg["Subject"] = server + " is down"
        msg["From"] = "MyEmail@outlook.com"
        msg["To"] = "email1@outlook.com"
        body = MIMEText("Server " + server + " is down")
        msg.attach(body)
        smtp = smtplib.SMTP('smtp.office365.com',587)
        smtp.ehlo()
        smtp.starttls()
        smtp.login(server_account, server_password)
        smtp.sendmail(msg["From"], msg["To"].split(","), msg.as_string())
        smtp.quit()

You could have a JSON file that stores when each server was last checked and what its status file was then.

Use the json module to read the file, do the checks and update the statuses, and then write the file back out.

1 Like

Hi,

can you please elaborate on these two requirements:

  1. Notify me about a server if that server was down the last time it checked.

    • This implies two check confirmations in a row that the server was down.
    • if so, send owner an email notification
    • if server is checked and it is not down, reset server fault counter
  2. Notify me if it’s been down for an hour.

    • This implies four check confirmations in a row that the server was down.
    • if so, send owner an email notification again
    • reset server fault counter

Summary:
Only send an email notification if fault counter is either 2 or 4 and not 3.

Is this about right? If not, please clarify.

Sorry @onePythonUser , I didn’t make that very clear. Let’s say Server1 is down, the Windows Scheduler runs the py script every 15 minutes so at the next check it emails me that Server1 is down. Now after another 15 minutes the Windows Scheduler runs the py script again and Server1 is still down. I don’t want it to email me. I want it to wait until Server1 has been down for one hour before it notifies me again. After 1hr 15min it checks again, I don’t want it to notify me, but after 2hrs I do want it to notify me if Server1 is still down.
So basically, if Server1 is down I want the initial email notification, but after that I only want to get notified every hour.

Is this about right. Send email after the first 15 minutes after detecting a server fault. The second email at the first one hour mark. Thereafter, every hour after the first hour. See sketch:

server_fault_emails

The 45 minute mark is not shown but is implied since it checks every fifteen minutes.

Yes, that is correct. Well done on the timeline!

Hi,

ok, here is a potential solution to your reported issue. Note that I chose a class because the variables (attributes) are persistent throughout the life of the program. They keep track of each server fault counter. There are three methods in the class to note here:

  1. server_fault_update
    - This calls the function server_fault_action. Since the same action is executed for every server, no need to repeat the same code multiple times. We can write the function once, and call it as many times as needed.

  2. reset_server_counter
    - This function resets both the server fault variable as well as the first email flag every time that their is a server check and no fault is detected.

  3. server_fault_action
    - This function increments the server fault variable and checks if an email should be sent.
    per the requirements, it will only send an email per the stated specifications list in the previous post and highlighted in the sketch.

I have included test code at the bottom of the script - this is NOT to be used in your application, however. I also added verification print statements throughout the test script. You can remove them if you decide to incorporate the script in your application.

Your code has the following test condition - incorporate this code as shown here

if respond ==1:
    server_faults.server_fault_update(server)  # Increments server fault counters
                                               # and checks if email should be sent
else:

server_faults.reset_server_counter(server)  # Resets the server fault counters as well as the first email flag

Let us know if you have any questions or if it is not clear. Taylor it to your preferences if you’d like.


class ServerFaultCounter:

    # Can use a class for server fault checking since variables have persistence
    # between method calls

    def __init__(self):

        # Server fault counters - initialize counts
        self.counter1 = 0
        self.counter2 = 0
        self.counter3 = 0
        self.counter4 = 0

        # Used for email at the 15 minute mark
        self.flag1 = 0
        self.flag2 = 0
        self.flag3 = 0
        self.flag4 = 0

        self.server_email_flags = {'Server1': self.flag1, 'Server2': self.flag2,
                                   'Server3': self.flag3, 'Server4': self.flag4}

        # Associate counters to servers
        self.server_status = {'Server1': self.counter1, 'Server2': self.counter2,
                              'Server3': self.counter3, 'Server4': self.counter4}

    # Call method when 'respond' in your code is a '1' (this implies a valid downed server)
    # It will then update the corresponding server counter and check if it was down
    # the previous time it checked status.
    # method called if 'respond' = 1
    def server_fault_update(self, server):  # Pass current server being checked as argument

        match server:

            case 'Server1':

                self.server_fault_action('Server1')

            case 'Server2':

                self.server_fault_action('Server2')

            case 'Server3':

                self.server_fault_action('Server3')

            case 'Server4':

                self.server_fault_action('Server4')

    # reset counter for corresponding server
    # method called if 'respond' != 1; implies the server is online and working
    def reset_server_counter(self, server): # reset counter for corresponding server

        match server:

            case 'Server1':

                self.server_status[server] = 0
                self.server_email_flags[server] = 0

            case 'Server2':

                self.server_status[server] = 0
                self.server_email_flags[server] = 0

            case 'Server3':

                self.server_status[server] = 0
                self.server_email_flags[server] = 0

            case 'Server4':

                self.server_status[server] = 0
                self.server_email_flags[server] = 0


    def server_fault_action(self, server):

        self.server_status[server] += 1

        # Send email if 2nd consecutive server down confirmation
        if self.server_status[server] == 2 and self.server_email_flags[server] == 0:
            self.server_email_flags[server] = 1  # Set to '1' so that it doesn't enter here again
            print('Send email at the 15 minute mark.')
            """
                Add code for sending email notification here
            """

        # Thereafter send email on every other hour starting from detection of first server down
        if self.server_status[server] % 5 == 0:
            self.server_status[server] = 0  # reset server fault counter to begin new count
            print('Send an email at the hour mark.')
            """
               Add code for sending email notification here
            """

if __name__ == '__main__':

    # Test code for class / script verification
    # Not part of final application

    server_faults = ServerFaultCounter()  # Create instance
    server_list = ['Server1', 'Server2', 'Server3', 'Server4']

    for server in server_list:
        print('\nTesting server: ' + server)
        print('-'*23)
        for server_down_count in range(18):
            server_faults.server_fault_update(server)

    # Simulate email flags and counters being reset during a server power up
    for server in server_list:
        server_faults.reset_server_counter(server)

    for server in server_list:
        print('\nTesting server: ' + server)
        print('-'*23)
        for server_down_count in range(18):
            server_faults.server_fault_update(server)

I recommend first testing it as a standalone script so that you understand its behavior.

1 Like

Much appreciated Paul. I’m new to Python so I’ll have to play with it a bit.

Hi,

I just realized that there is a mistake/bug in the code due to the 15 minute mark offset.

Per the email notification requirements of:

  1. Send email 15 minutes after first detecting server X going offline.
  2. Send email in multiples of hours after first detecting server going offline.

The following sketch summarizes the requirements:

email_confirmation_1

From the sketch, emails are only sent on the blue tick marks.

For the correction, replace the code for the method server_fault_action with the following.

    def server_fault_action(self, server):

        self.server_status[server] += 1

        # Send email 15 minutes after first detecting server going offline.
        if self.server_status[server] == 2 and self.server_email_flags[server] == 0:
            self.server_email_flags[server] = 1  # Set to '1' so that it doesn't enter here again
            self.server_status[server] = 1
            print('Send email at the 15 minute mark.')
            """
                Add code for sending email notification here
            """

        # Send email notificatin in hour intervals after first detecting server going offline.
        if self.server_status[server] % 4 == 0 and self.server_email_flags[server] == 1:
            self.server_status[server] = 0
            
            print('Send an email at the hour mark.')
            """
               Add code for sending email notification here
            """

        # Option: If you don't want to send email from inside these methods, you can
        # use the 'return' statement to return a value as a flag and send it from your own code.

Note that you can add extra counter variables to keep track of how many hours have passed since first detecting the servers going offline and include these values as part of the email notification.

Thanks Paul. I’ll test it out.

You could just pickle a dictionary of down times for servers.

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import os
import pickle 
down_file = './servers_down.pkl'
server_account = 'MyEmail@outlook.com'
server_password = 'MyPassword'

servers = ('Server1', 'Server2', 'Server3', 'Server4')
down = {}
if os.path.isfile(down_file):
       with open(down_file, 'rb') as f:
			down = pickle.load(f)
else:
        # make a new down_file
        with open(down_file, 'wb') as f:
			pickle.dump(down, f)

for server in servers:
    respond = os.system("ping -n 1 " + server + " > nul")
    if respond == 1:
		if server in down:	# Already see as down
			down_time = down[server]
			if (time.time()-down_time > 3600): # Down an hour? 
				print (f"{server} has been down over and hour") #do whatever
				msg = MIMEMultipart()
				msg["Subject"] = server + " is down"
				msg["From"] = "MyEmail@outlook.com"
				msg["To"] = "email1@outlook.com"
				body = MIMEText("Server " + server + " is down")
				msg.attach(body)
				smtp = smtplib.SMTP('smtp.office365.com',587)
				smtp.ehlo()
				smtp.starttls()
		else:
			down[server] = time.time()  # notice it is down
	else:
		if server in down:
			del down[server] # we are up now
			
  with open(down_file, 'wb') as f:
			pickle.dump(down, f)
1 Like

Thanks @MensaMoron I’ll try it out!

I ran the script in VS Code, but it doesn’t return anything or email me even though I have server names that are down. (I did have to add “import time”.)
Any ideas?

  1. What if the server OS is running but the web server on that machine is down? How do you handle that?
  2. What if the server was rebooted and the firewall was not set up properly after reboot? How do you handle that? (I think this happened to me once.)
  3. Are you doing this for a tutorial or a real-world case?

Because I thought ping was responded to by the OS itself, not the web server (like apache or a python web server) software.

For log files, Python has log file modules for that. And some have a way to trim the log files to be below a certain size. Use that with some type of scheduler that fits your needs.

Also I found that the items I created in Windows scheduler on my laptop would get turned off somehow, so I no longer use it. (I know it won’t run if the laptop is sleeping or turned off. But my scheduler items were disabled but still in the scheduler.)

I’m currently running the Powershell script from a Windows 2022 server with it scheduled to run every 15 minutes from the Scheduler. It’s not checking for a web server being up or down, just pinging the server.
This script is running on a production server.

What do you mean by which works fine? Are you able to communicate with all four servers?
Are you able to send yourself an email with the code from your original post?

How can you tell if the server is down/offline? Is it by the result of this expression:

respond = os.system("ping -n 1 " + server + " > nul")

What are the potential outcomes/values from this expression?

In your original email scripts, for the following line:

msg.attach(body)

body is not defined. Is this defined or added at your end? If not, you have to assign it a value otherwise an exception will be generated - for missing an assigned value.

Please clarify.

@onePythonUser I haven’t tried your solution yet, but here are the answers to your questions.

“Here is what I created using Python which works fine” was from my original post. That script only sends out a test email it doesn’t verify if a server is up/down.
“It’s not checking for a web server being up or down, just pinging the server.” Yes, it’s using the ping command.
“msg.attach(body)” is defined in previous line with:
body = MIMEText("Server " + server + " is down")

Yup, I missed it.

Here is a potential solution to your issue. Review and become familiar with it.

This script is an extension of the prior submission. I have incorporated your code for the email notification as a class method and the ping checking code as part of the main loop. I have also added an extra set of variables that keep track of the number of hours that the servers have been down and send those values as part of the Subject heading for easy reference.

I have added a second class ServerPing. It inherits the Server class where the bulk of the work is performed. The ServerPing class serves as the keeper of the main loop.

I have incorporated the time module to add a delay between server pings. The value in seconds is 900. There are 900 seconds in 15 minutes hence the value.

Test it and see if it addresses your issue or modify it to your needs:

import smtplib
from email.mime.text import MIMEText
from email.mime.multipart import MIMEMultipart
import os
import time

server_account = 'MyEmail@outlook.com'
server_password = 'MyPassword'

servers = ('Server1', 'Server2', 'Server3', 'Server4')


class Server:

    # Can use a class for server fault checking since variables have persistence
    # between method calls

    def __init__(self):

        # Server fault counters - initialize counts
        self.counter1 = 0
        self.counter2 = 0
        self.counter3 = 0
        self.counter4 = 0

        # email notification type flag (15 min or hourly)
        self.flag1 = 0
        self.flag2 = 0
        self.flag3 = 0
        self.flag4 = 0

        # Counts the hours that server 'X' has been down
        self.hours_down1 = 0
        self.hours_down2 = 0
        self.hours_down3 = 0
        self.hours_down4 = 0

        # Associate flags to servers
        self.server_email_flags = {'Server1': self.flag1, 'Server2': self.flag2,
                                   'Server3': self.flag3, 'Server4': self.flag4}

        # Associate counters to servers
        self.server_status = {'Server1': self.counter1, 'Server2': self.counter2,
                              'Server3': self.counter3, 'Server4': self.counter4}

        # Associate hourly fault counters to servers
        self.server_hours_down = {'Server1': self.hours_down1, 'Server2': self.hours_down2,
                                  'Server3': self.hours_down3, 'Server4': self.hours_down4}

        # Call method if respond = 1 # This implies a server is down

    def server_fault_update(self, server):  # Pass current server being checked as argument

        self.server_fault_counter(server)

    # reset counter for corresponding server
    # method called if 'respond' != 1; implies the server is back online and working
    def reset_server_counter(self, server):

        self.server_status[server] = 0
        self.server_email_flags[server] = 0
        self.hours_down[server] = 0        

    def server_fault_counter(self, server):

        self.server_status[server] += 1

        # Send email 15 minutes after first detecting server going offline.
        if self.server_status[server] == 2 and self.server_email_flags[server] == 0:
            
            self.server_email_flags[server] = 1  # Set to '1' so that it doesn't enter here again
            self.server_status[server] = 1

            self.email_notification(server, 0, False)

        # Send email notificatin in hour intervals after first detecting server going offline.
        if self.server_status[server] % 4 == 0 and self.server_email_flags[server] == 1:
            
            self.server_status[server] = 0
            self.hours_down[server] += 1

            self.email_notification(server, 1, self.hours_down[server])

    def email_notification(self, server, flag, hours_down):

        msg = MIMEMultipart()

        if flag == 0:
            msg["Subject"] = server + " has been down 15 minutes."

        else:
            msg["Subject"] = server + " hours down: " + str(hours_down)

        msg["From"] = "MyEmail@outlook.com"
        msg["To"] = "email1@outlook.com"
        body = MIMEText("Server " + server + " is down")
        msg.attach(body)
        smtp = smtplib.SMTP('smtp.office365.com', 587)
        smtp.ehlo()
        smtp.starttls()
        smtp.login(server_account, server_password)
        smtp.sendmail(msg["From"], msg["To"].split(","), msg.as_string())
        smtp.quit()


class ServerPing(Server):

    def __init__(self):

        super().__init__()

    def main(self):

        try:

            while True:

                for server in servers:

                    respond = os.system("ping -n 1 " + server + " > nul")

                    if respond == 1:

                        # Increment server offline counter and send notification email if necessary
                        Server.server_fault_counter(self, server)

                    else:

                        # Reset counters and flags
                        Server.reset_server_counter(self, server)

                time.sleep(900)  # Wait 15 minutes (900 seconds)

        except KeyboardInterrupt:

            print('\nExiting server pinging process..')


if __name__ == '__main__':
    server_pinging = ServerPing()
    server_pinging.main()

Thanks Paul. I’ll test it out.