Recently I got a question of an engineer during a community event why SCOM didn’t notify him when SCOM was down.
My first response was very similar to the response of my favorite captain below:
But this got me thinking actually because the engineer made a good point. That to have a full monitoring you should have another mechanism in place to monitor the monitoring system. Most companies still have a legacy monitoring system in place that can be leveraged to monitor the servers of SCOM but let’s face it: keeping another monitoring system alive just to monitor the SCOM servers only adds complexity to your environment for a small benefit.
That’s why I started building a small independent check with PowerShell. In part 1 of this series I’ll go over how to monitor whether your management servers are still up and running.
To do this we need to make sure that we have a watcher node which is able to ping the management servers. This watcher node may be any machine capable of running PowerShell and does not need to have operationsmanager PowerShell module available. This to make sure we are operating completely independent from SCOM.
The graph below shows the process used:
In my environment I have 2 management servers which are reachable from the watcher node. The first step is to dynamically determine how many management servers are in my environment. To do this I’m creating the input file which is generated by PowerShell on a management server and updated once a day. This is an automated process because face it: if we need to think about changing the infile.txt when we add or delete another management server we will forget.
This file will be available on the watcher node to do the ping commands even when the management servers are down.
(this is action 1 in the graph above)
To generate the infile containing all the management servers which are currently in our environment we need to execute the following PowerShell command on the watcher node:
[xml]
#=====================================================================================================
# AUTHOR: Dieter Wijckmans
# DATE: 03/12/2014
# Name: Readms.PS1
# Version: 1.0
# COMMENT: This script will read out all the Management servers in a management group and saves it
# into a txt file which is used to ping the servers from an external watcher node.
# This script is scheduled on a management server via scheduled tasks.
# Make sure to fill in your destination (which is your watcher node) in the variable
#
# Usage: readms.PS1
# Example:
#=====================================================================================================
$destination: "fill in the destination on the watchernode here"
$ms = get-scommanagementserver
foreach ($mstemp in $ms){
$ms.DisplayName | Out-File $destination
}
[/xml]
Schedule this script on the management server via scheduled tasks and run it once a day.
The program to run is: powershell.exe c:\scripts\readms.ps1
This will generate the infile for the ping command to check the management servers and will place it on the watcher node.
(this is action 2 in the graph above)
Next up is to configure the watcher node to monitor our management servers and alert when they are unreachable. This is done by executing the following PowerShell on a regular basis through schedule tasks. I schedule this task every 5 minutes. This means that you get a mail every 5 min until it’s resolved. Better annoy a little bit more than just send 1 mail which just drowns in the mail volume.
[xml]
#=====================================================================================================
# AUTHOR: Dieter Wijckmans
# DATE: 03/12/2014
# Name: Pingtest.PS1
# Version: 1.0
# COMMENT: This script will ping all the Management servers in a management group according to the
# input file and escalate when a server is not reachable.
# Make sure to fill in all the parameters in the parameter section.
# This script is scheduled on the watcher node via a scheduled tasks.
# Make sure to fill in your destination (which is your watcher node) in the variable
#
# Usage: pingtest.PS1
# Example:
#=====================================================================================================
#parameter section: Fill in all the parameters below
$infile = "Location of file with management servers listed"
$outfile = "Location of file which will keep historical data on the pings"
$smtp = "fill in your smtp config to send mail"
$to = "The destination email address"
$from = "The from email address"
#reading the date when the test is executed for logging in the historical file
$testexecuted = Get-Date
#reading in all the objects listed in the infile
$objects = get-content $infile
#running through all the objects and taking action accordingly
foreach ($object in $objects)
{
$pingresult = Test-Connection $object -quiet
if ($pingresult -eq $True)
{
$pingresult = "Online"
}
else
{
$pingresult = "Offline"
$subject = "SCOM: Management Server " + $object + " is down!"
$body = "<b><font color=red>ATTENTION SCOM support staff:</b></font> <br>"
$body += "Management Server: " + $object + " is down! Please check the server!"
send-MailMessage -SmtpServer $smtp -To $to -From $from -Subject $subject -Body $body -BodyAsHtml -Priority high
}
$result = $object + " :ping result: " + $pingresult + " :" + $testexecuted | Out-File $outfile -append
}
#read the length of the inputfile and validate the same amount of lines in the outfile to validate whether all management
#servers are down.
$filelength= Get-content $infile | measure-object -Line
$numberoflines = $filelength.Lines
$file = Get-Content $outfile -Tail $numberoflines
$wordToFind = "Online"
$containsWord = $file | %{$_ -match $wordToFind}
If($containsWord -notcontains $True)
{
$subject = "SCOM: ALL Management Servers are down!"
$body = "<b><font color=red>ATTENTION SCOM support staff:</b></font> <br>"
$body += "All Management servers are down. Please take immediate action"
send-MailMessage -SmtpServer $smtp -To $to -From $from -Subject $subject -Body $body -BodyAsHtml -Priority high
}
[/xml]
Note: Make sure that you change all the parameters in the parameter section.
This script will ping all the machines which are filled in in the infile we created earlier and writes this to the out-file. The outfile is than evaluated and a mail is automatically send when a management server is down. If ALL management servers are down a separate mail is sent to notify that SCOM is completely down.
You can change the mail appearance in the $body fields in the PowerShell.
The outfile will have the following entries:
My servers were Offline last night at 21:13:38. So the mailing was triggered and the mail will look like below when SCOMMS2 is down:
When all management servers are down it will look like this:
So now we get completely independent from SCOM mails telling us there’s an issue with the SCOM management servers.
You can find the PowerShell scripts and the files here on Technet Gallery:
In Part 2 I’ll go over the ability to monitor your SQL connection of the management servers.