SCOM: Disk space monitoring extension pack
In a constant quest to keep your environment running, Disk space is one of the things that need to be available to satisfy your organization’s continuously growing hunger for storage.
The price of storage has dropped significantly over the last years but unfortunately the demand for more storage has grown as well as files are getting bigger and more and more data is kept.
SCOM has had different processes over the year to make sure you are properly alerted when disk space is running low. In this post I will show you my method of keeping an eye on all the available disk space. This is however my point of view and open for discussion as usual.
I started this blog post because of a case I received from one of my customers:
- Disk should be monitored on both Free Mb left AND % free space left.
- SCOM only needs to react when BOTH the thresholds are breached
- Different threshold apply to critical and non critical servers
- Different kind of ticket needs to be created for critical and non critical servers
- A warning and Alert should be send out to warn upfront and send another warning when things get serious.
- Every day a new ticket should be sent when the condition was not solved the day before.
My initial response was: Great let’s get Orchestrator in here to get a better part of the logic in there. Answer was as predicted => no.
Ok so let’s break this up in the different categories:
Note: I did already create a management pack for this scenario but am explaining the scenario thoroughly so you can use this guide for another monitoring scenario as well
Download the mp from the gallery:
We are in luck because SCOM already has the ability to monitor on both conditions mentioned above (Free Mb left AND %free space). This was the case in the logical disk monitor and it is still present today BUT (yep there will be a lot of BUTS in this post) this is not the case in the Cluster and Cluster shared Volumes (CSV) monitors. They use the new kind of disk space monitoring where the previous 1 monitor with double thresholds is divided in to 2 separate monitors with a rollup monitor on top. In my opinion a good decision.
So at this point we can use for all different kinds of disks the same method: 2 monitors with 1 rollup monitor on top. GREAT.
So let’s start configuring them! Fill in all the different thresholds and you are good to go right?
In theory yes… but in this case not quit. One of the big hurdles was the fact that a monitor can only fire of one notification as long as it is not reset to healthy. As we need a notification on both warning and error we have an issue here. The notification process is by design built that you only will receive an alert once for either warning or error on the monitor.
Because we need to have a warning AND error we need to create additional monitors to cope with this requirement.
This is in fact how I tackled this issue.
Creating the necessary monitors.
To make sure we can have the ability to act on both thresholds we will need to create 3 monitors: Rollup monitor, Free Space Monitor (%) and Free Space Monitor (MB) like the one which ships out of the box.
So let’s get at it:
Note: I’m using the console to quickly create the management pack to show you with a minimum of authoring knowledge to solve this issue however I advise to dig deeper in the different authoring solutions for SCOM.
Note: All the necessary monitors are already in the management pack which I included in this post. I solely mention the process here so you potentially can use this method to do the same thing for another scenario.
Create the Rollup monitor
A rollup monitor will not check a condition itself but will react on the state of the monitors beneath it. Therefore we have to create this first. To make sure it shows up right under the other monitors we keep the same naming but add the word WARNING at the end.
Open the monitor tab and choose to create a monitor => Aggregate Rollup Monitor…
Fill in the name of the monitor
In this case we want the best state of any member to rollup because we want both %mb free AND %free to be true and thus in warning state before we want to be alerted:
We would like to have an alert when there’s a warning on both monitors underneath this monitor so we change the severity to Warning.
Create the monitors underneath this rollup monitor
To make sure are new rollup monitor is correctly influenced by the monitors underneath we now need to create the monitors with the conditions MB free and % free.
These are included in the management pack as well. Keep an eye on the fact that you need to create a monitor and select the appropriate rollup monitor where they need to reside under like shown below:
For the performance counter in this case I used these parameters:
Counter: % Free Space
NOTE: Make sure to turn off the alerting of these rules as we do not want to receive individual alerts but just the alert of the rollup monitor.
If you have created the monitors correctly it should look like this:
As you can see the monitors are now shown right beneath the actual monitors.
You can use this scenario for basically all approaches where you need to make double tickets for the same issue if they are caused by the same 3 state monitor.
Last important step in configuring the monitors
Because we now have the condition set for the warning condition with the appropriate thresholds we need to do the same thing for the out of the box monitor to only show us an alert when both critical conditions are met.
Therefore we need to override them with the proper thresholds and configuration:
For the rollup monitor we want to make sure it generates an alert when both the critical conditions are met therefore we set the following overrides to true:
- Generates alert
For the alerting part we only want to be alerted on Critical state because otherwise the 2 sets of monitors will interfere with each other therefore we need to set the Alert on State to “critical health state” and last but not least the rollup algorithm needs to be best health state of any member because again we only want to be notified when both conditions are met.
The 2 monitors under the Aggregate Rollup monitor also need to be updated with the correct thresholds + to not generate alerts otherwise we will have useless alerts because we only want to be alerted when both conditions are met.
Creating the necessary groups.
After we have created the monitors we need to make sure that we have a clear difference between the critical servers and the non critical servers. These are necessary to give us the opportunity to create different thresholds and different levels of tickets per category of server.
You can create a group of servers with explicit members and go from there. This is however from a manageability standpoint not a good idea as this requires the discipline to add a server to the group when it changes category or is installed. This leaves way to much opening for errors.
Therefore we are going to create groups based on an attribute which is detectable on the servers. In this case I set a Regkey on the servers identifying whether it’s a critical server or not. This can be easily done by running a script through SCCM or doing it during build of the server.
Note: Do this in a separate management pack than the one you use for your monitors as this management pack if sealed can be reused through your entire environment.
To create the attribute go to the authoring pane and under management pack objects select the attributes
Create new attribute
In this case I name it Critical server.
In the discovery method we need to tell SCOM how the attribute will be detected. In this case I choose to use a regkey.
In the target you select Windows Server and automatically the Target will be put in as Windows Server_Extended
The management pack should be the same management pack as your groups will reside in because we need to operate within the same unsealed management pack.
So after we filled in all the parameters it should look like this:
Last thing to do is to identify the key which is monitored by SCOM.
In my case it’s HKEY_LOCAL_Machine\Category\critical
Next up is to create both our groups: Critical and non critical servers
Create a new group fro the critical servers:
Check out the Dynamic Members rules
Select the Windows_Server_Extended class and check whether the Propery Critical server Equals True
The group will now be populated with all servers where this key has the value “true”
Only thing left to do is do the opposite with a group where there’s only servers not having this key set to true.
Because we now have all the building blocks to divide the warning and error on both groups of servers the only thing left to do is create both notification channels with the desired actions configured.
I ended up with 3 scenarios with their notifications to match the requirements:
I want to be alerted for a critical alert on the Critical servers and create a high priority ticket through my notification channels.
I want to be alerted for a critical alert on the non critical servers and create a normal priority ticket through my notification channels
I want to be alerted for a warning alert on both the critical servers and the non critical servers and send out a mail through my notification channels.
The next steps in how to get the tickets out scom in your organization should be configured for your environment specific but at this point the different scenarios are covered.
The last thing on the list was to reset the monitors on a daily basis so we are sure that we keep getting alerts as long as the condition was not resolved. This is accomplished by using my resetmonitorsofspecifictype script which I documented in this blogpost: http://scug.be/dieter/2013/10/23/scom-batch-reset-monitors-through-powershell/
This blogpost covers all the different questions in this scenario + that we did not have to build any complex scenarios outside of SCOM but used all technology within SCOM to accomplish our goal.
The last thing I would recommend is to seal the management pack used for the group creation. That way you can reuse this in other unsealed management packs as well to make a difference between critical and non critical servers.
Again you can use this approach for all different monitors.