Sunday, November 28, 2010

Why monitoring is necessary?

This week a colleague of mine asked me to be a co-presenter for a Microsoft Event on Lync 2010 (Office Communication Server 14, find the event here) where he wants me to talk about the SCOM implementation for Lync. So I thought about what may be a good entry to loosening up the audience.
I asked myself (once again):
Why are you doing system management?
What are the benefits of monitoring?
What is the business value of being proactive?
And how do you measure ROI?
From time to time most of us go for a preventive medical health1checkup (even those who life active healthy lifestyle).

We do that to know the health state of our own body. To know how we can prevent illnesses like hypertension, circulatory disturbance, blood glucose ailment, ... .
I compare that with technical proactive monitoring because things can go wrong without anybody being aware of it.
We also have other kinds of health check-up for more serious conditions like cancer examination, heart insufficiency, osteoporosis, and so on.
In the above case the system (yes, your bodies’ system too)  is in an unhealthy state but all services are working as expected. To prevents an unwanted break you have to know that there is a problem as soon as possible to repair the issue with less impact and subsequent damage.
By being proactive if we do have acute health problems we can go to the doctor or even to the hospital to determine the reason and to have the correct medical treatment immediately.
My opinion is, that servers and application should do that too to give us the possibility for the correct diagnostic, analyzes and recovery to minimalize the downtime.
Does this make sense to you?
I guess, because it is necessary to know your bodies’ health state. And I think it's also necessary to know the health state of your datacenter environment - at any time!
This is from the technical perspective.
On the other hand there is always the business perspective. Unfortunately it is not that easy to determine the ROI for that kind of software.
How should you declare the costs saved for service downtime that never happens? Or to declare the costs saved because of much more faster response and service recovery in case of an issue?
What will you consider in your ROI calculation? Do you observe file system thresholds too (because no more space available = no more service available)? Or do you only observe real service downtime? Is this the whole truth?
What about performance issues? Do you consider the saved costs health2because users can work faster (or even smoother) after you start up an additional server in your farm/cloud?
And on the other hand what about the costs saved for power, cooling, lower MTBF because shutting down a server when the workload in your farm/cloud is decreasing?
You see, that’s partially an absolutely philosophical proposal concerning calculating the costs.
But hopefully you keep in mind, that it is essential to do proactive monitoring. So: call the doctor you trust to get an appointment for your medical health check. And call the consultant you trust to implement useful (!) monitoring.
All information is provided "as is" without any warranty! Try in lab before. Handle with care in production.


Post a Comment