Tuesday, December 7, 2010

This time I will DO IT!

Humans (= admins) are lazy. Sorry, I do not want to affront anyone! I will come to that point later, so don’t inveigh me at this time!

But to be honest: did you ever said: “next time… next time I will make it better”? I guess so.

Did you ever said: “we will do everything to be sure to not get this failure again”? Of course you did.

And did it ever happened that nothing (or at least not enough) happened? Probably yes (be honest, nobody will find it out (and I will not tell it anyone Winking smile )).

Believe me: in System Management it’s also the same. You can’t build a 100% monitoring solution from scratch. But you can (or somebody can do this for you) implement a best practice solution. However, this solution will also not cover 100%.

But you can learn from every unexpected situation that decreases your service availability. If that happens it would be good to adapt your monitoring to alert as soon as possible before a crash happens again – or at least as soon as possible after a crash happened again. With the best possible information about that problem. In the worst case this affects the user, yes. But you can immediately begin with the right recovery and your helpdesk can tell your users what the problem is and that you are already working on a solution (and that’s also very (!) important).

From my experience (in- and outside perspective of datacenters) the motivation to extend monitoring for an specific service/problem can be visualized as in the chart below.

Normally, immediately after an unwanted service downtime happens, the motivation is nearly at 100% to do everything to beware of that issue in the future. Take advantage of this timeframe to extend your monitoring!



vertical axis % motivation and % monitoring integrated
horizontal axis time before/after crash (take any unit)
blue line motivation to improve monitoring
red line level of monitoring implemented already
orange area when the crash happens
green area time of highest motivation to become better

Nobody (!) can monitor all services for all failures that can ever happen. But it is mandatory to learn from every problem that occurs and most of the time there is not that much time to improve your monitoring configuration. Because everybody has a lot of workload and the motivation to create monitoring for this particular problem increases.

The good thing to know is, that it must not depend on lazy or not system management admins. In real world it is very often that exactly these guys are not in the review process and the involved admins or operators do not care about monitoring (I will write an own blog for that because this is a very common problem).

Read this blog for some other thoughts: Why monitoring is necessary?

Credits to my friend Alexander Edelmann („Das Regenschirm Prinzip“ ISBN-13: 978-3639098624) for inspiring me.

All information is provided "as is" without any warranty! Try in lab before. Handle with care in production.


Post a Comment