“That which is not monitored is not managed.” – A wise System Administrator
Failure is a part of life. This is especially true in the world of IT. It is not a question of if, but when. The key to successful enterprise management is to know when things fail. This can only be accomplished through monitoring. The name of this art – Network Management. Well, actually it is more than network management, but I guess this is the title it gets because of its roots; kind of like the way we still say we are dialing the phone. A better name would be – Enterprise Management, which is starting to catch on, but the old IT folks won’t understand what you are talking about.
Monitoring the Enterprise
If a tree falls in the forest and no one hears it, did it really fall? Well, I don’t know the answer to that, but I do know that if a system fails, someone is going to hear it. The goal of the IT staff is to be the first one to hear it (or better yet, know that it is going to fall/fail). Nothing is more painful than having your customer point to a fallen tree and ask you if you heard it. You must listen to your forest (Enterprise).
Referring back to the Enterprise Architecture post, I view systems as collections of services. Going a bit further, services are composed of components. These components are computers, switches, storage (SAN, NAS), and a bunch of other stuff. Therefore; systems are composed of services, and services are composed of components. And, the collection of our systems, services, and components is our Enterprise.
So, what do we monitor? Simple, as much as we can – systems, services, and components. To simplify this discussion, let’s look at this in terms of levels; Systems, Services, and Components. At the top level, Systems, we are checking for functionality. For example, if the system was a website, we could perform an HTTP get to check the functionality of the site. To accomplish more detailed monitoring we might craft a special HTTP request that would exercise the services that make up the site. The data returned from this HTTP request could then be analyzed to determine if the site was operating normally. In the case of Services, well, basically we are doing the same thing, keeping in mind that services are systems. So looking at the website again, we connect to its database and run some queries to gather status information. For the Components we can use SNMP to collect a whole bunch of data. In the case of a computer we would collect CPU data, disk information, memory usage, process data, and more. For switches and routers; system performance (CPU and memory), port information (usage, up/down status, error counts, etc.), and routing data (updates, errors, etc.). The more data we collect, the more likely we are to identify issues.
At each level we are collecting data that will be use to determine the operational status of all of the parts that make up a system; as well monitoring the system itself. Why not just monitor the system? Monitoring all of the supporting services and components allows us to quickly address the actual cause of a problem. If we know what caused the problem, we can fix it. The website is down is not enough information. Is the actual problem the server, switch, load balancer, router, database, firewall, or the user? Not knowing leaves a lot of things to check. If we are monitoring all of the supporting services and components we will know what is wrong and our efforts to fix the problem can be focused on what is broken.
When we monitor the Enterprise in terms of its architecture, we do not really need the “system” level monitoring, because we will be monitoring all of the services that comprise the system. Well, ok this is not really completely true. The point is that in most cases, if we are monitoring components and services, problems that affect systems will be identified at a lower level. And if the relationship of services to systems is know, and maybe even incorporated into or monitoring tool(s), we will understand why a system is down based on lower level issues. This is the goal – service and component level monitoring of the systems, for the whole Enterprise.
All right, here is the truth, you can never get there. The problem you run into is Zeno’s paradox of Achilles and the tortoise. Once you monitor half of the stuff, there is still half to go. You can get close, but close is all you can do. That being said, it is well worth the effort. Most basic stuff can be addressed quickly and easily with open source tools like Zenoss and Nagios. And as your monitoring solution matures it will become more effective. Just keep in mind that the task is never done. The more you monitor, the less you will miss.
Navstar has successfully implemented Enterprise level monitoring and capacity planning solutions for our customers. At the Department of Treasury we implemented a complete solution based on open source tools. At the core of the solution are Zenoss and Cacti. We can help you get there too.
So what are you waiting on? Start monitoring.