Last night in Washington, DC, I was driving home in a severe thunderstorm that had blinding lightning flashes and thundering canon booms. Two weeks ago we had lost power for 5 days in my home during the sweltering, summer heat and it was not pleasant (understatement). I was worried about losing power at the house and after checking in via the mobile phone, my mind wandered back to the office and work. If I was losing power at home, how would this storm impact my company’s network? Would I return to the office with a lack of insight into what devices or applications were up or down, for how long, etc.? Then I smiled.
At Visual Network Systems (VNS) our Visual Performance Manager (VPM) system has a unique and powerful feature that helps network managers deal with alarms that can be the result of severe weather or a variety of other events that may cause power loss. Therefore, I knew that if a problem occurred, I would get notified and have insight into what was happening on my network, so I could proactively work to fix the problem.
Let’s dive into this a little more with some background first. Within the network infrastructure, there are all sorts of SNMP (Simple Network Management Protocol) polling happening on the MIB (Management Information Base) of servers, routers, switches, of which, the most basic is the “availability” MIB. Is the device up or down? Most SNMP polling of a MIB is done in 3- or 5-minute increments, and a lot can happen to a network between polling cycles.
For example, severe weather (and a power outage) can create alarm storms with SNMP polling monitoring tools and when there are 100s of locations or customers, understanding the situation immediately is required by Network Operation Center (NOC) personal. This is called “situational awareness.” Power outages and surges cause a large number of automatic trouble tickets to be created within an enterprises that manages their own networks, and within the service providers managing routers or remote locations for customers. These events take time and energy to close by professional IT personal, distracting them from other issues or tickets. Often times, the ticket is closed with NTF for No Trouble Found.
Power outages affect the Service Level Agreement (SLA) metrics for Availability of a remote location. Normal remote SNMP polling would just show “unreachable” if the polling cycle occurred during the power outage, with no knowledge of whether the local loop had a back hoe cut through the parking lot copper to the building, or if the customer edge router is down, or if someone unplugs the router to plug in the vacuum nightly (this is actually a true story), or if there is a power outage.
How does a network manager overcome this lack of insight? The VPM system has a remote monitoring probe called an ASE (Analysis Service Element) and it can be installed in front of the customer edge router inline and will provide a “last cry” as the power in the building surges or goes out.
This last cry, real time SNMP trap (ASE lost power or experienced a brown out condition) is very difficult to recreate in the real world (and the test world) because the ASE probe unit must be plugged into the power receptacle to be created. As the unit loses power, there is just enough electricity to send a last trap to VPM and because their is power provided from the Central Office (CO) to the remote branch via large back up batteries and generators on the wire, the ASE is able to send this “last cry” trap back to the VPM server. This event can be forwarded to any third party event consolidation tool. This event identifies the issue immediately to the Network Operations Center personal monitoring the remote location.
Some scenarios for isolating a remote site availability problem include:
- If you unplug the power to the ASE, or reboot the device, then the trap created would be “The ASE has rebooted”. This is more of an administrative trap but is related to a power outage if the ASE probe is not part of a Uninterruptable Power Supply (UPS) system
- If the Customer Edge Router (CER) is down, then the trap would be “Customer Provided Equipment Failure”. (The ASE hardware has the capability to keep the local loop “alive” without the CER installed or running via utilizing LMI or BGP updates)
- If the local loop is down, then the “Access Line is down” and it is a hard down.
Now, why is this important?
- Availability metrics on a remote location can be affected for your daily, weekly and monthly reporting.
- There would also be an alarm storm from all the servers and infrastructure switching over to the UPS or rebooting. Understanding the weather is the cause provides situational awareness to the NOC. This supports Mean Time to Know or Mean Time to Identify metrics.
- If a Managed Service Provider (MSP) offers an SLA for Availability, the measurements can be easily correlated to the weather that occurred last night.
- If the enterprise NOC is 24/7, than monitoring the remote location can be correlated to a weather map and the sites affected by the polling tool and its events can be dismissed immediately.
Even with a rough drive home in the middle of a thunderous rain storm, I was secure in the knowledge that VPM and the Analysis Service Element was monitoring my office network and that I would be proactively notified of a problem prior to our team showing up for work the next day. Just another example of how VPM helps network managers gain insight into network operations.