I watched the Super Bowl last night and like you, I was amazed at the blackout that occurred after half-time. Was it due to Beyoncé’s incendiary performance that drew all the current out of the power supplies? It might be nice to think so…but, I suppose not. Where were the backup systems? Did they also fail? Most of the time these sorts of failures do not materialize out of nowhere, no matter who is singing and dancing…
Behind the scenes you can be sure monitoring software was running to detect and alert about a power failure. Why didn’t it provide sufficient notification about a pending problem and so we could have adverted the issue? Was this a cascading failure that started small and then grew to knock all the power out? Maybe, even the monitoring systems themselves were out. Over the next few days, I am sure there will be many articles explaining this.
Could it be complexity? Many have written about the Gulf oil spill and how the complexity of their monitoring systems relied on engineers to make sense of the data and correlate it in order to understand what it implied (see the blog by JP Garbani at Forrester). That approach to monitoring required an expert person to determine if there was a problem. And unfortunately by the time they understood the issue it was too late. Perhaps, this is what happened here, too.
But, in any case simple, proactive monitoring that can analyze data in real-time, recognize patterns that show a trend towards a pending problem and suppress false alarms can help prevent problems like this. Monitoring systems that depend on “eyes_on_screen” are reactive and can’t keep up with the complexity in today’s world. Ultimately, they fail. While, its too late for the 49ers, learn more how Nastel can help with a proactive approach to monitoring applications and take a look at “Unraveling the Mystery: How to predict Application Performance Problems”, a free whitepaper.