The earlier you find an IT problem the less expensive it is to resolve it. Most IT practitioners would agree with that. What they will most likely disagree about is how to do it. Some talk about predictive monitoring and say they can forecast IT problems before they occur. Can that really be done? Is predicting failure events even the right approach?
With the coming onslaught of what the media refers to as “big data” and the inevitable event storm that will accompany it, organizations need to find the best method for detecting business impacting problems early, and resolving them before they have user impact.
Although, useful, predicting IT events one by one isn’t sufficient. We need to go beyond merely predicting events and look at what kind of real business problems we can uncover. For example, predicting that the congestion between a connection pool and a database will become a bottleneck in an hour by looking at a moving baseline is nice to know and important to the DB admin, but not necessarily a business impacting issue. Problems like this are the easy, low-hanging fruit anyone can detect. But, then there are the problems that are really hard to uncover such as whether interbank payments made using a complex, home-grown SOA application are completing on time and if not why– the payoff is that solving these can produce significant business value.
In order to solve the hard problems, we need to move in the direction of a pattern-based, problem-recognition strategy. IT events can be captured from our devices, applications and users. However, resolving IT events one by one as they arrive is often ineffective as it can be impossible for the service desk to determine if they are symptoms or causes and to understand their priority.
Again, with the rise of big data as cloud computing adoption matures this approach is doomed. Today’s brute force approach to handling a growing set of events by writing an ever-increasing number of rules can never catch up and be effective.
Instead, a pattern-based strategy can look at these events in totality as a “situation”; a composite of IT events and business events that, taken together, describe a real problem with business impact. This can be done today by utilizing technologies such as complex event processing to consider situations top-down and not just individual events bound together by an ever-growing set of programmer rules.
In our example, a business problem such as an interbank funds transfer taking too long can be considered as a situation. Under the covers there may be messages with payment instructions sent over WebSphere MQ to a DataPower appliance. At the appliance the message is transformed and then routed to an application at the appropriate bank via the Fed and sent out via a TIBCO message. Think of this as a “chain of custody” for a payment as the transaction traverses multiple companies, a government agency and different messaging topologies. Pretty complex. In a situation like this do we really want to focus on predicting a WebSphere or TIBCO problem? What will that do for us? Or will it be much more effective to look at the situation of a late payment and then find the cause? Certainly, the latter will be better as it is focused on what matters to the business.
The pattern based strategy is really one of scale and time. As complexity increases, it can help us rapidly make sense of the data we are buried under and swiftly take the appropriate action to restore service.