Root cause analysis in often referred to as the “Holy Grail” of monitoring.

Getting to the root cause of APM









However, if I have my Arthurian legends straight, it was never found.  We won’t count the Indiana Jones re-write of the legend… In fact the Holy Grail is more of an ideal and not something you ever fully accomplish.

Nevertheless, it was considered the most worthwhile pursuit; even if it is always a bit out of reach.  The same holds true for application performance monitoring.The explanation for this is simple.  There is a timeline that starts with problem occurrence, continues on to first arrival of a symptom, impact of the problem and ends with problem resolution (hopefully).  Compressing this timeline is the obvious way to reduce impact, save money and preserve productivity.  Now of course, there is no way to put a time stamp on problem occurrence that is different from first arrival of a symptom.  It is sort of the tree in forest falling that no one hears.  However, monitoring is all about capturing that first symptom and then getting the resolution process in play as soon as possible.  But…it is unknown on which event stream the first symptom will appear.  It could be network latency, server slowdown, application timeouts, or a database call that never returns…the list is long.

A way to resolve the monitoring dilemma is to correlate symptoms across multiple event streams and attempt to separate symptom from cause. The latter is called “causality”.  Wikipedia defines causality as the relation between an event (the cause) and a second event (the effect), where the second event is understood as a consequence of the first.  In the world we live in (other than in Star Trek) causality cannot be violated.  Application performance monitoring while attempting to construct a causality chain (of events) must overcome the hurdles that some symptoms will be lost and those that do arrive will appear to be in a seemingly random order.

One solution to this is the usage of Complex Event Processing (CEP).  It is a tool that can rapidly recognize patterns across multiple event streams that evaluated together describe a problem.   CEP is very good at handling the inherent signal-to-noise ratio problem that monitoring has and still producing viable results.

For more information on this approach take a look at the article in APMdigest, “Root-Cause Analysis of Application Performance Problems”.   In addition check out the whitepaper written by JP Garbani of Forrester Research, “Application Performance Management and Complex Event Processing”.