Interview with Charles Rich of Nastel Technologies
By Charley Rich ( Profile )
January 3, 2013
VSM: This fall a widespread Amazon Web Services outage was blamed on a memory leak and a failure of its monitoring alarm. What is a memory leak?
CR: A memory leak occurs when running software acquires memory but fails to release it back to the operating system. Memory leaks can cause performance degradation by reducing the pool of available memory applications require to run effectively. Eventually, if the leak continues there may be insufficient memory available for the system to run resulting in system failure. This is a challenging problem in that it is quite difficult to determine if there is a leaky condition and where the leaks are. This is especially tough In Java environments where the JVM and OS memory footprints fluctuate independently from each other.
Java developers often have a false sense of security, assuming that Garbage Collection will take care of this for them. However, there are many situations where Garbage Collection is insufficient. Hence, there is a need for tooling to help with this important task.
VSM: How many different causes are there for leaks and how do leaks happen?
CR: There are many ways a leak may occur. The typical causes of these leaks in Java environments include programming errors, poorly selected JVM command line options, and bugs in the JVM itself or native libraries.
Programming errors or bugs within the JVM have multiple causes. These errors can be unchecked arrays, lists or hash map growth. They could also involve inadvertently forgetting to close JDBC Prepared statements, sockets or file handles. Or they could be issues with thread, handles and the class loader. Unfortunate choices when selecting command line options in the JVM could also be the cause such as choosing “-Xnoclassgc” in JEE environments. So, choose wisely.
The typical symptoms for these leaks includes: OutOfMemory exceptions, increasing Garbage Collection (GC) activity and heap usage. The more difficult and less obvious symptoms are the ones associated with thread, handle, JDBC statement and ClassLoader leaks.
VSM: Just how common are memory leaks in enterprise environments?
CR: Very! It is essentially an issue of complexity. As applications increase in scope and complexity, the likelihood that flaws will be introduced increases in direct proportion.
VSM: What can IT leaders do to remediate memory leaks and make sure they don’t cause bigger problems?
CR: There are best practices or remedies for treating memory leaks. One of the most important is to find the root cause of the problem which, of course, will invariably require that you fix the code. However, this is not so easy, especially with third-party libraries where you do not have access to the code. Another approach is to increase JVM heap sizes. But this is really just over-provisioning in hopes of avoiding the problem. However, hoping and wishing only get you so far when it comes to leak issues. You can tweak parameters such as the PermGen size and heap size. This is just postponing the inevitable because these changes do not address the cause of the leak. Alternatively, one could restart the JVMs in order to reset resource usage. This might include failing-over to a secondary instance. But, this approach is only viable if leaks are the slow-growing variety.
A better approach would be early detection and warning about possible leaks. This path enables sufficient time to establish a proper remedy by using diagnostics and avoids the pending crisis of resource exhaustion.
VSM: How can you detect a leak?
CR: You can’t. Well actually, you can, but detecting leaks is challenging. You need to use inference. It requires that one look for patterns that point to “leaky behavior” rather than fruitlessly attempting to look for the leak itself. One way to do this is to monitor workload and look for a trend in higher highs for things like Garbage Collection activity. Think of this as measuring acceleration in a specific activity, e.g. Heap utilization.
A momentum oscillator is used to measure supply and demand for a specific resource such as Heap. If this sounds like “supply chain management” for Heap utilization, it’s sort of like that. Momentum oscillators are designed to measure the speed and change of movements in an underlying metric.
Now, how do you create a momentum oscillator?
Start by monitoring parameters that include the following:
- OS memory – memory footprint, handles and threads
- Monitor JVM resources – heap allocated, max and used and GC activity and its frequency duration
For each resource metric, do the following:
- Measure momentum (rate and size of advances vs. declines) – Momentum Oscillator
- Resource leaks cannot reliably be determined by watching to see if it the resource exceeds some predefined threshold
The momentum oscillator illustrates a ratio of gains vs. losses in Heap utilization. We can set this with a range of 0-100, so a value of 50 indicates a net difference of 0. Detecting rate of change increases enables the user to be proactive and become aware of the trend towards Heap exhaustion long before it actually occurs.
Going further, you can use the metrics acquired to create a combined resource index for VM farms in order to track changes in momentum across multiple JVMS.
VSM: Ok, I have detected a leaky condition… now what do I do?
CR: The next step after detecting a trend toward a leak is to invoke memory diagnostics. Leak detection alerts that there is an impending problem. Memory diagnostics finds out where the problem resides. This sort of tool looks in all the obvious and not so obvious places where memory is utilized. It takes a snapshot of memory utilization showing every object in Heap, who is using it and how much is allocated. Multiple snapshots can be taken and compared over a time interval. Memory diagnostics can also provide useful reports such as the top 10 offenders in memory usage.
VSM: Can you summarize what I need to do to stay on top of this important issue?
CR: There are number of regular activities that can be undertaken to track memory leakage. They include the following:
1 - Monitor resources such as memory usage, GC activity, handles and threads
2 - For each of these, set up momentum oscillator. Using the suggested scale of 0-100, anything above 60 would indicate advances outpacing declines by a significant margin. A higher number indicates how aggressive the resource leak is -- 80 would indicate a very aggressive resource growth, regardless of the actual usage numbers for each metric
3 - Alert, notify or act when a momentum oscillator breaches a threshold in order to garner the time necessary to diagnose the root cause
4 - Use memory diagnostics to zoom into the offending JVM and determine root cause in order to prevent user impact
source: Virtual Strategy Magazine