Obamacare website cannot handle the load – a classical use case for AutoPilot

Several days ago I looked at the HealthCare.gov website, just to understand what Obamacare means. The website was up and working. Today one of our employees brought up to my attention an article published by Sharon Begley of Reuters stating that the Obamacare website is locked down a few days after the launch. The story caught my interest since it described a typical use case faced by many enterprises.

Several IT experts interviewed by Sharon Begley came up with different theories of potential reasons causing outage. Some stated that the flaw is in the architecture of the application and adding capacity may not help. One of them was quoted that there is a coding bug in the system. Another IT company stated that the problem is in the database access – the more you ask “the more it gets overwhelmed”. “The government officials blame persistent glitches” since they did not anticipate 8.5 million users within a few days. Another independent contractor brought up several hypotheses using an “overwhelming” term in regards to potential uploading of high number of Java script files to web browsers. An interesting probable cause was brought up comparing the situation to the DDOS attack on the website. The internal technicians tried to increase capacity by adding new servers and tune configuration to resolve the issue, but it did not help. Each group was trying to come up with explanations and possible root causes.

The truth is – they may all be right. But hearing all these assumptions and probable causes, indeed overwhelming to the people in charge. These are exactly the issues faced by many companies using various silo monitoring solutions for their mission critical applications. Dozens of IT personnel gather in war room meetings, go through finger-pointing and blame-storming sessions to identify probable causes that impact performance of their business services. While this is going on, the application is not performing. In fact I tried today the Obamacare website and it still does not work.

I am sure that people who designed the system are qualified professionals. They, most likely, went through a thorough design of the system, but like everywhere else the important topics such as high availability, reliability, scalability at all tiers of composite applications is an afterthought. Usually, under pressure of meeting target dates and relying on the existing infrastructure monitoring, not much attention is paid to scalability and root cause analysis. It looks like they did not anticipate the viral spread of the healthcare message, the desire and curiosity of people.

This application clearly needs a solution that monitors not only web browsers, databases, or servers, but also can diagnose the probable causes and predict potential failures. It should provide visibility to different tiers of transaction flows, anticipate performance bottlenecks at every step of the tier, end-to-end, and point to potential root causes before they occur. These are not new problems; we are dealing with them daily with our customers. The number of users and transactions in Obamacare is not overwhelming at this stage. I am sure if properly addressed they can achieve their goals and provide services to people that need medical care.