A lesson on failure: “Going Cloud”

About a year ago, during a time when I was barely aware about Cloud Computing or Amazon Web Services, I got assigned, along with a couple of colleagues from this consulting company I was working for, to bring an existing codebase from “alpha” to “production” and then ensure its smooth deployment to Amazon Cloud.

The customer wanted to go “live” in less than 3 months; they also wanted to be able to handle tens of thousands of visitors that would obviously click on banners and make them money. What it’s actually more probable is that they were hoping for a good exit, that is passing the hot potato to somebody else while walking out with a proft. On a side note, there is a term that could be used for these people, but this is not a meme-text so I won’t go more on that route.

Starting on a new project

With this project, things initially went to some direction: we had to incrementally deal with quite a few functionality issues and in the end we were able to put fixes for more than 100 bugs and glitches. Actually, this was all that we could do, along with the long hours required to get things done.

We could not be bothered by any setup issues with the “Cloud” configuration: we knew near to nothing on the topic and the customer fiercely guarded the “keys to the kingdom”; they would only agree on instance and resource set-ups on a case-by-case basis anyway. They were probably thinking they were paying way too much for those pesky Eastern European contractors (us), so I kind of get the “why” on keeping a close eye over the Amazon bill. It was fine by us; at the end of the day it was their home, with their needs and their rules.

Everything considered, things went quite well from our perspective, but only until we got this performance testing started. From then on, things fell apart: in about a month time the project was taken from our hands, with the customer going on with different contractors. I don’t think the product ever got launched; as of now, the particular domain is one of countless spam domains out there, after being abandoned and possibly changing ownership.

Firefighting stage

What did we, the programmers and QAs, see when things started to go in the wrong direction?

As mentioned before, the performance tests were returning poor results. We had a so-called “performance environment” that closely emulated the “production” in terms of configuration, so the expectation was that whatever performance we could achieve during testing would then closely replicate the “real thing”.

The metric that was behind the target was the Queries Per Second) (QPS), a figure that had been previously agreed with the customer through our engagement contract. From a qualitative point of view, website pages were slow to load and also timeouts were frequent. The simple process of navigating the site was painful, being so sluggish and unpredictable.

We were better software engineers than lawyers (oh, the irony!), so all we could do was figure out improvements to the software stack:

  • The backend framework was Python Tornado, so we have thought that a good solution could be to transform all “sync” operations into “async”. We were after some possible, but never properly identified outliers that may have been responsible for blocking the main server thread. The improvement itself was inconclusive, some tasks behaved better and some actually did worse.

  • The database was MongoDB, so we have also spent a lot of time improving queries and removing unneeded indexes. All this, again, to no meaningful improvement.

By the time the customer took the project away from us, we were all tired and frustrated.

What went wrong?

I have to say that it took me around an year or so to realise what went wrong; no: it wasn’t the software stack, yes: it all had to do with their “Cloud” (AWS) setup. I have reached this conclusion after starting to work with AWS in my daily role, supplemented by the formal preparation for this AWS certification I am currently pursuing.

Long story short, the customer made full use of AWS free tier, using t2.micro instances for development and testing and (most likely) t2.small or t2.large for the Mongo database installation. I forgot what the setup of web instances for production was – most likely we did not have a proper load balancer, so the load may have been “automagically distributed” with DNS round robin, i.e. every instance got its “elastic ip” and then Route 53 dealt with the balancing. Regardless of how this load balancing was configured at the time, all frontend instances were using a single database node as their backend.

Let me put down a detailed explanation:

  • “T” instances are marketed as general purpose; their baseline performance is limited, increases being allowed by a system of credit accrual: credits are gained while the instance is idle and are spent during load spikes. This works well for many non-production scenarios.

  • The instance storage system (EBS) used for such instances: when configuring new instances, if leaving such settings to their defaults (e.g. magnetic or general purpose), this uses the same credit system. This means that, when accessing the storage from an EC2 instance as part of their normal operation, the baseline IOPS (input-output operations per second) is being limited by Amazon (detailed explanation) to some value way below expected performance of the particular underlying storage.

  • These performance tests routinely exhausted all previously accrued EBS credits within the very first few minutes or so; this meant that instance storage performance got capped to a low baseline. This was visible on the database instance as high IO wait (the wa field in the output of top command). Our limited knowledge made us believe the root cause was poor coding within the software itself, rather than some limitation enforced by AWS.

  • MongoDB has an unpredictable storage access pattern by design, so it’s very likely there was no “burst capacity” (“accrued credits”, through idle time) available to start with. We were frequently experiencing sluggish performance while doing our daily work, without a discernible cause.

What could have saved the day?

Thing is, there were 2 things that may have helped:

  1. Setting up MongoDB shards on every frontend instance: this may have helped with distributing I/O operations throughout the entire cluster and could have very likely mitigated this “single point of failure” database design. On the flip side, this might have caused memory swapping on t2.micro instances, nullifying any potential benefit.

  2. Using Provisioned IOPS for EBS storage. Our non-existing DevOPs person could have kept an eye on the “average queue length” metric and then re-created the particular volume with a higher (or lower) provisioning configuration in order to better manage both the performance and cost associated with that volume.

None of these things ever happened. Sharding was mentioned when we were already racing to the bottom, during a time when it was already too late to do anything meaningful.

Conclusion

When going Cloud, get yourself a Cloud Architect in the team. At the end of the day, this may be the single large difference between success and failure. And also – the “free tier” by any Cloud provider is free for a reason; this reason may be less obvious than just providing a free trial of their services.


Later Edit: 2 days after writing this text, I have passed the certification exam for AWS CSA(A).


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.