A lesson on failure: “Going Cloud”

More than a year ago, during a time when I barely knew anything about Cloud Computing or AWS, I was assigned along with a couple of colleagues on bringing an existing code base from “alpha” to “production” and ensure a smooth deployment to the Amazon Cloud. The customer wanted to “go live” in less than 3 months and be able to handle tens of thousands of visitors that would click on banners and fill their bank accounts; well, most likely they were just wishing for a good exit. On a side note, one of the photoplasty sections of the cracked.com website has an image about this type of business.

Starting the project

Things initially went to some direction – we dealt with many functionality issues, being able to fix and test more than 100 bugs and glitches; after all, this was the thing we knew best how to do and we also put in the long hours required for getting things done. We weren’t bothered by the cloud setup issues – the customer fiercely guarded the “keys to the kingdom” and agreed on instance and resource set-ups on a case-by-case basis only, all with the desire of keeping the Amazon bill as low as possible. We thought this was fine – it was their home, they knew best what they needed.

Everything went fine from our perspective up to the point of the performance testing; then everything fell appart, one month later the project being taken out of our hands. The customer chose different contractors to continue the project, I’m not sure on the particular details of how things actually went from that moment of time on; as of now the site is live but not serving any interactive content (link if you’re curious).

Firefighting

What us, the programmers and QAs, have seen when things started to go in the wrong direction?

As mentioned before, the performance tests were returning poor results; the so-called performance environment that closely emulated the production was constantly behind the target QPS that was agreed with the customer. The pages were loading slowly and sometimes they timed out. The overall site navigation was sluggish and unpredictable. With our knowledge of the software stack, we turned to looking for improvements on the software itself:

  • The framework being Python Tornado, we spent a lot of time on transforming all the “sync” operations into “async”, thinking that we have some outliers that may block the main server thread. The improvement was inconclusive, some tasks behaving better, some worse.

  • The database being MongoDB, we also spent a lot of time improving queries and removing unneeded indexes. Again, the improvement was inconclusive.

By the time the customer took the project away from us, we were all tired and frustrated.

What went wrong?

It took me about an year to actually figure out what went wrong: it actually was the AWS setup. This conclusion came as a consequence of the AWS experience I kept on gathering during this time, supplemented by the formal preparation for the certification I am currently pursuing.

The customer made full use of the free tier, using t2.micro instances for development and testing and (most likely) t2.small or t2.large for the Mongo database installation. I forgot what the setup of web instances for production was, if a load balancer was used or if every instance got its elastic ip and then Route 53 dealt with the balancing – it matters little at this point; the terrible architecture detail was that all these frontend instances were using a single database node as their backend.

Let me put down a detailed explanation:

  • “T” instances are marketed as general purpose; their baseline performance is limited, increases being allowed by a system of credit accrual – credits are gained while the instance is idle and are spent during spikes. This works well for most non-production scenarios.

  • The instance storage system (EBS): when the defaults are chosen (e.g. magnetic or general purpose), this works with the same credit system, with the baseline IOPS (input-output operations per second) being limited by Amazon in relation to the volume size (detailed explanation).

  • The performance tests routinely exhausted all the previously accrued EBS credits within the very first few minutes or so, the storage performance being immediately capped to the low baseline. This was actually visible on the database instance as high IO wait (the wa field in top), but due to our limited knowledge this was attributed to poor coding within the software itself.

  • The Mongo database has by design an unpredictable storage access pattern and caused sluggish performance throughout the day in our setup without a discernible cause (by us at that time).

What could have saved the day?

There were actually 2 things that may have helped:

  1. MongoDB shards on every frontend instance – this may have helped on distributing the IO operations throughout the entire cluster and most likely avoid the “single point of failure” design we were facing with.

  2. Using Provisioned IOPS for EBS storage. One could then monitor the “average queue length” metric and re-create the volume with a higher (or lower) value in order to better manage the performance and the costs associated with the volume.

None of these things happened. Sharding was mentioned when we were already racing to the bottom, when it was already too late to save face.

Conclusion

When going Cloud, get yourself a Cloud Architect in the team. At the end of the day, this may be the single large difference between success and failure.


Later Edit: 2 days after writing this text, I have passed the certification exam for AWS CSA(A).


Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.