When there's no route forward (or so you fear)

When there’s no route forward (or so you fear)

Posted on November 14, 2016, 8:08 pm by Dan

Note: This is a text about the project work at one of my previous employments.

Introduction

These days everybody talks about Agile, Automation, DevOps and Continous (whatever), without truly understanding why things have gone in this direction. After all, for many years, a software project had a couple of well-known steps that needed to be followed, like:

Full, thorough planning at the very beginning and from time to time, before significant milestones;
Development, lots of development behind closed doors;
Lots of manual QA work, little automation with some custom-written testing framework written from scratch by one of the developers;
Infrequent releases (e.g. every year or even every other year or so); releases were thoroughly prepared and tested, with code freezes for (sometimes) months before the Day.

Even if everybody knows these days that such approach may have been a bad way of doing things, it actually worked for many years because that was the way the world expected things to work. There were many constraints, e.g:

hardware limits (e.g. multiple core CPUs first arrived into the consumer world about 10 years ago);
the Internet connectivity was not always something one could rely on – I personally was one of the early adopters of broadband in my country, 14 years ago, when broadband meant about 3x the speed of dial-up: around 16 kilobytes/s transfer rate.
getting various testing environments up & running meant lots of manual provisioning work (yes, installing from CDs), but only if the hardware expenses got through the management approvals.
virtualization came around more than 15 years ago, but until recently setting virtual machines up was not much easier than configuring physical machines (also the performance was not something to write home about).

That was the way things were done for years, not only with our particular project, but with all or most projects in the company. Nobody even cared about changing anything in our particular process until after we went live; that was the point things started to go down.

Disaster?

For about 2 years things went from bad to worse, escalating to an almost full-fledged war between the development and the operations teams, with the management trying various solutions that alienated people on both sides. By the end, most of the members of both teams were gone to different companies. Sounds bad? It was.

From a couple of years away things seem clearer: yes, it was unavoidable; yes, it was salvageable.

So what happened?

We spent long days, 3-4 days one after the other, discussing features for our particular product, every time before all the major milestones. We the devs implemented many, if not most of them. Roughly one year later, when the updated product got released, the stripes on the zebra became visible: some features were not actually needed – the ops requested them because they had no idea how the whole product would work, while some were buggy, but not in the sense of some careless implementation: the product was behaving in the way the developer thought about it – but not in the way the operation person expected it.
We thought the lack of QA time is to blame for the buggy project releases and tried to roll in some automation. This was achieved in the form of a full-fledged high-level testing engine – e.g. running a command line and expecting a response. Unfortunately there was no way to fully cover the complexity of the product – and by the time this was roughly completed, the product had already moved forward. At some point most of the QA time was being spent maintaining this automation engine, instead of testing newer features.
We had multiple performance issues that we initially took the blame on, even if – at some point later – the root cause analysis indicated issues such as reporting scripts that were dumping entire databases at once – or maybe poorly configured monitoring doing something in the area of a denial of service, all installed by different people in the operations team without properly informing anybody.
At the warmest point of the conflict, e-mails were exchanged on a daily basis that we’re providing buggy software and cannot be relied on; at the same time we were helping fight fires and were growing increasingly frustrated as we thought we were really trying to do the best job we could, but to no avail. A couple of years down the road I see that everybody was right, but at that time there were no fast nor easy solutions to bringing the peace process in.

Management response

At some point the management stepped in and ended up making the situation even worse. They thought that the main cause for slow and buggy releases is with the project management process, so we had to switch from “Waterfall” to Agile, with Sprints and regular Scrum meetings.

This truly was a step in the right direction, but the actual implementation was terrible. We, a team that were in a firefighting mode, constantly reviewing, fixing things or providing workarounds, ended up spending 1 hour or more every day in stand-ups that brought little value in. As the Development team got smaller (people started to leave), the “new normal” began to settle in.

So what did Agile do to us?

We had Sprints, but the project workload did not actually change. Maybe it helped a bit on setting priorities for tasks, but the tasks still had to be dealt with upon the Development resource becoming available. The backlog kept on growing, though.
We did not have more feature releases than before as the Operations team “could not handle them”. The idea was to have a deliverable at the end of each Sprint, but this did not work out. We did provide frequent emergency fixes, not related to any Sprint timing.
We still had those week-long planning sessions with discussions on tasks to be implemented during the following year, practically defeating the entire Agile philosophy.

Yes: it did not do much for us, the way we did it or how we continued to interact with the Operations team. The switch to Agile was received at that time by the Development team in the line of a not-so-well-deserved punishment, most people handing their resignations in within the following year.

In an alternate reality

What could have been done? For me, right now, the answer is clear: we should have switched to DevOps. There are many “should”-s and “could”-s that were achievable at that time with the right mindset:

Integrating the Development and the Operations team in order to have a common ownership over the Production installation; yes, “stay in the same office” and “have the same manager” type of integration;
Have in place a Continous Integration pipeline for the project in order to free up QA time;
Have a fast and frequent deployment process, obviously employing automation;
Move from manually configured servers with some automation to full-fledged configuration management.

Yes, we could have saved the day – if only the management had taken the right decisions for the project. Unfortunately everybody got into “saving face” and “shifting blame” modes; it’s also very likely they simply weren’t aware of any better ways of doing things.

We, the Developers, were convinced to the very last day of our employment that writing the best possible code is the key to getting a good product out of the door. We also thought that even if we had somehow done our jobs properly, at the end of the day we may have been sabotaged by the Management and the Operations team, so there really was no way of winning this fight. Unfortunately, as I learned a few years down the road, good code can only get you so far.

That’s it for today, thank you for your read!

No comments yet Categories: Configuration, Nontechnical Tags: architecture, career, companies, devops, linux, performance

Introduction

Disaster?

Management response

In an alternate reality

Related Posts

Leave a Reply Cancel reply

Recent Posts

Recent Comments

Archives

Categories