Baking a cake while juggling chainsaws

Our value proposition is simple: Work efficiently, sleep peacefully. It’s our mission to free our customers of all things web operations so that they can focus on doing their business. But last year I had to wake up to the nasty realisation that we were not delivering on that promise. In this post, I’m going to tell you how we failed at managing our work, what the consequences were and how we now handle heaps of work in a small team.

Delivery failed

Building a Platform-as-a-Service is like baking a cake — or for some critical components, more like baking a soufflé. It requires focus, skills and the right ingredients. But apply them in the wrong way and the result is an ugly mess.

One of the great things about my job is that skills and ingredients are pretty much non-issues for us. Our tiny web operations team has in sum more than 30.000 hours of experience in running business-critical IT systems, and the amazing Open Source community provides us with all the sophisticated technology we need. Our key weakness was the focus part. Last year, we started all kinds of projects and initiatives. But we didn’t get to ship half of them.

This was in part caused by the difference between web development and web operations — the amount of unplanned work. Smaller and bigger outages, support requests from customers and urgent security updates keep popping up day and night. Handle them poorly and it will hurt; it’s a bit like juggling chainsaws. This is exactly the reason why using freistilbox as their hosting platform makes so much sense to our customers. Isn’t it ironic that what’s at the core of our business model almost broke us?

Late last year, the fact that our workload had gone out of control became obvious from its effects. Team members started showing signs of burnout. In my 1:1 talks, discussion of frustration, fatigue and the feeling of letting everyone down became more frequent. And the results from the NPS surveys we had recently added to our user dashboard confirmed that our customers weren’t particularly happy either.

Not until team members started to take time off because they couldn’t cope with the pressure anymore, I realised that I had made a grave leadership mistake: I had delegated all of IT operations to our engineering team without taking into account that managing work (and workers) requires a completely different skill set than just doing the work.

The success of a CEO is measured by the results delivered. We were clearly not delivering, so it was time for me to do something about it. I spent a weekend re-reading The Phoenix Project and found uncanny similarities between its story and our situation:

  • We had team members actively taking on too much work.
  • When work had to be done on a certain piece of infrastructure, the team member who was most familiar with it instantly became a bottleneck.
  • Unplanned work kept disrupting the little flow we struggled to achieve.
  • We had not nearly enough visibility of what work was currently in progress.
  • We also did not have a clear way of deciding what tasks we would tackle next.

The solution to these problems needed to have three parts:

  1. a way to easily prioritise our business projects and internal initiatives,
  2. a method for visualising and limiting the full amount of work in progress (WIP), and
  3. an approach to limit the effects of unplanned work on the flow of our work.

Yes we Kanban

Turns out we already knew the solution to item 2! Someone even had given a talk about it a few years ago at OSDC (in German):

Until last year, we used just a simple segmented list in our task management software. While we had used the Kanban Method for a while, we abandoned it because we felt it didn’t yield the effects we had expected. I know now that this was because we had only applied it to planned work. We now have a real Kanban board again and this time, we also put production issues and high-maintenance support requests on it. Now all our WIP is visible and easy to follow.

We’ve put in place a simple but significant rule: Cards can only move in one direction — to the right. If a project gets stuck, it becomes an issue for the whole team. A sign that I spotted in a coworking space in Dublin says it perfectly:

We’d like some RICE with that

Prioritising projects had never been a particular strength of our team, myself included. In order to make sure that the cards on our new Kanban board would always reflect our most important projects, I looked for a solid and repeatable process to identify them.

As Ben Finn explained in his recent presentation at DrupalCamp London, the importance of a project is proportional to its business value. His talk confirmed what I had learned from the article RICE: Simple prioritization for product managers late last year. The geniuses at Intercom built a simple formula that allows them to quickly assess the relative importance of a new feature, and I’m forever grateful they’ve shared it with the world. Based on their approach, I customised their RICE factors for our internal initiatives and our business projects. This is how we now assign weight to behind-the-scenes changes:

Relevance x Improvement x Confidence / Effort

Relevance is a value for the class of issue that needs to be solved, in order of decreasing weight:

  • eliminates high risk
  • eliminates bottleneck in the team
  • eliminates medium risk
  • increases efficiency

Improvement represents the impact a change will have:

  • fundamental change
  • major change
  • incremental change

Effort is the estimated number of person days it will take us to implement the change. Finally, the purpose of the Confidence factor is, as in the original Intercom formula, to reduce the weight of projects that haven’t been properly thought through yet; its value can either be 100%, 80% or 50%.

For business projects, new feature projects in particular, we use a slight variation:

Reach x Impact x Confidence / Effort

In this variant, Reach represents the percentage of customers we’re going to affect with a new feature. Impact stands for its significance for these customers:

  • massive
  • high
  • medium
  • low

Now, every new feature idea and proposed infrastructure gets added to a spreadsheet where it is ranked according to its RICE values. Each week, we move the top projects of this list into the Backlog column of our Kanban board. It amazes me to no end what little effort is really necessary to ensure that our team always works on the most effective projects!

Enter the Libero

There are tasks that don’t require a decision about their priority:

  • Production issues, outages
  • Urgent support requests
  • Customer orders

These types of tasks get added to the Kanban board automatically via a Zapier integration. For example, as soon as we assign a support request the priority level high in our ticket system, a card linked to the ticket appears on the board and gets fast-tracked. The same happens with the service provisioning tasks we add to a customer order project. This makes unplanned work visible, including its impact on cycle time (the time it takes us to finish a project).

That was definitely an improvement already. But how were we to deal with the fact that both outages and high-touch customer care cause delays for our planned projects? Our current approach of trying to distribute interruptions evenly in the team prevented everyone from experiencing flow, from being in the zone.

In a state of flow, however, maintaining focused attention on these absorbing activities requires no extertion of self-control, thereby freeing resources to be directed to the task at hand. — Daniel Kahnemann, Thinking Fast And Slow

We found a way to focus on our cake without getting maimed by dropping chainsaws. Taking inspiration from the Spotify engineering team, we created a new role in our web operations team: The Libero or sweeper. We already had a special role for on-call; 24/7 pager duty rotates on a weekly basis. With the Libero, we added another rotation in which a member of the web ops team deals exclusively with all the unplanned work that comes up during business hours, mainly monitoring alerts and support requests. This new role has many beneficial effects:

  • Having a dedicated person focus on catching unplanned work as it comes up reduced our support response times significantly.
  • Since everyone else gets to work on projects without being distracted or interrupted, our project velocity improved immediately.
  • And so did everyone’s mood — suddenly we’re getting things done in a consistent manner!
  • During the times when the hosting platform runs smoothly (i.e., most of the time 😉) and support requests are few and far between, the Libero enjoys complete freedom how to spend their time. Finally, there’s time to fix the spice rack, or to read this blog series about Docker, or to watch another episode of Black Mirror!

Stop starting. Start finishing.

That’s our new motto, and it works. Boy, does it work! I was overwhelmed by the praise I got from the whole web operations team almost immediately after we made the changes I described above. Even now, a few months later, there’s not a single weekly retrospective without a mention of how much more focused and effective we’re working than last year.

Our new process isn’t perfect (probably far from that) and we know it. That’s why we’ve adopted the lean approach of experiments and continuous improvement (kaizen). For example, we found that running the Libero rotation in lockstep with our on-call rotation (from Thursday over the weekend to the next Thursday) felt a bit long. We’ve changed it to a business week rotation from Monday to Friday.

When I think back, it still dismays me that it took me far too long to introduce these changes, but gosh, am I happy that we made them! Both the feedback we get from our customers and the general level of happiness in the team are clear proof that we’re heading in the right direction again.

Did you like this post? Would you like to know more about our lean web operations process? Give us a shout on Twitter!