Chaos Engineering

Delivering value to the end-users means we need systems that are useful, secure, and resistant to the shark-filled acid bath of the cloud environment. Engineering for chaos is crucial.

There are many resources that discuss the architectural concerns of resiliency engineering. Circuit breakers to gracefully handle the instability of our dependencies. Incrementally reducing feature abilities if dependencies are offline. Switching loads seamlessly between regions if one goes down or SLOs are exceeded. Redundant data persistence, idempotent messages, and etc. While we may be able to predict many failures and make them part of the initial design, we won’t predict for every failure. It requires a resilient product team to deliver resilient solutions.

What is a resilient team? It is a team that can continue to deliver value while dealing with the chaos of the real world. One of the principle practices of DevOps is that the team contains all of the capabilities required to deliver and operate their product. However, it’s possible to assemble a cross-functional team that’s brittle and will fall apart as soon as anything unexpected happens. Here are a few team anti-patterns that will put your ability to meet business goals at risk.

Individual Ownership

This anti-pattern is where a developer is the primary owner of some part of the system. This could be caused by a manager looking for ways to highlight someone for a promotion, someone joining the team who brings a new component with them, someone having expertise in a particular framework, or many other reasons. It’s tempting to do this. If we have a person who is an expert at something and we have deadlines, the obvious solution is to assign work based on expertise. This creates several large problems.

  • What is the quality process? If only a small subset of the team understands something, how will the rest of the team review for quality?
  • Things deliver slowly because no one else really understands the code or the business problem it solves. Therefore, no one on the team can help without onboarding.
  • When the “owner” goes on vacation, has a family emergency, or wins the lottery, how will development continue in their absence?

Tactical decisions based on best-case scenarios and hope can succeed as long as nothing goes wrong. Business strategy requires embracing reality. Terminate knowledge silos with extreme prejudice.

The Change Agent

Improvement rarely starts as an overall team effort. However, improvement of daily work is a business deliverable. If we are not continuously improving, we are continuously degrading. There is no steady-state position. If we have a person on the team who is passionate about improving things, we cannot depend on that passion long-term as the driver of improvement.

  • Passion burns out if there is no motion or active support.
  • As in the “single owner” anti-pattern, the improvement will depend on the presence of that contributor. What happens when they seek greener pastures.

Leverage the passion for improvement to infect the team, but embed it into the team’s culture by incentivizing it as we would any other value delivery. Grow more change agents.

Assigned roles

Assigned roles and titles awesome for HR, but do not add value to the business. Product teams have one role; deliver business value and continuously improve how we do that.

  • Assigning functional roles on the team leads to “not my job” and extended wait times while hand-offs occur within the team.
  • If a link in the chain is broken due to the chaos of the real world, then value stops.
  • If the flow of work overwhelms the capacity of one of the assigned roles, then lead times extend, and value is lost due to delay.

It’s fine to assemble a group of specialists to develop an MVP, but the complexity debt in the application and in the team will steadily increase if things remain specialized. Over the long term, that structure will drive you into a ditch. Optimize for sustainable value delivery and impact to the bottom line, not HR concerns.

These are all common issues on teams that are not deliberately architected to deliver resilient applications. When short term deliverables have more focus than long term strategy, we’ll miss our goals. Plan for chaos and architect teams for reality.


Written on January 25, 2021 by Bryan Finster.

Originally published on Medium