1. 程式人生 > >Bench Engineering

Bench Engineering

Breaking Good — Having a blameless engineering culture

In engineering, we often create simple philosophies that seem to have little to do with “real engineering”, but turn out to help us collectively become better engineers. I usually measure “better engineers” by the rate the of innovation, happiness and agility of the team. At Bench, I think we are getting to a sweet-spot in our engineer culture and I think one philosophy that got us here and will take us to the next level is having a blameless engineering culture.

Looking back

When I worked in Japan, I worked for a company that had grown to a billion dollar valuation and was the fourth most visited website in the country. One memorable day, I made a deployment that didn’t have any immediate negative impact. An hour or so later, it was reported that the site was progressively getting slower. Since I was the last to deploy, it was deemed that I had caused the issue, so I began to investigate. Immediately the entire Board of Directors came out of their offices and huddled behind me and my 11 inch Windows 95 laptop. As I tried to understand the cause of the slowness, they repeatedly asked me long complex questions in Japanese and each time this was translated for me by a nervous looking engineer simply as, “is it fixed yet?” or “did you fix it?”. While I was doing my best to multitask between politeness and a deep dive into what might be causing the outage, another engineer fixed the issue. Once we were in the clear, there was no further discussion on what actually broke or how it was fixed. I was just relieved the ordeal was over. In hindsight, I see how we handled this system failure as a missed opportunity to improve our systems, increase team knowledge and generally make people happier about coming to work every day.

Fear of breaking

In software engineering (although I’m sure this is true of other professions), it is important that engineers can move forward with confidence and without fear of breaking things. Fear of making mistakes kills velocity, innovation and sometimes any progress at all. For instance, as soon as an area of code is deemed “brittle”, “critical”, or “unknown”, engineers will generally think twice before touching it. The more we avoid it, the older it gets and the more team knowledge atrophies. This not only poses non-obvious risk to the company, but accepting this mindset spills into other areas of code and soon becomes part of the culture.

Fear is generally less about bugs in code than it is about the repercussions that one might have to face from peers and management. Will I lose my job? Will I be deemed a bad engineer? No longer trusted? Or worse? To be at their best, software engineers need a safe place to work, make mistakes and break things if we want them to move forward with velocity and grow. Only then will they start to take risks and ultimately embrace more innovative ideas or to say important things like, “we can’t support this brittle code, let’s try to fix it or rewrite it”.

Unsafe environment

When I first joined Bench, I witnessed a Slack exchange that was disappointing, to say the least. It was a great example of what a unsafe engineering environment looks like, which is why I’m sharing it here.

Engineer1: FYI mailchimp and mandrill have been turned off.
Engineer2: WTF!!??????????????????????????????????!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! (edited)
WHY!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!?????????????????????????????????!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Engineer2: Put it back on.
Engineer2: @Engineer1 now!
Engineer2: Right now!!!!
Engineer1: What.
Engineer1: Ffs
Engineer2: We need mandrill to send emails
Engineer1: Dude.
Engineer1: Im afk.
Engineer1: login is in 1password
Engineer1: Call me.
Engineer1: I can be back in 10.
Engineer2: Alright, we can send emails now.
Engineer1: Thanks man. Thats obviously my fault.
Engineer3: are we back to our regularly scheduled program?

In this exchange, Engineer1 is doing their job and communicating their actions to others. Top marks! Unfortunately, somewhere along the line there was a previous communication break-down and that engineer didn’t realize that the components being turned off were critical and if not available, would cause system-wide issues. More unfortunate is that the response from Engineer1’s peer likely exacerbated the issue and muted the team’s ability to build in future resilience to similar issues.. The only lesson here is something like: “it’s never safe to turn things off unless you like to be verbally assaulted online”.

Safe environment

I’m happy to say with confidence that you would never see this type of conversation today at Bench. If I was not confident in that, then I would not be sharing this story or working at Bench. This second version is far more likely.

Engineer1: FYI mailchimp and mandrill have been turned off.
Engineer2: :scream_cat:
Engineer2: We need mandrill to send emails.
Engineer2: @Engineer1 Can you revert that change?
Engineer1: Im afk.
Engineer1: login is in 1password
Engineer2: Thanks. I’ve re-enabled it.
Engineer2: Alright, we can send emails now.
Engineer1: Thanks man. I’ll schedule a 5-Whys when I get back.
Engineer3: are we back to our regularly scheduled program?

From here we would have a blameless “5-whys” meeting and come out with positive changes to ensure everyone on the team is always on the same page with regards to what we use our services for.

One of the reasons I wanted to write this post is to once again remind all Bench engineers that they are in a safe blameless environment. Living “blame free” doesn’t come for free. It takes constantly reminding ourselves, especially when entering a “5-Whys” meeting and digging for the underlying reasons. At best, blame just wastes time. At worst, it causes cultural rot.

It took time to get people in the routine, but for a while now we have had a 5-Whys meeting after every Production-impacting issue and for anything else where we feel like the investment in meeting time will yield ways to improve our systems. This meeting is designed to get to the root of the issue. For instance, the last “why?” might be “we don’t have enough tests in a certain area”, “we need to hire more people for X”, “we don’t have a process for doing Y”, or “no team owns Z”.

Sometimes the the answer seems obvious — “I forgot to press the button”. This is the wrong statement. “The button wasn’t pressed” focuses on the issue, not the person. Then what follows is “it’s too easy to forgot to press the button”, “we don’t have a foolproof process for this”, “pressing the button is missing from the process” or “we should get alerted earlier if the button isn’t pressed”.

Getting to blameless

First, it helps if you don’t have any jerks on the team. Even in the absence of personal attacks, our experience is that people only feel safe raising issues if we consistently and explicitly reinforce that we seek to have a blameless culture. People will often blame themselves and assume others blame them, even if nobody does. Or worse, they may deflect the blame from themselves, blaming others or “the system”. Instead, we should act professionally and look at why the issue occurred, then improve our tooling and processes to prevent someone else making the same mistake. After all, it’s just a bug in the system.

Small quick steps

We want to allow engineers to move fast, break things (in small ways), learn, course correct and keep going. Call it “DevOps” or whatever you like, but this industry has learnt a lot about moving forward quickly in safe ways. One of the most important learnings is moving forward in many small steps, rather than a few big lumbering steps. This results in each step likely only causing a small problem, which can be quickly rectified. At Bench, we’ve had a strong focus on this area in the past twelve months. In this time, we’ve gone from less than 70 deployments per quarter to now in excess of 600. That number is still climbing. Our time from code to production is also quickly declining, as is the aggregate outage time.

Our culture

Bench’s current engineering culture is one of the strongest I’ve experienced. I think we now do pretty well at having a “blameless culture”, which has not always been the case. But I think we need to constantly remind ourselves of this fact in order to keep it true. It isn’t just a nice-to-have or way to avoid conflict. It’s an important tool for focusing on what matters and getting rid of the fear of making changes.