In 2012, I wrote a post for the Code As Craft blog about how we approach learning from accidents and mistakes at Etsy. I wrote about the perspectives and concepts behind what is known (in the world of Systems Safety and Human Factors) as the New View on “human error.” I also wrote about what it means for an organization to take a different approach, philosophically, to learn from accidents, and that Etsy was such an organization.
That post’s purpose was to conceptually point in a new direction, and was, necessarily, void of pragmatic guidance, advice, or suggestions on how to operationalize this perspective. Since then, we at Etsy have continued to explore and evolve our understanding of what taking this approach means, practically. For many organizations engaged in software engineering, the group “post-mortem” debriefing meeting (and accompanying documentation) is where the rubber meets the road.
Many responded to that original post with a question:
“Ok, you’ve convinced me that we have to take a different view on mistakes and accidents. Now: how do you do that?”
As a first step to answer that question, we’ve developed a new Debriefing Facilitation Guide which we are open-sourcing and publishing.
We wrote this document for two reasons.
The first is to state emphatically that we believe a “post-mortem” debriefing should be considered first and foremost a learning opportunity, not a fixing one. All too often, when teams get together to discuss an event, they walk into the room with a story they’ve already rationalized about what happened. This urge to point to a “root” cause is seductive — it allows us to believe that we’ve understood the event well enough, and can move on towards fixing things.
We believe this view is insufficient at best, and harmful at worst, and that gathering multiple diverse perspectives provides valuable context, without which you are only creating an illusion of understanding. Systems safety researcher Nancy Leveson, in her excellent book Engineering A Safer World, has this to say on the topic:
“An accident model should encourage a broad view of accident mechanisms that expands the investigation beyond the proximate events: A narrow focus on operator actions, physical component failures, and technology may lead to ignoring some of the most important factors in terms of preventing future accidents. The whole concept of ‘root cause’ needs to be reconsidered.”
In other words, if we don’t pay attention to where and how we “look” to understand an event by considering a debriefing a true exploration with open minds, we can easily miss out on truly valuable understanding. How and where we pay attention to this learning opportunity begins with the debriefing facilitator.
The second reason is to help develop debriefing facilitation skills in our field. We wanted to provide practical guidance for debriefing facilitators as they think about preparing for, conducting, and navigating a post-event debriefing. We believe that organizational learning can only happen when objective data about the event (the type of data that you might put into a template or form) is placed into context with the subjective data that can only be gleaned by the skillful constructing of dialogue with the multiple, diverse perspectives in the room.
The Questions Are As Important As The Answers
In his book “Pre-Accident Investigations: Better Questions,” Todd Conklin sheds light on the idea that the focus on understanding complex work is not in the answers we assert, but in the questions we ask:
“The skill is not in knowing the right answer. Right answers are pretty easy. The skill is in asking the right question. The question is everything.”
What we learn from an event depends on the questions we ask as facilitators, not just the objective data you gather and put into a document. It is very easy to assume that the narrative of an accident can be drawn up from one person’s singular perspective, and that the challenge is what to do about it moving forward.
We do not believe that to be true. Here’s a narrative taken from real debriefing notes, generalized for this post:
“I don’t know,” the engineer said, when asked what happened. “I just wasn’t paying attention, I guess. This is on me. I’m sorry, everyone.” The outage had lasted only 9 minutes, but to the engineer it felt like a lifetime. The group felt a strange but comforting relief, ready to fill in the incident report with ‘engineer error’ and continue on with their morning work.
The facilitator was not ready to call it a ‘closed case.’
“Take us back to before you deployed, Phil…what do you remember? This looks like the change you prepped to deploy…” The facilitator displayed the code diff on the big monitor in the conference room.
Phil looked closely to the red and green lines on the screen and replied “Yep, that’s it. I asked Steve and Lisa for a code review, and they both said it looked good.” Steve and Lisa nod their heads, sheepishly.
“So after you got the thumbs-up from Steve and Lisa…what happened next?” the facilitator continued.
“Well, I checked it in, like I always do,” Phil replied. “The tests automatically run, so I waited for them to finish.” He paused for a moment. “I looked on the page that shows the test results, like this…” Phil brought up the page in his browser, on the large screen.
“Is that a new dashboard?” Lisa asked from the back of the room.
“Yeah, after we upgraded the Jenkins install, we redesigned the default test results page to the previous colors because the new one was hard to read,” replied Sam, the team lead for the automated testing team.
“The page says eight tests failed.” Lisa replied. Everyone in the room squinted.
“No, it says zero tests failed, see…?” Phil said, moving closer to the monitor.
Phil hit control-+ on his laptop, increasing the size of the text on the screen. “Oh my. I swear that it said zero tests failed when I deployed.”
The facilitator looked at the rest of the group in the conference room. “How many folks in the room saw this number as a zero when Phil first put it up on the screen?” Most of the group’s hands went up. Lisa smiled.
“It looked like a zero to me too,” the facilitator said.
“Huh. I think because this small font puts slashes in its zeros and it’s in italics, an eight looks a lot like a zero,” Sam said, taking notes. “We should change that.”
As a facilitator, it would be easy to stop asking questions at the mea culpa given by Phil. Without asking him to describe how he normally does his work, by bringing us back to what he was doing at the time, what he was focused on, what led him to believe that deploying was going to happen without issue, we might never have considered that the automated test results page could use some design changes to make the number of failed tests clearer, visually.
In another case, an outage involved a very complicated set of non-obvious interactions between multiple components. During the debriefing, the facilitator asked each of the engineers who designed the systems and were familiar with the architecture to draw on a whiteboard the picture that they had in their minds when they think about how it all is expected to work.
When seen together, each of the diagrams from the different engineers painted a fuller picture of how the components worked together than if there was only one engineer attempting to draw out a “comprehensive” and “canonical” diagram.
The process of drawing this diagram together also brought the engineers to say aloud what they were and were not sure about, and that enabled others in the room to repair those uncertainties and misunderstandings.
Both of these cases support the perspective we take at Etsy, which is that debriefings can (and should!) serve multiple purposes, not just a simplistic hunt for remediation items.
By placing the focus explicitly on learning first, a well-run debriefing has the potential to educate the organization on not just what didn’t work well on the day of the event in question, but a whole lot more.
The key, then, to having well-run debriefings is to start treating facilitation as a skill in which to invest time and effort.
Influencing the Industry
We believe that publishing this guide will help other organizations think about the way that they train and nurture the skill of debriefing facilitation. Moreover, we also hope that it can continue the greater dialogue in the industry about learning from accidents in software-rich environments.
In 2014, knowing how critical I believe this topic to be, someone alerted me to the United States Digital Service Play Books repo on Github, and an issue opened about incident review and post-mortems. I commented there that this was relevant to our interests at Etsy, and that at some point I would reach back out with work we’ve done in this area.
It took two years, but now we’re following up and hoping to reopen the dialogue.
Our Debriefing Facilitation Guide repo is here, and we hope that it is useful for other organizations.