Continuous improvement is a core value of our culture here at Medium. Recently, we introduced a new RCA process to help us better understand why things go wrong and how we can improve our service and processes.
RCA stands for Root Cause Analysis. The goal of the analysis is to identify the root causes of an incident so it can be learned from and the incident can be prevented in the future. An incident can be unexpected downtime, an out of bands release or just about any other unplanned event with negative consequences. This is not to point fingers and not to assign blame!
One such method in determining the root cause is the 5 Whys. This method was developed by Taiichi Ohno and used by the Toyota Motor Corporation to uncover the cause of manufacturing defects. 5 Whys is simply just repeating the question “Why?” until you can’t go any further and a root cause(s) is identified.
The exact number of Whys is not important but 5 is usually enough! The real root cause should point toward a process that is not working correctly or does not exist. You might have heard the phrase “people do not fail, processes do”. Again, we are not looking to find who was at fault. It’s important that there is trust in the team and everyone feels they can contribute freely towards the analysis. The root cause should be framed so it may be corrected by the completion of actions. Actions often include the creation of a new process. Ideally the actions would prevent the incident from happening again but that’s not always possible or practical. You can also consider actions to help identify the issue earlier so you can react quicker.
Incident: Users were experiencing slow page load times.
- Why? —Requests were hanging in the back end
- Why were the requests hanging? — The database was under increased load
- Why was the database under load? — We released a new feature that required data from a new table and that query was slow
- Why is the query slow? — The new table is a missing an index
- Add an index to the table (Owner: Alice)
You shouldn’t stop when you find the first cause and action. You can branch off at any level and keep going.
5a. Why is the index missing from the table?— The new feature wasn’t tested under load before release
- Add a performance test for this feature (Owner: Bob)
- Consider updating launch checklist to require performance tests for big features (Owner: Xin)
There can be more than one cause when answering a particular ‘Why’.
5b. Why is the index missing from the table? — The author(and the reviewer) didn’t have experience in database performance.
6. Why was the index missed in a code review? — The caretaker list for this part of the code no longer includes someone experienced in database performance.
- Update caretaker list for schema code (Owner: Upeka)
- Schedule engineering brown bag talk for schema performance (Owner: JZ)
Notice that these actions are a combination of process, code and communication changes. It’s critical that every action is complete-able and has only one owner. Having one owner makes it very clear who has the ball. Actions like “more testing” or “increase documentation” are cop-outs. They are not measurable so it’s hard to mark them done. Don’t be trapped, there are usually real insights if you keep going. Sometimes asking, ‘Why did the process fail?’, helps to keep you on track. A good facilitator can really help here. They are good at asking slightly different questions to generate new lines of thinking. A good facilitator will also encourage diverse participation that will uncover weaknesses seen from other perspectives.
The RCA Meeting
Our RCAs are conducted by the team that is responsible for the feature or area that was involved in the incident. We also invite anyone else that was affected or participated in the diagnosis or resolution. The meetings are run as follows.
- Pick a facilitator and note taker
- State the incident
- Ask the Whys
- Identify and assign actions
- Communicate learnings
Step 6 is critical. The actions will prevent this specific problem from happening again but the real leverage comes in spreading the learnings from this one particular failure so other similar type failures can be prevented in the future. We publish a summary along with the actions on Hatch, our internal version of Medium.
The simple template we use looks like this.
- Incident (What happened, when, how long)
- Actions w/ owners
RCA Keys to Success
- Don’t play the blame game.
- Focus on continuous improvement.
- Make the exercise inclusive across departments.
- Keep digging and see if you can uncover more causes for each why.
- Share the learnings with the broader organization.
- Remember that things go wrong. Accept, embrace, and learn from it.