Understanding failures, and learning from them, is a core competency for any engineer. So, in order to learn a bit more about failure, let’s examine a few large and deadly engineering disasters.
The Space Shuttle Challenger disaster occurred on January 28, 1986 when the orbiter broke apart shortly after launch. A subsequent investigation produced the 200-page Rogers Commission Report, detailing the circumstances that led to the explosion. The proximate cause was determined to be a failed O-ring that was compromised when the temperature dropped below 40-degrees the night before the launch. Engineers familiar with the O-ring had recommended to NASA that there should be no launch, but these concerns were resisted in order to meet the ambitious launch schedule. Shuttle missions were put on a 32-month hiatus following the disaster.
OK, Enough With the Sad Stuff
At this point you may be wondering, “What is the point of reading about these depressing disasters?” My point is to illustrate that even huge, well-funded, highly-structured, mature organizations can fail. Despite years of planning, and attempting to enumerate every possible failure mode and mitigate each of them, they can still fail.
Fortunately, Jana is not in the business of launching space shuttles or delivering radiation therapy. When we fail, nobody dies. We don’t know all the ways in which we might fail. We can not spend all of our time trying to identify every possible failure mode. We know we are going to fail, and that is OK! As a result, we do not optimize for preventing failures. Instead, we optimize for recovery, or in system reliability lingo: MTTR.
Mean Time to Repair (MTTR) is the measurement of how long it takes to discover a problem and release a fix. The idea is that instead of investing time to never make a mistake, you invest in your ability to react to problems and fix them quickly. In practice, this is a much more reasonable way to spend your time (at least it is for a software startup).
At Jana, we have developed many layers of testing and monitoring in order to reduce MTTR. Our goal is to discover any issues ourselves, before one of our members or customers has a bad experience. In most cases, we detect bugs or aberrations before code is deployed. In cases when that does not work, we are usually able to detect an issue based on crash reports and release a repaired version within an hour.
While optimizing for MTTR is not going to be optimal for all organizations, consider your current situation: Could you be spending too much time trying to avoid making mistakes, and not enough time investing in a quick way to fix them?
If you’re interested in learning more about system reliability and other achievements at Jana, you should consider joining us! Check out our careers page for more information about joining.