Uncategorized

A history of engineering disasters

Understanding failures, and learning from them, is a core competency for any engineer. So, in order to learn a bit more about failure, let’s examine a few large and deadly engineering disasters.

Challenger Disaster

Challenger DisasterThe Space Shuttle Challenger disaster occurred on January 28, 1986 when the orbiter broke apart shortly after launch. A subsequent investigation produced the 200-page Rogers Commission Report, detailing the circumstances that led to the explosion. The proximate cause was determined to be a failed O-ring that was compromised when the temperature dropped below 40-degrees the night before the launch. Engineers familiar with the O-ring had recommended to NASA that there should be no launch, but these concerns were resisted in order to meet the ambitious launch schedule. Shuttle missions were put on a 32-month hiatus following the disaster.

Columbia Disaster

Columbia DisasterSpace Shuttle Columbia disintegrated as it reentered Earth’s atmosphere on February 1, 2003. During the launch, a piece of foam broke off of the main fuel tank, striking and damaging the thermal protection system (TPS) on the underside of the shuttle. “Foam shedding” had also occurred in four previous launches, and was being monitored by NASA. However, none of the previous incidents resulted in damage to the TPS. In the subsequent disaster investigation, NASA officials contended that even if the damage had been known, nothing could have been done to repair the TPS. NASA suspended launches for 29 months after the disaster.

Therac-25 Machine

Therac-25 MachineThe Therac-25 was a radiation therapy machine designed to provide radiation to a specific part of the body, hopefully killing a cancerous tumor. The machine was involved in at least six incidents in which massive overdoses of radiation were given to patients. In three cases, the patients died as a result of the overdose. The overdose was possible because of a software defect that allowed the machine to unknowingly enter high-power mode. An investigatory committee later determined that the software for the Therac-25 was designed in such a way that it was realistically impossible to test in an automated way.

OK, Enough With the Sad Stuff

At this point you may be wondering, “What is the point of reading about these depressing disasters?” My point is to illustrate that even huge, well-funded, highly-structured, mature organizations can fail. Despite years of planning, and attempting to enumerate every possible failure mode and mitigate each of them, they can still fail.

Fortunately, Jana is not in the business of launching space shuttles or delivering radiation therapy. When we fail, nobody dies. We don’t know all the ways in which we might fail. We can not spend all of our time trying to identify every possible failure mode. We know we are going to fail, and that is OK! As a result, we do not optimize for preventing failures. Instead, we optimize for recovery, or in system reliability lingo: MTTR.

Mean Time to Repair (MTTR) is the measurement of how long it takes to discover a problem and release a fix. The idea is that instead of investing time to never make a mistake, you invest in your ability to react to problems and fix them quickly. In practice, this is a much more reasonable way to spend your time (at least it is for a software startup).

At Jana, we have developed many layers of testing and monitoring in order to reduce MTTR. Our goal is to discover any issues ourselves, before one of our members or customers has a bad experience. In most cases, we detect bugs or aberrations before code is deployed. In cases when that does not work, we are usually able to detect an issue based on crash reports and release a repaired version within an hour.

While optimizing for MTTR is not going to be optimal for all organizations, consider your current situation: Could you be spending too much time trying to avoid making mistakes, and not enough time investing in a quick way to fix them?

If you’re interested in learning more about system reliability and other achievements at Jana, you should consider joining us! Check out our careers page for more information about joining.

Discussion

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s