Mission critical products need to work perfectly every single time, in any environment, no exceptions.
Today’s world demands software now, and in the age of agile, SaaS, cloud, and other methodologies, Product Managers, Software Engineers and most companies duly oblige — often releasing with known bugs, issues, or documented missing features, with the idea that future iterations will fix it. This is a great approach for most products, but there are genuine cases where there are major consequences if the product fails — such as in aviation, medicine, or finance. It’s imperative, and often critical to life, that we release software that has zero bugs. Like, none.
Before I begin, I want to acknowledge that despite best efforts, there are still cases today where “mission critical” is taken lightly, and has lethal consequences such as the 346 passengers that lost their lives through the MCAS failing on the Boeing 737–800 MAX. Every accident furthers our understanding, and I forever hope that incidents like that never repeat, through better understanding and appreciation of the systems that we rely on.
Development discipline
The team is the key to success. To focus on mission critical, we must first focus on developing a high performing, safety conscious, and cautious team.
First — make sure the team knows its a “no fault, fix it” culture. Making mistakes is human nature and here we are trying to ensure that us humans don’t make a mistake — it’s not going to happen. The best mitigation is to ensure egos and pride don’t play a part by not pointing fingers and keeping the team clearly focused on identifying and resolving faults, not identifying who is at fault.
Second — the entire company needs to share the mission critical mindset, from the executive team, through to even your accounts department. This shared purpose unites the company and will keep politics at bay. This isn’t about securing funding, or ensuring the company knows your team’s importance — about ensuring that everyone shares the same common purpose.
Third — The team building it needs to have discipline. This means that shortcuts, secrets, and general shoddiness are culturally developed to be unacceptable, and every single person in the team knows that their work will be subject to scrutiny and praise.
Human context
The next biggest factor is also human — on the end user side. It’s imperative that your team understands who the user ultimately will be for your solution, what kind of task they will be doing, and the conditions they will be doing it under. This cannot be achieved with a whiteboard and desk, and the best way to learn this is to be in the user’s shoes quite literally and do the task with them on-site. This may seem like a major exercise, but I guarantee the time and cost to do so is completely worth it.
To provide a personal example, I recently worked on a product that involved cargo scanners for airlines in Alaska. The work we did seemed perfect in theory, but when I actually stood outside in the Alaskan winter, I immediately realised that our solution package (user interface + hardware product) was impractical for someone outdoors that’s wearing three very thick layers of gloves, operating in twilight, snow, and breaking ice, not to mention a very cold place where they don’t want to linger too long to try get an app to work.
From that experience, I developed the following framework I always mentally take note of when I’m with the end user:
- Where will they be when they use this product?
- How will they use this?
- What actions and tasks do they do before and after using this feature?
- What kind of environment are they working in? (noise, temperature, light, movement, etc)
- What other actions are taking place while they’re using this feature? (Forklifts, aircraft movements, meteors…)
This will help you build an understanding of the kind of messages, alerts, warnings and graphics you provide in context, as the last thing you want to do is create a product that always works, but causes a chain reaction that ruins something else (e.g. imagine if your car suddenly beeped a shrill noise while driving for something minor like losing Bluetooth).
Cognitive Load
Cognitive load is a field in its own right, but for our purposes, it’s the effort a user needs to take to use your product. I want to demonstrate this visually with three options for a basic on/off switch. Which one of these is the simplest, and which one is the best for reducing cognitive load?

The one on the far left is certainly the simplest — the one button works for on and off, and is just like your TV remote or flashlight at home. The middle one has dedicated buttons for both on and off, while the one on the right has two buttons and a tactile feel to it.
Now imagine you’re on a oil pipeline and need to shut off the supply line, although the left one is simple, it doesn’t actually tell you whether you’re about to turn it on or off, the middle one is better as you know you’re definitely turning it off, but what about if there was smoke or limited visibility? The right one works best as you can know your current state, can easily toggle, and definitively know the end state, by tactile feel or visually, immediately.
Although a hardware switch was used an example, the same applies in software. You want to remove as much cognitive load from your user as possible, and present them with the right information and choices at the right time. Remember: This may not necessarily be the simplest user experience, as demonstrated above, its about being the most practical.
Single points of failure and redundancy
Aircraft systems almost always have more than one point of failure with the use of redundant systems. For example the Boeing 777 has three backup flight computer systems just in case the pilots have the unfortunate luck of having the first two fail. This can be applied to product development too, think strategically about the different dependencies and components that are required for your product to function, and what would happen if any one of them were removed from the process.
A good example is cloud technologies, although very reliable these days with 99.99% uptime promised by most vendors, some organisations have multi-cloud environments where everything is duplicated identically across various cloud providers, so in the event two went down, the third would still function. Ofcourse this assumes the actual network cable still functions — I had backup satellite communications bandwidth should this occur.
By creating redundant systems you reduce the risks of single points of failure. Remember: remove the weakest link and spread it.
Inform, Warn, Escalate
Advise, warn, escalate is the “crawl, walk, run” of helping users understand mission critical products. These are a sequence of important system actions that graduate responses.
- Inform
The system informs the user that something could go wrong based on current system status. This is essentially an advisory message providing the user situational awareness. For example, the fuel gauge in your car shows how much fuel it has and lets you decide when you want to refuel.
2. Warn
At this point, the user has done something that could compromise your system, and you are warning them that their actions could have consequences. This is the point the low fuel warning light comes on in your car — it is warning you that fuel is low, and helping you decide to pull over and fill it up. The difference between informing and warning is the actions required — information doesn’t require action, warning requires action.
3. Escalate
The escalation step occurs where the user has not taken the appropriate actions from the warnings provided and the system needs to step in to ensure bad things don’t happen. Most modern cars will stop the fuel connection before the car completely runs out to prevent engine damage. When building mission critical products, you should have a decision point where you want to take control away from the user. There’s no easy answer to when this should occur.
Communicate failure
Communicating failure is the last step in a mission critical system’s journey once it has failed to be mission critical — letting the user know that they can no longer rely on the system to be whole (it may still work, but just not with the guarantee the user wanted). This is a point at which the user takes over cognitive control away from the system. In most aircraft systems this is where the auto-pilot gives back control to the pilots via an audible chime.
When building products, this point ties in with the inform/warn/escalate step above, this is where control is handed back to the user and they’re tasked with building their own situational awareness to handle tasks unassisted or with minimal assistance.
Failure recovery
This is an area I never like discussing, as it implies we have failed at our mission. Unfortunately it’s also one of the most important things we can do to learn about our product. In aviation, this is commonly seen in the form of a black box, and in our context, this is a very detailed log of every single parameter and event that occurred leading up to the failure of the system.
This is even more important in the context of scalable cloud software — when something critical goes down, it won’t take long for it to propagate across our network. Detailed logs and events help us quickly identify the problem, and solve it, stopping problems en-masse.
That’s not to say we can’t have a last ditch attempt at self-recovery. One good thing at this point is that if we can safely assume that our system is compromised, we can take recovery steps such as a full reboot, or a fail-over to another system that can at-least help the user recover. We need to be very cautious with this however, as falsely presuming failure and taking action is essentially killing our system — the last thing we want to do.
Documentation
Finally, before, during, and after the product is developed or improved, every single decision should be documented with the date, decision owner, reasoning, and outcome. This is crucial to foster the development culture within the team, and will help the entire team understand why decisions are made across the entire system. This will also help the team spot systemic or architectural defects that can be easily corrected now, rather than in production.
Further Reading
When I started in role, these articles provided me comprehensive depth about building mission critical systems, and I recommend them as further reading if this article interested you.
- MISRA C:2012, Motor Industry Software Reliability Association, 2012.
- The Power of Ten –Rules for Developing Safety Critical Code, Gerard J. Holzmann, NASA Laboratory for Reliable Software, 2006.