Follow @softskillpattns

The Broken System: Crisis Resolution pattern

"The System is completely unusable. We need you to find an effective, efficient and low risk resolution—quickly."

Why This Pattern?

When computer systems fail they are likely to impact users. This impact is often immediate (the current transaction fails) and long term (the user chooses not to use the system again). The purpose of this pattern is to describe an effective method of minimising the impact of a major system failure. In addition to effective resolution, the pattern also helps us to minimise the effort needed and exposure to further risk from unproven "fixes".

ITIL Incident Management provides detailed advice for managing incidents. This pattern does not challenge ITIL or make any attempt to replace its use. ITIL principles are widely adopted and add considerable value. The "Broken System: Crisis Resolution" pattern focuses on the human process of resolving critical incidents rather than the management of them. This pattern adds value whether or not there is a formal ITIL-based process in place.

What is the “Broken System: Crisis Resolution” Pattern?

The "Broken System: Crisis Resolution" (BSCR) pattern is only used, as its name suggests, when something "bad" has happened. It is a reactive approach to solving significant failures. It is one of those patterns that you should "keep on a shelf until you need it". If you find you are using it regularly, then your system would probably benefit from some proactive work to improve its resiliency.

This pattern is best used in conjunction with the "Learning from Unintended Failures" (LUF) pattern . LUF describes an approach that improves the resiliency of a system following a failure. So, when a major failure occurs, you firstly use BSCR to resolve the issue, and then LUF to improve the system using what you have learnt from the failure.

Learning from Unintended Failures pattern

How to Use the Pattern

Effective Communication and Collaboration skills are central to the guidance provided throughout the BSCR pattern.

When a system failure occurs, the success of the resolution will be significantly affected by the communication skills applied. Communication failings would increase the time taken to reach a resolution due to, for example, extended wait-periods and use of misinformation. When users are frustrated by the quality of the communications received during a failure, the impact of the failure is also increased.

Many failures require the combined skills and knowledge of multiple people to be resolved effectively and efficiently. The level of collaboration between the team members will, therefore, also drive the success of the resolution.

There are four distinct steps to the BSCR pattern:

Step 1: Triage

You use the BSCR pattern when there is a suspected system failure that appears to be causing significant immediate impact. You will notice that there are a number of subjective words in that sentence: 'suspected', 'appears' and 'significant'. The purpose of the first step is to remove this subjectivity by performing a Triage process on the suspected failure. The Triage process will identify if there is a failure that should be treated as a major incident. We will define major incident as a failure that requires immediate resolution using all the people and resources that are deemed necessary.

The BSCR Triage process is based on three questions: "Is there a failure?", "Is it urgent and important?" and "Is it a known issue?" Let's take a look at those questions in turn as they will help us with Prioritisation.

Is there a failure?

Fire! Shark! Wolf! OK, not really, sorry! Humans are programmed to react quickly to danger and this instinct can result in false alarms. The dangers of system failures are a little less scary, but false alarms are still common. Users can lack confidence in, or understanding of technology and this can lead to over-eager assumptions of failure. Make sure there is a real problem before setting-out to fix it.

Is the failure urgent and important?

Urgency and importance are not synonymous. Urgency means immediate attention is needed. Importance tells us something is significant, valued or necessary. The following quote is attributed to former U.S. President Dwight D. Eisenhower and Eisenhower's Principle provides more detail on the distinction between urgency and importance.

The remainder of this pattern is only appropriate for major incidents. To be considered major, a failure needs to be urgent and important. Scales of urgency and importance will be specific to the organisation. Given the high opportunity cost of major incidents—they take immediate priority over all other activity—only a small proportion of failures should be treated as major. Non-major incidents should be treated as part of the normal work process

Is the failure a known issue?

If the cause and resolution of the issue are already known, there is no need to continue with the pattern. It is worth noting that if a known issue has caused a major incident, it is probably time to prioritise a fix for it.

Step 2: Gather the Resolution Team

So, you have a major incident which needs to be resolved as soon as possible. By definition, this incident takes priority over all other activities. The first resolution task is to select the best people to entrust with resolving the failure. You are looking for a combined group of business experts and technical experts. This mix gives you understanding of what the system does and how it does it-both may be needed to generate the resolution. Don't be tempted to create a team that is too big or too small. Somewhere in the range of three to six people is ideal. Given the urgency of the issue, it should only take a few minutes to put the team together, so don't overthink it.

The BSCR team

Major incidents attract numerous stakeholders, such as managers and customers, who all want constant progress updates. This distraction can be a huge cost for the Resolution Team. Any time they are focussing on communicating progress is time they are not focussing on the resolution. Effective communication, as we saw earlier, is critical to the perceived success of the resolution. We need a way, therefore, to keep the stakeholders informed, whilst not impacting the team. The BSCR pattern employs a Liaison for this purpose. The Liaison works closely with the team, but is not focussed on resolving the failure. Her primary responsibility is to pro-actively communicate progress to stakeholders without disturbing the team. She is also the channel for feeding any critical new information into the team. The best Liaisons look for every opportunity to maintain the focus of the Resolution Team. This might include, for example, fetching "coffee and pizza" for the team or sending apologies for meetings the team members can no longer attend.

Step 3: Identify the Resolution

The Resolution Team is now formed. Let's reiterate their role: to generate a solution that will resolve the failure as soon as possible. Identifying the root cause is not a necessity for this task. Efficient teams will focus on the resolution and not the precise cause. Considering the most common causes of system failures can, however, help direct the initial analysis. The table below shows a list of common causes of failure. Answering the questions listed will indicate whether any of the common causes are likely to be responsible.

Common Cause Identifying Question Resolution Actions
A change to the system Have any changes been made to the system since the last time it was known to be working?
  • Review change with change owners
  • Rollback change
  • Fix change
A change to a service the system depends on Have any changes been made to a service the system depends on since the last time it was known to be working?
  • Review change with change owners
  • Rollback change
  • Fix change
  • Adapt system to updated dependency
A service the system depends on is not available Are there any failures when independently connecting to each of the system dependencies (e.g. Databases, Web Services, etc)?
  • Fix dependency
  • Switch off or replace dependency
Bad or unexpected user or data input Does the system work if we use only normal system inputs?
  • Stop flow of bad input
Performance degradation Is there anything to suggest performance issues (e.g. intermittent failure, previous slow-down)?
  • Fix or restrict affected resource (memory, CPU, disk-space, etc.)
Hardware failure Are there any other systems hosted on the same hardware? If so, are they affected?
  • Switch to alternative hardware

Write answers to the common cause questions on a whiteboard (or electronic alternative for distributed teams). This is your Fact Board where you will record known facts that will drive your analysis of the failure. Making the facts that you identify highly visible means you can easily refer back to them.

If the answer to any of the common cause questions is "Yes", then a potential cause and associated Resolution Action(s) have been identified. The first two common causes are changes to either the system itself or a service it depends on. The existence of a recent change is not sufficient to reach a conclusion—you need further evidence. Changes are, however, a very common cause of failures so treat them with a high level of suspicion. Engage the owners of the change and investigate whether this is the likely cause.

If you answer "Yes" to any of the other common cause questions, the cause is likely now to be clear. Your system will be directly affected by broken services it depends on, bad input, performance issues, or hardware failures. The table provides the appropriate Resolution Actions for each of these common causes.

So, what should we do if we answered "No" to all the common cause questions? Firstly: check again. The common causes are still the most likely candidates for causing the failure. Answer each question again in turn, this time letting different team members take the lead. This may sound like a waste of time, but relate it to looking for something you have mislaid at home. I am prone to losing my keys. After desperately looking in twenty places, it is quite surprising how frequently I find them when revisiting the first place I checked.

Throughout the analysis of the failure, the Resolution Team should record any impediments to the analysis they encounter. Some possible examples would be: ineffective logging, lack of access to the system, unclear flow of data through the system and an absence of system performance monitoring.

"I still can't find the keys!" OK, so you have checked for the common causes twice and you still haven't identified what is causing the failure. It is time to closely analyse your system's flow. How you do this depends entirely on the nature of your system. Using whatever tools at your disposal, trace the logic flow from the start of the system's process to the point of failure. As the analysis progresses you will identify facts that you will add to your fact board so they are shared. Now is when you will really benefit from having the experts in your Resolution Team. You are trying to systematically isolate the location of the problem. When you find it, you can then generate a way to resolve it.

This step of the BSCR pattern uses many of the the Resolution Team's soft skills, including Problem Solving, Collaboration, Logical Thinking, and Cool-Headedness.

Step 4: Apply the Resolution

In the previous step you identified the actions needed to resolve the issue. It is now time to apply the resolution and get the system back online. No changes should have been made to the system until this step. Any changes made, even to a failing system, have the potential to either obfuscate the original failure or increase the impact. Use your Risk Awareness and do not be tempted to allow changes without having strong confidence that they will effectively resolve the issue and not introduce new issues.

If you normally follow a change control process, follow an expedited version of the same process when applying fixes. Specifically: record and follow the change steps, understand how to rollback the change, know how to test the effectiveness of the change and evaluate the risk to the system and other connected systems.

Outcomes from the Pattern

Outcome 1: The system is rectified to normal state quickly and efficiently

This is the primary purpose of the BSCR pattern. The pattern's steps specifically target common inefficiencies which occur when analysing and rectifying system failures.

Outcome 2: No risk of making the situation worse

The pattern raises awareness of the risk of increasing the impact of a failure by applying unsafe "fixes" to the system.

Outcome 3: Ideas to improve system supportability

Good craftsmen choose their own tools. Analysis of the failure will have given the system experts an opportunity to use the existing support tools. The pattern advises the Resolution Team to record any improvements to tooling they identify. These improvements could reduce the time taken to resolve future failures.

The Pattern in Action

This is a fictional story with fictional characters.

Ian is the support manager for the "I-Trade" system at a financial institution. It is Monday morning and Ian's arrival in the office is greeted by a number of emails from I-Trade users saying that they are unable to access the system. He takes off his coat and logs into I-Trade. All he sees is the message "Loading...please wait...". Usually that message disappears in about 2 seconds, but not today. There is clearly an important and urgent issue as without I-Trade the users cannot perform their basic duties.

Ian calls an immediate meeting to gather the Resolution Team. The technical experts assigned are Jane and Khaled and Lara is the business expert. Mohammed is asked to be the Liaison. All the team members work in the same office, so they get together into an area near Jane's desk. Mohammed brings a mobile whiteboard to the area.

With the team standing by the whiteboard, Khaled is first to speak. "Do you think we should re-boot the application server I-Trade runs on? So many issues are fixed by turning things off and then back on again." Lara nods and says, "That sound's good. It might not work, but what harm can it do as the system is already broken?" Jane replies: "No, we can't do that. There may be transactions cached which will be lost if we reboot the server. Also, SystemY also runs on that server so that would be taken offline too. Let's go through the list of common causes and see if we can see what's happening."

Mohammed has already written the common cause questions from the BSCR pattern on the board. He has acted as the Liaison for a Resolution Team before. After 10 minutes discussion, the team answer "No" to all the questions. At this point, Neil, the department manager comes over to the group. "Morning guys, I hear we have problems today. What progress are we making? How can I help?" Mohammed replies quickly "Morning Neil. Let's pop into your room and I'll give you an update whilst the guys carry on."

The Fact Board

"Right", says Lara "Answering those questions didn't get us anywhere. Shall we reboot the server now?" Before Jane gets the chance to repeat her concerns, Khaled replies "Actually Lara, I think Jane is right. Especially now that we know SystemY is working fine. I think we should take another look at these questions." Mohammed returns from Neil's room and writes the following email, sending it to everyone he thinks has an interest in the failure.

Jane, Khaled and Lara check the version numbers of I-Trade together. None of the files have been changed for over a week. They find the same with the Market Data Feed. "How do we check whether the Database has been updated?" asks Lara. "Good question Lara. I have no idea. It would be easy if we had a table which logs all the changes made. But we don't." replies Khaled. Mohammed records this impediment so it can be considered as a future improvement. They decide to change the "No" for the second question to "Assumed not" as they can't be sure.

"How did you check the Database access Khaled?", asks Jane. Khaled answers: "I just logged on and ran a few queries. It looks fine. I'll show you." Khaled logs in using his own account details. Lara says "Shouldn't you use the account the system uses rather than your own." And that was the Eureka! moment. Khaled logged in using the I-Trade system account. He ran a simple query, but it failed to run, instead returning an "Access Denied" error message.

After a few minutes of further checks and a call to the Database Administration Team, Mohammed sends the second update:

The resolution worked as planned and the system restored. Ian, Jane, Khaled, Lara and Mohammed were all thanked for their excellent "super fast" resolution of the failure.

Anti-Patterns: Pitfalls to Avoid

In the example, the pattern worked out well for the team, but there are some important things to be aware of when using this pattern.

Test environment analysis

It is common in some organisations to attempt to recreate, analyse and resolve an issue by only using a non-production environment. The only reason for considering this option would be to protect the production system from further impact. The BSCR pattern addresses this concern by strictly separating resolution analysis and corrective action. No changes can be made until there is confidence that they will correct the failure and not cause further impact.

Many failures only appear on specific environments due to data, versioning, hardware, etc. There is no guarantee that a failure in production can be reproduced in a non-production environment. Analysing system failures can often feel “like looking for a needle in a haystack”. To give yourself a chance of finding the needle, make sure you are at least looking in the correct haystack.

It's broken already, what harm can I do?

In short: a lot of harm. Even the most impactful system failures can often be corrected by a very simple resolution. Making optimistic and unproven changes should be considered reckless. Rather than resolve the failure, such changes are more likely to make finding the real cause more difficult, and increase the impact.

Fake truths

Problem solving depends on the things you know, or rather those you treat to be truthful. If what you “know” is actually incorrect, your solution is very likely to be incorrect. Take this very simple example: what number am I thinking of? The number is an even integer. Dividing the number by 3 gives another integer. The number is a single digit. You of course say “Six” and you are surprised when I reply “No, its eighteen: the last clue I gave you wasn’t true”.

Take steps to gain confidence in the accuracy of a fact before relying on it.

Treat all failures as critical

The “Broken System: Crisis Resolution” pattern is only appropriate for major incidents. It requires the best people to be re-assigned to the resolution of the failure. If this is done for minor incidents, then there will be a significant impact on “normal” work.