For want of a nail the shoe was lost. For want of a shoe the horse was lost. For want of a horse the rider was lost. For want of a rider the message was lost. For want of a message the battle was lost. For want of a battle the kingdom was lost. And all for the want of a horseshoe nail.
This proverb, for me, was the initial antecedent to Critical Analysis. The proverb was intended to stress that an outcome might teeter or hang in the balance from lack of attention to the trivial. Even the smallest detail might put the ‘kingdom’ in jeopardy. Because a nail was missing, a horse’s shoe was not fitted correctly, the horse threw the shoe in combat, causing the rider to fall. Because the rider was not in combat (one version has it as the King, himself), the battle was missing its field general. And, because of the missing King, the battle and kingdom were ultimately lost.
What does any of this have to do with criticality analysis? I wonder if anyone in this old proverb would have classified that nail, remind you, just one nail, as critical? Or, more to the point, would any have seen the potential loss of the King’s presence in the battle as the decisive event to losing the kingdom?
A few years ago, Reliable Plant published an exceptional article titled “Criticality Analysis: What It Is and Why It’s Important.” If you have not read this piece, I would strongly encourage you to do so; this article is meant to build off of that article. I hope to put the idea and practice of Criticality Analysis into more of a contextual discussion and ultimately compel the reader to “get one.” I'll also be discussing this topic at the 2022 Reliable Plant Conference and Exhibition, held in from July 25-28 in Orlando, Florida.
Before we begin, we must address the nasty business of paradigm-shifting. For this article, I ask you to put aside your norms, biases, and beliefs and consider these philosophical necessities:
Keeping these philosophical necessities in mind, approach this article with the confidence that in the end, you will be able to make a convincing argument on:
Quoting an old ISO auditor, Criticality Analysis is “whatever we say it is.” That seems simple enough. However, he further instructed me that we must have a process in which we all agree that “this is how we arrived at our conclusion.” And you must follow your process as it is written.
Our Criticality Analysis process, as it turns out, should start with what the industry calls an “Asset Criticality Analysis.” This particular process has the added benefit of being the initial step to almost every single maintenance and reliability continuous improvement effort. The Asset Criticality Analysis is quite simply a listing of all the plant or facility’s assets in order of their criticality to the function of the plant.
This practice is a necessary step for one dramatic reason: if the leadership of a facility can’t agree on what assets the facility has and how important each one is, the leadership isn’t likely to agree on anything. I have personally been in meetings where the argument was over whether the plant had 56 conveyors or 57. Someone suggested that they go out and count.
Organizations are counseled to form a team of stakeholders who will assemble this list of plant assets and subsequently determine each asset’s importance. This cross-functional grouping of interested decision-makers is crucial to the task, and membership criteria should be greatly considered —the bar should be very high for admittance. Why? This may be the very same group that executes the criticality analysis (risk analysis), and they will hold much sway on the maintenance (small “m”) responsibilities that enable an asset to deliver on its value promise.
ISO 55000, Asset Management, helps us understand that the term “stakeholder” is anyone who has an interest in the success of the asset performing its intended function. A list of stakeholders would certainly include the plant manager’s staff, key lieutenants, and, depending on your involvement with ISO 55000, may even include vendors and customers. Caution is warranted: those tasked with this very important detailing of plant assets should be the decision-makers. I would counsel against deputies seated at the big table; these need to be top representatives of their associated departments.
A typical first use of an Asset Criticality Analysis is the construction of a Ranking Index of Maintenance Expenditures (RIME) chart; this is a common tool used by organizations to determine the priority of a maintenance work order based on the criticality of the equipment and the corresponding importance of the maintenance activity.
The plant assets are listed, in order of importance (criticality), down the left-hand side of the spreadsheet; the work order type is listed across the top, also in order of importance (criticality). Companies determine priorities based on the intersection of asset and work order type. This chart is, of course, customized to the client and the context of operation. An airport’s snowplow, for example, has a high level of criticality in the winter months but not so much in the spring and summer. A piece of critical production equipment reflects as much in its criticality (unless it is accompanied by a spare unit with the same capacity). As it turns out, context is critical in an Asset Criticality Analysis.
Here is an example of a RIME chart from a client site:
Note the weighted (ranked) value associated with the order of criticality. One fundamentally key factor in almost every single Asset Criticality Analysis is that the weight of “10” for assets is universally reserved for plant utilities. Actually, in practice, few assets in a facility have more priority than the first company transformer that is tied to the public utility’s pole-mounted breakers. Losing the first transformer will shut the entire operation down; everyone will go home. Juxtapose the level of maintenance (little “m” and capital “M”) that is performed on your main transformer, arguably the most critical asset in your facility. That transformer is the ‘nail’ in the proverb that started this article. That was meant to be a scary thought.
From the basic, in-order list of assets, as exampled in the RIME chart, we are better poised to initiate a criticality analysis. The original criticality analysis article informed us that criticality analysis is “the process of assigning assets a criticality rating based on their potential risk.” To understand risk, you must understand that it is algebraic: risk = probability x consequences.
Since this is an algebraic formula, risk can be reduced by decreasing probability, decreasing the consequences, or doing both. Consider how we reduce the risk of being injured in an automobile accident. We:
If you follow Reliability Centered Maintenance (RCM) theory, you’ll recall that the objective is not so much to keep equipment from failing, but rather to reduce the consequences of that failure when/if it does occur. RCM is a type of risk management.
What have we learned so far?
Criticality Analysis is tantamount to Risk Assessment and Risk Management. No plant, facility, or functional organization can assess and manage risk if the leadership can’t even agree on the assets and their contributing importance. That is step one.
I think the original criticality analysis article said it best in concluding the bottom line for criticality analysis:
“Criticality analysis is a great tool for identifying the priority of maintenance tasks. A good way to look at it is that maintenance task priority should be established by the risk level that comes with not performing that task. Coincidently, this level of risk associated with not doing a particular task is determined by the consequences of the potential failure that could happen if the task isn’t completed and the likelihood of that failure occurring if the task isn’t done at a predetermined time.”
This is exactly right, and the outcome proposed is precisely the result we need. In fact, I’d use this line of reasoning to eliminate non-value-added preventive maintenance tasks, and I’d suggest using a very effective tool, such as Preventive Maintenance Optimization (PMO), to identify those non-value-added tasks. We want, and we only want, tasks that are meant to lower the probability of an event, lessen the consequence when/if that event happens, or both. Recall that risk = probability x consequence.
I use the example of the low fuel indicator on a dashboard to demonstrate an example of a risk-reducing, value-added PM task. If you do not stop within a reasonable distance to put fuel in your vehicle (this is a servicing type of PM) after this light comes on, there is a high probability that you’ll run out of gas on the road, and the consequence is that you are greatly inconvenienced. You have put yourself unnecessarily at risk.
I implore you not to take my word for it; or even that of my colleague. Rather, consider how ISO 55002 directs us to understand what puts us at risk and how we need to determine the level of risk that is right for our organization; in proper context:
“The overall purpose is to understand the cause, effect and likelihood of adverse events occurring, to manage such risks to an acceptable level, and to provide an audit trail for the management of risks. The intent is for the organization to ensure that the asset management system achieves its objectives, prevents or reduces undesired effects, identifies opportunities, and achieves continual improvement.”
“To manage such risks to an acceptable level.” This is a clear-cut, unambiguous mandate to understand what adversely affects the company (criticality analysis) and take control of the circumstances that put the organization at risk (risk management). I sometimes ask my classes to imagine a scenario where they are telling the company CEO that, “We are going to start managing risks at an acceptable level,” and, the company CEO responds, “Wait. You weren’t already doing that?”
Some measure of common sense must be applied here. We certainly can’t spend the resources assessing every asset against every risk potential. Reason has to prevail, if for no other reason than a fiscal one. Most facilities are guided by and evaluated against budgetary discipline. The budget is central to both for-profit and nonprofit organizations.
The criticality analysis, quite frankly tells us:
There are certain assets in our operating facilities that are more important than others. Their loss or a reduction in their function affects our mission greater than other assets. Our organization is funded for maintenance and repair; it makes sense to apply those funds against critical equipment first and only in the execution of maintenance (small “m”) activities that are going to manage risk to an acceptable level. How could we possibly justify doing differently?
In “Criticality Analysis: What It Is and Why It’s Important,” we learn that, “criticality analysis is also important because it can be used across a variety of scenarios within an organization." Some of these scenarios might look like this:
What have we learned so far?
The good news is that we’ve already completed much of the work necessary to perform a criticality analysis. By this point, we have already established a complete listing of assets and ranked them in order of their importance (criticality). That was done as part of the Asset Criticality Analysis. The portion of this article addressing the purpose of a criticality analysis has aligned everyone to the need for this process, both according to the standard and the budget. And we have assembled an engaged team of decision-makers.
We have no other task but to get started.
Recall two points made earlier:
There are many methods for conducting a criticality analysis, as there are many methods for conducting a root cause analysis. What I am going to demonstrate is a more basic risk matrix approach that takes into consideration the probability (frequency) and the consequence (severity). Keep in mind that the descriptions shown are just examples. Any risk matrix is editable in regards to the end-user, using “their” words and “their” circumstances.
The first action is for the assembled team to determine what assets we are going to process through the analysis. It is a common practice to make the cut at the top 20% of your assets. If you recall, we don’t have the resources (money) or the luxury to focus on the lower priority assets. Not initially. This would also keep in concert with the Pareto principle, or 80/20 rule that so many of us know by heart.
Here is a typical Risk Matrix. They can get more involved, but for this article, I want to keep it simple. This type of matrix is deployable in the field and has an everyday utility about it:
Look back at the RIME chart shown at the beginning of this article and notice that Cutting Table 8 has a weight or rank of 8. In an 80/20 cut, I’d suspect that Cutting Table 8 would “make the cut” and be considered a highly critical asset, certainly useful for our example.
With our assembled group, let’s start the criticality analysis. We will begin by having someone describe the function of Cutting Table 8. Since those reading this article don’t have any knowledge of this table, we can use the following as a scenario:
Mining giant Rio Tinto uses the term Maximum Reasonable Outcome (MRO) in their injury risk assessment. Although the application is different, the term MRO is common in risk mitigation and runs loosely along the lines of a worst-case scenario, which, of course, could also mean the best possible result.
For our use, let’s take the risk matrix and consider the worst case scenario for the critical asset of Cutting Table 8. In the instance of affecting production, Cutting Table 8 could simply fail in one of its functions. Let’s say, for example, it fails to cut to length. Cutting Table 8 uses a two-axis movement common cutting torch following a pre-programmed sequence. The primary movers for transitioning the cutting torch are three electric motors.
In our scenario, we have:
Examine the risk matrix and note that there are dark green cells labeled as “Accept.” These are the areas that the team, and indeed the organization, have determined to be the acceptable level of risk in which this plant will operate (ref: ISO 55002).
Let’s further consider the failure mode of “Failure to cut to length due to loss of motor operation.” This could result from:
In an actual analysis, we would have performance and failure information, and we might conclude or measure that total loss of power to the table has never happened. In fact, it would be labeled as “Occasional” or “Remote.” The frequency or probability of this failure happening is rare (never recorded), and if it were to happen, the consequence or severity would result in downtime for sure, but it would not likely be catastrophic; it would be critical or marginal at worse. I would suspect that the nature of the consequence would be the result of the downtime and not injury or damage. If you consider the scenario I just painted, a total loss of power to Cutting Table 8 would put us in the risk matrix about here:
Now we have some choices. Clearly, the loss of Cutting Table 8 would not be acceptable, as it is an identified critical asset. Our criticality analysis has shown us that there is a range of outcomes for power failure to the table. As rare as that occurrence might be, we can’t rule out an “act of God.” We can manage the risk and drive the outcome to the cell labeled “14 Accept” by routine PM/PdM activities and some good storeroom practices. Since the loss of power is more of a power distribution issue than a motor performance issue, we might adopt the following mitigation strategy:
We would then repeat the criticality analysis using the Risk Matrix for the next failure situation around the loss of a motor. I suspect this would land us on the Risk Matrix around the “7” cell, and we could mitigate this with a spare motor (utilizing the concept of standardization) to minimize severity and move from Marginal to Minor. A good PM/PdM program would move the frequency from Frequent to Probable. Our objective is to measure risk, determine where the failure mode lands us on the matrix, and then take action to move to a level of acceptable risk.
What have we learned so far?
If this looks like a lot of work, that’s because it is. Can anyone reading this article honestly answer “No” to any of the following questions?
Final question. Does any of this take any less time to accomplish the longer you wait to do it?