Troubleshooting and RCA Come Together
Back in 1997, I had reliability engineers tell me that root cause analysis didn’t work. They tried it, but they didn’t find fixes that would stop repeat equipment failures. I didn’t understand their observation. It seemed to me that they should be able to get to the root causes of these equipment issues, just like root cause analysis of any other problem.
I decided to work with an equipment reliability expert to see if I could understand their problems and come up with a solution. That led me to equipment reliability expert Heinz Bloch.
Heinz had been an expert equipment troubleshooter for Exxon before he retired and started his consulting practice. Once he retired, he wrote dozens of books about equipment reliability, equipment troubleshooting and machinery design and lubrication, including his book “Machinery Failure Analysis and Troubleshooting: Practical Machinery Management for Process Plants.”
What did I learn in my discussions with Heinz? Reliability engineers were having problems with root cause analysis because they weren’t completing thorough troubleshooting of the equipment failure before they tried to identify the equipment problem’s root causes.
What did Heinz learn in his discussions with me? There are advanced root cause analysis methods that are more effective than the common 5-Whys, Fishbone Diagrams and Cause and Effect methods.
Heinz calculated that failing to troubleshoot equipment problems and fix the root cause could cost thousands (if not millions) of dollars in needless repairs and equipment downtime. Multiply these costs across a corporation’s facilities worldwide, and they could amount to hundreds of millions of dollars per year.
That motivated us to take action. We decided to work together to create a system for thorough troubleshooting of equipment problems and advanced root cause analysis of the reliability issues that would lead to effective corrective actions. This article explains the result of our work.
Thorough Equipment Troubleshooting
To develop an effective troubleshooting system, Heinz and I started with the troubleshooting tables that he successfully used and that he had included in his book. This provided a basis for developing a computerized troubleshooting technique that we called the Equifactor® Troubleshooting Tables.
The troubleshooting tables were divided into four topics:
- Equipment: Pumps, compressors, fans, blowers, engines, electric motors, refrigeration and conveyor belts.
- Manual Valves: Ball valves, butterfly valves, diaphragm valves, pinch valves, globe valves, gate valves, plug valves and genre valve troubleshooting.
- Components: Bearings, gears, gear couplings and mechanical seals.
- Electrical: Resistors, cable insulation, switches, fuses, breakers, capacitors, terminals/joints, transformers, diodes, semiconductors and integrated circuits.
Each of these was broken down into an exhaustive list of failure symptoms and, for each symptom, a list of potential causes.
If these troubleshooting tables didn’t provide an answer, one could use Heinz’s two other troubleshooting methods — Failure Modes and Failure Agents — to develop a better understanding of the equipment issue.
Below is a graphic of the Equifactor® Troubleshooting Tables.
The techniques are explained in more detail in the example that follows.
From Troubleshooting to Root Cause Analysis
Thorough troubleshooting provided the information needed to start a root cause analysis. Without thorough troubleshooting, the reliability expert was working blindly. That’s why they thought that root cause analysis didn’t work. With the knowledge gained from identifying the potential cause that led to the failure (and eliminating the other potential causes that did not), the reliability professional could now identify the failure’s root causes using an advanced root cause analysis system.
The advanced root cause analysis system we chose to use is the TapRooT® Root Cause Analysis System which includes the Root Cause Tree® Diagram. The TapRooT® System is described in more detail in the example below.
Example: Pump Fails to Pump Rated Flow
In this example, a pump that was vibrating excessively was removed, rebuilt and reinstalled in the system. However, when the pump was tested, the vibration was gone, but it only provided 70% of the previously rated flow. The questions for the reliability engineer were: what was wrong and what was the root cause of the problem?
This example uses the Equifactor® Six Step Process for troubleshooting and finding the root causes of the pump’s inability to provide the rated flow. The process is shown below.
Normally, the process starts with the analyst drawing a SnapCharT® Diagram of what they know. For this example, the SnapCharT® Diagram they initially drew is provided below.
Next, they started the troubleshooting process by opening the centrifugal pump troubleshooting table and selecting the insufficient capacity symptom. That symptom, shown below in the computerized Equifactor® Troubleshooting Table, provided a list of possible causes.
For this example, those 25 potential causes are what the analyst needed to either verify or eliminate.
At this point, we recommend developing a troubleshooting checklist that starts with the easiest potential causes to eliminate, such as tasks that don’t require pump removal or disassembly, followed by the more in-depth tasks. An example of this type of checklist is provided below.
Answering these questions should lead the troubleshooter to the cause of the problem. In this case, they found the impeller (a double suction/double volute impeller) was installed backward. The information gained was added to the SnapCharT® Diagram shown below.
The information in the SnapCharT® Diagram is used to identify the problem’s Causal Factors. In this case, the Causal Factor identified was “Mechanic installed impeller backward.”
Notice that what was originally perceived as an equipment problem (pump not pumping rated flow) was actually a human performance problem (mechanic installed impeller backward). Without thorough troubleshooting, there was little chance of correctly identifying the problem’s root cause.
Next, the Causal Factor was analyzed using the TapRooT® Root Cause Tree® Diagram.
The Root Cause Tree® Diagram and the associated Root Cause Tree® Dictionary provide a comprehensive set of questions that help the analyst identify the fixable root causes of human performance issues. In this case, the analyst would be guided to look at procedure use, quality control, human engineering, management system and work direction.
An example of the Procedure Basic Cause Category from the back side of the Root Cause Tree® Diagram is shown below. This is one of the seven Basic Cause Categories that could be indicated for analysis.
Depending on the answers to the questions and the root causes selected, the analyst might decide to require:
- A written procedure to be used that includes a warning about installing the impeller backward.
- A quality control inspection after an impeller is installed on the shaft.
- The manufacturer to develop a keyway that only allows the impeller to be installed correctly.
- The supervisor to verify the correct installation of an impeller before the pump is reassembled.
Cost of Failures
The cost of a failure is never negligible. Not only do we consider the cost of the unplanned downtime, loss of production and spare parts, but also the cost of having to remove plant workers from their necessary scheduled activities to perform an emergency repair. When the expenses are added together, one failure has the potential to cost a company hundreds of thousands of dollars.
For example, an offshore gas production platform was experiencing the recurring failure of a downhole pump that supplied cooling water to the platform’s processes. Each time the pump failed, it cost the company over $50,000 to fix, and on average, this one pump at this one site was experiencing over six failures a year. That’s over $300,000 a year for this one failure alone. Multiple this by other pumps and platforms, and the costs have the potential to bleed a company dry, all because the maintenance personnel were not effectively troubleshooting the problem and fixing the root cause.
Surely the potential cost savings that could be achieved are worth the cost of training reliability engineers, maintenance managers and maintenance technicians to effectively troubleshoot and find the root causes of equipment reliability issues?
But this isn’t the only reason to consider implementing effective troubleshooting and advanced root cause analysis. Equipment failures and unplanned maintenance can lead to even more significant issues. These issues include injuries while performing unplanned maintenance and explosions or fires due to a reduction in the containment of flammable or hazardous materials.
In the end, effective troubleshooting, paired with advanced root cause analysis, is essential to saving your company money and creating a safer environment for all.
*We are proud to work with Heinz Bloch to develop and improve the Equifactor® Troubleshooting Tools and Techniques. Heinz passed away in 2022, but we will continue his legacy by teaching equipment professionals the tools he developed.
*The figures in this article are copyrighted by System Improvements, Inc and are used here with permission. Duplication or use in any other form are prohibited. For more information about the Equifactor® Troubleshooting Techniques and TapRooT® Root Cause Analysis, visit taproot.com.