Beyond the Root Cause: 11 Lessons Learned from a Lifetime of Troubleshooting

R. Keith Mobley

The Reality of Being a Good Troubleshooter

In the mid-1980s, United Air Lines produced a commercial that clearly defined the life of an effective troubleshooter. This corporate problem-solver could never return home, bouncing from one emergency to the next and living out of a suitcase on his never-ending journey. I knew then, and certainly now, just how he felt.

For 60 years, I lived out of a suitcase, traveling to different countries and towns big and small, solving a myriad of problems along the way. It has been a good life, and reflecting on my career, I’m happy to say that it was a fulfilling one. The high points were when I made people’s lives better. Helping major corporations increase profitability was great, but when I directly improved people's lives through problem solving, I felt the absolute joy of troubleshooting.

One such instance involved a food manufacturing plant with about 700 employees that was facing a shutdown due to enormous scrap rates. Lacking a solution, a friend in their corporate office asked for my help. Even though I was already involved in a major effort at a sizeable high-speed manufacturing plant, I agreed to try.

My first visit was worse than I imagined. As the employees finished their shifts, they were obviously physically and mentally exhausted beyond anything I had ever seen. Over the next few weeks, we methodically resolved all their issues—a combination of design deficiencies and operating practices—and returned the plant to normal operations. It was not easy, but with effort we could (and did) find the solution.

I returned a year later, and the change was remarkable. The factory floor was pristine, and all the production lines ran well. Because of one troubleshooting effort, 700 families had a stable income and a bright future. Without it, the plant would have shut down. That is what being a good troubleshooter is about.

Becoming a World-Class Troubleshooter

My primary goal in life was not troubleshooting; from 10 years old, my goal was to become the best mechanical engineer. But after university, it seemed like every step in my career reinforced the need for finding cost-effective solutions to problems. My first job as a maintenance engineer in a mid-size manufacturing plant certainly stretched my troubleshooting skills and taught my first lesson.

The company, a division of Rubbermaid, manufactured various plastic products over open-gas flames. This was in the early 1960s when plastics were in their infancy. Our plant was assessing using Dowtherm oil in lieu of open flame to mold plastic, which was an innovative and new technique. However, chronic vapor-locking was creating non-stop downtime.

I had a bright idea: I couldn’t stop the vapor from forming, but I could remove the vapor from the hot oil system, which would prevent downtime. I asked the maintenance and production managers for some planned downtime to strategically install standpipes that would allow the vapor to bleed off. The maintenance manager, a self-made engineer, laughed and said, "There is no reason to shut down; turn off the pump and that will relieve the pressure."

That day I learned the hard way that 600˚F oil does not make for a good shower. My first troubleshooting lesson: Never blindly trust what others tell you.

This was the first of many lessons I learned over the next 60 years. Some are the result of the perspective I’ve gained climbing the corporate ladder, and others are lessons learned out of pure necessity. All were beneficial personally and professionally and improved my troubleshooting skills. Understanding how valuable these lessons have proved to be in my career, my goal is to pass these lessons on to help others’ careers grow and improve.

11 Lessons Learned

I have limited troubleshooting to asset-related failures and/or problems for this article, but these same lessons apply to almost all problems or issues that warrant resolution. The lessons below are organized as steps in a typical troubleshooting sequence, not in chronological order.

Lesson 1: Identify the Real Problem

It sounds simple, but identifying the problem is often the most elusive part of troubleshooting. When asked, you will often get different answer directly proportional to the number of people asked. Descriptions are often vague and can often point to symptoms that may or may not be real. The only thing you can be sure of is that most responses will point to maintenance deficiencies as the source of the problem.

Early in my career I learned there are not enough hours in the day to spend them chasing the wrong problem (or one that doesn’t even exist). That’s why this first step is so crucial and in no way optional. Some may question why so much time is spent clarifying the problem, while others may become impatient and insist that we get on with it and quit wasting time.

Trust me, this is not a waste of time. Overall, it will save time and ensure positive results.

Repeatedly asking "why" works. Many people laugh at the “5-Why” approach to troubleshooting, but it does work. I’ve often found that the real problem was something other than the perception that triggered the investigation. It is tedious and time-consuming, but if you cannot accurately lock down the problem at the beginning, the odds are great that it will not be resolved in the end.

Lesson 2: Forget Your Bias

Because of our upbringing and the conditioning of our work environment, we all have biases of one form or another. Bias is rarely a good thing and can be an insurmountable obstacle when troubleshooting. One common bias I encountered regularly is that all asset problems (especially failures) are a result of poor maintenance or operator error. When trying to identify the root cause of a problem, these preconceptions can easily cloud the outcome.

A standing joke in the troubleshooting business is, "You don't know what you don't know." Put another way, you do not know as much as you think you do, especially about how assets work. For example, the difference between the design and operation of a fan or blower in a positive vs. a negative system or the inherent reliability of a centerline vs. end suction centrifugal pump, which can and often do lead to wrong conclusions. Because of our natural bias against admitting we do not know, we charge ahead without the fundamental knowledge needed to solve the problem.

It is difficult to cleanse the brain of bias, but it is necessary for successful troubleshooting. Let the facts take you to the answer without the “Kentucky windage” your bias may introduce.

Lesson 3: Identify Normal

Before you can identify an abnormal condition or behavior in an asset, production line, or manufacturing system, you must first understand what normal conditions look like. Every asset, process, or system is designed to perform specific work.

They are bound by incoming parameters, the amount of work that can be performed, and a limited range of finished goods. Within these boundary conditions are specific modes of operation, sustaining maintenance requirements, and limitations on incoming materials or flow. The asset, process, or system will perform normally when these conditions are met.

This normal condition can be described using process parameters such as flow, pressure, temperature, vibration signatures, or other measurable variables. This description of “normal” is your reference point for effective troubleshooting. Without this clear, measurable benchmark, your troubleshooting will quickly devolve, and the probability of success is limited.

Lesson 4: Evaluate the Application

Most problems can be resolved by determining the difference between the asset's design and its actual application. Statistically, 27% of asset failures result from using assets in applications outside their design boundaries. From my experience, at least 50% of your troubleshooting tasks can be resolved this way.

When comparing asset applications against their intended use, consider everything. Do not let bias impede important comparison. Looking at the mode of operation, ask these questions:

How is the asset started and shut down, and how are speed changes made?
Are the incoming materials or flow within the original design constraints?
Is the output consistent with the asset’s designed work capabilities?

I once consulted on a first-of-a-kind chemical process line that from startup had chronic, catastrophic failures with a wet grinder. This created a bottleneck and single-point failure in a continuous system. After the third failure in nine months, the general manager was convinced the culprit was improper maintenance and asked me to find a solution.

Following the application evaluation, I consulted the system’s original design in detail. With this knowledge, an application review quickly isolated the reason behind the chronic wet grinder failures. The in-feed to the grinder was rated for 12,000 pounds per hour of long-chain polymer but was being fed into a wet grinder only designed to process 3,500 pounds per hour.

But that was the trigger, not the root cause. The root cause was that the company had deliberately downsized the wet grinder during the building process to save money.

Consequently, the solution was expensive. The piping had to be modified to install parallel wet grinders with enough capacity to process the 12,000 pounds per hour coming from the in-feed. Once this was complete, the line began operating normally. This is typical of what you will find when following the design and application review sequence.

Lesson 5: Choose the Best Approach

There are endless troubleshooting approaches to resolving problems, but which is the most effective? Throughout my career, I’ve come to understand that none are perfect; all can face serious limitations and be misused.

However, one common factor in every approach is the need for a qualified team. Teams typically include operators, maintenance technicians, engineers, and subject matter experts. These teams often possess the combined knowledge and unique experience needed to resolve the problem. Unfortunately, teams can devolve due to biased opinions and contradicting ideas, often resulting in the most dominant member becoming the root cause “decider.”

The keys to team effectiveness are:

Including a qualified, unbiased facilitator to keep individual agendas at bay and the team focused on the problem being investigated.
Setting a time limit for the effort to keep the team focused (with facilitation) on the true issues. I have found that if a team cannot solve a problem within ten calendar days, it’s unlikely they will ever solve it.

Lesson 6: Never Assume Anything

I cannot overstate the importance of utilizing documented facts when troubleshooting. Every time that an assumption is made, the effort risks devolving into chasing squirrels. Never assume anything. Incorrect assumptions mean efforts are wasted chasing false problems. For example, you should not assume that operators follow the best practices or that maintenance technicians always perform sustaining maintenance.

It's not always possible to verify everything, but avoid using assumptions, opinions, or any other form of undocumented or unproven information in the troubleshooting process. Troubleshooting must be based on facts, not suppositions.

Lesson 7: Look Beyond the Obvious

One of the more serious potential mistakes in the troubleshooting process is to stop short of the true root cause(s). The tendency is to assign the formal reason for a problem or failure as human error, such as operator error or failure to perform the required maintenance tasks. There are two problems with this conclusion.

First, human error is the symptom or trigger of the event, not the root cause. The event will recur unless the underlying reason behind the error is identified and corrected. Such causes might include an incorrect procedure, faulty instructions from a supervisor, or even homelife problems.

The second problem is there are no benefits to finding a scapegoat. The focus of too many troubleshooting efforts is finding someone to blame, thinking this will solve the problem. This is very rarely the case.

Lesson 8: Evaluate Your Hypothesis

Once you have determined what you believe to be the root cause(s), evaluate your hypothesis and confirm your conclusions before finalizing a report. Remember, a solution is just a hypothesis until it’s proven correct.

This might require setting up controlled tests to determine the cause and effect of inputs that might have contributed to the problem. It might require using predictive or non-destructive testing methodology to quantify the impact of controlled inputs. Regardless of the approach, all potential root causes and possible corrective actions must be evaluated before the troubleshooting effort can be considered complete.

Lesson 9: Prove Your Hypothesis Wrong

If you hate to be wrong like me, this advice is for you. Before signing your name to a troubleshooting report, prove your conclusions wrong. If you don’t, someone else will.

One of the many things I have learned over my career is that when problems occur, especially those involving serious downtime and/or incurred costs, everyone tries to become invisible and ensure they aren’t at fault. Therefore, any report you generate will be examined closely. Those implicated, either directly or indirectly, will do their best to discount the conclusions.

The best way to counter this is to do it first. Evaluate your conclusions from every contrary viewpoint, including the facts, observations, and conclusions that could distract from your conclusions. If you cannot prove your conclusions wrong, then the probability that they are correct is high.

Lesson 10: Find A Cost-Effective Solution

While teaching a vibration analysis course, a young engineer asked an interesting question that caught me off guard. He asked, "If you can’t tell them how to fix the problem, have you done your job?" He had a valid point. Sometimes we focus on finding the root cause and lose sight of this essential step.

Here, bias once again enters the troubleshooting process. Depending on your background, the viable solutions may range from simple to overly complex. If you are a maintenance technician, your solution may not be technically correct or aesthetically pleasing, but it might at least solve the symptoms of the problem. If you are an engineer, your solution may be almost perfect from a technical perspective, but it’s also likely to be expensive. Somewhere in the middle is often the most cost-effective solution.

Another item to address is ensuring that the solution doesn’t create other problems. Solutions must be evaluated following management of change processes that ensure any change in form, fit, or function will not create additional problems.

Lesson 11: Verify the Solution

The troubleshooting effort is not over until the recommended corrective action(s) have been proven and have not introduced other issues that degrade reliability, performance, or the useful life of the asset, process, or system. This involves a tracking methodology that accurately quantifies key parameters that measure both the corrective actions and the overall performance of the extended system to which the asset belongs.

This paper was created as supplemental content for R. Keith Mobley’s speaking session at the 2023 Reliable Plant Conference. Learn more about attending or speaking at this year’s Reliable Plant Conference today!

About the Author

R. Keith Mobley

R. Keith Mobley has earned an international reputation as a leader in corporat... Read More

RCA and Troubleshooting: A Path to Sustainable Reliability

Single Point Lesson: Equipment Criticality Analysis

RC-Yay! Finding Success with Root Cause Analysis

Root Cause Assessment Methods

Featured Whitepapers

The State of Asset Performance: Benchmark and Best Practices

Buyer's Guide

Lubricants

Oil Filtration

Lubricant Storage and Handling