Many companies have recently implemented reliability initiatives geared toward optimizing the maintenance function at their plants. Some are successful; however, most will admit they did not realize the expected benefits. While there are many approaches to successfully implementing a reliability program, I will discuss a proven model for improving a company’s Reliability-Based Maintenance program through maintenance task optimization focused on failure elimination.

Let’s begin by assuming we are dealing with a large plant with many programs already implemented as they attempt to move toward Reliability-Based Maintenance. A computerized maintenance management system (CMMS) is utilized to manage the operation, a large inspection-based preventive maintenance (PM) program has been built, and a relatively large predictive maintenance (PdM) program is in place to monitor asset condition.

Many of the pieces of the reliability puzzle exist, but improved cost and reliability results have not been realized because integration of the separate systems has not been considered, leaving each system sub-optimized.

Often, programs like those listed above are viewed by organizations as “stand-alone” programs. Yet if there is a concerted effort to refine and integrate all of the programs already in place, we will typically see increases in overall equipment effectiveness (OEE) with a significant reduction in maintenance spending.

The Starting Point
Success is typically measured by the improvement to the company’s bottom line. To achieve the financial success of any project, the key cost drivers addressed by the project have to be understood. For example, a plant may measure types of work (preventive, predictive, failure and modifications), labor and materials.

Let us assume we are looking at a plant where approximately 15 percent of the work is predictive, 35% is preventive, 25% is unexpected failure and approximately 15% of the PM is delinquent each month. In addition, the organization may have a gross overlap between preventive and predictive maintenance activities. Overlap costs money and it occurs for a specific reason. It is important to understand the reason before developing a maintenance strategy.

Plants can spend many years building PM programs, and they are encouraged to create PMs because they are rewarded for reduced failure when a PM process is implemented. Over time, these PM systems will grow to include inspections for all manner of failures. There can be a negative experience related to a failure, requiring the addition of a PM. The frequency will be set and the PM applied to every piece of equipment that is similar to the one that failed. The consequences or the nature of the failure are usually ignored because they have no bearing on meetings with superiors explaining the failure. The machine failed, the boss is unhappy, and PM makes the boss less unhappy. Over time, the number of PMs increases to the point that many aren’t being completed; even with an extensive PM program, there are still failures that can’t be eliminated.

A plant would begin a PdM program by monitoring a few pieces of highly critical equipment with vibration analysis (where there usually is some success). Of course, success is a positive reward, and to increase that success, the program would grow. If the organization has a lot of support corporately for implementation of PdM, they would typically apply the technology to 50% of their known assets and use all available technologies. To determine how many assets to monitor, the plant should determine how many technologists it can support and then buy the equipment needed to perform the work.

In neither case, PM nor PdM implementation, were the failure modes, effects or consequences of failure evaluated to determine the cost-effectiveness or even the feasibility of the maintenance task to truly predict or eliminate failures.

As an example, a plant may use a predictive technology to monitor bearings, but frequently sends a mechanic to tear down an asset and inspect the same bearings. The ridiculousness of this may seem fundamental. However, it is common in industry. Because of this, we must first discuss the methodology before describing the implementation steps.

The principles used to correct such inefficiencies are:

  • All maintenance tasks must address a specific failure mode
  • Use the least expensive and most effective task to maintain the asset
  • The maintenance task interval will be such that it addresses the failure at the optimal point in that asset’s failure cycle
  • The total cost of the failure must exceed the cost of the tasks to maintain the asset
  • PM should ultimately be a time-based refurbishment, not an inspection
  • Failures created by operating an asset outside of capability cannot be maintained. The asset must be redesigned

To illustrate this approach, let’s take a quick look at the P-F Curve shown in Figure 1. Author John Moubray uses the P-F curve in his book “Reliability-Centered Maintenance II” to demonstrate the timeliness and effectiveness of PdM tasks. Points have been placed along the curve to represent a period of time (P-F) from defect detection point P to functional failure point F. Logic tells us that the longer the warning period, the easier it is to support the planning and scheduling effort necessary for an efficient, Reliability-Based Maintenance organization.

What we can readily see by studying this curve is that PdM tasks have the ability to identify failure-creating conditions at a longer P-F interval than PM tasks. In addition, the PdM task may be more suited to identifying the failure mode.

Further analysis of the labor required to perform the work shows us that from a financial standpoint, PdM tasks, on average, are one-fourth of the cost of a PM task used to detect the same failure mode. In addition, PM is proven to introduce failure that otherwise would not happen. This early failure is often referred to as infant mortality.

An additional, and often the greatest, financial impact is production downtime. PdM tasks are usually performed while the equipment is running and the corrective work identified by the PdM technology is scheduled concurrently with other high-value corrective tasks. PM inspections normally require the equipment to be shut down.

Figure 1: The P-F Curve, from John Moubray’s book “Reliability Centered Maintenance II”

As you can see, the most economical decision, and the one that makes the most technical sense, is to maintain the asset by using the following resources, in order, as they apply:

  • Process monitoring
  • PdM technologies
  • Time/meter based directed tasks (PM)

Aligning maintenance tasks to failures
Failures can be grouped into the following three categories. Understanding these categories is critical when assigning maintenance tasks.

  • Induced
  • Intermittent
  • Wear out

Induced failures are a result of an outside force causing the failure mode. For instance, a plant may run the production process in such a way that the assets are prematurely forced into a potential failure situation, or a soft foot condition on an equipment train causing coupling misalignment eventually leads to an inboard bearing failure. While process and PdM monitoring may help detect these potential failures (thereby eliminating an unscheduled stoppage), it is important to understand that induced failure must be recognized and analysis performed to determine the root cause. Only then are we acting proactively and making the transition into a Reliability-Based Maintenance organization.

Intermittent failures can happen at any time. Some may actually use the term “random”; however, the implication is that the mean time between failure (MTBF) cannot be determined. These differ from induced failure because they typically happen far enough up the P-F Curve that the repair can be effectively planned and scheduled. A plant can best detect these failure modes through process and PdM monitoring when possible.

Many plants also find that PMs are not effective in determining the onset of failure in either induced or intermittent failures and, therefore, a waste of capital. Too often, a plant may then choose to increase PM frequencies, or worse, write and schedule new procedures to attempt to mitigate these failures. This is what ultimately leads to an ineffective, costly and out-of-control maintenance program.

Wear-out failures have a known MTBF and they occur when the useful life of a component is expended. These types of failure modes are often detectable through process and PdM monitoring. However, time-based refurbishment usually proves to be the most effective maintenance strategy.

The definition of PM
A PM, by definition, is a repair/replace activity that will restore the functionality or useful life of an asset back to its original state. Other types of PM are failure-finding or condition evaluation tasks. A plant would deploy a failure-finding task when the consequences of failure or the risks associated with the failure are tolerable; these tasks are also helpful in finding hidden failures. One method of failure finding is to test-run standby plant equipment on some frequency to ensure it hasn’t failed while sitting idle.

Condition evaluation tasks are performed to determine a component’s failure rate. When organizations choose to perform condition evaluation tasks, it is with the understanding that condition evaluation is used to try to determine the MTBF. Correctly applied, it should be quantitative in nature. In other words, a precision measurement is taken and compared to established criteria that define when replacement is necessary. There are two principle reasons a plant would establish quantitative measures.

  • Craft skill differences are minimized.
  • Wear rate trending. Where possible, warning or alert levels (yellow condition) and critical or action levels (red condition) should be defined.

The implementation
A thorough understanding of potential failures of each piece of equipment can be developed through failure modes and effects analysis (FMEA) to each equipment type in the plant. FMEA templates can be developed at a class/subclass/qualifier level (i.e. Pump/Centrifugal/Coupled or Pump/Centrifugal/Belt Driven). Significant time savings can be realized by developing templates. With each equipment type, a plant should be able to answer the seven basic RCM questions.

  • What is its function?
  • What are the functional failures?
  • What are the failure modes?
  • What are the effects of those failures?
  • What are the consequences?
  • How can the failure be mitigated?
  • What if a suitable task cannot be found?

When answering Question 6, consider a logical path to utilize the three resources – process monitoring, PdM monitoring, and PM, in that order – as previously described.

Once the FMEAs are completed, they can be applied at the asset level. This more granular review ties in the criticality ranking criteria to determine if the consequences of failure are great enough to perform the task. This is really an economic decision rule, “Is the cost of failure greater than the cost to mitigate?” This is extremely important to note since the goal of these programs is to reduce the cost of maintenance while maintaining high asset utilization.

Now a plant can define and communicate process parameters and rebuild and implement PdM routes. For example, a plant may employ: slow and high-speed vibration monitoring, electrical and mechanical thermography, motor circuit analysis, oil analysis, and NDT thickness testing. Existing PM tasks that cover the same failure modes that are now being defined with the PdM tasks can then be removed from the system.

If a plant determines that a PM is the most effective way to mitigate failure, the worn component is replaced. For example, if a screw conveyor is shut down for a PM that addresses hanger bearings, the bearings would be replaced rather than inspected to determine if replacement is needed. This approach is often taken because the cost to shut down the line and the labor required to tear down the equipment for inspection is greater than the cost of a few hanger bearings. Once repairs are completed, the removed bearings could be inspected “at the bench” to help further define the MTBF and thereby “tweak” the task frequencies if warranted. This eliminates almost all condition evaluation type tasks.

PM frequency is determined by work order history and craft knowledge. If there is a question about the MTBF, any given plant will choose the longer duration to set the PM frequency. Why should they choose the longer duration for failure rate? One might think this will cause some failure, but think of it this way: If every PM is entered conservatively and performed at a short and safe interval, it will take a long time to know if we right-sized the PM system. If each PM is set at an interval that to the best of our knowledge is the true interval, there will be a few mistakes made, but this will be evident relatively quickly. This may be an enormous leap for some plants. However, in order to make great strides in most reliability efforts, this will prove to be the correct method. If some frequencies are missed, they will be able to temporarily accept failure and improve over time. Success will hinge upon whether the frequencies appear to have been set appropriately and if unexpected failure does not increase.

The results
This approach typically results in the following:

  • During the initial stages, maintenance costs will drop, and will continue to do so.
  • Total maintenance staffing will decrease significantly compared with pre-project levels and will continue to decrease. This is usually realized through the elimination of contractors).
  • Significant return on the project investment (i.e. the first three months’ performance paid back over half of the total project costs).
  • Large shutdowns will be possible to enable the installation of new capital equipment while the OEE for the facility does not decline.
  • Equipment is taken out of service far less frequently due to PMs.
  • The number of predictive technologists increases as does the percentage of condition-monitored assets. Because of the depth of the condition monitoring coverage, continued monitoring ensures that the reliability of the plant is not compromised because of the project.

Many plants and managers identify tools and systems that claim to remedy the reliability ills of a facility. Implemented independently, the tools and systems are just added modules that increase costs without increasing plant reliability. True reliability is achieved when the most cost-effective methods are applied to the assets in the plant, thereby maximizing the maintenance effort with the minimum total cost to the business.

“Economy of force” is a military term used to describe the technique of using only the force necessary to defeat the enemy. In the reliability world, the enemy is downtime, labor, rework and materials costs. To compete globally, we must use the “economy of force” principle to ensure our plants run reliably at maximum output for minimum total costs. An integrated maintenance and reliability strategy is a key part of accomplishing this objective.

Timothy White presented this article at Noria Corporation’s 2008 conference in Nashville, Tenn. For more information on Noria conferences and educational events, visit