At its core, reliability engineering is about predicting and preventing failure. By listing out all the ways a system can fail, it becomes possible to work out all the ways failure can be detected, delayed, or stopped dead before it can even truly begin.
Failure reporting, analysis and corrective action system (FRACAS) is an organized method of sorting through the possible means of asset failure and working back to all the possible root causes. The final product acts as a map for all the ways by which things can fail, allowing for a plan to be laid out to correct issues as they arise, and for eliminating chronic problems at the point of inception.
The FRACAS process should encompass an understanding of component criticality across the system and should make use of experienced planners, maintainers, and operators in listing all the myriad ways in which a machine fails.
Failure is best defined as any operating state other than ideal. Most would think of failure as a state of total inoperability, but operating in a degraded state should be considered a failure just the same. I was recently at a plant where a machine had been retooled to operate at a speed higher than originally designed. The operators and maintainers in that specific area of the plant considered it to be a great success to have gotten their system to work so well, but just down the line the overfeed hadn’t been accounted for and product was spilling out at the next pumping point, creating a large pile of waste. So, failure can be something working better than it’s supposed to as well as working worse.
The best way to proceed is to list out all the ways a machine can fail by breaking it down to smaller subcomponents that would either be worked or replaced as part of repair. The subcomponent level could go all the way down to nuts & bolts or to “off the shelf” equipment that can easily be replaced, such as a small motor or sensor. It’s very useful to have access to the material inventory program that your site uses.
This may already be tied to the in-use CMMS or may be a distinct system. Using the material inventory program will be useful again later during the corrective action section when determining how many spare parts and consumables should be kept on hand.
Another word that would fit well here would be recording. My first maintenance manager’s catch phrase was “you gotta write it down.” It’s important to log everything in a structured, organized method. There are many software options that can be purchased to help in this area, but for a smaller plant or system, consider using Projects or a spreadsheet. Having a structured cascade of subcomponents and failure types is important because patterns and repeat methods of failure will occur. Theoretically, there are infinite ways something can fail, but in practice you’ll find a lot of the same ones showing up over and over.
Another method of reporting is to use a cause-map for each system. This is a more graphical approach and can be harder to capture electronically without some experience in that process, but in many cases it allows for an easier understanding of what’s happening by others outside the FRACAS process. A simple version would be a basic drawing or picture of the system or machine in question with text boxes treeing out from the points of failure. These boxes can subsequently branch into further causes or effects creating a sort of cloud of failure modes around the equipment in the middle.
Having all the methods of failure listed out in an organized way now allows for a team of subject matter experts to review and notice trends. Having an experienced group reviewing the data allows for a fair amount of history to come into play as well. Every plant has at least one machine with an established record of breaking down and having a team of people most familiar with those sorts of incidents is vital to a successful analysis phase.
With that, it’s also important to distinguish between chronic failures and one-off events. There can be a tendency to focus on notable or particularly catastrophic events and try to prevent that same specific set of circumstances from happening again. While important to consider all possibilities, sometimes material failure is such a freak occurrence that there is no practical lesson to be learned.
Parallel with failure analysis is performing a criticality assessment. When combined with the structured cascade of equipment, subcomponents, and failure modes, a criticality assessment will help determine how detailed the analysis team should go in assigning corrective actions. Criticality should be defined by a combination of price to repair or replace, cost of labor to repair or replace, and effect of downtime on the system or plant as a whole.
In the lean/Six Sigma world, there’s been a strong focus on reducing inventory and having just-in-time deliveries of products or material to save on storage costs. Applying a criticality analysis to your system will let you see which machines or components need ready-spare replacements on the shelf and which can afford to be ordered as it happens. This also helps decide how long a failing machine can be allowed to operate in that state.
Consider a machine that would require the whole system to be shut down but could be fixed in two days with a part already on hand versus operating that same system at 90 percent through the end of the month while waiting on a replacement machine to arrive that can be swapped in instantaneously. Ninety percent efficiency seems like the way to go, but having the full use of the system for 28/30 days is more than 93 percent, a not insignificant increase in plant output for that timeframe. A thorough criticality assessment will help list out these sorts of actions to be weighted against the eventual corrective actions list.
The simplest corrective action to list for each instance of failure is to replace the failed component. Slightly more complex is to repair it. Freely list all possible corrective actions for each failure but also add a dollar amount that considers both the part and the labor needed to repair or replace. Pre-written work orders can be written for either case and stored in the CMMS for use in the event failure occurs.
A possible corrective action that can be listed can be “run to failure.” There isn’t a true one-size-fits-all solution to prevent or correct every failure mode, but applying the Pareto principle will produce an easily managed list of activities and actions that can be swiftly applied. In other words, it will likely be found that most of the chronic issues affecting a system are from a smaller, manageable list of faults and can be corrected by a similarly manageable list of follow up actions.
The most important part of FRACAS is that it’s maintained as a system. Repeatability of this process is vital for adding additional systems, new equipment, emerging technologies, and even new personnel into the plant. Have a plan to review the effectiveness of corrective actions and bring in new members to the analysis group to help shine a light on any area of the plant that doesn’t seem to show marked improvement in uptime.
Another systematic part of this process lies in educating other members of the plant. This increases ownership of the process for personnel not involved in the initial analysis and can be a culture-building event. Training others early in the process helps them see that the process is unbiased and is open to incorporating their ideas as well.
Failure is random, but in most cases can be detected, delayed, or prevented altogether with a systematic approach. Data collection and review by experienced plant personnel are key, as is figuring a dollar amount for each possible action or outcome to help guide the eventual corrective process. Applying these methods methodically over time, as well as teaching the basic principles of the process to other team members, will help improve culture and increase overall asset reliability. Remember, you want to have a FRACAS, not a fracas.