Repairable Systems Analysis

The previous chapters presented analysis methods for data obtained during developmental testing. However, data from systems in the field can also be analyzed in the RGA software. This type of data is called fielded systems data and is analogous to warranty data. Fielded systems can be categorized into two basic types: one-time or nonrepairable systems and reusable or repairable systems. In the latter case, under continuous operation, the system is repaired, but not replaced after each failure. For example, if a water pump in a vehicle fails, the water pump is replaced and the vehicle is repaired. Two types of analysis are presented in this chapter. The first is repairable systems analysis where the reliability of a system can be tracked and quantified based on data from multiple systems in the field. The second is fleet analysis where data from multiple systems in the field can be collected and analyzed so that reliability metrics for the fleet as a whole can be quantified. =Background= Most complex systems, such as automobiles, communication systems, aircraft, printers, medical diagnostics systems, helicopters, etc., are repaired and not replaced when they fail. When these systems are fielded or subjected to a customer use environment, it is often of considerable interest to determine the reliability and other performance characteristics under these conditions. Areas of interest may include assessing the expected number of failures during the warranty period, maintaining a minimum mission reliability, evaluating the rate of wearout, determining when to replace or overhaul a system and minimizing life cycle costs. In general, a lifetime distribution, such as the Weibull distribution, cannot be used to address these issues. In order to address the reliability characteristics of complex repairable systems, a process is often used instead of a distribution. The most popular process model is the Power Law model. This model is popular for several reasons. One is that it has a very practical foundation in terms of minimal repair. This is the situation when the repair of a failed system is just enough to get the system operational again. Second, if the time to first failure follows the Weibull distribution, then each succeeding failure is governed by the Power Law model in the case of minimal repair. From this point of view, the Power Law model is an extension of the Weibull distribution.

Sometimes, the Crow Extended model, which was introduced in a previous chapter for analyzing developmental data, is also applied for fielded repairable systems. Applying the Crow Extended model on repairable system data allows analysts to project the system MTBF after reliability-related issues are addressed during the field operation. Projections are calculated based on the mode classifications (A, BC and BD). The calculation procedure is the same as the one for the developmental data.and is not repeated in this chapter.

Distribution Example
Visualize a socket into which a component is inserted at time $$0$$. When the component fails, it is replaced immediately with a new one of the same kind. After each failure, the socket is put back into an as good as new condition. Each component has a time-to-failure that is determined by the underlying distribution. It is important to note that a distribution relates to a single failure. The sequence of failures for the socket constitutes a random process called a renewal process. In the illustration below, the component life is $${{X}_{j}}$$  and  $${{t}_{j}}$$  is the system time to the  $${{j}^{th}}$$  failure. Each component life $${{X}_{j}}$$  in the socket is governed by the same distribution  $$F(x)$$. A distribution, such as the Weibull, governs a single lifetime. There is only one event associated with a distribution. The distribution $$F(x)$$  is the probability that the life of the component in the socket is less than  $$x$$. In the illustration above, $${{X}_{1}}$$  is the life of the first component in the socket. $$F(x)$$ is the probability that the first component in the socket fails in time  $$x$$. When the first component fails, it is replaced in the socket with a new component of the same type. The probability that the life of the second component is less than $$x$$  is given by the same distribution function,  $$F(x)$$. For the Weibull distribution:


 * $$F(x)=1-{{e}^{-\lambda {{x}^{\beta }}}}$$

A distribution is also characterized by its density function, such that:


 * $$f(x)=\frac{d}{dx}F(x)$$

The density function for the Weibull distribution is:


 * $$f(x)=\lambda \beta {{x}^{\beta -1}}\cdot {{e}^{-\lambda \beta x}}$$

In addition, an important reliability property of a distribution function is the failure rate, which is given by:


 * $$r(x)=\frac{f(x)}{1-F(x)}$$

The interpretation of the failure rate is that for a small interval of time $$\Delta x$$,  $$r(x)\Delta x$$  is approximately the probability that a component in the socket will fail between time  $$x$$  and time  $$x+\Delta x$$ , given that the component has not failed by time  $$x$$. For the Weibull distribution, the failure rate is given by:


 * $$r(x)=\lambda \beta {{x}^{\beta -1}}$$

It is important to note the condition that the component has not failed by time $$x$$. Again, a distribution deals with one lifetime of a component and does not allow for more than one failure. The socket has many failures and each failure time is individually governed by the same distribution. In other words, the failure times are independent of each other. If the failure rate is increasing, then this is indicative of component wearout. If the failure rate is decreasing, then this is indicative of infant mortality. If the failure rate is constant, then the component failures follow an exponential distribution. For the Weibull distribution, the failure rate is increasing for $$\beta >1$$, decreasing for  $$\beta $$   $$<1$$  and constant for  $$\beta =1$$. Each time a component in the socket is replaced, the failure rate of the new component converts back to the value at time $$0$$. This means that the socket is as good as new after each failure and the subsequent replacement by a new component. This process is continued for the operation of the socket.

Process Example
Now suppose that a system consists of many components with each component in a socket. A failure in any socket constitutes a failure of the system. Each component in a socket is a renewal process governed by its respective distribution function. When the system fails due to a failure in a socket, the component is replaced and the socket is again as good as new. The system has been repaired. Because there are many other components still operating with various ages, the system is not typically put back into a like new condition after the replacement of a single component. For example, a car is not as good as new after the replacement of a failed water pump. Therefore, distribution theory does not apply to the failures of a complex system, such as a car. In general, the intervals between failures for a complex repairable system do not follow the same distribution. Distributions apply to the components that are replaced in the sockets but not at the system level. At the system level, a distribution applies to the very first failure. There is one failure associated with a distribution. For example, the very first system failure may follow a Weibull distribution.

For many systems in a real world environment, a repair is only enough to get the system operational again. If the water pump fails on the car, the repair consists only of installing a new water pump. If a seal leaks, the seal is replaced but no additional maintenance is done, etc. This is the concept of minimal repair. For a system with many failure modes, the repair of a single failure mode does not greatly improve the system reliability from what it was just before the failure. Under minimal repair for a complex system with many failure modes, the system reliability after a repair is the same as it was just before the failure. In this case, the sequence of failure at the system level follows a non-homogeneous Poisson process (NHPP). The system age when the system is first put into service is time $$0$$. Under the NHPP, the first failure is governed by a distribution $$F(x)$$  with failure rate  $$r(x)$$. Each succeeding failure is governed by the intensity function $$u(t)$$  of the process. Let $$t$$  be the age of the system and  $$\Delta t$$  is very small. The probability that a system of age $$t$$  fails between  $$t$$  and  $$t+\Delta t$$  is given by the intensity function  $$u(t)\Delta t$$. Notice that this probability is not conditioned on not having any system failures up to time $$t$$, as is the case for a failure rate. The failure intensity $$u(t)$$  for the NHPP has the same functional form as the failure rate governing the first system failure. Therefore, $$u(t)=r(t)$$, where  $$r(t)$$  is the failure rate for the distribution function of the first system failure. If the first system failure follows the Weibull distribution, the failure rate is:


 * $$r(x)=\lambda \beta {{x}^{\beta -1}}$$

Under minimal repair, the system intensity function is:


 * $$u(t)=\lambda \beta {{t}^{\beta -1}}$$

This is the Power Law model. It can be viewed as an extension of the Weibull distribution. The Weibull distribution governs the first system failure and the Power Law model governs each succeeding system failure. If the system has a constant failure intensity $$u(t)$$  =  $$\lambda $$, then the intervals between system failures follow an exponential distribution with failure rate  $$\lambda $$. If the system operates for time $$T$$, then the random number of failures  $$N(T)$$  over  $$0$$  to  $$T$$  is given by the Power Law mean value function.


 * $$E[N(T)]=\lambda {{T}^{\beta }}$$

Therefore, the probability $$N(T)=n$$  is given by the Poisson probability.


 * $$\frac{n!};\text{ }n=0,1,2\ldots $$

This is referred to as a homogeneous Poisson process because there is no change in the intensity function. This is a special case of the Power Law model for $$\beta =1$$. The Power Law model is a generalization of the homogeneous Poisson process and allows for change in the intensity function as the repairable system ages. For the Power Law model, the failure intensity is increasing for $$\beta >1$$  (wearout), decreasing for  $$\beta <1$$  (infant morality) and constant for  $$\beta =1$$  (useful life).

Using the Power Law Model to Analyze Complex Repairable Systems
The Power Law model is often used to analyze the reliability for complex repairable systems in the field. A system of interest may be the total system, such as a helicopter, or it may be subsystems, such as the helicopter transmission or rotator blades. When these systems are new and first put into operation, the start time is $$0$$. As these systems are operated, they accumulate age, e.g. miles on automobiles, number of pages on copiers, hours on helicopters. When these systems fail, they are repaired and put back into service.

Some system types may be overhauled and some may not, depending on the maintenance policy. For example, an automobile may not be overhauled but helicopter transmissions may be overhauled after a period of time. In practice, an overhaul may not convert the system reliability back to where it was when the system was new. However, an overhaul will generally make the system more reliable. Appropriate data for the Power Law model is over cycles. If a system is not overhauled, then there is only one cycle and the zero time is when the system is first put into operation. If a system is overhauled, then the same serial number system may generate many cycles. Each cycle will start a new zero time, the beginning of the cycle. The age of the system is from the beginning of the cycle. For systems that are not overhauled, there is only one cycle and the reliability characteristics of a system as the system ages during its life is of interest. For systems that are overhauled, you are interested in the reliability characteristics of the system as it ages during its cycle.

For the Power Law model, a data set for a system will consist of a starting time $$S$$, an ending time  $$T$$  and the accumulated ages of the system during the cycle when it had failures. Assume the data exists from the beginning of a cycle (i.e. the starting time is 0), although non-zero starting times are possible with the Power Law model. For example, suppose data has been collected for a system with 2000 hours of operation during a cycle. The starting time is $$S=0$$  and the ending time is  $$T=2000$$. Over this period, failures occurred at system ages of 50.6, 840.7, 1060.5, 1186.5, 1613.6 and 1843.4 hours. These are the accumulated operating times within the cycle and there were no failures between 1843.4 and 2000 hours. It may be of interest to determine how the systems perform as part of a fleet. For a fleet, it must be verified that the systems have the same configuration, same maintenance policy and same operational environment. In this case, a random sample must be gathered from the fleet. Each item in the sample will have a cycle starting time $$S=0$$, an ending time  $$T$$  for the data period and the accumulated operating times during this period when the system failed.

There are many ways to generate a random sample of $$K$$  systems. One way is to generate $$K$$  random serial numbers from the fleet. Then go to the records corresponding to the randomly selected systems. If the systems are not overhauled, then record when each system was first put into service. For example, the system may have been put into service when the odometer mileage equaled zero. Each system may have a different amount of total usage, so the ending times, $$T$$, may be different. If the systems are overhauled, then the records for the last completed cycle will be needed. The starting and ending times and the accumulated times to failure for the $$K$$  systems constitute the random sample from the fleet. There is a useful and efficient method for generating a random sample for systems that are overhauled. If the overhauled systems have been in service for a considerable period of time, then each serial number system in the fleet would go through many overhaul cycles. In this case, the systems coming in for overhaul actually represent a random sample from the fleet. As $$K$$  systems come in for overhaul, the data for the current completed cycles would be a random sample of size  $$K$$.

In addition, the warranty period may be of particular interest. In this case, randomly choose $$K$$  serial numbers for systems that have been in customer use for a period longer than the warranty period. Then check the warranty records. For each of the $$K$$  systems that had warranty work, the ages corresponding to this service are the failure times. If a system did not have warranty work, then the number of failures recorded for that system is zero. The starting times are all equal to zero and the ending time for each of the $$K$$  systems is equal to the warranty operating usage time, e.g. hours, copies, miles. In addition to the intensity function $$u(t)$$  given by Eqn. (intensity) and the mean value function given by Eqn. (expected failures), other relationships based on the Power Law are often of practical interest. For example, the probability that the system will survive to age $$t+d$$  without failure is given by:


 * $$R(t)={{e}^{-\left[ \lambda {{\left( t+d \right)}^{\beta }}-\lambda {{t}^{\beta }} \right]}}$$

This is the mission reliability for a system of age $$t$$  and mission length  $$d$$.

=Parameter Estimation= Suppose that the number of systems under study is $$K$$  and the  $${{q}^{th}}$$  system is observed continuously from time  $${{S}_{q}}$$  to time  $${{T}_{q}}$$,  $$q=1,2,\ldots ,K$$. During the period $$[{{S}_{q}},{{T}_{q}}]$$, let  $${{N}_{q}}$$  be the number of failures experienced by the  $${{q}^{th}}$$  system and let  $${{X}_{i,q}}$$  be the age of this system at the  $${{i}^{th}}$$  occurrence of failure,  $$i=1,2,\ldots ,{{N}_{q}}$$. It is also possible that the times $${{S}_{q}}$$  and  $${{T}_{q}}$$  may be observed failure times for the  $${{q}^{th}}$$  system. If $${{X}_{{{N}_{q}},q}}={{T}_{q}}$$  then the data on the  $${{q}^{th}}$$  system is said to be failure terminated and  $${{T}_{q}}$$  is a random variable with  $${{N}_{q}}$$  fixed. If $${{X}_{{{N}_{q}},q}}<{{T}_{q}}$$  then the data on the  $${{q}^{th}}$$  system is said to be time terminated with  $${{N}_{q}}$$  a random variable. The maximum likelihood estimates of $$\lambda $$  and  $$\beta $$  are values satisfying the Eqns. (lambdaPowerLaw) and (BetaPowerLaw).


 * $$\begin{align}

& \widehat{\lambda }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,\left( T_{q}^{\widehat{\beta }}-S_{q}^{\widehat{\beta }} \right)} \\ & \widehat{\beta }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{\widehat{\lambda }\underset{q=1}{\overset{K}{\mathop{\sum }}}\,\left[ T_{q}^{\widehat{\beta }}\ln ({{T}_{q}})-S_{q}^{\widehat{\beta }}\ln ({{S}_{q}}) \right]-\underset{q=1}{\overset{K}{\mathop{\sum }}}\,\underset{i=1}{\overset{\mathop{\sum }}}\,\ln ({{X}_{i,q}})} \end{align}$$

where $$0\ln 0$$  is defined to be 0. In general, these equations cannot be solved explicitly for $$\widehat{\lambda }$$  and  $$\widehat{\beta },$$  but must be solved by iterative procedures. Once $$\widehat{\lambda }$$  and  $$\widehat{\beta }$$  have been estimated, the maximum likelihood estimate of the intensity function is given by:


 * $$\widehat{u}(t)=\widehat{\lambda }\widehat{\beta }{{t}^{\widehat{\beta }-1}}$$

If $${{S}_{1}}={{S}_{2}}=\ldots ={{S}_{q}}=0$$  and  $${{T}_{1}}={{T}_{2}}=\ldots ={{T}_{q}}$$   $$\,(q=1,2,\ldots ,K)$$  then the maximum likelihood estimates  $$\widehat{\lambda }$$  and  $$\widehat{\beta }$$  are in closed form.


 * $$\begin{align}

& \widehat{\lambda }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{K{{T}^{\beta }}} \\ & \widehat{\beta }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,\underset{i=1}{\overset{\mathop{\sum }}}\,\ln (\tfrac{T})} \end{align}$$

The following examples illustrate these estimation procedures.

Example 1
For the data in Table 13.1, the starting time for each system is equal to $$0$$  and the ending time for each system is 2000 hours. Calculate the maximum likelihood estimates $$\widehat{\lambda }$$  and  $$\widehat{\beta }$$.

Solution Since the starting time for each system is equal to zero and each system has an equivalent ending time, the general Eqns. (lambdaPowerLaw) and (BetaPowerLaw) reduce to the closed form Eqns. (sample1) and (sample2). The maximum likelihood estimates of $$\hat{\beta }$$  and  $$\hat{\lambda }$$  are then calculated as follows:


 * $$\begin{align}

& \widehat{\beta }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,\underset{i=1}{\overset{\mathop{\sum }}}\,\ln (\tfrac{T})} \\ & = & 0.45300 \end{align}$$


 * $$\begin{align}

& \widehat{\lambda }= & \frac{\underset{q=1}{\overset{K}{\mathop{\sum }}}\,{{N}_{q}}}{K{{T}^{\beta }}} \\ & = & 0.36224 \end{align}$$



The system failure intensity function is then estimated by:


 * $$\widehat{u}(t)=\widehat{\lambda }\widehat{\beta }{{t}^{\widehat{\beta }-1}},\text{ }t>0$$

Figure wpp intensity is a plot of $$\widehat{u}(t)$$  over the period (0, 3000). Clearly, the estimated failure intensity function is most representative over the range of the data and any extrapolation should be viewed with the usual caution.

Goodness-of-Fit Tests for Repairable System Analysis
It is generally desirable to test the compatibility of a model and data by a statistical goodness-of-fit test. A parametric Cramér-von Mises goodness-of-fit test is used for the multiple system and repairable system Power Law model, as proposed by Crow in [17]. This goodness-of-fit test is appropriate whenever the start time for each system is 0 and the failure data is complete over the continuous interval $$[0,{{T}_{q}}]$$  with no gaps in the data. The Chi-Squared test is a goodness-of-fit test that can be applied under more general circumstances. In addition, the Common Beta Hypothesis test also can be used to compare the intensity functions of the individual systems by comparing the $${{\beta }_{q}}$$  values of each system. Lastly, the Laplace Trend test checks for trends within the data. Due to their general applicatoin, the Common Beta Hypothesis test and the Laplace Trend test are both presented in Appendix B. The Cramér-von Mises and Chi-Squared goodness-of-fit tests are illustrated next.

Economical Life Model
One consideration in reducing the cost to maintain repairable systems is to establish an overhaul policy that will minimize the total life cost of the system. However, an overhaul policy makes sense only if $$\beta >1$$. It does not make sense to implement an overhaul policy if $$\beta <1$$  since wearout is not present. If you assume that there is a point at which it is cheaper to overhaul a system than to continue repairs, what is the overhaul time that will minimize the total life cycle cost while considering repair cost and the cost of overhaul? Denote $${{C}_{1}}$$  as the average repair cost (unscheduled),  $${{C}_{2}}$$  as the replacement or overhaul cost and  $${{C}_{3}}$$  as the average cost of scheduled maintenance. Scheduled maintenance is performed for every $$S$$  miles or time interval. In addition, let $${{N}_{1}}$$  be the number of failures in  $$[0,t]$$  and let  $${{N}_{2}}$$  be the number of replacements in  $$[0,t]$$. Suppose that replacement or overhaul occurs at times $$T$$,  $$2T$$ ,  $$3T$$. The problem is to select the optimum overhaul time $$T={{T}_{0}}$$  so as to minimize the long term average system cost (unscheduled maintenance, replacement cost and scheduled maintenance). Since $$\beta >1$$, the average system cost is minimized when the system is overhauled (or replaced) at time  $${{T}_{0}}$$  such that the instantaneous maintenance cost equals the average system cost. The total system cost between overhaul or replacement is:


 * $$TSC(T)={{C}_{1}}E(N(T))+{{C}_{2}}+{{C}_{3}}\frac{T}{S}$$

So the average system cost is:


 * $$C(T)=\frac{{{C}_{1}}E(N(T))+{{C}_{2}}+{{C}_{3}}\tfrac{T}{S}}{T}$$

The instantaneous maintenance cost at time $$T$$  is equal to:


 * $$IMC(T)={{C}_{1}}\lambda \beta {{T}^{\beta -1}}+\frac{S}$$

The following equation holds at optimum overhaul time $${{T}_{0}}$$ :


 * $$\begin{align}

& {{C}_{1}}\lambda \beta T_{0}^{\beta -1}+\frac{S}= & \frac{{{C}_{1}}E(N(T))+{{C}_{2}}+{{C}_{3}}\tfrac{T}{S}}{T} \\ & = & \frac{{{C}_{1}}\lambda T_{0}^{\beta }+{{C}_{2}}+{{C}_{3}}\tfrac{S}} \end{align}$$


 * Therefore:


 * $${{T}_{0}}={{\left[ \frac{{{C}_{2}}}{\lambda (\beta -1){{C}_{1}}} \right]}^{1/\beta }}$$

When there is no scheduled maintenance, Eqn. (ecolm) becomes:


 * $${{C}_{1}}\lambda \beta T_{0}^{\beta -1}=\frac{{{C}_{1}}\lambda T_{0}^{\beta }+{{C}_{2}}}$$

The optimum overhaul time, $${{T}_{0}}$$, is the same as Eqn. (optimt), so for periodic maintenance scheduled every $$S$$  miles, the replacement or overhaul time is the same as for the unscheduled and replacement or overhaul cost model.

=Examples=

Example 6 (repairable system data)
This case study is based on the data given in the article Graphical Analysis of Repair Data by Dr. Wayne Nelson [23]. The data in Table 13.10 represents repair data on an automatic transmission from a sample of 34 cars. For each car, the data set shows mileage at the time of each transmission repair, along with the latest mileage. The + indicates the latest mileage observed without failure. Car 1, for example, had a repair at 7068 miles and was observed until 26,744 miles. Do the following:


 * 1)	Estimate the parameters of the Power Law model.
 * 2)	Estimate the number of warranty claims for a 36,000 mile warranty policy for an estimated fleet of 35,000 vehicles.

Solution to Example 6

 * 1)	The estimated Power Law parameters are shown in Figure Repair3.
 * 2)	The expected number of failures at 36,000 miles can be estimated using the QCP as shown in Figure Repair4. The model predicts that 0.3559 failures per system will occur by 36,000 miles. This means that for a fleet of 35,000 vehicles, the expected warranty claims are 0.3559 * 35,000 = 12,456.

$$$$

$$$$

Example 7 (repairable system data)
Field data have been collected for a system that begins its wearout phase at time zero. The start time for each system is equal to zero and the end time for each system is 10,000 miles. Each system is scheduled to undergo an overhaul after a certain number of miles. It has been determined that the cost of an overhaul is four times more expensive than a repair. Table 13.11 presents the data. Do the following:
 * 1)	Estimate the parameters of the Power Law model.
 * 2)	Determine the optimum overhaul interval.
 * 3)	If $$\beta <1$$, would it be cost-effective to implement an overhaul policy?

Solution to Example 7

 * 1)	Figure Repair5 shows the estimated Power Law parameters.
 * 2)	The QCP can be used to calculate the optimum overhaul interval as shown in Figure Repair6.
 * 3)	Since $$\beta <1$$  then the systems are not wearing out and it would not be cost-effective to implement an overhaul policy. An overhaul policy makes sense only if the systems are wearing out. Otherwise, an overhauled unit would have the same probability of failing as a unit that was not overhauled.

$$$$

Example 8 (repairable system data)
Failures and fixes of two repairable systems in the field are recorded. Both systems start from time 0. System 1 ends at time = 504 and system 2 ends at time = 541. All the BD modes are fixed at the end of the test. A fixed effectiveness factor equal to 0.6 is used. Answer the following questions:
 * 1)	Estimate the parameters of the Crow Extended model.
 * 2)	Calculate the projected MTBF after the delayed fixes.
 * 3)	What is the expected number of failures at time 1,000, if no fixes were performed for the future failures?

Solution to Example 8

 * 1)	Figure CrowExtendedRepair shows the estimated Crow Extended parameters.
 * 2)	Figure CrowExtendedMTBF shows the projected MTBF at time = 541 (i.e. the age of the oldest system).
 * 3)	Figure CrowExtendedNumofFailure shows the expected number of failures at time = 1,000.

$$$$