Repairable Systems Analysis Through Simulation

Having introduced some of the basic theory and terminology for repairable systems in Introduction to Repairable Systems, we will now examine the steps involved in the analysis of such complex systems. We will begin by examining system behavior through a sequence of discrete deterministic events and expand the analysis using discrete event simulation.

=Simple Repairs=

Deterministic View, Simple Series
To first understand how component failures and simple repairs affect the system and to visualize the steps involved, let's begin with a very simple deterministic example with two components, $$A$$  and  $$B$$, in series.



Component $$A$$  fails every 100 hours and component  $$B$$  fails every 120 hours. Both require 10 hours to get repaired. Furthermore, assume that the surviving component stops operating when the system fails (thus not aging). NOTE: When a failure occurs in certain systems, some or all of the system's components may or may not continue to accumulate operating time while the system is down. For example, consider a transmitter-satellite-receiver system. This is a series system and the probability of failure for this system is the probability that any of the subsystems fail. If the receiver fails, the satellite continues to operate even though the receiver is down. In this case, the continued aging of the components during the system inoperation must be taken into consideration, since this will affect their failure characteristics and have an impact on the overall system downtime and availability.

The system behavior during an operation from 0 to 300 hours would be as shown in the figure below.



Specifically, component $$A$$  would fail at 100 hours, causing the system to fail. After 10 hours, component $$A$$  would be restored and so would the system. The next event would be the failure of component $$B$$. We know that component $$B$$  fails every 120 hours (or after an age of 120 hours). Since a component does not age while the system is down, component $$B$$  would have reached an age of 120 when the clock reaches 130 hours. Thus, component $$B$$  would fail at 130 hours and be repaired by 140 and so forth. Overall in this scenario, the system would be failed for a total of 40 hours due to four downing events (two due to $$A$$  and two due to  $$B$$ ). The overall system availability (average or mean availability) would be $$260/300=0.86667$$. Point availability is the availability at a specific point time. In this deterministic case, the point availability would always be equal to 1 if the system is up at that time and equal to zero if the system is down at that time.

Operating Through System Failure
In the prior section we made the assumption that components do not age when the system is down. This assumption applies to most systems. However, under special circumstances, a unit may age even while the system is down. In such cases, the operating profile will be different from the one presented in the prior section. The figure below illustrates the case where the components operate continuously, regardless of the system status.



Effects of Operating Through Failure
Consider a component with an increasing failure rate, as shown in the figure below. In the case that the component continues to operate through system failure, then when the system fails at $${{t}_{1}}$$  the surviving component's failure rate will be  $${{\lambda }_{1}}$$, as illustrated in figure below. When the system is restored at $${{t}_{2}}$$, the component would have aged by  $${{t}_{2}}-{{t}_{1}}$$  and its failure rate would now be  $${{\lambda }_{2}}$$.

In the case of a component that does not operate through failure, then the surviving component would be at the same failure rate, $${{\lambda }_{1}},$$  when the system resumes operation.



Deterministic View, Simple Parallel
Consider the following system where $$A$$  fails every 100,  $$B$$  every 120,  $$C$$  every 140 and  $$D$$  every 160 time units. Each takes 10 time units to restore. Furthermore, assume that components do not age when the system is down.



A deterministic system view is shown in the figure below. The sequence of events is as follows:


 * 1) At 100, $$A$$  fails and is repaired by 110.  The system is failed.
 * 2) At 130, $$B$$  fails and is repaired by 140.  The system continues to operate.
 * 3) At 150, $$C$$  fails and is repaired by 160.  The system continues to operate.
 * 4) At 170, $$D$$  fails and is repaired by 180.  The system is failed.
 * 5) At 220, $$A$$  fails and is repaired by 230.  The system is failed.
 * 6) At 280, $$B$$  fails and is repaired by 290.  The system continues to operate.
 * 7) End at 300.

Additional Notes
It should be noted that we are dealing with these events deterministically in order to better illustrate the methodology. When dealing with deterministic events, it is possible to create a sequence of events that one would not expect to encounter probabilistically. One such example consists of two units in series that do not operate through failure but both fail at exactly 100, which is highly unlikely in a real-world scenario. In this case, the assumption is that one of the events must occur at least an infinitesimal amount of time ( $$dt)$$ before the other. Probabilistically, this event is extremely rare, since both randomly generated times would have to be exactly equal to each other, to 15 decimal points. In the rare event that this happens, BlockSim would pick the unit with the lowest ID value as the first failure. BlockSim assigns a unique numerical ID when each component is created. These can be viewed by selecting the Show Block ID option in the Diagram Options window.

Deterministic Views of More Complex Systems
Even though the examples presented are fairly simplistic, the same approach can be repeated for larger and more complex systems. The reader can easily observe/visualize the behavior of more complex systems in BlockSim using the Up/Down plots. These are the same plots used in this chapter. It should be noted that BlockSim makes these plots available only when a single simulation run has been performed for the analysis (i.e., Number of Simulations = 1). These plots are meaningless when doing multiple simulations because each run will yield a different plot.

Probabilistic View, Simple Series
In a probabilistic case, the failures and repairs do not happen at a fixed time and for a fixed duration, but rather occur randomly and based on an underlying distribution, as shown in the following figures.

We use discrete event simulation in order to analyze (understand) the system behavior. Discrete event simulation looks at each system/component event very similarly to the way we looked at these events in the deterministic example. However, instead of using deterministic (fixed) times for each event occurrence or duration, random times are used. These random times are obtained from the underlying distribution for each event. As an example, consider an event following a 2-parameter Weibull distribution. The $$cdf$$  of the 2-parameter Weibull distribution is given by:


 * $$F(T)=1-{{e}^{-{{\left( \tfrac{T}{\eta } \right)}^{\beta }}}}$$

The Weibull reliability function is given by:


 * $$\begin{align}

R(T)= & 1-F(t) \\ = & {{e}^{-{{\left( \tfrac{T}{\eta } \right)}^{\beta }}}} \end{align}$$

Then, to generate a random time from a Weibull distribution with a given $$\eta $$  and  $$\beta $$, a uniform random number from 0 to 1,  $${{U}_{R}}[0,1]$$ , is first obtained. The random time from a Weibull distribution is then obtained from:


 * $${{T}_{R}}=\eta \cdot {{\left\{ -\ln \left[ {{U}_{R}}[0,1] \right] \right\}}^{\tfrac{1}{\beta }}}$$

To obtain a conditional time, the Weibull conditional reliability function is given by:


 * $$R(T,t)=\frac{R(T+t)}{R(T)}=\frac$$

Or:


 * $$R(T,t)={{e}^{-\left[ {{\left( \tfrac{T+t}{\eta } \right)}^{\beta }}-{{\left( \tfrac{T}{\eta } \right)}^{\beta }} \right]}}$$

The random time would be the solution for $$t$$  for  $$R(T,t)={{U}_{R}}[0,1]$$.

To illustrate the sequence of events, assume a single block with a failure and a repair distribution. The first event, $${{E}_}$$, would be the failure of the component. Its first time-to-failure would be a random number drawn from its failure distribution, $${{T}_}$$. Thus, the first failure event, $${{E}_}$$, would be at  $${{T}_}$$. Once failed, the next event would be the repair of the component, $${{E}_}$$. The time to repair the component would now be drawn from its repair distribution, $${{T}_}$$. The component would be restored by time $${{T}_}+{{T}_}$$. The next event would now be the second failure of the component after the repair, $${{E}_}$$. This event would occur after a component operating time of $${{T}_}$$  after the item is restored (again drawn from the failure distribution), or at  $${{T}_}+{{T}_}+{{T}_}$$. This process is repeated until the end time. It is important to note that each run will yield a different sequence of events due to the probabilistic nature of the times. To arrive at the desired result, this process is repeated many times and the results from each run (simulation) are recorded. In other words, if we were to repeat this 1,000 times, we would obtain 1,000 different values for $${{E}_}$$, or  $$\left[ {{E}_},{{E}_},...,{{E}_} \right]$$. The average of these values, $$\left( \tfrac{1}{1000}\underset{i=1}{\overset{1,000}{\mathop{\sum }}}\,{{E}_} \right)$$, would then be the average time to the first event,  $${{E}_}$$ , or the mean time to first failure (MTTFF) for the component. Obviously, if the component were to be 100% renewed after each repair, then this value would also be the same for the second failure, etc.

=General Simulation Results= To further illustrate this, assume that components A and B in the prior example had normal failure and repair distributions with their means equal to the deterministic values used in the prior example and standard deviations of 10 and 1 respectively. That is, $${{F}_{A}}\tilde{\ }N(100,10),$$   $${{F}_{B}}\tilde{\ }N(120,10),$$   $${{R}_{A}}={{R}_{B}}\tilde{\ }N(10,1)$$. The settings for components C and D are not changed. Obviously, given the probabilistic nature of the example, the times to each event will vary. If one were to repeat this $$X$$  number of times, one would arrive at the results of interest for the system and its components. Some of the results for this system and this example, over 1,000 simulations, are privided in the figure below and explained in the next sections.

The simulation settings are shown in the figure below.

Mean Availability (All Events), $${{\overline{A}}_{ALL}}$$
This is the mean availability due to all downing events, which can be thought of as the operational availability. It is the ratio of the system uptime divided by the total simulation time (total time). For this example:


 * $$\begin{align}

{{\overline{A}}_{ALL}}= & \frac{Uptime}{TotalTime} \\ = & \frac{269.137}{300} \\ = & 0.8971 \end{align}$$

Std Deviation (Mean Availability)
This is the standard deviation of the mean availability of all downing events for the system during the simulation.

Mean Availability (w/o PM, OC & Inspection), $${{\overline{A}}_{CM}}$$
This is the mean availability due to failure events only and it is 0.971 for this example. Note that for this case, the mean availability without preventive maintenance, on condition maintenance and inspection is identical to the mean availability for all events. This is because no preventive maintenance actions or inspections were defined for this system. We will discuss the inclusion of these actions in later sections.

Downtimes caused by PM and inspections are not included. However, if the PM or inspection action results in the discovery of a failure, then these times are included. As an example, consider a component that has failed but its failure is not discovered until the component is inspected. Then the downtime from the time failed to the time restored after the inspection is counted as failure downtime, since the original event that caused this was the component's failure.

Point Availability (All Events), $$A\left( t \right)$$
This is the probability that the system is up at time $$t$$. As an example, to obtain this value at $$t$$  = 300, a special counter would need to be used during the simulation. This counter is increased by one every time the system is up at 300 hours. Thus, the point availability at 300 would be the times the system was up at 300 divided by the number of simulations. For this example, this is 0.930, or 930 times out of the 1000 simulations the system was up at 300 hours.

Reliability (Fail Events), $$R(t)$$
This is the probability that the system has not failed by time $$t$$. This is similar to point availability with the major exception that it only looks at the probability that the system did not have a single failure. Other (non-failure) downing events are ignored. During the simulation, a special counter again must be used. This counter is increased by one (once in each simulation) if the system has had at least one failure up to 300 hours. Thus, the reliability at 300 would be the number of times the system did not fail up to 300 divided by the number of simulations. For this example, this is 0 because the system failed prior to 300 hours 1000 times out of the 1000 simulations.

It is very important to note that this value is not always the same as the reliability computed using the analytical methods, depending on the redundancy present. The reason that it may differ is best explained by the following scenario:

Assume two units in parallel. The analytical system reliability, which does not account for repairs, is the probability that both units fail. In this case, when one unit goes down, it does not get repaired and the system fails after the second unit fails. In the case of repairs, however, it is possible for one of the two units to fail and get repaired before the second unit fails. Thus, when the second unit fails, the system will still be up due to the fact that the first unit was repaired.

Expected Number of Failures, $${{N}_{F}}$$
This is the average number of system failures. The system failures (not downing events) for all simulations are counted and then averaged. For this case, this is 3.188, which implies that a total of 3,188 system failure events occurred over 1000 simulations. Thus, the expected number of system failures for one run is 3.188. This number includes all failures, even those that may have a duration of zero.

Std Deviation (Number of Failures)
This is the standard deviation of the number of failures for the system during the simulation.

MTTFF
MTTFF is the mean time to first failure for the system. This is computed by keeping track of the time at which the first system failure occurred for each simulation. MTTFF is then the average of these times. This may or may not be identical to the MTTF obtained in the analytical solution for the same reasons as those discussed in the Point Reliability section. For this case, this is 100.2511. This is fairly obvious for this case since the mean of one of the components in series was 100 hours.

It is important to note that for each simulation run, if a first failure time is observed, then this is recorded as the system time to first failure. If no failure is observed in the system, then the simulation end time is used as a right censored (suspended) data point. MTTFF is then computed using the total operating time until the first failure divided by the number of observed failures (constant failure rate assumption). Furthermore, and if the simulation end time is much less than the time to first failure for the system, it is also possible that all data points are right censored (i.e., no system failures were observed). In this case, the MTTFF is again computed using a constant failure rate assumption, or:


 * $$MTTFF=\frac{2\cdot ({{T}_{S}})\cdot N}{\chi _{0.50;2}^{2}}$$

where $${{T}_{S}}$$  is the simulation end time and  $$N$$  is the number of simulations. One should be aware that this formulation may yield unrealistic (or erroneous) results if the system does not have a constant failure rate. If you are trying to obtain an accurate (realistic) estimate of this value, then your simulation end time should be set to a value that is well beyond the MTTF of the system (as computed analytically). As a general rule, the simulation end time should be at least three times larger than the MTTF of the system.

Uptime, $${{T}_{UP}}$$
This is the average time the system was up and operating. This is obtained by taking the sum of the uptimes for each simulation and dividing it by the number of simulations. For this example, the uptime is 269.137. To compute the Operational Availability, $${{A}_{o}},$$  for this system, then:


 * $${{A}_{o}}=\frac(eqn 3)$$

CM Downtime, $${{T}_{C{{M}_{Down}}}}$$
This is the average time the system was down for corrective maintenance actions (CM) only. This is obtained by taking the sum of the CM downtimes for each simulation and dividing it by the number of simulations. For this example, this is 30.863. To compute the Inherent Availability, $${{A}_{I}},$$  for this system over the observed time (which may or may not be steady state, depending on the length of the simulation), then:


 * $${{A}_{I}}=\frac{{{T}_{S}}-{{T}_{C{{M}_{Down}}}}}(eqn 4)$$

Inspection Downtime
This is the average time the system was down due to inspections. This is obtained by taking the sum of the inspection downtimes for each simulation and dividing it by the number of simulations. For this example, this is zero because no inspections were defined.

PM Downtime, $${{T}_{P{{M}_{Down}}}}$$
This is the average time the system was down due to preventive maintenance (PM) actions. This is obtained by taking the sum of the PM downtimes for each simulation and dividing it by the number of simulations. For this example, this is zero because no PM actions were defined.

OC Downtime, $${{T}_{O{{C}_{Down}}}}$$
This is the average time the system was down due to on-condition maintenance (PM) actions. This is obtained by taking the sum of the OC downtimes for each simulation and dividing it by the number of simulations. For this example, this is zero because no OC actions were defined.

Total Downtime, $${{T}_{Down}}$$
This is the downtime due to all events. In general, one may look at this as the sum of the above downtimes. However, this is not always the case. It is possible to have actions that overlap each other, depending on the options and settings for the simulation. Furthermore, there are other events that can cause the system to go down that do not get counted in any of the above categories. As an example, in the case of standby redundancy with a switch delay, if the settings are to reactivate the failed component after repair, the system may be down during the switch-back action. This downtime does not fall into any of the above categories but it is counted in the total downtime.

For this example, this is identical to $${{T}_{C{{M}_{Down}}}}.$$

System Downing Events
System downing events are events associated with downtime. Note that events with zero duration will appear in this section only if the task properties specify that the task brings the system down or if the task properties specify that the task brings the item down and the item’s failure brings the system down.

Number of Failures, $${{N}_}$$
This is the average number of system downing failures. Unlike the Expected Number of Failures, $${{N}_{F}},$$  this number does not include failures with zero duration. For this example, this is 3.188.

Number of CMs, $${{N}_{C{{M}_{Down}}}}$$
This is the number of corrective maintenance actions that caused the system to fail. It is obtained by taking the sum of all CM actions that caused the system to fail divided by the number of simulations. It does not include CM events of zero duration. For this example, this is 3.188. Note that this may differ from the Number of Failures, $${{N}_}$$. An example would be a case where the system has failed, but due to other settings for the simulation, a CM is not initiated (e.g., an inspection is needed to initiate a CM).

Number of Inspections, $${{N}_}$$
This is the number of inspection actions that caused the system to fail. It is obtained by taking the sum of all inspection actions that caused the system to fail divided by the number of simulations. It does not include inspection events of zero duration. For this example, this is zero.

Number of PMs, $${{N}_{P{{M}_{Down}}}}$$
This is the number of PM actions that caused the system to fail. It is obtained by taking the sum of all PM actions that caused the system to fail divided by the number of simulations. It does not include PM events of zero duration. For this example, this is zero.

Number of OCs, $${{N}_{O{{C}_{Down}}}}$$
This is the number of OC actions that caused the system to fail. It is obtained by taking the sum of all OC actions that caused the system to fail divided by the number of simulations. It does not include OC events of zero duration. For this example, this is zero.

Total Events, $${{N}_{AL{{L}_{Down}}}}$$
This is the total number of system downing events. It also does not include events of zero duration. It is possible that this number may differ from the sum of the other listed events. As an example, consider the case where a failure does not get repaired until an inspection, but the inspection occurs after the simulation end time. In this case, the number of inspections, CMs and PMs will be zero while the number of total events will be one.

Costs and Throughput
Cost and throughput results are discussed in later sections.

Note About Overlapping Downing Events
It is important to note that two identical system downing events (that are continuous or overlapping) may be counted and viewed differently. As shown in Case 1 of the following figure, two overlapping failure events are counted as only one event from the system perspective because the system was never restored and remained in the same down state, even though that state was caused by two different components. Thus, the number of downing events in this case is one and the duration is as shown in CM system. In the case that the events are different, as shown in Case 2 of the figure below, two events are counted, the CM and the PM. However, the downtime attributed to each event is different from the actual time of each event. In this case, the system was first down due to a CM and remained in a down state due to the CM until that action was over. However, immediately upon completion of that action, the system remained down but now due to a PM action. In this case, only the PM action portion that kept the system down is counted.



System Point Result
The system point results, as shown in the figure below, shows the Point Availability (All Events), $$A\left( t \right)$$, and Point Reliability,  $$R(t)$$ , as defined in the previous section. These are computed and returned at different points in time, based on the number of intervals selected by the user. Additionally, this window shows $$(1-A(t))$$,  $$(1-R(t))$$ , $$\text{Labor Cost(t)}$$ ,$$\text{Part Cost(t)}$$ , $$Cost(t)$$ ,  $$Mean$$   $$A(t)$$ ,  $$Mean$$   $$A({{t}_{i}}-{{t}_{i-1}})$$ ,  $$System$$, $$Failures(t)$$, $$\text{System Off Events by Trigger(t)}$$ and  $$Throughput(t)$$.



=Results by Component= Simulation results for each component can also be viewed. The figure below shows the results for component A. These results are explained in the sections that follow.



Number of Block Downing Events, $$Componen{{t}_{NDE}}$$
This the number of times the component went down (failed). It includes all downing events.

Number of System Downing Events, $$Componen{{t}_{NSDE}}$$
This is the number of times that this component's downing caused the system to be down. For component $$A$$, this is 2.038. Note that this value is the same in this case as the number of component failures, since the component A is reliability-wise in series with components D and components B, C. If this were not the case (e.g., if they were in a parallel configuration, like B and C), this value would be different.

Number of Failures, $$Componen{{t}_{NF}}$$
This is the number of times the component failed and does not include other downing events. Note that this could also be interpreted as the number of spare parts required for CM actions for this component. For component $$A$$, this is 2.038.

Number of System Downing Failures, $$Componen{{t}_{NSDF}}$$
This is the number of times that this component's failure caused the system to be down. Note that this may be different from the Number of System Downing Events. It only counts the failure events that downed the system and does not include zero duration system failures.

Number of OFF events by Trigger, $$Componen{{t}_{OFF}}$$
The total number of events where the block is turned off by state change triggers. An OFF event is not a failure but it may be included in system reliability calculations.

Mean Availability (All Events), $${{\overline{A}}_{AL{{L}_{Component}}}}$$
This has the same definition as for the system with the exception that this accounts only for the component.

Mean Availability (w/o PM, OC & Inspection), $${{\overline{A}}_{C{{M}_{Component}}}}$$
The mean availability of all downing events for the block, not including preventive, on condition or inspection tasks, during the simulation.

Block Uptime, $${{T}_{Componen{{t}_{UP}}}}$$
This is tThe total amount of time that the block was up (i.e., operational) during the simulation. For component $$A$$, this is 279.8212.

Block Downtime, $${{T}_{Componen{{t}_{Down}}}}$$
This is the average time the component was down for any reason. For component $$A$$, this is 20.1788.

Block Downtime shows the total amount of time that the block was down (i.e., not operational) during the simulation.

RS DECI
The ReliaSoft Downing Event Criticality Index for the block. This is a relative index showing the percentage of times that a downing event of the block caused the system to go down (i.e., the number of system downing events caused by the block divided by the total number of system downing events). For component $$A$$, this is 63.93%. This implies that 63.93% of the times that the system went down, the system failure was due to the fact that component $$A$$  went down. This is obtained from:


 * $$\begin{align}

RSDECI=\frac{Componen{{t}_{NSDE}}} \end{align}$$

Mean Time Between Downing Events
This is the mean time between downing events of the component, which is computed from:


 * $$MTBDE=\frac{Componen{{t}_{NDE}}}$$

For component $$A$$, this is 137.3019.

RS FCI
ReliaSoft's Failure Criticality Index (RS FCI) is a relative index showing the percentage of times that a failure of this component caused a system failure. For component $$A$$, this is 63.93%. This implies that 63.93% of the times that the system failed, it was due to the fact that component $$A$$  failed. This is obtained from:


 * $$\begin{align}

RSFCI=\frac{Componen{{t}_{NSDF}}+{{F}_{ZD}}} \end{align}$$

$${{F}_{ZD}}$$ is a special counter of system failures not included in  $$Componen{{t}_{NSDF}}$$. This counter is not explicitly shown in the results but is maintained by the software. The reason for this counter is the fact that zero duration failures are not counted in $$Componen{{t}_{NSDF}}$$  since they really did not down the system. However, these zero duration failures need to be included when computing RS FCI.

It is important to note that for both RS DECI and RS FCI, and if overlapping events are present, the component that caused the system event gets credited with the system event. Subsequent component events that do not bring the system down (since the system is already down) do not get counted in this metric.

MTBF, $$MTB{{F}_{C}}$$
Mean time between failures is the mean (average) time between failures of this component, in real clock time. This is computed from:


 * $$MTB{{F}_{C}}=\frac{{{T}_{S}}-CFDowntime}{Componen{{t}_{NF}}}$$

$$CFDowntime$$ is the downtime of the component due to failures only (without PM, OC and inspection). The discussion regarding what is a failure downtime that was presented in the section explaining Mean Availability (w/o PM & Inspection) also applies here. For component $$A$$, this is 137.3019. Note that this value could fluctuate for the same component depending on the simulation end time. As an example, consider the deterministic scenario for this component. It fails every 100 hours and takes 10 hours to repair. Thus, it would be failed at 100, repaired by 110, failed at 210 and repaired by 220. Therefore, its uptime is 280 with two failure events, MTBF = 280/2 = 140. Repeating the same scenario with an end time of 330 would yield failures at 100, 210 and 320. Thus, the uptime would be 300 with three failures, or MTBF = 300/3 = 100. Note that this is not the same as the MTTF (mean time to failure), commonly referred to as MTBF by many practitioners.

Mean Downtime per Event, $$MDPE$$
Mean downtime per event is the average downtime for a component event. This is computed from:


 * $$MDPE=\frac{Componen{{t}_{NDE}}}$$

RS DTCI
The ReliaSoft Downtime Criticality Index for the block. This is a relative index showing the contribution of the block to the system’s downtime (i.e., the system downtime caused by the block divided by the total system downtime).

Other Results of Interest
The remaining component (block) results are similar to those defined for the system with the exception that now they apply only to the component.

=Imperfect Repairs=

=Using Resources: Pools and Crews= In order to make the analysis more realistic, one may wish to consider additional sources of delay times in the analysis or study the effect of limited resources. In the prior examples, we used a repair distribution to identify how long it takes to restore a component. The factors that one chooses to consider in this time may include the time it takes to do the repair and/or the time it takes to get a crew, a spare part, etc. While all of these factors may be included in the repair duration, optimized usage of these resources can only be achieved if the resources are studied individually and their dependencies are identified.

As an example, consider the situation where two components in parallel fail at the same time and only a single repair person is available. Because this person would not be able to execute the repair on both components simultaneously, an additional delay will be encountered that also needs to be included in the modeling. One way to accomplish this is to assign a specific repair crew to each component.

Including Crews
BlockSim allows you to assign maintenance crews to each component and one or more crews may be assigned to each component from the Maintenance Task Properties window. Note that there may be different crews for each action, (i.e., corrective, preventive, on condition and inspection).

A crew record needs to be defined for each named crew, as shown in the picture below. The basic properties for each crew include factors such as:
 * Logistic delays. How long does it take for the crew to arrive?
 * Is there a limit to the number of tasks this crew can perform at the same time? If yes, how many simultaneous tasks can the crew perform?
 * What is the cost per hour for the crew?
 * What is the cost per incident for the crew?



Illustrating Crew Use
To illustrate the use of crews in BlockSim, consider the deterministic scenario described by the following RBD and properties.





As shown in the figure above, the System Up/Down plot illustrates the sequence of events, which are:


 * At 100, $$A$$  fails.  It takes 20 to get the crew and 10 to repair, thus the component is repaired by 130.  The system is failed/down during this time.
 * At 150, $$B$$  fails since it would have accumulated an operating age of 120 by this time.  It again has to wait for the crew and is repaired by 190.
 * At 170, $$C$$  fails.  Upon this failure,  $$C$$  requests the only available crew.  However, this crew is currently engaged by  $$B$$  and, since the crew can only perform one task at a time, it cannot respond immediately to the request by  $$C$$ .  Thus,  $$C$$  will remain failed until the crew becomes available.  The crew will finish with unit  $$B$$  at 190 and will then be dispatched to  $$C$$ .  Upon dispatch, the logistic delay will again be considered and  $$C$$  will be repaired by 230.  The system continues to operate until the failures of  $$B$$  and  $$C$$  overlap (i.e., the system is down from 170 to 190)
 * At 210, $$D$$  fails.  It again has to wait for the crew and repair.
 * $$D$$ is up at 260.

The following figure shows an example of some of the possible crew results (details), which are presented next.



Explanation of the Crew Details

 * Each request made to a crew is logged.
 * If a request is successful (i.e., the crew is available), the call is logged once in the Calls Received counter and once in the Accepted Calls counter.
 * If a request is not accepted (i.e., the crew is busy), the call is logged once in the Calls Received counter and once in the Rejected Calls counter. When the crew is free and can be called upon again, the call is logged once in the Calls Received counter and once in the Accepted Calls counter.
 * In this scenario, there were two instances when the crew was not available, Rejected Calls = 2, and there were four instances when the crew performed an action, Calls Accepted = 4, for a total of six calls, Calls Received = 6.
 * Percent Accepted and Percent Rejected are the ratios of calls accepted and calls rejected with respect to the total calls received.
 * Total Utilization is the total time that the crew was used. It includes both the time required to complete the repair action and the logistic time.  In this case, this is 140, or:


 * $$\begin{align}

{{T}_}= & 10,{{T}_}=20 \\ {{T}_}= & 20,{{T}_}=20 \\ {{T}_}= & 20,{{T}_}=20 \\ {{T}_}= & 10,{{T}_}=20 \\ {{T}_{U}}= & \left( {{T}_}+{{T}_} \right)+\left( {{T}_}+{{T}_} \right) \\ & +\left( {{T}_}+{{T}_} \right)+\left( {{T}_}+{{T}_} \right) \\ {{T}_{U}}= & 140 \end{align}$$


 * 6. Average Call Duration is the average duration of each crew usage, and it also includes both logistic and repair time.  It is the total usage divided by the number of accepted calls.  In this case, this is 35.
 * 7. Total Wait Time is the time that blocks in need of a repair waited for this crew.  In this case, it is 40 ( $$C$$  and  $$D$$  both waited 20 each).
 * 8. Total Crew Costs are the total costs for this crew.  It includes the per incident charge as well as the per unit time costs.  In this case, this is 180.  There were four incidents at 10 each for a total of 40, as well as 140 time units of usage at 1 cost unit per time unit.
 * 9. Average Cost per Call is the total cost divided by the number of accepted calls.  In this case, this is 45.

Note that crew costs that are attributed to individual blocks can be obtained from the Blocks reports, as shown in the figure below.



How BlockSim Handles Crews

 * Crew logistic time is added to each repair time.
 * The logistic time is always present, and the same, regardless of where the crew was called from (i.e., whether the crew was at another job or idle at the time of the request).
 * For any given simulation, each crew's logistic time is constant (taken from the distribution) across that single simulation run regardless of the task (CM, PM or inspection).
 * A crew can perform either a finite number of simultaneous tasks or an infinite number.
 * If the finite limit of tasks is reached, the crew will not respond to any additional request until the number of tasks the crew is performing is less than its finite limit.
 * If a crew is not available to respond, the component will "wait" until a crew becomes available.
 * BlockSim maintains the queue of rejected calls and will dispatch the crew to the next repair on a "first come, first served" basis.
 * Multiple crews can be assigned to a single block (see overview in the next section).
 * If no crew has been assigned for a block, it is assumed that no crew restrictions exist and a default crew is used. The default crew can perform an infinite number of simultaneous tasks and has no delays or costs.

Looking at Multiple Crews
Multiple crews may be available to perform maintenance for a particular component. When multiple crews have been assigned to a block in BlockSim, the crews are assigned to perform maintenance based on their order in the crew list, as shown in the figure below.



In the case where more than one crew is assigned to a block, and if the first crew is unavailable, then the next crew is called upon and so forth. As an example, consider the prior case but with the following modifications (i.e., Crews  $$A$$  and  $$B$$  are assigned to all blocks):



The system would behave as shown in the figure below.

In this case, Crew $$B$$  was used for the  $$C$$  repair since Crew  $$A$$  was busy. On all others, Crew $$A$$  was used. It is very important to note that once a crew has been assigned to a task it will complete the task. For example, if we were to change the delay time for Crew $$B$$  to 100, the system behavior would be as shown in the figure below.



In other words, even though Crew $$A$$  would have finished the repair on  $$C$$  more quickly if it had been available when originally called,  $$B$$  was assigned the task because  $$A$$  was not available at the instant that the crew was needed.

Additional Rules on Crews

 * 1. If all assigned crews are engaged, the next crew that will be chosen is the crew that can get there first.
 * a)	This accounts for the time it would take a particular crew to complete its current task (or all tasks in its queue) and its logistic time.
 * 2. If a crew is available, it gets used regardless of what its logistic delay time is.
 * a)	In other words, if a crew with a shorter logistic time is busy, but almost done, and another crew with a much higher logistic time is currently free, the free one will get assigned to the task.
 * 3. For each simulation each crew's logistic time is computed (taken randomly from its distribution or its fixed time) at the beginning of the simulation and remains constant across that one simulation for all actions (CM, PM and inspection).

Using Spare Part Pools
BlockSim also allows you to specify spare part pools (or depots). Spare part pools allow you to model and manage spare part inventory and study the effects associated with limited inventories. Each component can have a spare part pool associated with it. If a spare part pool has not been defined for a block, BlockSim's analysis assumes a default pool of infinite spare parts. To speed up the simulation, no details on pool actions are kept during the simulation if the default pool is used.

Pools allow you to define multiple aspects of the spare part process, including stock levels, logistic delays and restock options. Every time a part is repaired under a CM or scheduled action (PM, OC and Inspection), a spare part is obtained from the pool. If a part is available in the pool, it is then used for the repair. The figure below shows the pages in BlockSim's Spare Part Pool window. Spare part pools perform their actions based on the simulation clock time.



Spare Properties
A spare part pool is identified by a name. The general properties of the pool are its stock level (must be greater than zero), cost properties and logistic delay time. If a part is available (in stock), the pool will dispense that part to the requesting block after the specified logistic time has elapsed. One needs to think of a pool as an independent entity. It accepts requests for parts from blocks and dispenses them to the requesting blocks after a given logistic time. Requests for spares are handled on a first come, first served basis. In other words, if two blocks request a part and only one part is in stock, the first block that made the request will receive the part. Blocks request parts from the pool immediately upon the initiation of a CM or scheduled event (PM, OC and Inspection).

Restocking the Pool
If the pool has a finite number of spares, restock actions may be incorporated. The figure below shows the restock properties. Specifically, a pool can restock itself either through a scheduled restock action or based on specified conditions.

A scheduled restock action adds a set number of parts to the pool on a predefined scheduled part arrival time. For the settings in the figure above, one spare part would be added to the pool every 100 time units, based on the system (simulation) time. In other words, for a simulation of 1000 time units, a spare part would arrive at 100 $$tu$$, 200  $$tu$$ , etc. The part is available to the pool immediately after the restock action and without any logistic delays.

In an on-condition restock, a restock action is initiated when the stock level reaches (or is below) a specified value. In figure above, five parts are ordered when the stock level reaches 0. Note that unlike the scheduled restock, parts added through on-condition restock become available after a specified logistic delay time. In other words, when doing a scheduled restock, the parts are pre-ordered and arrive when needed. Whereas in the on-condition restock, the parts are ordered when the condition occurs and thus arrive after a specified time. For on-condition restocks, the condition is triggered if and only if the stock level drops to or below the specified stock level, regardless of how the spares arrived to the pool or were distributed by the pool. In addition, the restock trigger value must be less than the initial stock.

Lastly, a maximum capacity can be assigned to the pool. If the maximum capacity is reached, no more restock actions are performed. This maximum capacity must be equal to or greater than the initial stock. When this limit is reached, no more items are added to the pool. For example, if the pool has a maximum capacity of ten and a current stock level of eight and if a restock action is set to add five items to the pool, then only two will be accepted.

Obtaining Emergency Spares
Emergency restock actions can also be defined. The figure below illustrates BlockSim's Emergency Spare Provisions options. An emergency action is triggered only when a block requests a spare and the part is not currently in stock. This is the only trigger condition. It does not account for whether a part has been ordered or if one is scheduled to arrive. Emergency spares are ordered when the condition is triggered and arrive after a time equal to the required time to obtain emergency spare(s).



Summary of Rules for Spare Part Pools
The following rules summarize some of the logic when dealing with spare part pools.

Basic Logic Rules

 * 1. Queue Based: Requests for spare parts from blocks are queued and executed on a "first come, first served" basis.
 * 2. Emergency: Emergency restock actions are performed only when a part is not available.
 * 3. Scheduled Restocks: Scheduled restocks are added instantaneously to the pool at the scheduled time.
 * 4. On-Condition Restock: On-condition restock happens when the specified condition is reached (e.g., when the stock drops to two or if a request is received for a part and the stock is below the restock level).
 * a)	For example, if a pool has three items in stock and it dispenses one, an on-condition restock is initiated the instant that the request is received (without regard to the logistic delay time). The restocked items will be available after the required time for stock arrival has elapsed.
 * b)	The way that this is defined allows for the possibility of multiple restocks. Specifically, every time a part needs to be dispensed and the stock is lower than the specified quantity, parts are ordered.  In the case of a long logistic delay time, it is possible to have multiple re-orders in the queue.
 * 5. Parts Become Available after Spare Acquisition Logistic Delay:  If there is a spare acquisition logistic time delay,  the requesting block will get the part after that delay.
 * a)	For example, if a block with a repair duration of 10 fails at 100 and requests a part from a pool with a logistic delay time of 10, that block will not be up until 120.
 * 6. Compound Delays: If a part is not available and an emergency part (or another part) can be obtained, then the total wait time for the part is the sum of both the logistic time and the required time to obtain a spare.
 * 7. First Available Part is Dispensed to the First Block in the Queue: The pool will dispense a requested part if it has one in stock or when it becomes available, regardless of what action (i.e., as needed restock or emergency restock) that request may have initiated.
 * a)	For example, if Block A requests a part from a pool and that triggers an emergency restock action, but a part arrives before the emergency restock through another action (e.g., scheduled restock), then the pool will dispense the newly arrived part to Block A (if Block A is next in the queue to receive a part).
 * 8. Blocks that Trigger an Action Get Charged with the Action: A block that triggers an emergency restock is charged for the additional cost to obtain the emergency part, even if it does not use an emergency part (i.e., even if another part becomes available first).
 * 9.	Triggered Action Cannot be Canceled. If a block triggers a restock action but then receives a part from another source, the action that the block triggered is not canceled.
 * a)	For example, if Block A initiates an emergency restock action but was then able to use a part that became available through other actions, the emergency request is not canceled and an emergency spare part will be added to the pool's stock level.
 * b)	Another way to explain this is by looking at the part acquisition logistic times as transit times. Because an ordered part is en-route to you after you order it, you will receive it regardless of whether the conditions have changed and you no longer need it.

Simultaneous Dispatch of Crews and Parts Logic
Some special rules apply when a block has both logistic delays in acquiring parts from a pool and when waiting for crews. BlockSim dispatches requests for crews and spare parts simultaneously. The repair action does not start until both crew and part arrive, as shown next.

If a crew arrives and it has to wait for a part, then this time (and cost) is added to the crew usage time.

=Using Maintenance Tasks= One of the most important benefits of simulation is the ability to define how and when actions are performed. In our case, the actions of interest are part repairs/replacements. This is accomplished in BlockSim through the use of maintenance tasks. Specifically, four different types of tasks can be defined for maintenance actions: corrective maintenance, preventive maintenance, on condition maintenance and inspection.

Corrective Maintenance Tasks
A corrective maintenance task defines when a corrective maintenance (CM) action is performed. The figure below shows a corrective maintenance task assigned to a block in BlockSim.



Corrective actions will be performed either immediately upon failure of the item or upon finding that the item has failed (for hidden failures that are not detected until an inspection). BlockSim allows the selection of either category. If Upon item failure is selected, the CM action is initiated immediately upon failure. If the user don't specify the choice for a CM, then this is the default option. All prior examples were based on the instruction to perform a CM upon failure. If the When found failed during an Inspection option is selected, then the CM action will only be initiated after an inspection is done on the failed component. How and when the inspections are performed is defined by the block's inspection properties. This has the effect of defining a dependency between the corrective maintenance task and the inspection task.

See it in action... More application examples are available! See also: CM Triggered by Subsystem Down 

Scheduled Tasks
Scheduled tasks can be performed on a known schedule, which can be based on any of the following:
 * A time interval, either fixed or dynamic, based on the item's age (item clock) or on calendar time (system clock). See Item and System Ages.
 * The occurrence of certain events, including:
 * The system goes down.
 * Certain events happen in a maintenance group. The events and groups are user-specified, and the item that the task is assigned to does not need to be part of the selected maintenance group(s).

The types of scheduled tasks include:
 * Inspection tasks
 * Preventive maintenance tasks
 * On condition tasks

Item and System Ages
It is important to keep in mind that the system and each component of the system maintain separate clocks within the simulation. When setting intervals to perform a scheduled task, the intervals can be based on either type of clock. Specifically:
 * Item age refers to the accumulated age of the block, which gets adjusted each time the block is repaired (i.e., restored). If the block is repaired at least once during the simulation, this will be different from the elapsed simulation time. For example, if the restoration factor is 1 (i.e., “as good as new”) and the assigned interval is 100 days based on item age, then the task will be scheduled to be performed for the first time at 100 days of elapsed simulation time. However, if the block fails at 85 days and it takes 5 days to complete the repair, then the block will be fully restored at 90 days and its accumulated age will be reset to 0 at that point. Therefore, if another failure does not occur in the meantime, the task will be performed for the first time 100 days later at 190 days of elapsed simulation time.




 * Calendar time refers to the elapsed simulation time. If the assigned interval is 100 days based on calendar time, then the task will be performed for the first time at 100 days of elapsed simulation time, for the second time at 200 days of elapsed simulation time and so on, regardless of whether the block fails and gets repaired correctively between those times.



Inspection Tasks
Like all scheduled tasks, inspections can be performed based on a time interval or upon certain events. Inspections can be specified to bring the item or system down or not.

Preventive Maintenance Tasks
The figure below shows the options available in a preventive maintenance (PM) task within BlockSim. PMs can be performed based on a time interval or upon certain events. Because PM tasks always bring the item down, one can also specify whether preventive maintenance will be performed if the task brings the system down.



On Condition Tasks
On condition maintenance relies on the capability to detect failures before they happen so that preventive maintenance can be initiated. If, during an inspection, maintenance personnel can find evidence that the equipment is approaching the end of its life, then it may be possible to delay the failure, prevent it from happening or replace the equipment at the earliest convenience rather then allowing the failure to occur and possibly cause severe consequences. In BlockSim, on condition tasks consist of an inspection task that triggers a preventive task when an impending failure is detected during inspection.

Failure Detection
Inspection tasks can be used to check for indications of an approaching failure. BlockSim models such indications of when an approaching failure will become detectable upon inspection using Failure Detection Threshold and P-F Interval. Failure detection threshold allows the user to enter a number between 0 and 1 indicating the percentage of an item's life that must elapse before an approaching failure can be detected. For instance, if the failure detection threshold value is set as 0.8 then this means that the failure of a component can be detected only during the last 20% of its life. If an inspection occurs during this time, an approaching failure is detected and the inspection triggers a preventive maintenance task to take the necessary precautions to delay the failure by either repairing or replacing the component.

The P-F interval allows the user to enter the amount of time before the failure of a component when the approaching failure can be detected by an inspection. The P-F interval represents the warning period that spans from P(when a potential failure can be detected) to F(when the failure occurs). If a P-F interval is set as 200 then the approaching failure of the component can only be detected 200 time units ($$tu$$) before the failure of the component. Thus, if a component has a fixed life of 1,000 $$tu$$ and the P-F interval is set to 200 $$tu$$, then if an inspection occurs at or beyond 800 tu, the approaching failure of the component that is to occur at 1,000 $$tu$$ is detected by this inspection and a preventive maintenance task is triggered to take action against this failure.

Rules for On Condition Tasks

 * An inspection that finds a block at or beyond the failure detection threshold or within the range of the P-F interval will trigger the associated preventive task as long as preventive maintenance can be performed on that block.


 * If a non-downing inspection triggers a preventive maintenance action because the failure detection threshold or P-F interval range was reached, no other maintenance task will be performed between the inspection and the triggered preventive task; tasks that would otherwise have happened at that time due to system age, system down or group maintenance will be ignored.


 * A preventive task that would have been triggered by a non-downing inspection will not happen if the block fails during the inspection, as corrective maintenance will take place instead.


 * If a failure will occur within the failure detection threshold or P-F interval set for the inspection, but the preventive task is only supposed to be performed when the system is down, the simulation waits until the requirements of the preventive task are met to perform the preventive maintenance.


 * If the on condition inspection triggers the preventive maintenance part of the task, the simulation assumes that the maintenance crew will forego any routine servicing associated with the inspection part of the task. In other words, the restoration will come from the preventive maintenance, so any restoration factor defined for the inspection will be ignored in these circumstances.

Example Using P-F Interval
To illustrate the use of the P-F interval in BlockSim, consider a component $$A$$  that fails every 700 $$tu$$. The corrective maintenance on this equipment takes 100 $$tu$$ to complete, while the preventive maintenance takes 50 $$tu$$ to complete. Both the corrective and preventive maintenance actions have a type II restoration factor of 1. Inspection tasks of 10 $$tu$$ duration are performed on the component every 300 $$tu$$. There is no restoration of the component during the inspections. The P-F interval for this component is 100 $$tu$$.

The component behavior from 0 to 2000 $$tu$$ is shown in the figure below and described next.


 * At 300 $$tu$$ the first scheduled inspection of 10 $$tu$$ duration occurs. At this time the age of the component is 300 $$tu$$.  This inspection does not lie in the P-F interval of 100 $$tu$$ (which begins at the age of 600 $$tu$$ and ends at the age of 700 $$tu$$).  Thus, no approaching failure is detected during this inspection.
 * At 600 $$tu$$ the second scheduled inspection of 10 $$tu$$ duration occurs. At this time the age of the component is 590 $$tu$$ (no age is accumulated during the first inspection from 300 tu to 310 $$tu$$ as the component does not operate during this inspection).  Again this inspection does not lie in the P-F interval.  Thus, no approaching failure is detected during this inspection.
 * At 720 $$tu$$ the component fails after having accumulated an age of 700 $$tu$$. A corrective maintenance task of 100 $$tu$$ duration occurs to restore the component to as-good-as-new condition.
 * At 900 $$tu$$ the third scheduled inspection occurs. At this time the age of the component is 80 $$tu$$.  This inspection does not lie in the P-F interval (from age 600 $$tu$$ to 700 $$tu$$).  Thus, no approaching failure is detected during this inspection.
 * At 1200 $$tu$$ the fourth scheduled inspection occurs. At this time the age of the component is 370 $$tu$$.  Again, this inspection does not lie in the P-F interval and no approaching failure is detected.
 * At 1500 $$tu$$ the fifth scheduled inspection occurs. At this time the age of the component is 660 $$tu$$, which lies in the P-F interval.  As a result, an approaching failure is detected and the inspection triggers a preventive maintenance task.  A preventive maintenance task of 50 $$tu$$ duration occurs at 1510 $$tu$$ to restore the component to as-good-as-new condition.
 * At 1800 $$tu$$ the sixth scheduled inspection occurs. At this time the age of the component is 240 $$tu$$.  This inspection does not lie in the P-F interval (from age 600 tu to 700 $$tu$$) and no approaching failure is detected.



Rules for PMs and Inspections
All the options available in the Maintenance task window were designed to maximize the modeling flexibility within BlockSim. However, maximizing the modeling flexibility introduces issues that you need to be aware of and requires you to carefully select options in order to assure that the selections do not contradict one another. One obvious case would be to define a PM action on a component in series (which will always bring the system down) and then assign a PM policy to the block that has the Do not perform maintenance if the action brings the system down option set. With these settings, no PMs will ever be performed on the component during the BlockSim simulation. The following sections summarize some issues and special cases to consider when defining maintenance properties in BlockSim.


 * Inspections do not consume spare parts. However, an inspection can have a renewal effect on the component if the restoration factor is set to a number other than the default of 0.
 * On the inspection tab, if Inspection brings system down is selected, this also implies that the inspection brings the item down.
 * If a PM or an inspection are scheduled based on the item's age, then they will occur exactly when the item reaches that age. However, it is important to note that failed items do not age.  Thus, if an item fails before it reaches that age, the action will not be performed.  This means that if the item fails before the scheduled inspection (based on item age) and the CM is set to be performed upon inspection, the CM will never take place.  The reason that this option is allowed in BlockSim is for the flexibility of specifying renewing inspections.
 * Downtime due to a failure discovered during a non-downing inspection is included when computing results "w/o PM, OC & Inspections."
 * If a PM upon item age is scheduled and is not performed because it brings the system down (based on the option in the PM task) the PM will not happen unless the item reaches that age again (after restoration by CM, inspection or another type of PM).
 * If the CM task is upon inspection and a failed component is scheduled for PM prior to the inspection, the PM action will restore the component and the CM will not take place.
 * In the case of simultaneous events, only one event is executed (except the case in maintenance phase, in maintenance phase, all simultaneous events in maintenance phase are executed in a order). The following precedence order is used: 1). Tasks based on intervals or upon start of a maintenance phase; 2). Tasks based on events in a maintenance group, where the triggering event applies to a block; 3). Tasks based on system down; 4). Tasked on events in a maintenance group, where the triggering event applies to a subdiagram. Within these categories, order is determined according to the priorities specified in the URD (i.e., the higher the task in on the list, the higher the priority).
 * The PM option of Do not perform if it brings the system down is only considered at the time that the PM needs to be initiated. If the system is down at that time, due to another item, then the PM will be performed regardless of any future consequences to the system up state.  In other words, when the other item is fixed, it is possible that the system will remain down due to this PM action.  In this case, the PM time difference is added to the system PM downtime.
 * Downing events cannot overlap. If a component is down due to a PM and another PM is suggested based on another trigger, the second call is ignored.
 * A non-downing inspection with a restoration factor restores the block based on the age of the block at the beginning of the inspection (i.e., duration is not restored).
 * Non-downing events can overlap with downing events. If in a non-downing inspection and a downing event happen concurrently, the non-downing event will be managed in parallel with the downing event.
 * If a failure or PM occurs during a non-downing inspection and the CM or PM has a restoration factor and the inspection action has a restoration factor, then both restoration factors are used (compounded).
 * A PM or inspection on system down is triggered only if the system was up at the time that the event brought the system down.
 * A non-downing inspection with restoration factor of 0 does not affect the block.

=Subdiagrams and Multi Blocks in Simulation=

Any subdiagrams and multi blocks that may be present in the BlockSim RBD are expanded and/or merged into a single diagram before the system is simulated. As an example, consider the system shown in the figure below.



BlockSim will internally merge the system into a single diagram before the simulation, as shown in the figure below. This means that all the failure and repair properties of the items in the subdiagrams are also considered.

In the case of multi blocks, the blocks are also fully expanded before simulation. This means that unlike the analytical solution, the execution speed (and memory requirements) for a multi block representing ten blocks in series is identical to the representation of ten individual blocks in series.

=Containers in Simulation=

Standby Containers
In the case of a standby container, the container acts as the switch mechanism (as shown below) in addition to defining the standby relationships and the number of active units that are required. The container's failure and repair properties are really that of the switch itself. The switch can fail with a distribution, while waiting to switch or during the switch action. Repair properties restore the switch regardless of how the switch failed. Failure of the switch itself does not bring the container down because the switch is not really needed unless called upon to switch. The container will go down if the units within the container fail or the switch is failed when a switch action is needed. The restoration time for this is based on the repair distributions of the contained units and the switch. Furthermore, the container is down during a switch process that has a delay. $$$$





To better illustrate this, consider the following deterministic case.


 * Units $$A$$  and  $$B$$  are contained in a standby container.
 * The standby container is the only item in the diagram, thus failure of the container is the same as failure of the system.
 * $$A$$ is the active unit and  $$B$$  is the standby unit.
 * Unit $$A$$  fails every 100  $$tu$$  (active) and takes 10  $$tu$$  to repair.
 * $$B$$ fails every 3  $$tu$$  (active) and also takes 10  $$tu$$  to repair.
 * The units cannot fail while in quiescent (standby) mode.
 * Furthermore, assume that the container (acting as the switch) fails every 30 $$tu$$  while waiting to switch and takes 4  $$tu$$  to repair. If not failed, the container switches with 100% probability.
 * The switch action takes 7 $$tu$$  to complete.
 * After repair, unit $$A$$  is always reactivated.
 * The container does not operate through system failure and thus the components do not either.

Keep in mind that we are looking at two events on the container. The container down and container switch down.

The system event log is shown in the figure below and is as follows:




 * At 30, the switch fails and gets repaired by 34. The container switch is failed and being repaired; however, the container is up during this time.
 * At 64, the switch fails and gets repaired by 68. The container is up during this time.
 * At 98, the switch fails. It will be repaired by 102.
 * At 100, unit $$A$$  fails.  Unit  $$A$$  attempts to activate the switch to go to  $$B$$ ; however, the switch is failed.
 * At 102, the switch is operational.
 * From 102 to 109, the switch is in the process of switching from unit $$A$$  to unit  $$B$$ .  The container and system are down from 100 to 109.
 * By 110, unit $$A$$  is fixed and the system is switched back to  $$A$$  from  $$B$$ .  The return switch action brings the container down for 7  $$tu$$, from 110 to 117.  During this time, note that unit  $$B$$  has only functioned for 1  $$tu$$ , 109 to 110.
 * At 146, the switch fails and gets repaired by 150. The container is up during this time.
 * At 180, the switch fails and gets repaired by 184. The container is up during this time.
 * At 214, the switch fails and gets repaired by 218.
 * At 217, unit $$A$$  fails.  The switch is failed at this time.
 * At 218, the switch is operational and the system is switched to unit $$B$$  within 7  $$tu$$ .  The container is down from 218 to 225.
 * At 225, unit $$B$$  takes over.  After 2  $$tu$$  of operation at 227, unit  $$B$$  fails.  It will be restored by 237.
 * At 227, unit $$A$$  is repaired and the switchback action to unit  $$A$$  is initiated.  By 234, the system is up.
 * At 262, the switch fails and gets repaired by 266. The container is up during this time.
 * At 296, the switch fails and gets repaired by 300. The container is up during this time.

The system results are shown in the figure below and discussed next.


 * 1.	System CM Downtime is 24.
 * a)	CM downtime includes all downtime due to failures as well as the delay in switching from a failed active unit to a standby unit. It does not include the switchback time from the standby to the restored active unit.  Thus, the times from 100 to 109, 217 to 225 and 227 to 234 are included.  The time to switchback, 110 to 117, is not included.
 * 2.	System Total Downtime is 31.
 * a)	It includes the CM downtime and the switchback downtime.
 * 3.	Number of System Failures is 3.
 * a)	It includes the failures at 100, 217 and 227.
 * b)	This is the same as the number of CM downing events.
 * 4.	The Total Downing Events are 4.
 * a)	This includes the switchback downing event at 110.
 * 5.	The Mean Availability (w/o PM and Inspection) does not include the downtime due to the switchback event.

Additional Rules and Assumptions for Standby Containers

 * 1)	A container will only attempt to switch if there is an available non-failed item to switch to. If there is no such item, it will then switch if and when an item becomes available. The switch will cancel the action if it gets restored before an item becomes available.
 * a)	As an example, consider the case of unit $$A$$  failing active while unit  $$B$$  failed in a quiescent mode.  If unit  $$B$$  gets restored before unit  $$A$$, then the switch will be initiated.  If unit  $$A$$  is restored before unit  $$B$$ , the switch action will not occur.
 * 2)	In cases where not all active units are required, a switch will only occur if the failed combination causes the container to fail.
 * a)	For example, if $$A$$,  $$B$$  and  $$C$$  are in a container for which one unit is required to be operating and  $$A$$  and  $$B$$  are active with  $$C$$  on standby, then the failure of either  $$A$$  or  $$B$$  will not cause a switching action.  The container will switch to  $$C$$  only if both  $$A$$  and  $$B$$  are failed.
 * 3)	If the container switch is failed and a switching action is required, the switching action will occur after the switch has been restored if it is still required (i.e., if the active unit is still failed).
 * 4)	If a switch fails during the delay time of the switching action based on the reliability distribution (quiescent failure mode), the action is still carried out unless a failure based on the switch probability/restarts occurs when attempting to switch.
 * 5)	During switching events, the change from the operating to quiescent distribution (and vice versa) occurs at the end of the delay time.
 * 6)	The option of whether components operate while the system is down is defined at component level now (This is different from BlockSim 7, in which this option of the contained items inherit from container). Two rules here:
 * a)	If a path inside the container is down, blocks inside the container that are in that path do not continue to operate.
 * b)	Blocks that are up do not continue to operate while the container is down.
 * 7)	A switch can have a repair distribution and maintenance properties without having a reliability distribution.
 * a)	This is because maintenance actions are performed regardless of whether the switch failed while waiting to switch (reliability distribution) or during the actual switching process (fixed probability).
 * 8)	A switch fails during switching when the restarts are exhausted.
 * 9)	A restart is executed every time the switch fails to switch (based on its fixed probability of switching).
 * 10)	If a delay is specified, restarts happen after the delay.
 * 11)	If a container brings the system down, the container is responsible for the system going down (not the blocks inside the container).

Load Sharing Containers
In the case of a load sharing container, the container defines the load that is shared. A load sharing container has no failure or repair distributions. The container itself is considered failed if all the blocks inside the container have failed (or $$k$$  blocks in a  $$k$$ -out-of- $$n$$  configuration).

To illustrate this, consider the following container with items $$A$$  and  $$B$$  in a load sharing redundancy.

Assume that $$A$$  fails every 100  $$tu$$  and  $$B$$  every 120  $$tu$$  if both items are operating and they fail in half that time if either is operating alone (i.e., the items age twice as fast when operating alone). They both get repaired in 5 $$tu$$.



The system event log is shown in the figure above and is as follows:


 * 1.	At 100, $$A$$  fails.  It takes 5  $$tu$$  to restore  $$A$$.
 * 2.	From 100 to 105, $$B$$  is operating alone and is experiencing a higher load.
 * 3.	At 115, $$B$$  fails.    would normally be expected to fail at 120, however:
 * a)	From 0 to 100, it accumulated the equivalent of 100 $$tu$$  of damage.
 * b)	From 100 to 105, it accumulated 10 $$tu$$  of damage, which is twice the damage since it was operating alone.  Put another way,  $$B$$  aged by 10  $$tu$$  over a period of 5  $$tu$$.
 * c)	At 105, $$A$$  is restored but  $$B$$  has only 10  $$tu$$  of life remaining at this point.
 * d)	 $$B$$ fails at 115.
 * 4.	At 120, $$B$$  is repaired.
 * 5.	At 200, $$A$$  fails again.   $$A$$  would normally be expected to fail at 205; however, the failure of  $$B$$  at 115 to 120 added additional damage to  $$A$$ .  In other words, the age of  $$A$$  at 115 was 10; by 120 it was 20.  Thus it reached an age of 100 95  $$tu$$  later at 200.
 * 6.	 $$A$$ is restored by 205.
 * 7.	At 235, $$B$$  fails.   $$B$$  would normally be expected to fail at 240; however, the failure of  $$A$$  at 200 caused the reduction.
 * a)	At 200, $$B$$  had an age of 80.
 * b)	By 205, $$B$$  had an age of 90.
 * c)	 $$B$$ fails 30  $$tu$$  later at 235.
 * 8.	The system itself never failed.

Additional Rules and Assumptions for Load Sharing Containers

 * 1.	The option of whether components operate while the system is down is defined at component level now (This is different from BlockSim 7, in which this option of the contained items inherit from container). Two rules here:
 * a)	If a path inside the container is down, blocks inside the container that are in that path do not continue to operate.
 * b)	Blocks that are up do not continue to operate while the container is down.
 * 2.	If a container brings the system down, the block that brought the container down is responsible for the system going down. (This is the opposite of standby containers.)

=State Change Triggers=

=Discussion=

Even though the examples and explanations presented here are deterministic, the sequence of events and logic used to view the system is the same as the one that would be used during simulation. The difference is that the process would be repeated multiple times during simulation and the results presented would be the average results over the multiple runs.

Additionally, multiple metrics and results are presented and defined in this chapter. Many of these results can also be used to obtain additional metrics not explicitly given in BlockSim's Simulation Results Explorer. As an example, to compute mean availability with inspections but without PMs, the explicit downtimes given for each event could be used. Furthermore, all of the results given are for operating times starting at zero to a specified end time (although the components themselves could have been defined with a non-zero starting age). Results for a starting time other than zero could be obtained by running two simulations and looking at the difference in the detailed results where applicable. As an example, the difference in uptimes and downtimes can be used to determine availabilities for a specific time window.