Markov Diagrams: Difference between revisions

From ReliaWiki
Jump to navigation Jump to search
No edit summary
 
(73 intermediate revisions by 5 users not shown)
Line 1: Line 1:
{{Template:bsbook|11}}  
{{Template:bsbook|11}}  
The term “Markov Chain", invented by Russian mathematician Andrey Markov, is used across many applications to represent a stochastic process made up of a sequence of random variables representing the evolution of a system, where the future state only depends on the current state of the system as past and future states are independent. Events are “chained” or “linked” serially together though memoryless transitions from one state to another. The term memoryless is used because past events are forgotten as they are irrelevant since an event or state is only dependent on the state or event that immediately preceded it.
The term "Markov Chain," invented by Russian mathematician Andrey Markov, is used across many applications to represent a stochastic process made up of a sequence of random variables representing the evolution of a system. Events are "chained" or "linked" serially together though memoryless transitions from one state to another. The term "memoryless" is used because past events are forgotten, as they are irrelevant; an event or state is dependent only on the state or event that immediately preceded it.


=Concept and Methodology=
=Concept and Methodology=
The concept behind the method is that given a system of states with transitions between them the analysis will give the probability of being in a particular state at a particular time. If some of the states are considered to be unavailable states for the system, then availability/reliability analysis can be performed for the system as a whole.
The concept behind the Markov chain method is that given a system of states with transitions between them, the analysis will give the probability of being in a particular state at a particular time. If some of the states are considered to be unavailable states for the system, then availability/reliability analysis can be performed for the system as a whole.


Depending on the transitions between the states, the Markov chain can be considered to be a discrete Markov chain, which has a constant probability of transition per unit step, or a continuous Markov chain, which has a constant rate of transition per unit time.


=Discrete Markov Chains: Limiting Probabilities=
=Discrete Markov Chains=
==Transition Matrix==
A discrete Markov chain can be viewed as a Markov chain where at the end of a step, the system will transition to another state (or remain in the current state), based on fixed probabilities. It is common to use discrete Markov chains when analyzing problems involving general probabilities, genetics, physics, etc.
A system has finitely many states {0, 1, 2…,N} and transition from state to state is random. The matrix shows the potential inputs and outputs from one state to another to describe transitions of a Markov chain.
To represent all the states that the system can occupy, we can use a vector <math>\underline{X}\,\!</math>:
P(X<sub>(n+1)</sub>=j│X,<sub>n</sub>=i)=P<sub>ij</sub> where  0≦P<sub>ij</sub>≦1


[[File:TransitionMatrixExp.jpg]]
::<math>\underline{X}(0)= (X_1  X_2 \cdot\cdot\cdot  X_i) \,\!</math>
 
where the term <math>X_i\,\!</math> represents the probability of the system being in state <math>i\,\!</math> and with:
 
::<math>\sum_{i=1}^{n}X_{i}=1\,\!</math>
 
The transitions between the states can be represented by a matrix <math>\underline{P}\,\!</math>:
 
::<math> \underline{P} =
\begin{pmatrix}
  P_{11} & P_{12} & \cdots & P_{1i} \\
  P_{21} & P_{22} & \cdots & P_{2i} \\
  \vdots  & \vdots  & \ddots & \vdots  \\
  P_{i1} & P_{i2} & \cdots & P_{ii}
\end{pmatrix}\,\!</math>
 
where, for example, the term <math>P_{12}\,\!</math> is the transition probability from state 1 to state 2, and for any row <math>m\,\!</math> with <math>i\,\!</math> states:
 
::<math>\sum_{j=1}^{i}P_{mj}=1\,\!</math>
 
To determine the probability of the system being in a particular state after the first step, we have to multiply the initial state probability vector <math>\underline{X}(0)\,\!</math> with the transition matrix <math>\underline{P}\,\!</math>:
 
::<math>\underline{X}(1)=\underline{X}(0) \cdot \underline{P}\,\!</math>
 
This will give us the state probability vector after the first step, <math>\underline{X}(1)\,\!</math>.
 
If one wants to determine the probabilities of the system being in a particular state after <math>n\,\!</math> steps, the Chapman-Kolmogorov equation can be used. This equation states that the probabilities of being in a state after <math>n\,\!</math> steps can be calculated by taking the initial state vector and multiplying by the transition matrix to the <math>n\,\!</math>th power, or:
 
::<math>\underline{X}(n)=\underline{X}(0) \cdot \underline{P}^n\,\!</math>
 
==Example==
Take a system that can be in any one of three states — operational, standby or offline — at a given time, and starts in the standby state.
 
After each step:
 
*If the system is in the operational state, there is a 20% chance that it moves to the standby state, and a 5% chance that it goes offline.
*If it is in the standby state, there is a 40% chance that it becomes operational, and a 1% chance that it goes offline.
*If it is in the offline state, there is a 15% chance that it becomes operational, and a 50% chance that it moves to the standby state.
 
We want to know the probability that it is offline after 10 steps.
 
First, we must create the state probability vector at time equal to zero, which in this case is:
 
::<math>\underline{X}(0)=\begin{pmatrix} 0 & 1 & 0 \end{pmatrix}\,\!</math>
 
Then, we can create the transition matrix to represent all the various transition probabilities between states:
 
::<math>\underline{P}=\begin{pmatrix}
0.75 & 0.20 & 0.05 \\
0.40 & 0.59 & 0.01 \\
0.15 & 0.50 & 0.35
\end{pmatrix}\,\!</math>
 
Lastly, we can calculate the state probabilities after 10 steps using the Chapman-Kolmogorov equation:
 
::<math>\underline{X}(10)=\underline{X}(0) \cdot \underline{P}^{10}\,\!</math>
 
or:
 
::<math>\underline{X}(10)=\begin{pmatrix} 0 & 1 & 0 \end{pmatrix} \cdot \begin{pmatrix}
0.75 & 0.20 & 0.05 \\
0.40 & 0.59 & 0.01 \\
0.15 & 0.50 & 0.35
\end{pmatrix}^{10}\,\!</math>
 
 
resulting in :
 
::<math>\underline{X}(10)=\begin{pmatrix} 0.596 & 0.353 & 0.051 \end{pmatrix}\,\!</math>
 
We can plot the point probabilities of each state at each step if we calculate the state probabilities after each step:
   
   
===Markov Chain Diagram===
[[Image:discrete_markov_state_point_probability_plot.png|center|700px|link=]]
Markov Chain State diagrams can be used to label events and transitions based upon a transition matrix.


From the plot, we can also determine that the probabilities of being in a state reach steady-state after about 6 steps.


==Chapman-Kolmogorov Equation==
=Continuous Markov Chains=
The Chapman-Kolmogorov Equation was realized and defined independently by British mathematician Sydney Chapman and Russian mathematician Andrew Kolmogorov. It can be used to provide the transitional densities of a Markov sequence.  
A continuous Markov chain can be viewed as a Markov chain where the transitions between states are defined by (constant) transition rates, as opposed to transition probabilities at fixed steps. It is common to use continuous Markov chains when analyzing system reliability/availability problems.


Let p<sub>i</sub><sup>(n)</sup>=P(X<sub>n</sub>=i), then
Because we are no longer performing analysis using fixed probabilities and a fixed step, we are no longer able to simply multiply a state probability vector with a transition matrix in order to obtain new state probabilities after a given step.


P(X<sub>(n+1)</sub>=j) = <math>\sum_{i \mathop =0}^{N}P</math> (X<sub>n+1</sub> = j|X<sub>n</sub> = i)
Instead, our problem can be written as a system of ordinary differential equations, where each differential equation represents the change in the probability of being in a particular state:


so
::<math>\frac{dP_j}{dt}= \displaystyle\sum_{l=1}^{n} \lambda_{lj}P_{l} - \displaystyle\sum_{l=1}^{n} \lambda_{jl}P_{j}\,\!</math>
P<sub>j</sub><sup>(n+1)</sup> = <math>\sum_{i \mathop =0}^{N}P</math> (X<sub>i</sub><sup>(n)</sup> P<sub>ij</sub>


With vector notation <math>\underline{p}</math><sup>(n)</sup> = (p<sub>0</sub><sup>(n)</sup>,p<sub>1</sub><sup>(n)</sup>, ... ,p<sub>N</sub><sup>(n)</sup>) (row vector)
where:


<math>\underline{p}</math><sup>(n+1)</sup> = p<sup>(n)</sup><math>\underline{P}</math> = (<math>\underline{p}</math><sup>(n-1)</sup><math>\underline{P}</math><sup>2</sup> = p<sup>(0)</sup>p<sup>(n+1)</sup>
*<math>n\,\!</math> is the total number of states
*<math>P_j\,\!</math> is the probability of being in state <math>j\,\!</math>
*<math>P_l\,\!</math> is the probability of being in state <math>l\,\!</math>
*<math>\lambda_{lj}\,\!</math> transition rate from state <math>l\,\!</math> to state <math>j\,\!</math>
*<math>\lambda_{jl}\,\!</math> transition rate from state <math>j\,\!</math> to state <math>l\,\!</math>


Let P<sub>ij</sub><sup>(m)</sup> = P (X<sub>n+m</sub> = j| X<sub>n</sub> = i) and <math>\underline{p}</math><sup>(m)</sup> = P<sub>ij</sub><sup>(m)</sup>
Our initial conditions to solve the differential equations are our initial probabilities for each state.


then <math>\underline{P}</math><sup>n+m</sup> = <math>\underline{P}</math><sup>(n)</sup> * <math>\underline{P}</math><sup>(m)</sup> and <math>\underline{P}</math><sup>(n)</sup>= <math>\underline{P}</math><sup>n</sup>
As a quick example, let us take a system that can be in one of two states, either working or under repair, and is initially in the working state. The transition rate from working to repair is 0.0001/hour (MTTF of 10,000 hours) and the transition rate from repair to working is 0.01/hour (MTTR of 100 hours). In this case, the two differential equations would look like this:


==Accessible and Communicating States==
::<math>\frac{dP_{working}}{dt}=0.01\cdot P_{repair} - 0.0001 \cdot P_{working} \,\!</math>
State j is accessible from state i, if for some m,


P<sub>ij</sub><sup>(m)</sup> > 0
::<math>\frac{dP_{repair}}{dt}=0.0001\cdot P_{working} - 0.01 \cdot P_{repair} \,\!</math>


State i communicates with state j, if j is accessible from i and also state i is accessible from j:
with initial bounds of:


::<math>P_{working}(0)=1 \,\!</math>


<math>\sum_{m \mathop =1}^{\infty}P</math><sub>ij</sub><sup>(m)</sup> and <math>\sum_{m \mathop =1}^{\infty}P</math><sub>ji</sub><sup>(m)</sup>
::<math>P_{repair}(0)=0 \,\!</math>


Solving this system of equations yields the following results:


Markov chain is irreducible if every state i communicates with all other states and with itself.
::<math>P_{working}=\frac{0.01}{0.0101}+\frac{0.0001}{0.0101}e^{-(0.0101)t} \,\!</math>
==Recurrent and Transient States==
Let f<sub>i</sub> = (starting at state i, system will return to state i)
If f<sub>i</sub> = 1, then state i is recurrent , repeated infinitely often
If f<sub>i</sub> < 1, then state i is transient , repeated returns have smaller and smaller probabilities.


f<sub>i</sub> = <math>\sum_{m \mathop =1}^{\infty}P</math><sub>ii</sub><sup>(m)</sup>
::<math>P_{repair}=\frac{0.0001}{0.0101}-\frac{0.0001}{0.0101}e^{-(0.0101)t} \,\!</math>


Markov chain is ergodic, if all states are recurrent and not periodic
For more complex analyses, numerical methodologies are preferred. There are many methods, both analytical and numerical, to solve systems of ordinary differential equations. One of these is the Runge-Kutta-Fehlberg method, also known as the RKF45 method, which is a numerical algorithm. This algorithm is practical because one extra calculation allows for the error to be estimated and controlled with the use of an automatic adaptive step size methodology.
(there is no d>0 such that P<sub>ii</sub><sup>(m)</sup> > 0 if and only if m is multiple of d)
==Limiting Probabilities==
* Theorem:
For an irreducible, ergodic Markov chain
<math> \lim_{m \to \infty}P</math><sub>ij</sub><sup>(m)</sup> = &pi;<sub>j</sub> for all j for all j (10.4)
and limit is independent of i( steady state probabilities):


The methodology uses a 4th order Runge-Kutta and a 5th order Runge-Kutta, where the former is used for the calculation and the latter is used for the error estimation. The formulas for the method are as follows:


0 ≦ π<sub>j</sub>≦ 1
::<math>f_j=\frac{dP_{j}}{dt}= \displaystyle\sum_{l=1}^{n} \lambda_{lj}P_{l} - \displaystyle\sum_{l=1}^{n} \lambda_{jl}P_{j} \,\!</math>


[[File:LimitingProbab.Theorem.PNG]]
::<math>k_1=hf(t_i,P_{l,i})\,\!</math>
 
* Method:  
::<math>k_2=hf\Big( t_i+\frac{h}{4},P_{l,i}+\frac{k_1}{4}\Big)\,\!</math>


[[File:LimitingProbab.Method.PNG]]
::<math>k_3=hf\Big( t_i+\frac{3h}{8},P_{l,i}+\frac{3}{32}k_2\Big)\,\!</math>


==Mean Time Spent in States==
::<math>k_4=hf\Big( t_i+\frac{12h}{13},P_{l,i}+\frac{1932}{2197}k_1+\frac{7200}{2197}k_2+\frac{7296}{2197}k_3\Big)\,\!</math>
Mean time spent in recurrent states = ∞
Mean time spent in transient states :  
S<sub>ij</sub> = Starting at state i , expected number of time periods that state is j
S<sub>ij</sub> = [[File:MeanTimeSpentInStates.1.PNG]]
 
where P * contains rows and columns of transient states of matrix ▁P:
S = I + P* S
<math>\underline{S}</math> = (I-P *)<sup>-1</sup>


::<math>k_5=hf\Big( t_i+h,P_{l,i}+\frac{439}{216}k_1+8k_2+\frac{3680}{513}k_3-\frac{845}{4104}k_4\Big)\,\!</math>


=Continuous Markov Chains: Applications to Non-Repairable Systems=
::<math>k_6=hf\Big( t_i+\frac{h}{2},P_{l,i}-\frac{8}{27}k_1+2k_2-\frac{3544}{2565}k_3+\frac{1859}{4104}k_4-\frac{11}{40}k_5\Big)\,\!</math>


::<math>P_{j,i+1}=P_{j,i}+\frac{25}{216}k_1+\frac{1408}{2565}k_3+\frac{2197}{4104}k_4-\frac{1}{5}k_5\,\!</math>


* Non-repairable component with failure rate &lambda;
::<math>\tilde{P}_{j,i+1}=P_{j,i}+\frac{16}{135}k_1+\frac{6656}{12825}k_3+\frac{28561}{56430}k_4-\frac{9}{50}k_5+\frac{2}{55}k_6\,\!</math>
** P0(t) = P ( at time t component works)
** P1(t) = P ( at time t component is broken)


P0 (t+ ∆ t) = (1- λ ∆ t) P0 (t) +0 P1 (t) Does not fail during ∆ t times
::<math>R=\frac{1}{h}|\tilde{P}_{j,i+1}-P_{j,i+1}|\,\!</math>


P1(t+ ∆ t) = λ ∆ t P0 (t) + 1 P1 (t) since P (Fails in ∆ Time) =1- e<sup>- &lambda;∆t</sup> ≈ 1- (1- <math>\tfrac{\lambda;\Delta;}{t/1!}+\tfrac{(\lambda;2(\Delta;t)2)}{2!}</math> - …) ≈ &lambda;∆t if ∆t is small
::<math>\delta=0.84\Big(\frac{\epsilon}{R}\Big)^{\frac{1}{4}}\,\!</math>


==Method==
If <math>R\leq\epsilon\,\!</math> then keep <math>P_{i+1}\,\!</math> as the current solution, and move to next step with step size <math>\delta{h}\,\!</math>; if <math>R>\epsilon\,\!</math> then recalculate the current step with step size <math>\delta{h}\,\!</math>
The method employed to solve a given Markov Chain problem will be the RK45 Runga-Kutta-Fehlberg, which is an adaptive step size Runga-Kutta method.


==User Inputs==
where:
The user must provide initial probabilities for each state (must add up to exactly 1.0), and a transition probability between each state. If a transition probability is not given, it should be assumed to be zero.


===Symbol Definitions===
*<math>P_j\,\!</math> is the probability of being in state <math>j\,\!</math>
* α<sub>j,0</sub> is the initial probability of being in state j (given by the user).
*<math>P_{l,i}\,\!</math> is the probability of being in state <math>l\,\!</math> at time <math>i\,\!</math>  
* ε is the user defined tolerance (accuracy). Default should be 1e<sup>-5</sup> and can only get smaller.
*<math>\lambda_{lj}\,\!</math> is the transition rate from state <math>l\,\!</math> to state <math>j\,\!</math>
* λ<sub>l,j</sub> is the transitional failure rate into state w<sub>j</sub> from state w<sub>l</sub>
*<math>\lambda_{jl}\,\!</math> is the transition rate from state <math>j\,\!</math> to state <math>l\,\!</math>
* w<sub>l</sub> is the probability of being in the state associated with the λ<sub>l,j’s</sub>
*<math>f_j\,\!</math> is the change in the probability of being in state <math>P_j\,\!</math> (note that <math>f_j\,\!</math> is not a function of time for constant transition rate Markov chains)
* λ<sub>j,k</sub> is the transitional failure rate leaving state w<sub>j</sub> to state w<sub>k</sub>
*<math>h\,\!</math> is the time step size
* f<sub>j</sub> is the change in state probability function (for a given state w<sub>j</sub>):
*<math>t_i\,\!</math> is the time at <math>i\,\!</math>
**[[File:ChgInStateProbFunt.Jpg]]
*<math>\varepsilon\,\!</math> is the chosen acceptable error
f<sub>j</sub> is not a function of time as only constant failure rates will be allowed in initial version. This means that the various k’s calculated during the RK45 method are only functions of all the w’s and the constant failure rates, λ’s.


Here's the formula for the Runge-Kutta-Fehlberg method (RK45).
This methodology can then be used for each state at each time step, where a time step is accepted only if the time step size is acceptable for each state in the system.


w<sub>0</sub> = α
Since continuous Markov chains are often used for system availability/reliability analyses, the continuous Markov chain diagram in BlockSim allows the user the ability to designate one or more states as unavailable states. This allows for the calculation of both availability and reliability of the system.


k<sub>1</sub> = hf(t<sub>i</sub>, w<sub>i</sub>)
Availability is calculated as the mean probability that the system is in a state that is not an unavailable state.


k<sub>2</sub> = hf(t<sub>i</sub>+<math>\tfrac{h}{4}</math>, w<sub>i</sub>+<math>\tfrac{k_1}{4}</math>)
Reliability is calculated in the same manner as availability, with the additional restriction that all transitions leaving any unavailable state are considered to have a transition rate of zero.


k<sub>3</sub> = hf(t<sub>i</sub>+<math>\tfrac{3h}{8}</math>, w<sub>i</sub>+<math>\tfrac{3}{32}</math>k<sub>1</sub>+<math>\tfrac{9}{32}</math>k<sub>2</sub>)
==Example==
Assume you have a system composed of two generators. The system can be in one of three states:


k<sub>4</sub> = hf(t<sub>i</sub>+<math>\tfrac{12h}{13}</math>, w<sub>i</sub>+<math>\tfrac{1932}{2197}</math>k<sub>1</sub>-<math>\tfrac{7200}{2197}</math>k<sub>2</sub>+<math>\tfrac{7296}{2197}</math>k<sub>3</sub>)
*Both generators are operational
*One generator is operational and the other is under repair
*Both generators are under repair. This is an unavailable state.


k<sub>5</sub> = hf(t<sub>i</sub>+h w<sub>i</sub>+<math>\tfrac{439}{216}</math>k<sub>1</sub>-8k<sub>2</sub>+<math>\tfrac{3680}{513}</math>k<sub>3</sub>-<math>\tfrac{845}{4104}</math>k<sub>4</sub>)
The system starts in the state in which both generators are operational.


k<sub>6</sub> = hf(t<sub>i</sub>+<math>\tfrac{h}{2}</math> w<sub>i</sub>+<math>\tfrac{8}{27}</math>k<sub>1</sub>-2k<sub>2</sub>+<math>\tfrac{3544}{2565}</math>k<sub>3</sub>-<math>\tfrac{1859}{4104}</math>k<sub>4</sub>-<math>\tfrac{11}{40}</math>k<sub>5</sub>)
We know that the failure rate of a generator is 1 per 2,000 hours, and the repair rate is 1 per 200 hours.


w<sub>i</sub>+1 = w<sub>i</sub>+<math>\tfrac{25}{216}</math>k<sub>1</sub>+<math>\tfrac{1408}{2565}</math>k<sub>3</sub>-<math>\tfrac{2197}{4104}</math>k<sub>4</sub>-<math>\tfrac{1}{5}</math>k<sub>5</sub>)
Therefore:


w'<sub>i</sub>+1 = w<sub>i</sub>+<math>\tfrac{16}{135}</math>k<sub>1</sub>+<math>\tfrac{6656}{12825}</math>k<sub>1</sub>+<math>\tfrac{28561}{56430}</math>k<sub>4</sub>-<math>\tfrac{9}{50}</math>k<sub>5</sub>-<math>\tfrac{2}{55}</math>k<sub>6</sub>)
*The transition rate from the state in which both generators are operational to the state where only one is operational is 1 per 1,000 hours.
*The transition rate from the state in which one generator is operational to the state where both generators are operational is 1 per 200 hours.
*The transition rate from the state in which one generator is operational to the state where both generators are under repair is 1 per 2,000 hours.
*The transition rate from the state in which both generators are under repair to the state where one generator is operational is 1 per 100 hours.


R = <math>\tfrac{1}{h}|</math>w'<sub>i+1</sub>-w<sub>i+1</sub>|
We would like to know the mean availability of our system after 20,000 hours for all three states so that we can estimate our output based on time spent at full, half and zero generator capacity.


&delta; = 0.84 * <math>\tfrac{\varepsilon}{R}^\tfrac{1}{4}</math>
Solving this system by hand using the RKF45 method would be very time consuming, so we will turn to BlockSim to help us solve this problem using a continuous Markov diagram. After creating the diagram, adding the states and the transition rates, the final diagram looks like this:


if R≤&epsilon; keep w as the current step solution and move to the next step with step size &delta;h
[[Image:continuous_markov_diagram.png|center|628px|link=]]


if R>&epsilon; recalculate the current step with step size &delta;h The above method is for each individual state, and not for the system as a whole. The w is the equivalent of the probability of being in a particular state, where the subscript i represents the time based variation. This still has to be done for all the states in the system, which later on is represented by the subscript j (for each state).
Once the diagram is complete, the analysis is set for 20,000 hours. The comparison between the mean probabilities of the states can be done using a bar graph, or by looking at the diagram result summary.


==Detailed Methodology==
[[Image:continuous_markov_state_mean_probability_plot.png|center|700px|link=]]
Generate an initial step size h from the available failure rates (preferably something where no state can change by more than 1% of the smallest MTTF)


Use the RK45 method on all states simultaneously using the given h. (this means that each state must have its k1 value calculated/used together, then k2, then k3, etc).


If all calculations are within tolerance, keep results (RK4, so the w without the hat) and increase h to the smallest of the increases generated by the method. If some of the calculations are not within tolerance, decrease the step size to the smallest of the decreases generated by the method and recalculate with the new h.
[[Image:continuous_markov_results.png|center|798px|link=]]
h should not be increased to more than double, so s should have a stipulation on it that forbids from that occurring. Be aware that s may become infinite if the difference between the RK4 and the RK5 is zero. This should be addressed as well with a catch of some sort to make s = 2 in that case. (So basically if s is calculated to be greater than 2 for any state, make it equal to 2 for that state).


Repeat steps 2 & 3 as necessary.
From the Mean Probability column, we can see that the system is expected to be fully operational 82.8% of the time, half operational 16.4% of the time, and non-operational 0.8% of the time.


If there are multiple phases, then 1-4 needs to be done for each phase where the initial probability of being in a state is = the final value from the previous phase.
From the Point Probability (Av) column, we can get the point probability of being in a state when all transitions are considered. From the Point Probability (Rel) column, we can get the point probability of being in a state if we assume that there is no return from unavailable states, or in other words we are assuming no repair once the system has entered an unavailable (failed) state. Using the "non-repair" assumption, there is only an 18.0% chance that the system would still be fully operational, a 3.3% chance that it would be half operational and a 78.7% chance that it would be non-operational.


This methodology will provide the ability to give availability and unavailability metrics for the system, as well as point probabilities of being in a certain state.
=Phases=
For the reliability metrics, the methodology differs in that each unavailable state is considered as a “sink”, or in other words all transitions from unavailable to available states are to be ignored. This could be calculated simultaneously with the availability using the same step sizes as generated there.
It is possible to do analyses with multiple phases for both discrete and continuous Markov chain calculations. When using phases, the software keeps track of state correspondence across phases using the name of the state (e.g., "State B" is one phase is considered to correspond to "State B" in all other phases). The starting state probability comes from the ending state probability from the previous phase. If a state is absent from a phase, then its state probability does not change during the phase.


The two results that need to be stored are the time and the corresponding probability of being in the state for each state.
Also, it is possible to perform an analysis where the number of steps, for discrete analyses, or the operational time, for continuous analyses, are greater/longer than the sum length of all the phases. In this case, the analysis will return to the first phase, without resetting the state probabilities, and continue from there.

Latest revision as of 17:38, 17 October 2016

New format available! This reference is now available in a new format that offers faster page load, improved display for calculations and images, more targeted search and the latest content available as a PDF. As of September 2023, this Reliawiki page will not continue to be updated. Please update all links and bookmarks to the latest reference at help.reliasoft.com/reference/system_analysis

Chapter 11: Markov Diagrams


BlockSimbox.png

Chapter 11  
Markov Diagrams  

Synthesis-icon.png

Available Software:
BlockSim

Examples icon.png

More Resources:
BlockSim examples

The term "Markov Chain," invented by Russian mathematician Andrey Markov, is used across many applications to represent a stochastic process made up of a sequence of random variables representing the evolution of a system. Events are "chained" or "linked" serially together though memoryless transitions from one state to another. The term "memoryless" is used because past events are forgotten, as they are irrelevant; an event or state is dependent only on the state or event that immediately preceded it.

Concept and Methodology

The concept behind the Markov chain method is that given a system of states with transitions between them, the analysis will give the probability of being in a particular state at a particular time. If some of the states are considered to be unavailable states for the system, then availability/reliability analysis can be performed for the system as a whole.

Depending on the transitions between the states, the Markov chain can be considered to be a discrete Markov chain, which has a constant probability of transition per unit step, or a continuous Markov chain, which has a constant rate of transition per unit time.

Discrete Markov Chains

A discrete Markov chain can be viewed as a Markov chain where at the end of a step, the system will transition to another state (or remain in the current state), based on fixed probabilities. It is common to use discrete Markov chains when analyzing problems involving general probabilities, genetics, physics, etc. To represent all the states that the system can occupy, we can use a vector [math]\displaystyle{ \underline{X}\,\! }[/math]:

[math]\displaystyle{ \underline{X}(0)= (X_1 X_2 \cdot\cdot\cdot X_i) \,\! }[/math]

where the term [math]\displaystyle{ X_i\,\! }[/math] represents the probability of the system being in state [math]\displaystyle{ i\,\! }[/math] and with:

[math]\displaystyle{ \sum_{i=1}^{n}X_{i}=1\,\! }[/math]

The transitions between the states can be represented by a matrix [math]\displaystyle{ \underline{P}\,\! }[/math]:

[math]\displaystyle{ \underline{P} = \begin{pmatrix} P_{11} & P_{12} & \cdots & P_{1i} \\ P_{21} & P_{22} & \cdots & P_{2i} \\ \vdots & \vdots & \ddots & \vdots \\ P_{i1} & P_{i2} & \cdots & P_{ii} \end{pmatrix}\,\! }[/math]

where, for example, the term [math]\displaystyle{ P_{12}\,\! }[/math] is the transition probability from state 1 to state 2, and for any row [math]\displaystyle{ m\,\! }[/math] with [math]\displaystyle{ i\,\! }[/math] states:

[math]\displaystyle{ \sum_{j=1}^{i}P_{mj}=1\,\! }[/math]

To determine the probability of the system being in a particular state after the first step, we have to multiply the initial state probability vector [math]\displaystyle{ \underline{X}(0)\,\! }[/math] with the transition matrix [math]\displaystyle{ \underline{P}\,\! }[/math]:

[math]\displaystyle{ \underline{X}(1)=\underline{X}(0) \cdot \underline{P}\,\! }[/math]

This will give us the state probability vector after the first step, [math]\displaystyle{ \underline{X}(1)\,\! }[/math].

If one wants to determine the probabilities of the system being in a particular state after [math]\displaystyle{ n\,\! }[/math] steps, the Chapman-Kolmogorov equation can be used. This equation states that the probabilities of being in a state after [math]\displaystyle{ n\,\! }[/math] steps can be calculated by taking the initial state vector and multiplying by the transition matrix to the [math]\displaystyle{ n\,\! }[/math]th power, or:

[math]\displaystyle{ \underline{X}(n)=\underline{X}(0) \cdot \underline{P}^n\,\! }[/math]

Example

Take a system that can be in any one of three states — operational, standby or offline — at a given time, and starts in the standby state.

After each step:

  • If the system is in the operational state, there is a 20% chance that it moves to the standby state, and a 5% chance that it goes offline.
  • If it is in the standby state, there is a 40% chance that it becomes operational, and a 1% chance that it goes offline.
  • If it is in the offline state, there is a 15% chance that it becomes operational, and a 50% chance that it moves to the standby state.

We want to know the probability that it is offline after 10 steps.

First, we must create the state probability vector at time equal to zero, which in this case is:

[math]\displaystyle{ \underline{X}(0)=\begin{pmatrix} 0 & 1 & 0 \end{pmatrix}\,\! }[/math]

Then, we can create the transition matrix to represent all the various transition probabilities between states:

[math]\displaystyle{ \underline{P}=\begin{pmatrix} 0.75 & 0.20 & 0.05 \\ 0.40 & 0.59 & 0.01 \\ 0.15 & 0.50 & 0.35 \end{pmatrix}\,\! }[/math]

Lastly, we can calculate the state probabilities after 10 steps using the Chapman-Kolmogorov equation:

[math]\displaystyle{ \underline{X}(10)=\underline{X}(0) \cdot \underline{P}^{10}\,\! }[/math]

or:

[math]\displaystyle{ \underline{X}(10)=\begin{pmatrix} 0 & 1 & 0 \end{pmatrix} \cdot \begin{pmatrix} 0.75 & 0.20 & 0.05 \\ 0.40 & 0.59 & 0.01 \\ 0.15 & 0.50 & 0.35 \end{pmatrix}^{10}\,\! }[/math]


resulting in :

[math]\displaystyle{ \underline{X}(10)=\begin{pmatrix} 0.596 & 0.353 & 0.051 \end{pmatrix}\,\! }[/math]

We can plot the point probabilities of each state at each step if we calculate the state probabilities after each step:

Discrete markov state point probability plot.png

From the plot, we can also determine that the probabilities of being in a state reach steady-state after about 6 steps.

Continuous Markov Chains

A continuous Markov chain can be viewed as a Markov chain where the transitions between states are defined by (constant) transition rates, as opposed to transition probabilities at fixed steps. It is common to use continuous Markov chains when analyzing system reliability/availability problems.

Because we are no longer performing analysis using fixed probabilities and a fixed step, we are no longer able to simply multiply a state probability vector with a transition matrix in order to obtain new state probabilities after a given step.

Instead, our problem can be written as a system of ordinary differential equations, where each differential equation represents the change in the probability of being in a particular state:

[math]\displaystyle{ \frac{dP_j}{dt}= \displaystyle\sum_{l=1}^{n} \lambda_{lj}P_{l} - \displaystyle\sum_{l=1}^{n} \lambda_{jl}P_{j}\,\! }[/math]

where:

  • [math]\displaystyle{ n\,\! }[/math] is the total number of states
  • [math]\displaystyle{ P_j\,\! }[/math] is the probability of being in state [math]\displaystyle{ j\,\! }[/math]
  • [math]\displaystyle{ P_l\,\! }[/math] is the probability of being in state [math]\displaystyle{ l\,\! }[/math]
  • [math]\displaystyle{ \lambda_{lj}\,\! }[/math] transition rate from state [math]\displaystyle{ l\,\! }[/math] to state [math]\displaystyle{ j\,\! }[/math]
  • [math]\displaystyle{ \lambda_{jl}\,\! }[/math] transition rate from state [math]\displaystyle{ j\,\! }[/math] to state [math]\displaystyle{ l\,\! }[/math]

Our initial conditions to solve the differential equations are our initial probabilities for each state.

As a quick example, let us take a system that can be in one of two states, either working or under repair, and is initially in the working state. The transition rate from working to repair is 0.0001/hour (MTTF of 10,000 hours) and the transition rate from repair to working is 0.01/hour (MTTR of 100 hours). In this case, the two differential equations would look like this:

[math]\displaystyle{ \frac{dP_{working}}{dt}=0.01\cdot P_{repair} - 0.0001 \cdot P_{working} \,\! }[/math]
[math]\displaystyle{ \frac{dP_{repair}}{dt}=0.0001\cdot P_{working} - 0.01 \cdot P_{repair} \,\! }[/math]

with initial bounds of:

[math]\displaystyle{ P_{working}(0)=1 \,\! }[/math]
[math]\displaystyle{ P_{repair}(0)=0 \,\! }[/math]

Solving this system of equations yields the following results:

[math]\displaystyle{ P_{working}=\frac{0.01}{0.0101}+\frac{0.0001}{0.0101}e^{-(0.0101)t} \,\! }[/math]
[math]\displaystyle{ P_{repair}=\frac{0.0001}{0.0101}-\frac{0.0001}{0.0101}e^{-(0.0101)t} \,\! }[/math]

For more complex analyses, numerical methodologies are preferred. There are many methods, both analytical and numerical, to solve systems of ordinary differential equations. One of these is the Runge-Kutta-Fehlberg method, also known as the RKF45 method, which is a numerical algorithm. This algorithm is practical because one extra calculation allows for the error to be estimated and controlled with the use of an automatic adaptive step size methodology.

The methodology uses a 4th order Runge-Kutta and a 5th order Runge-Kutta, where the former is used for the calculation and the latter is used for the error estimation. The formulas for the method are as follows:

[math]\displaystyle{ f_j=\frac{dP_{j}}{dt}= \displaystyle\sum_{l=1}^{n} \lambda_{lj}P_{l} - \displaystyle\sum_{l=1}^{n} \lambda_{jl}P_{j} \,\! }[/math]
[math]\displaystyle{ k_1=hf(t_i,P_{l,i})\,\! }[/math]
[math]\displaystyle{ k_2=hf\Big( t_i+\frac{h}{4},P_{l,i}+\frac{k_1}{4}\Big)\,\! }[/math]
[math]\displaystyle{ k_3=hf\Big( t_i+\frac{3h}{8},P_{l,i}+\frac{3}{32}k_2\Big)\,\! }[/math]
[math]\displaystyle{ k_4=hf\Big( t_i+\frac{12h}{13},P_{l,i}+\frac{1932}{2197}k_1+\frac{7200}{2197}k_2+\frac{7296}{2197}k_3\Big)\,\! }[/math]
[math]\displaystyle{ k_5=hf\Big( t_i+h,P_{l,i}+\frac{439}{216}k_1+8k_2+\frac{3680}{513}k_3-\frac{845}{4104}k_4\Big)\,\! }[/math]
[math]\displaystyle{ k_6=hf\Big( t_i+\frac{h}{2},P_{l,i}-\frac{8}{27}k_1+2k_2-\frac{3544}{2565}k_3+\frac{1859}{4104}k_4-\frac{11}{40}k_5\Big)\,\! }[/math]
[math]\displaystyle{ P_{j,i+1}=P_{j,i}+\frac{25}{216}k_1+\frac{1408}{2565}k_3+\frac{2197}{4104}k_4-\frac{1}{5}k_5\,\! }[/math]
[math]\displaystyle{ \tilde{P}_{j,i+1}=P_{j,i}+\frac{16}{135}k_1+\frac{6656}{12825}k_3+\frac{28561}{56430}k_4-\frac{9}{50}k_5+\frac{2}{55}k_6\,\! }[/math]
[math]\displaystyle{ R=\frac{1}{h}|\tilde{P}_{j,i+1}-P_{j,i+1}|\,\! }[/math]
[math]\displaystyle{ \delta=0.84\Big(\frac{\epsilon}{R}\Big)^{\frac{1}{4}}\,\! }[/math]

If [math]\displaystyle{ R\leq\epsilon\,\! }[/math] then keep [math]\displaystyle{ P_{i+1}\,\! }[/math] as the current solution, and move to next step with step size [math]\displaystyle{ \delta{h}\,\! }[/math]; if [math]\displaystyle{ R\gt \epsilon\,\! }[/math] then recalculate the current step with step size [math]\displaystyle{ \delta{h}\,\! }[/math]

where:

  • [math]\displaystyle{ P_j\,\! }[/math] is the probability of being in state [math]\displaystyle{ j\,\! }[/math]
  • [math]\displaystyle{ P_{l,i}\,\! }[/math] is the probability of being in state [math]\displaystyle{ l\,\! }[/math] at time [math]\displaystyle{ i\,\! }[/math]
  • [math]\displaystyle{ \lambda_{lj}\,\! }[/math] is the transition rate from state [math]\displaystyle{ l\,\! }[/math] to state [math]\displaystyle{ j\,\! }[/math]
  • [math]\displaystyle{ \lambda_{jl}\,\! }[/math] is the transition rate from state [math]\displaystyle{ j\,\! }[/math] to state [math]\displaystyle{ l\,\! }[/math]
  • [math]\displaystyle{ f_j\,\! }[/math] is the change in the probability of being in state [math]\displaystyle{ P_j\,\! }[/math] (note that [math]\displaystyle{ f_j\,\! }[/math] is not a function of time for constant transition rate Markov chains)
  • [math]\displaystyle{ h\,\! }[/math] is the time step size
  • [math]\displaystyle{ t_i\,\! }[/math] is the time at [math]\displaystyle{ i\,\! }[/math]
  • [math]\displaystyle{ \varepsilon\,\! }[/math] is the chosen acceptable error

This methodology can then be used for each state at each time step, where a time step is accepted only if the time step size is acceptable for each state in the system.

Since continuous Markov chains are often used for system availability/reliability analyses, the continuous Markov chain diagram in BlockSim allows the user the ability to designate one or more states as unavailable states. This allows for the calculation of both availability and reliability of the system.

Availability is calculated as the mean probability that the system is in a state that is not an unavailable state.

Reliability is calculated in the same manner as availability, with the additional restriction that all transitions leaving any unavailable state are considered to have a transition rate of zero.

Example

Assume you have a system composed of two generators. The system can be in one of three states:

  • Both generators are operational
  • One generator is operational and the other is under repair
  • Both generators are under repair. This is an unavailable state.

The system starts in the state in which both generators are operational.

We know that the failure rate of a generator is 1 per 2,000 hours, and the repair rate is 1 per 200 hours.

Therefore:

  • The transition rate from the state in which both generators are operational to the state where only one is operational is 1 per 1,000 hours.
  • The transition rate from the state in which one generator is operational to the state where both generators are operational is 1 per 200 hours.
  • The transition rate from the state in which one generator is operational to the state where both generators are under repair is 1 per 2,000 hours.
  • The transition rate from the state in which both generators are under repair to the state where one generator is operational is 1 per 100 hours.

We would like to know the mean availability of our system after 20,000 hours for all three states so that we can estimate our output based on time spent at full, half and zero generator capacity.

Solving this system by hand using the RKF45 method would be very time consuming, so we will turn to BlockSim to help us solve this problem using a continuous Markov diagram. After creating the diagram, adding the states and the transition rates, the final diagram looks like this:

Continuous markov diagram.png

Once the diagram is complete, the analysis is set for 20,000 hours. The comparison between the mean probabilities of the states can be done using a bar graph, or by looking at the diagram result summary.

Continuous markov state mean probability plot.png


Continuous markov results.png

From the Mean Probability column, we can see that the system is expected to be fully operational 82.8% of the time, half operational 16.4% of the time, and non-operational 0.8% of the time.

From the Point Probability (Av) column, we can get the point probability of being in a state when all transitions are considered. From the Point Probability (Rel) column, we can get the point probability of being in a state if we assume that there is no return from unavailable states, or in other words we are assuming no repair once the system has entered an unavailable (failed) state. Using the "non-repair" assumption, there is only an 18.0% chance that the system would still be fully operational, a 3.3% chance that it would be half operational and a 78.7% chance that it would be non-operational.

Phases

It is possible to do analyses with multiple phases for both discrete and continuous Markov chain calculations. When using phases, the software keeps track of state correspondence across phases using the name of the state (e.g., "State B" is one phase is considered to correspond to "State B" in all other phases). The starting state probability comes from the ending state probability from the previous phase. If a state is absent from a phase, then its state probability does not change during the phase.

Also, it is possible to perform an analysis where the number of steps, for discrete analyses, or the operational time, for continuous analyses, are greater/longer than the sum length of all the phases. In this case, the analysis will return to the first phase, without resetting the state probabilities, and continue from there.