Event Log Data

From ReliaWiki
Revision as of 00:16, 24 August 2012 by Kate Racaza (talk | contribs)
Jump to navigation Jump to search

Template:LDABOOK SUB Event logs, or maintenance logs, store information about a piece of equipment's failures and repairs. They provide useful information that can help companies achieve their productivity goals by giving insight about the failure modes, frequency of outages, repair duration, uptime/downtime and availability of the equipment. Some event logs contain more information than others, but essentially event logs capture data in a format that includes the type of event, the date/time when the event occurred and the date/time when the system was restored to operation.

The data from event logs can be used to extract failure times and repair times information. For n number of failures and repair actions that took place during the event logging period, the times-to-failure of every unique occurrence of an event are obtained by calculating the time between the last repair and the time the new failure occurred, or:


[math]\displaystyle{ \text{Time-to-Failure}_{i}=t_{1}-r_{i-1}\,\! }[/math]
where:
  • [math]\displaystyle{ i=1,...n\,\! }[/math]
  • [math]\displaystyle{ t_{i}\,\! }[/math] is the date/time of occurrence of [math]\displaystyle{ i\,\! }[/math].
  • [math]\displaystyle{ r_{i-1}\,\! }[/math] is the date/time of restoration of the previous occurrence [math]\displaystyle{ (i-1)\,\! }[/math].


For systems that were new when the collection of the event log data started, the times to first occurrence of every unique event is equivalent to the date/time of the occurrence of the event minus the time the system monitoring started. That is:


[math]\displaystyle{ \text{Time-to-Failure}_{1}=t_{1}-\text{System Start Time}\,\! }[/math]


For systems that were not new when the collection of event log data started, the times to first occurrence of every unique event are considered to be suspensions (right censored) because the system is assumed to have accumulated more hours before the data collection period started (i.e., the time between the start date/time and the first occurrence of an event is not the entire operating time). In this case:


[math]\displaystyle{ \text{Suspension}_{1}=t_{1}-\text{System Start Time}\,\! }[/math]


When monitoring on the system is stopped or when the system is no longer being used, all events that have not occurred by this time are considered to be suspensions.


[math]\displaystyle{ \text{Last Suspension}=\text{System End Time}-r_{n}\,\! }[/math]


The four equation given above are valid for cases in which the component operates through the failure of other components. When the component does not operate through the failures, the assumptions must include the downtime of the system due to the other failures. In other words, the first four equations become:


[math]\displaystyle{ \text{Time-to-Failure}_{i}=t_{1}-r_{i-1}-(\text{System Downtime since}\,r_{i-1})\,\! }[/math]
[math]\displaystyle{ \text{Time-to-failure}_{i}=t_{1}-(\text{System Start Time}-\text{System Downtime since System Start Time})\,\! }[/math]
[math]\displaystyle{ \text{Suspension}_{1}=t_{1}-(\text{System Start Time}-\text{System Downtime since System Start Time})\,\! }[/math]
[math]\displaystyle{ \text{LastSuspension} = \text{System End Time}-r_{n}-\text{System Downtime since}\,r_{n}\,\! }[/math]


Repair times are obtained by calculating the difference between the date/time of event occurrence and the date/time of restoration, or:


[math]\displaystyle{ \text{Time-to-repair}_{i}=r_{i}-t_{i}\,\! }[/math]


All these equations should also take into consideration the periods when the system is not operating or not in use, as in the case of operations that do not run on a 24/7 basis. The failure/repair data of every component in the event log can then be used to derive failure distributions and repair distributions using life data analysis methods. The process of data extraction and model fitting can be automated using the Weibull++ event log folio.


Example

Consider a very simple system composed of only two components, A and B. The system runs from 8 AM to 5 PM, Monday through Friday. When a failure is observed, the system undergoes repair and the failed component is replaced. The date and time of each failure is recorded in an equipment downtime log, along with an indication of the component that caused the failure. The date and time when the system was restored is also recorded. The downtime log for this simple system is given next.

Note that:

  • The date and time of each failure is recorded.
  • The date and time of repair completion for each failure is recorded.
  • The repair involves replacement of the responsible component.
  • The responsible component for each failure is recorded.

For this example, we will assume that an engineer began recording these events on January 1, 1997 at 12 PM and stopped recording on March 18, 1997 at 1 PM, at which time the analysis was performed. Information for events prior to January 1 is unknown.

The objective of the analysis is to obtain the failure and repair distributions for each component. To do this, the times-to-failure and the times-to-repair for each component need to be computed from the data in the table. Once the times-to-failure data and times-to-repair data have been obtained, a life distribution will be fitted to each data set. The principles and theory for fitting a life distribution is presented in detail in Life Distributions.


Solution

Obtaining Failure and Repair Times for Component A


We begin the analysis by looking at component A. The first time that component A is known to have failed is recorded in row 1 of the data sheet; thus, the first age (or time-to-failure) for A is the difference between the time we began recording the data and the time when this failure event happened. Also, the component does not age when the system is down due to the failure of another component. Therefore, this time must be taken into account.


1. The First Time-To-Failure for Component A, TTFA[1]

The first time-to-failure of component A, TTFA[1], is the sum of the hours of operation for each day, starting on the start date (and time) and ending with the failure date (and time). This is shown graphically next. The operating periods are indicated with a green background. Thus, TTFA[1] = 5 + 8 = 13 hours.


PIC


2. The First Time-To-Repair for Component A, TTRA[1]

The time-to-repair for component A for this failure, TTRA[1], is [Date/Time Restored - Date/Time Occurred] or:


TTRA[1] = (Jan 02 1997/7:49 PM) - (Jan 02 1997/4:00 PM) = 3:49 = 3.8166 hours


(Note that in the case of repair actions, shifts are not taken into account since it is assumed that repair actions will be performed as needed to bring the system up.)

3. The Second Time-To-Failure for Component A, TTFA[2]

Continuing with component A, the second system failure due to component A is found in row 4, on January 12, 1997 at 3:26 PM. Thus, to compute TTFA[2], you must look at the age the component accumulated from the last repair time, taking shifts into account as before, but with the added complexity of accounting for the times that the system was down due to failures of other components (i.e., component A was not aging when the system was down for repair due to a component B failure).

This is shown graphically next using green to show the operating times of A and orange to show the downtimes of the system for reasons other than the failure of A (to the closest hour).


PIC


To illustrate this mathematically, we will use a function, [math]\displaystyle{ \tau }[/math], which, given a range of times, returns the shift hours worked during that period. In other words, for this example [math]\displaystyle{ \tau }[/math](1/1/97 3:00 AM - 1/1/97 6:00 PM) = 9 hours given an 8 AM to 5 PM shift. Furthermore, we will show the date and time a failure occurred as DTO and the date and time a repair was completed at DTR with a numerical subscript indicating the row that this entry is in (e.g., DTO4 for the date and time a failure occurred in row 4).

Then the total possible hours (TPH) that component A could have operated from the time it was repaired to the time it failed the second time is:

TPH = [math]\displaystyle{ \tau }[/math](DTO4 – DTR1),
TPH = [math]\displaystyle{ \tau }[/math](DTO4 – DTR1) = 9 Days * 9 hours + 7:26 hours = 88:26 hours = 88.433 hours


The time that component A was not operating (NOP) during normal hours of operation is the time that the system was down due to failure of component B, or:

NOP = [math]\displaystyle{ \tau }[/math](DTO2 – DTR2) + [math]\displaystyle{ \tau }[/math](DTO3 – DTR3)
NOP = [math]\displaystyle{ \tau }[/math](DTO2 – DTR2) + [math]\displaystyle{ \tau }[/math]( DTO3 – DTR3) = 2:13 hours + 7:47 hours = 10:00 hours


Thus, the second time-to-failure for component A, TTFA[2], is:

TTFA [2] = TPH- NOP
TTFA[2] = 88:26 hours –10:00 hours = 78:26 hours = 78.433 hours