CS302 Lecture Notes - Event-based simulation of failures and repair in a production environment

Day #2


Our next step is to add Maintenance. To do that, we'll generate a DU_Maint_Begin event at the appropriate time (the maintenance interval) on the first distillation unit. Keeping with our incremental style of programming, we won't implement it. This is in patron_7.cpp -- here's the relevant change to the constructor:

Simulation_System::Simulation_System(int argc, char **argv)
{
 ....
  e = new Event;
  e->type = DU_Maint_Begin;
  e->time = mi;
  e->du = dus[0];
  eventq.insert(make_pair(e->time, e));
}

When we run it on the last set of arguments, we see the event, but it doesn't get processed since it's after the end of the simulation:

UNIX> patron_7 4 10 1.3 3650 3 5 1 1 3651
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 1012.6471 DU_Repaired      0
 3021.4310 DU_Failed        0
 3028.8040 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 3651.0000 DU_Maint_Begin   0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
Uptime: 14580.6
Downtime: 19.4109
--------------------------------
UNIX> 
We run it again with a Maintenance interval of a year, and now we exit because the event is not implemented:
UNIX> patron_7 4 10 1.3 3650 3 5 1 1 365
  365.0000 DU_Maint_Begin   0
Event not implemented 2
UNIX> 

Patron_8.cpp - Maintenance when a DU is under repair

We'll implement maintenance in steps. First, what should we do if we try to do maintenance on a unit that is under repair? We should simply skip to the next unit. We'll implement that first. To skip to the next unit, we generate a new DU_Maint_Begin event at the current time. We could do that differently, but this fits best with our framework. This is done in patron_8.cpp -- note, we implement skipping to the next unit in a separate procedure because we're going to need to do the same thing when we process a DU_Maint_End event:

void Simulation_System::schedule_next_du_maintenance(Event *e)
{
  e->type = DU_Maint_Begin;
  if (e->du->id == n-1) {
    e->du = dus[0];
    e->time += mi;
  } else {
    e->du = dus[e->du->id+1];
  }
  eventq.insert(make_pair(e->time, e));
}

void Simulation_System::Run()
{
 ...
        du = e->du;
        if (du->state == Failed) {
          schedule_next_du_maintenance(e);
        } else {
          cerr << "Maintenance not implemented yet\n";
          exit(1);
        }
        break;
...
}

Of course, when we run it, we get failure, because the first DU is functional when maintenance starts:

UNIX> patron_8 4 10 1.3 3650 3 5 1 1 365
  365.0000 DU_Maint_Begin   0
Maintenance not implemented yet
UNIX> 
However, let's tweak the parameters so that failures happen roughly every 50 days, and repair takes 100 years. That will mean that when we do maintenance, the DU's will be under repair. We'll also have the simulation last 2.5 years rather than ten. This means that we should have exactly two sets of maintenance on our system:
UNIX> patron_8 4 2.5 3 50 3 1000000 1 1 365
   23.3109 DU_Failed        2
   28.6091 DU_Failed        0
   55.7461 DU_Failed        1
   63.4527 DU_Failed        3
  365.0000 DU_Maint_Begin   0
  365.0000 DU_Maint_Begin   1
  365.0000 DU_Maint_Begin   2
  365.0000 DU_Maint_Begin   3
  730.0000 DU_Maint_Begin   0
  730.0000 DU_Maint_Begin   1
  730.0000 DU_Maint_Begin   2
  730.0000 DU_Maint_Begin   3
  912.5000 Simulation_Over
--------------------------------
Simulation Time: 912.5
DU 0's state is 1
DU 1's state is 1
DU 2's state is 1
DU 3's state is 1
 1095.0000 DU_Maint_Begin   0
772052.9448 DU_Repaired      3
951398.4507 DU_Repaired      2
1056266.6779 DU_Repaired      1
1155040.2999 DU_Repaired      0
Uptime: 171.119
Downtime: 3478.88
--------------------------------
UNIX> 
Excellent.

Patron_9.cpp - Lazy Maintenance

In our next step, let's implement maintenance that doesn't do anything. This is equivalent to the maintenance man going to the DU and falling asleep for the required period of time. When we do maintenance on a failed unit, we simply generate a DU_Maint_End event and quit. Then the DU_Maint_End event calls schedule_next_du_maintenance(e) either to move on to the next DU or to schedule the next maintenance on the whole system (patron_9.cpp):

void Simulation_System::Run()
...
      case DU_Maint_Begin: 
        du = e->du;
        if (du->state == Failed) {
          schedule_next_du_maintenance(e);
        } else {
          e->type = DU_Maint_End;
          e->time += mt;
          eventq.insert(make_pair(e->time, e));
        }
        break;
      case DU_Maint_End: 
        schedule_next_du_maintenance(e);
        break;
}

We run it on the original set of parameters and see that everything works as it should -- the maintenance events each take a day, but do nothing to failure or uptime or downtime:

UNIX> patron_9 4 10 1.3 3650 3 5 1 1 365
  365.0000 DU_Maint_Begin   0
  366.0000 DU_Maint_End     0
  366.0000 DU_Maint_Begin   1
  367.0000 DU_Maint_End     1
  367.0000 DU_Maint_Begin   2
  368.0000 DU_Maint_End     2
  368.0000 DU_Maint_Begin   3
  369.0000 DU_Maint_End     3
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
  734.0000 DU_Maint_Begin   0
  735.0000 DU_Maint_End     0
.....
 3320.0000 DU_Maint_End     2
 3320.0000 DU_Maint_Begin   3
 3321.0000 DU_Maint_End     3
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 3686.0000 DU_Maint_Begin   0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
Uptime: 14580.6
Downtime: 19.4109
--------------------------------
UNIX> 

Patron_A.cpp - Actually Doing Maintenance

When you think about it, maintenance needs to get rid of the current failure event and generate a new one when maintenance is over. To do that, we need to store a pointer to the event on the event queue so that we can delete it when we do maintenance. This is done in patron_A.cpp. Here are the changes to the data structures -- not a Dist_Unit holds an iterator that points to the failure event of that Dist_Unit on the event queue:

typedef multimap <double, class Event *> EQ;

class Dist_Unit {
  public:
    double lasteventtime;
    double uptime;
    double downtime;
    DU_State state;
    int id;
    EQ::iterator failure_ptr;
};

When we insert a failure event on the , we set failure_ptr to point to the event. For example, here's a code snippet from the constructor:

...
  for (i = 0; i < n; i++) {
    e = new Event;
    e->type = DU_Failed;
    e->time = gen_weibull(beta_f, eta_f, 0.0);
    e->du = dus[i];
    e->du->failure_ptr = eventq.insert(make_pair(e->time, e));
  }
...

And now, at the start of maintenance, we delete the failure event. At the end, we generate a new failure event. To test these, we print out some messages. Finally, note that we call delete on the event, because we don't reuse it like we do the other events. If we didn't delete it, our program would have a memory leak. Here's the code from Run():

      case DU_Maint_Begin: 
        du = e->du;
        if (du->state == Failed) {
          schedule_next_du_maintenance(e);
        } else {
          ne = e->du->failure_ptr->second;
          cout << "Deleting Event: ";
          Print_Event(ne);
          delete ne;
          eventq.erase(e->du->failure_ptr);
          e->type = DU_Maint_End;
          e->time += mt;
          eventq.insert(make_pair(e->time, e));
        }
        break;
      case DU_Maint_End: 
        ne = new Event;
        ne->du = e->du;
        ne->type = DU_Failed;
        ne->time = simtime + gen_weibull(beta_f, eta_f, 0.0);
        e->du->failure_ptr = eventq.insert(make_pair(ne->time, ne));
        cout << "Generating New Event: ";
        Print_Event(ne);
        schedule_next_du_maintenance(e);
        break;

When we run it, we see that failure events do get deleted and regenerated properly. Interestingly, maintenance didn't eradicate all failures, as there is still one in the system:

UNIX> patron_A 4 10 1.3 3650 3 5 1 1 365
  365.0000 DU_Maint_Begin   0
Deleting Event:  1006.3660 DU_Failed        0
  366.0000 DU_Maint_End     0
Generating New Event:  3619.3772 DU_Failed        0
  366.0000 DU_Maint_Begin   1
Deleting Event:  4691.5675 DU_Failed        1
  367.0000 DU_Maint_End     1
Generating New Event:  5457.0507 DU_Failed        1
  367.0000 DU_Maint_Begin   2
Deleting Event:   627.3349 DU_Failed        2
  368.0000 DU_Maint_End     2
Generating New Event:  4508.9609 DU_Failed        2
  368.0000 DU_Maint_Begin   3
Deleting Event:  6325.4670 DU_Failed        3
  369.0000 DU_Maint_End     3
Generating New Event:  2377.7840 DU_Failed        3
  734.0000 DU_Maint_Begin   0
Deleting Event:  3619.3772 DU_Failed        0
  735.0000 DU_Maint_End     0
...
 3318.0000 DU_Maint_End     0
Generating New Event:  3537.7115 DU_Failed        0
 3318.0000 DU_Maint_Begin   1
Deleting Event:  3738.9977 DU_Failed        1
 3319.0000 DU_Maint_End     1
Generating New Event:  4242.3191 DU_Failed        1
 3319.0000 DU_Maint_Begin   2
Deleting Event:  3502.8592 DU_Failed        2
 3320.0000 DU_Maint_End     2
Generating New Event:  5400.5551 DU_Failed        2
 3320.0000 DU_Maint_Begin   3
Deleting Event: 11064.0722 DU_Failed        3
 3321.0000 DU_Maint_End     3
Generating New Event:  4246.9549 DU_Failed        3
 3537.7115 DU_Failed        0
 3543.2616 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 3686.0000 DU_Maint_Begin   0
 4242.3191 DU_Failed        1
 4246.9549 DU_Failed        3
 5400.5551 DU_Failed        2
 9988.6965 DU_Failed        0
Uptime: 14594.4
Downtime: 5.55007
--------------------------------
UNIX> 

Patron_B.cpp - Uptime and Downtime

We haven't updated the uptime/downtime parameters during maintenance. The following code (patron_B.cpp) performs this:

      case DU_Maint_Begin: 
        du = e->du;
        if (du->state == Failed) {
          schedule_next_du_maintenance(e);
        } else {
          du->state = Under_M;
          du->uptime += (simtime - du->lasteventtime);
          du->lasteventtime = simtime;
          ne = e->du->failure_ptr->second;
          delete ne;
          eventq.erase(e->du->failure_ptr);
          e->type = DU_Maint_End;
          e->time += mt;
          eventq.insert(make_pair(e->time, e));
        }
        break;
      case DU_Maint_End: 
        du = e->du;
        du->state = Up;
        du->downtime += (simtime - du->lasteventtime);
        du->lasteventtime = simtime;
        ne = new Event;
        ne->du = e->du;
        ne->type = DU_Failed;
        ne->time = simtime + gen_weibull(beta_f, eta_f, 0.0);
        e->du->failure_ptr = eventq.insert(make_pair(ne->time, ne));
        schedule_next_du_maintenance(e);
        break;

Now we have a nice simulator that works on all fronts:

UNIX> patron_B 4 10 1.3 3650 3 5 1 1 365
  365.0000 DU_Maint_Begin   0
  366.0000 DU_Maint_End     0
  366.0000 DU_Maint_Begin   1
...
 3543.2616 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 3686.0000 DU_Maint_Begin   0
 4242.3191 DU_Failed        1
 4246.9549 DU_Failed        3
 5400.5551 DU_Failed        2
 9988.6965 DU_Failed        0
Uptime: 14558.4
Downtime: 41.5501
--------------------------------
UNIX> 
If you look at the output with no maintenance, you'll see that maintenance doesn't seem to help in terms of uptime/downtime:
UNIX> patron_B 4 10 1.3 3650 3 5 1 1 3651
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 1012.6471 DU_Repaired      0
 3021.4310 DU_Failed        0
 3028.8040 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 3651.0000 DU_Maint_Begin   0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
Uptime: 14580.6
Downtime: 19.4109
--------------------------------
UNIX> 
But, is one run really enough, especially when it generates just three failures? No.

Patron_C.cpp - Running many simulations

What we need to do is run many, many simulations so we can try to draw some conclusions from our simulation. So Patron_C.cpp modifies the simulator to do so. We implement a method called get_downtime() that calculates and returns the downtime in a simulation. Then we can have the simulator do multiple runs, and we can calculate stats from the data. The following main() does this - it puts the results of each run into a vector, sorts the vector, and then prints out maximum/minimum/average values, plus a final line that has everything including quartile values:

main(int argc, char **argv)
{
  int i;
  long seed;
  Simulation_System *ss;
  double downtime, dt;
  double iterations;
  vector <double> simvals;

  if (argc != 12) {
    cerr << "usage: patron n duration(years) beta_f eta_f beta_r eta_r gamma_r maintenance_time maintenance_interval iterations seed\n";
    cerr << "       all other units are days\n";
    exit(1);
  }
  if (sscanf(argv[10], "%lf", &iterations) != 1 || iterations < 0) {
    cerr << "Bad iterations\n";
    exit(1);
  }

  if (sscanf(argv[11], "%ld", &seed) != 1) {
    cerr << "Bad seed\n";
    exit(1);
  }

  srand48(seed);
  downtime = 0;
  ss = new Simulation_System(argc, argv);
  for (i = 0; i < iterations; i++) {
    ss->Run();
    dt = ss->get_downtime();
    simvals.push_back(dt);
    downtime += dt;
  }
  sort(simvals.begin(), simvals.end());

  printf("Iterations: %10.0lf\n", iterations);
  printf("Min:        %10.4lf\n", simvals[0]);
  printf("Max:        %10.4lf\n", simvals[simvals.size()-1]);
  printf("Mean        %10.4lf\n", downtime / iterations);
  printf("\n");

  cout << simvals[0] << " " <<
          simvals[simvals.size()-1] << " " <<
          downtime / iterations << " " <<
          simvals[simvals.size()/2] << " " <<
          simvals[simvals.size()/4] << " " <<
          simvals[simvals.size()*3/4] << endl;
}

Unfortunately, this also requires some changing of our Simulation_System. In particular, we need to be able to call Run() many times for different runs. This means that Run() needs to re-initialize the simulator's state. We do that by implementing a new method called Initialize_Simulation(), which resets all of the simulator's state, including deleting the event queue:

void Simulation_System::Initialize_Simulation()
{
  int i;
  Event *e;
  Dist_Unit *du;
  EQ::iterator eqit;
  
  while (!eventq.empty()) {
    eqit = eventq.begin();
    e = eqit->second;
    delete e;
    eventq.erase(eqit); 
  }

  for (i = 0; i < n; i++) {
    du = dus[i];
    du->lasteventtime = 0;
    du->uptime = 0;
    du->downtime = 0;
    du->state = Up;
  }

  e = new Event;
  e->type = Simulation_Over;
  e->time = duration;
  e->du = NULL;

  eventq.insert(make_pair(e->time, e));

  for (i = 0; i < n; i++) {
    e = new Event;
    e->type = DU_Failed;
    e->time = gen_weibull(beta_f, eta_f, 0.0);
    e->du = dus[i];
    e->du->failure_ptr = eventq.insert(make_pair(e->time, e));
  }

  e = new Event;
  e->type = DU_Maint_Begin;
  e->time = mi;
  e->du = dus[0];
  eventq.insert(make_pair(e->time, e));
}

We remove the relevant code from the constructor, and call Initialize_Simulation() at the beginning of Run(), and that allows us to run the simulation a bunch of times:

Let's first run the simulation one iteration with a few different seeds:

UNIX> patron_C 4 10 1.3 3650 3 5 1 1 3650 1 0
Iterations:          1
Min:           19.4109
Max:           19.4109
Mean           19.4109

19.4109 19.4109 19.4109 19.4109 19.4109 19.4109
UNIX> patron_C 4 10 1.3 3650 3 5 1 1 3650 1 1
Iterations:          1
Min:           21.0263
Max:           21.0263
Mean           21.0263

21.0263 21.0263 21.0263 21.0263 21.0263 21.0263
UNIX> patron_C 4 10 1.3 3650 3 5 1 1 3650 1 2
Iterations:          1
Min:           17.8684
Max:           17.8684
Mean           17.8684

17.8684 17.8684 17.8684 17.8684 17.8684 17.8684
UNIX> 
Those value differ by quite a bit, so if we are going to try to draw conclusions, we'll need to make multiple runs. To assess this, the following graph shows the results of ten runs for each of the given number of iterations:

Clearly, using fewer than 1000 iterations does not give you a good average value. Were I trying to publish this data, I'd be using at least 100,000 iterations per data point. You should also be wary about just presenting average numbers. The following graph compares the two strategies of doing no maintenance and doing yearly maintenance:

The plots are called Tukey plots. They convey much more information than a simple average value. Instead, they show the spread of values from the minimum to maximum. Additionally, there is a yellow box that starts at the end of the first quartile of data, and goes to the end of the third quartile. Also shown are the mean (depicted by a dot) and the median (depicted by hash marks inside the box. The Tukey plots are effective at showing that no maintenance is much better on average than maintenance. It is true that an individual run with no maintenance may show more downtime than an individual run with maintenance. But the average behavior of using no maintenance is clearly better.

Exploring further, perhaps we'd like to see how different maintenance intervals impact the average downtime. The following graph plots this information:

There are a few problems with this graph. First, I'm plotting days on the X axis. Is that what I want? No -- I tend to think about maintenance in terms of yearly intervals, so I should change the X axis to plot the interval in terms of years. Second, where is the point for no maintenance? It's not there, and it should be. I fix both problems below (I include this because many people will write papers and give me graphs like the one above, because they are easy to produce, rather than take the time to modify them so that they are easy to read):

Much better. Now, why the steps in the curve? The reason is that each step downward represents one fewer maintenance operation. For example, when the interval is between five and ten years, there is just one maintenance operation. When the interval is between 3.4 and five years, there are two. Each maintenance interval adds significant downtime to the simulation. However, if you look just at the curve when there is only one maintenance interval, you'll see that there is less downtime on the five year side than there is on the ten year side. Why? Because both perform the same amount of maintenance, but when maintenance is performed just after five years, that helps to prevent failures from five to ten years. When maintenance is performed just before ten years, you don't prevent any of the failures from five to ten years, but you still absorb the maintenance overhead. Regardless, performing no maintenance gives you the smallest amount of downtime.

Our next exploration is to see what happens when the maintenance time is reduced. The following graph plots the downtime of performing yearly maintenance when the maintenance time ranges from a little more than an hour to a day:

The graph shows that when the maintenance time is sufficiently small (less than 5.5 hours or so), that yearly maintenance results in less average downtime than no maintenance.

How about messing with βf? Does this impact performance? We plot the results below:

βf most definitely impacts the effectiveness of maintenance. Increasing βf makes maintenance more effective. Why? Because increasing βf shifts the random numbers so that they are closer to the mean -- there are fewer small random numbers, so there are fewer failures generated before scheduled maintenance.

One implication of this is that you need to have confidence that the parameters of your simulation match what you're trying to simulate. If you use βf equal to 1.3, but in reality it is 3, then you will draw incorrect conclusions from your simulation. How do you assure that you have the right parameters? We'll talk about it in class (It's certainly outside the scope of this class, but something that you should think about. You'd be amazed how much academic research gets published that draws conclusions based on simulation parameters that have no justifiable basis in fact....)