CS302 Lecture Notes - Event-based simulation of failures and repair in a production environment

Day 1


We're going to perform an event-based simulation in this lecture. This is a technique that you can use to evaluate a real system if you can parameterize it decently. I asked in class what you would like your simulation to simulate, and y'all suggested a tequila-producing plant, so we'll go with that.

Our tequila plant is composed of distillation units (DU's), each of which produces tequila. These units fail according to a Weibull distribution whose rate is parameterized by βF, ηF and γF (see the lecture notes on Random Number Generation for an explanation of Weibulls). For example, suppose our units have a lifetime of 10 years with a mildly increasing failure rate and no minimum. Then we'll have ηF = 10 years, βF = 1.3 and γF = 0 years.

When a unit fails, its repair is also governed by a Weibull parameterized by βR, ηR and γR. For example, let's use ηR = 5 days, βR = 3 and γR = 1 day. This says it always takes at least a day, and after that day, the characteristic repair time is 5 days, and repairs are distributed in roughly a normal distribution (βR = 3).

Suppose also that you can enroll in a maintenance plan where a maintenance specialist comes every mi days, and spends mt days on each unit performing maintenance. When the specialist is maintaining a unit, it does not produce tequila. When the specialist is finished with a unit, it is started afresh.

We want to evaluate what maintenance schedule would be best for our system, and to do that, we will perform an event-based simulation. We will choose failure and repair times using drand48() and simulate the system in action for a duration of time. Moreover, we will keep track of how long each DU is usable and unusable, so that we can report how successful the maintenance schedule is.

We can represent each distillation unit as going though the following states:

Note that events take us from state to state. Our simulation will represent the states that the DU's go through, and will generate the events that take us from state to state. Thus, our simulation is event-based, where we will have the following events:

We will have a data structure for each DU, which records its identity and its state (working, failed, under maintenance). We will start with each DU in the working state. Then, we will generate when in time each DU will fail, turn that into a failure event, and put it on an event queue. We then go into a loop where we simulate time going forward -- we grab the event in the event queue that is going to happen next, take it off, and simulate it. This may involve updating a DU data structure, and will also usually involve adding one or more new events to the event queue.


Patron_1.cpp - the simulation framework

We're going to build our simulation from the ground up. We start with the program patron_1.cpp. First, let's look at the data structures. We have three states for a DU -- Up, Failed and Under Maintenance. and five event types: DU Failed, DU Repaired, DU Maintenance Begin, DU Maintenance End, and Simulation Over. You should be able to map these states and events to the picture above.

Our DU's have two pieces of data -- their name their state. They will all start in the Up state. We also have an event data structure, which contains the time of the event, its type, and the id of the DU on which the event is acting. Note, the Simulation_Over event acts on the system and not on an individual DU..

Finally, we have a class definition of our simulation system, which consists of all of the input data, a vector of DU's, an Event Queue (which is a multimap since it is sorted), and a simulation time. Our basic process will be to put events on the event queue, then to process the event queue by simulating time going forward. We select the event on the queue with the lowest time, and then we "process" that event, which moves time forward to the time of the event. In processing that event, we may generate other events, etc.

Our Simulation system implements three methods -- a constructor, which initializes the system from argc and argv, a Run() method which runs the simulation, processing the Event Queue, and a Print_Event() method which prints out an event:

#include <iostream>
#include <vector>
#include <map>
using namespace std;

enum DU_State { Up, Failed, Under_M };
enum Event_Type { DU_Failed, DU_Repaired, DU_Maint_Begin, DU_Maint_End, Simulation_Over };

class Dist_Unit {
  public:
    DU_State state;
    int id;
};

class Event {
  public:
     Event_Type type;
     Dist_Unit *du;
     double time;
};

typedef multimap <double, Event *> EQ;

class Simulation_System {
  public:
    Simulation_System(int argc, char **argv);
    void Run();
    void Print_Event(Event *e);
  protected:
    int n;
    double duration;
    double beta_f;
    double eta_f;
    double beta_r;
    double eta_r;
    double gamma_r;
    double mi;
    double mt;
    vector <Dist_Unit *> dus;
    EQ eventq;
    double simtime;
};

void Simulation_System::Print_Event(Event *e)
{

}

Simulation_System::Simulation_System(int argc, char **argv)
{
  int i;
  Dist_Unit *du;
  Event *e;

  if (argc != 10) {
    cerr << "usage: patron n duration(years) beta_f eta_f beta_r eta_r gamma_r maintenance_time maintenance_interval\n";
    cerr << "       all other units are days\n";
    exit(1);
  }
   
  if (sscanf(argv[1], "%d", &n) != 1 || n <= 0) { cerr << "Bad n" << endl; exit(0); }
  if (sscanf(argv[2], "%lf", &duration) != 1 || duration <= 0) { cerr << "Bad duration" << endl; exit(0); }
  duration *= 365.0;
  if (sscanf(argv[3], "%lf", &beta_f) != 1 || beta_f <= 0) { cerr << "Bad beta_f" << endl; exit(0); }
  if (sscanf(argv[4], "%lf", &eta_f) != 1 || eta_f <= 0) { cerr << "Bad eta_f" << endl; exit(0); }
  if (sscanf(argv[5], "%lf", &beta_r) != 1 || beta_r <= 0) { cerr << "Bad beta_r" << endl; exit(0); }
  if (sscanf(argv[6], "%lf", &eta_r) != 1 || eta_r <= 0) { cerr << "Bad eta_r" << endl; exit(0); }
  if (sscanf(argv[7], "%lf", &gamma_r) != 1 || gamma_r < 0) { cerr << "Bad gamma_r" << endl; exit(0); }
  if (sscanf(argv[8], "%lf", &mt) != 1 || mt <= 0) { cerr << "Bad mt" << endl; exit(0); }
  if (sscanf(argv[9], "%lf", &mi) != 1 || mi <= 0) { cerr << "Bad mi" << endl; exit(0); }

  for (i = 0; i < n; i++) {
    du = new Dist_Unit;
    du->id = i;
    du->state = Up;
    dus.push_back(du);
  }

  simtime = 0;

  e = new Event;
  e->type = Simulation_Over;
  e->time = duration;
  e->du = NULL;

  eventq.insert(make_pair(e->time, e));
}

void Simulation_System::Run()
{
  EQ::iterator eqit;
  Event *e;

  while(1) {
    eqit = eventq.begin();
    if (eqit == eventq.end()) {
      cerr << "Event Queue is empty\n";
      exit(1);
    }
    e = eqit->second;
    eventq.erase(eqit);
    switch(e->type) {
      default:
        cerr << "Event not implemented " << e->type << endl;
        exit(1);
    }
  }
}

main(int argc, char **argv)
{
  int ndu;
  int i;
  Simulation_System *ss;
  double simtime, mttf, mttr;

  ss = new Simulation_System(argc, argv);
  ss->Run();
}

Note, we haven't implemented Print_Event() yet, and our Run() procedure does nothing. However, the initialization is all performed by the constructor, and a Simulation_Over event is put on the Event Queue. To run the program, we need to enter the number of DU's, the duration of the simulation, and then all the event generation parameters:

UNIX> patron_1 4 10 1.3 3650 3 5 1 1 3651
Event not implemented 4
UNIX> 
As you can see, in the Run() method, the event is pulled off the queue, but it is not implemented, so you get an error.

Patron_2.cpp: Implementing Simulation_Over and Print_Event

Here are the relevant parts of patron_2.cpp

void Simulation_System::Print_Event(Event *e)
{

  printf("%10.4lf ", e->time);
  switch(e->type) {
    case DU_Failed: printf("%-15s", "DU_Failed"); break;
    case DU_Repaired: printf("%-15s", "DU_Repaired"); break;
    case DU_Maint_Begin: printf("%-15s", "DU_Maint_Begin"); break;
    case DU_Maint_End: printf("%-15s", "DU_Maint_End"); break;
    case Simulation_Over: printf("%-15s", "Simulation_Over"); break;
    default: cerr << "This shouldn't happen\n"; exit(1); break;
  }
  if (e->type != Simulation_Over) {
    printf("%3d\n", e->du->id);
  } else {
    printf("\n");
  }
}

...

void Simulation_System::Run()
{
  EQ::iterator eqit;
  Event *e;

  while(1) {
    eqit = eventq.begin();
    if (eqit == eventq.end()) {
      cerr << "Event Queue is empty\n";
      exit(1);
    }
    e = eqit->second;
    eventq.erase(eqit);
    simtime = e->time;
    Print_Event(e);
    switch(e->type) {
      case Simulation_Over: 
        return;
        break;
      default:
        cerr << "Event not implemented " << e->type << endl;
        exit(1);
    }
  }
}

As you can see, we've now implemented printing Events, and we print each event when we remove it from the Event Queue. We've also implemented the Simulation_Over event by returning from the procedure. When we run it, our output shows the processing of the one event:

UNIX> patron_2 4 10 1.3 3650 3 5 1 1 3651
 3650.0000 Simulation_Over
UNIX> 

Patron_3.cpp: Generating Failure and Repair Events

In patron_3.cpp, we make several changes to generate failures and process them in the Run() method. Moreover, we implement a method Print_State() which prints the states of all DU's and the Event Queue. This is very helpful for debugging. We call it at the top of the while() loop in the Run method.

Next, we have the following code to generate a Weibull random number. You can cut and paste this for your Lab 5:

double gen_weibull(double beta, double eta, double gamma)
{
  double tmp1, tmp2;

  tmp1 = -log(1.0 - drand48());
  tmp2 = pow(tmp1, 1.0/beta);
  return eta * tmp2 + gamma;
}

Next, we have the following code to generate failure events for each DU. This is in the constructor:

Simulation_System::Simulation_System(int argc, char **argv)
{
  ...

  for (i = 0; i < n; i++) {
    e = new Event;
    e->type = DU_Failed;
    e->time = gen_weibull(beta_f, eta_f, 0.0);
    e->du = dus[i];
    eventq.insert(make_pair(e->time, e));
  }
}

And we have the following code in Run() to process DU_Failed events. What they do is set the state of the DU to failed, then generate a DU_Repaired event according to the Repair Weibull:

void Simulation_System::Run()
{
   ...
      case DU_Failed: 
        du = e->du;
        du->state = Failed;
        e->type = DU_Repaired;
        e->time = simtime + gen_weibull(beta_r, eta_r, gamma_r);
        eventq.insert(make_pair(e->time, e));
        break;
  ...
}

When we run it, we see the state of the Event Queue at the startup time. All disks are in state 0 (Up), and there are failure events for all four DU's. The first is for #2 at time 627.3349. The last is for DU 3 at time 6325.4670, which is after the simulation is over:

UNIX> patron_3 4 10 1.3 3650 3 5 1 1 3651
--------------------------------
Simulation Time: 0
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
  627.3349 DU_Failed        2
 1006.3660 DU_Failed        0
 3650.0000 Simulation_Over
 4691.5675 DU_Failed        1
 6325.4670 DU_Failed        3
--------------------------------
  627.3349 DU_Failed        2
--------------------------------
Simulation Time: 627.335
DU 0's state is 0
DU 1's state is 0
DU 2's state is 1
DU 3's state is 0
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 3650.0000 Simulation_Over
 4691.5675 DU_Failed        1
 6325.4670 DU_Failed        3
--------------------------------
  633.0918 DU_Repaired      2
Event not implemented 1
UNIX> 
When the first failure event is processed, the state of DU #2 changed to 1 (Failed), and a repair event is generated for time 633.0918. When that event is processed by Run(), the event is not implemented and the simulation exits.


Patron_4.cpp: Processing Repair Events

In patron_4.cpp, we implement repair, which is very similar to failure -- we reset the state of the DU and generate a new failure event. Also, the program takes the Print_State() call out of Run() and instead calls it from the main() routine when the simulation is over. Here's the processing code from Run(): patron_4.cpp

void Simulation_System::Run()

...
      case DU_Repaired: 
        du = e->du;
        du->state = Up;
        e->type = DU_Failed;
        e->time = simtime + gen_weibull(beta_f, eta_f, 0.0);
        eventq.insert(make_pair(e->time, e));
        break;
...
}

And here is an example where three failure and repair events are processed (two of them on DU #0):

UNIX> patron_4 4 10 1.3 3650 3 5 1 1 3651
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 1012.6471 DU_Repaired      0
 3021.4310 DU_Failed        0
 3028.8040 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
--------------------------------
UNIX> 

Patron_5.cpp: Monitoring Up and Downtime Incorrectly

In patron_5.cpp, we add uptime, downtime, and last-event-time variables to the DU's:

class Dist_Unit {
  public:
    double lasteventtime;
    double uptime;
    double downtime;
    DU_State state;
    int id;
};

And we modify them when we change state in Run(): We add to the uptime when we see a failure, and we add to the downtime when we see a repair:

void Simulation_System::Run()
{
...
      case DU_Failed: 
        du = e->du;
        du->state = Failed;
        du->uptime += (simtime - du->lasteventtime);
        e->type = DU_Repaired;
        e->time = simtime + gen_weibull(beta_r, eta_r, gamma_r);
        du->lasteventtime = simtime;
        eventq.insert(make_pair(e->time, e));
        break;
      case DU_Repaired: 
        du = e->du;
        du->state = Up;
        du->downtime += (simtime - du->lasteventtime);
        e->type = DU_Failed;
        e->time = simtime + gen_weibull(beta_f, eta_f, 0.0);
        du->lasteventtime = simtime;
        eventq.insert(make_pair(e->time, e));
        break;
...
}

In Print_State() we add them up and print them out:

void Simulation_System::Print_State()
{
  int i;
  Event *e;
  EQ::iterator eqit;
  double uptime, downtime;

  ...

  uptime = 0;
  downtime = 0;
  for (i = 0; i < n; i++) {
    uptime += dus[i]->uptime;
    downtime += dus[i]->downtime;
  }
  cout << "Uptime: " << uptime << endl;
  cout << "Downtime: " << downtime << endl;
  cout << "--------------------------------\n";
}

When we run this on our example, like dutiful programmers, we double-check the output:

UNIX> patron_5 4 10 1.3 3650 3 5 1 1 3651
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 1012.6471 DU_Repaired      0
 3021.4310 DU_Failed        0
 3028.8040 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
Uptime: 3642.48
Downtime: 19.4109
--------------------------------
UNIX> 
Hmmm -- the up and downtimes should add to 3650*4 = 14600, and they're not even close. What's going on?

Patron_6.cpp: Monitoring Up and Downtime Correctly

Well, we calculate up and downtimes when a failure or repair event occurs, but when we call Print_State(), time has passed since the last failure and repair event for each DU, and we haven't reflected that. So, we need to modify the code in Print_State() to account for the time from the last event to the present for each DU. This is done in patron_6.cpp:

void Simulation_System::Print_State()
{
  ...
  uptime = 0;
  downtime = 0;
  for (i = 0; i < n; i++) {
    uptime += dus[i]->uptime;
    downtime += dus[i]->downtime;
    if (dus[i]->state == Up) {
      uptime += (simtime - dus[i]->lasteventtime);
    } else {
      downtime += (simtime - dus[i]->lasteventtime);
    }
  }
  cout << "Uptime: " << uptime << endl;
  cout << "Downtime: " << downtime << endl;
  cout << "--------------------------------\n";
}

Now, when we test it all looks as it should:

UNIX> patron_6 4 10 1.3 3650 3 5 1 1 3651
  627.3349 DU_Failed        2
  633.0918 DU_Repaired      2
 1006.3660 DU_Failed        0
 1012.6471 DU_Repaired      0
 3021.4310 DU_Failed        0
 3028.8040 DU_Repaired      0
 3650.0000 Simulation_Over
--------------------------------
Simulation Time: 3650
DU 0's state is 0
DU 1's state is 0
DU 2's state is 0
DU 3's state is 0
 4691.5675 DU_Failed        1
 5723.1425 DU_Failed        2
 6325.4670 DU_Failed        3
 7670.7184 DU_Failed        0
Uptime: 14580.6
Downtime: 19.4109
--------------------------------
UNIX> echo "" | awk '{ print 14580.6+19.4109 }'
14600
UNIX> 

This is where we stopped in class on thursday -- we'll pick it up again on Tuesday: