- September 22, 2009
- James S. Plank
- Directory:
**/home/plank/cs302/Notes/Random**

When we generate a sequence of random numbers, there need to be parameters involved to say what the properties that sequence has. We usually assume a "uniform" distribution between two numbers. For example,

A **probability distribution**
defines the characteristics of generating random numbers. One
way to characterize this is
with a **cumulative
distribution function** (CDF), defined as: *F(x)* = the probability that a random number is
less than or equal to *x* for all possible *x*.

Think about a uniform distribution whose minimum value is *a* and whose maximum
value is *b*. What is *F(b)*? It is one, since all values are less than or
equal to *b*. The mean of *a* and *b* is *(a+b)/2*. So what about *F((b+a)/2)*?
It is 0.5, since half of the numbers will be below and mean and half will be above the mean.

We plot CDF's with the *x* values on the x-axis, and the *F(x)* values on the Y axis.
Note, the Y axis therefore only goes from zero to one, and the CDF has to be a monotonically
increasing function.
Here is the CDF for the uniform distribution
whose minimum value is *a* and whose maximum value is *b*:

Mathematically, this is:

We can use a CDF to generate random numbers using **drand48()**. Let *y* be a
number generated with **drand48()**. What we do to generate the number according to
a CDF is to find the value of *x* for which *F(x) = y*, and *x* is our
random number.

Try it for the uniform distribution above. Suppose we generate *y* = **drand48()**.
We need to find the value of *x* for which:

Using algebra we can massage that properly:

So when we want to emit random numbers, we do:

number = drand48() * (b - a) + a;If you think about it, that would be how you would generate uniformly distributed random numbers between

The CDF for an exponential distribution is defined for *x ≥ 0*, and is:

Here are CDF's for three values of λ:

Note a few things -- the function is asymptotic, meaning any value of *x*, from one
to a gazillion can occur. It's just that the gazillion has a *really* low probability
of occurring.

The mean value is *1/λ*.
However, in looking at the graph, you will see that well over half of the values generated will be below the mean - roughly 60 percent.
This means that when you buy a light bulb that is rated to last for 1000 hours, chances are 60
percent that it will fail before 1000 hours. However, there's also a chance that the light
bulb will last for 1000 years. It's just a really small chance.

We can generate values from an exponential distribution by using the CDF, just as we did above:

So, to generate random numbers according to an exponential, we do:

number = -1.0 * log(1.0 - drand48()) / lambda;How about we try this out? Look at

#include <stdio.h> #include <iostream> #include <math.h> #include <string> #include <set> #include <map> using namespace std; main(int argc, char **argv) { double lambda; int iterations; int i; if (argc != 3 || sscanf(argv[1], "%lf", &lambda) != 1 || lambda <= 0 || sscanf(argv[2], "%d", &iterations) != 1) { cerr << "usage: expon lambda iterations\n"; exit(1); } srand48(time(0)); for (i = 0; i < iterations; i++) { cout << log(1-drand48())/lambda*-1 << endl; } } |

Let's try it out on a simulation scenario. Suppose we buy a crate of 1000 light bulbs, each of which has an average lifetime of 1000 hours. We can simulate the failure time of each bulb by choosing 1000 numbers from an exponential distribution where λ = 1/1000 = 0.001:

UNIX>You can see from the firstexpon 0.001 1000 > bulb.txtUNIX>head bulb.txt286.928 120.251 1234.99 185.829 4899.49 2493.85 3980.23 1912.57 697.718 687.042 UNIX>sort -n bulb.txt | head -n 51.06986 1.20739 2.45384 2.62195 2.73788 UNIX>sort -n bulb.txt | tail -n 55608.55 5671.58 5792.11 7189.54 8034.5 UNIX>

Suppose we sort **bulb.txt**, and plot the numbers as a graph: the y-axis will
be the line number of the file, and the x-axis will have the value on that line:

Think to yourself -- how is that related to the CDF for an exponential? Well, divide the values
on the y-axis by 1000 and what do you get? You should get an approximation for the CDF for
λ = 1000. To show that, the graph below graphs two lines: the previous graph with the
y-axis divided by 1000, and the CDF function *y = 1 - e ^{-0.001 x}*:

Awesome, no?

- η is the
*lifetime*-- sometimes called a*shape*. Wikipedia calls it λ, but I find that confusing because it is not the same λ as an exponential, so I'm using an alternate form that I've seen in other papers. - β is a
*shape*which describes whether failures/events increase or decrease over time:- If β equals one, then the Weibull is the same as an exponential with
*λ = 1 / η*. This makes η the expected value, which makes sense. - If β is greater than one, then failures/events increase over time.
- If β is less than one, then failures/events decrease over time.

- If β equals one, then the Weibull is the same as an exponential with
- γ is a minimum value.

The CDF for a Weibull is defined for *x ≥ γ*:

So, to generate random numbers, you do:

Yuck. How do you compute that ^{1/&beta}? Use the **pow()**
math routine (**man pow**).
That allows us to compute a Weibull:

tmp1 = -log(1.0 - drand48()); tmp2 = pow(tmp1, 1.0/beta); return eta * tmp2 + gamma;Use this exact generator in your labs.

You can test it in
**weibull.cpp**

#include <stdio.h> #include <iostream> #include <math.h> #include <string> #include <set> #include <map> using namespace std; main(int argc, char **argv) { double eta, beta, tmp1, tmp2, gamma; int iterations; int i; if (argc != 5 || sscanf(argv[1], "%lf", &beta) != 1 || beta <= 0 || sscanf(argv[2], "%lf", &eta) != 1 || eta <= 0 || sscanf(argv[3], "%lf", &gamma) != 1 || gamma < 0 || sscanf(argv[4], "%d", &iterations) != 1) { cerr << "usage: weibull beta eta gamma iterations\n"; exit(1); } srand48(time(0)); for (i = 0; i < iterations; i++) { tmp1 = -log(1.0 - drand48()); tmp2 = pow(tmp1, 1.0/beta); cout << tmp2 * eta + gamma << endl; } } |

Rather than generate these histograms mathematically, as we would do in a course
on probability, we're going to generate them empirically. The program
**histogram.cpp** takes numbers on standard input
and turns them into histograms of a given granularity. We use this program
to take a look at a random number generator that generates numbers uniformly
between zero and 2000. The command used to generate the graph below is:

UNIX>uniform 0 2000 1000000 | histogram 100

The histogram breaks up the random numbers into "bins" of size 100. Thus, the numbers 0-100 are in the first bin, 100-200 are in the second bin, etc. The Y-axis plots the fraction of all numbers generated in each bin. Since there are 20 bins in the above graph, and numbers are generated uniformly, we would expect each bin to hold roughly 1/20 = 0.05 of the numbers, and indeed, the histogram shows that.

Let's now turn our attention to the Exponential, which is equivalent to the Weibull with β equal to one. We use the following command:

UNIX>to generate the following histogram:weibull 1 1000 0 1000000 | histogram 100

See how fundamentally different the exponential is from the uniform number generator? The bin with the most values is the smallest -- nearly 1/10th of the values are between zero and 100, even though the mean is 1000.

Let's increase β to 1.3 and see what it does to the histogram:

It shifts the values toward the mean. In other words, more of the values generated are closer to the mean. There are fewer values in the 0-100 bin, and more in the bins closer to 1000. If we increase β to 3, we see something approximating a bell curve:

Increasing β to 50, we see that the values become completely concentrated around the mean:

If we view Weibulls as generating failure data, we say that increasing β "increases the failure rate", because as time gets closer to the mean, if you have not had a failure yet, your probability increases as β increases.

What about decreasing β? We plot three histograms below. First, β = .8:

Then β = .3:

And finally β = .05:

Each one concentrates values further towards zero. However, remember that the mean must still be 1000. This means that there are some really big values generated (or more precisely that the really big bins probably have higher frequencies than their counterparts for larger β). This is sometimes called a "decreasing" failure rate.

Does that give you a better intuitive feel than the histograms? I don't think so, although all of the conclusions that we drew from the histograms are present in the CDF's. Most are just harder to see. One that is easier to see is when β equals 0.05. Just above, we said that the really big values have higher frequencies than their counterparts for larger β. That is clear from the CDF, and less clear from the histogram.

I didn't expect the CDF(1000) to be independent of β, but I'm sure if you look at the math, you'll see that it is. Interesting.