Key Words: covariance function, aggregate data, spatial-temporal process
The value of such functions can be predicted by first estimating the dependence structure of the underlying stochastic process. Our proposed method for estimating the covariance function from the integrals of a stationary isotropic stochastic process poses the problem as a set of integral equations. To test this proposal we applied it to epidemiological data on the incidence rates of three diseases in the United States between 1980 and 1994. Spatial correlations obtained in this way reasonably described the mechanism by which those diseases spread. We therefore conclude that it is possible to reliably estimate covariance functions from aggregate observations. The estimate of the covariance functions provides valuable insights into the nature of the space-time process - in the epidemiological data it described a possible mechanism by which the diseases spread.
In many instances, observations collected in an experiment or in a study represent averages of measurements taken over a period in time and/or over a region in space. Examples include the number of cases of a disease, population size, income, or other epidemiological, socio-economic, or physical quantity. This quantity can be approximated by modeling it as a stochastic process of continuous space and time, and there are several ways to create such a model. We consider ways to estimate the covariance function of such a process and ways to predict the values of the processes.
The modeling has three objectives:
The last objective is the most demanding as it requires efficient software implementation of the results of the first two problems. An attempt to merge exploratory analysis of the considered type of data with appropriate modeling tools has led to the creation of a software program called MIASMA (Multivariate Interactive Animation System for Map Analysis, see Eddy and Mockus 1995). The software implementation of covariance function estimation described in this paper is being integrated into the MIASMA system.
A problem of spatial estimation from aggregate data was considered by Tobler (1979) in the context of estimating population density. That paper contains a bibliography from the field of applied geography. The interpolation proposed by Tobler is a numeric solution of a Dirichlet's equation with boundary constraints. The method is designed to produce an interpolant under such geographic constraints as lakes, mountains, and deserts. Dyn and Wahba (1982) describe a thin plate histospline approach for a similar problem. Clayton and Kaldor (1987) describe an empirical Bayes estimator for disease incidence rates. Available modeling methods include a kernel type smoothing method described in Eddy and Mockus (1994). Knowledge of the covariance function provides important insights into the nature of the space-time process, but none of the above methods considers ways to estimate the dependence structure.
After describing our datasets, we will briefly consider prediction, then discuss estimation of the dependence structure, and conclude by analyzing the spread of three diseases.
The data were obtained from the National Notifiable Disease Surveillance System. The dataset contains weekly reports, by state, on 57 diseases, from 1980 to 1994. There are 783 report weeks in this period and the reports are provided for 50 states, District of Columbia, 3 territories and New York City. We analyzed 3 common diseases: Hepatitis A, Hepatitis B, and Tuberculosis.
Table 1 lists the 3 diseases and the total number of reported cases from 1980 to 1994.
Disease Name | Total Number of Cases |
Hepatitis A | 328964 |
Hepatitis B | 280586 |
Tuberculosis | 327441 |
The dataset can be shown as multiple time series for each state and
as multiple disease incidence maps for each week. The MIASMA tool is
specially designed to explore such data interactively. Here we can
only illustrate the overall time trends. Figure 1 shows
the weekly average incidence rates (new cases per week per 100,000
people averaged over states) for the 3 diseases. All of the
diseases are seasonal and show a nontrivial trend.
![]() |
The disease incidence rates can be modeled as a smooth function over the territory of the United States and over the time period considered. Of interest is the number of cases of each disease each week in each state. Those quantities represent integrals of the incidence rate function with respect to population density over the territories of individual states and over weeks.
More formally, we are interested in determining a function
from its integrals
,
where the Ai's
are subsets of a finite set
and
represents
population density. We can model the function as a
stochastic process. Let
be a stochastic process with the
index
.
The zi are observed integrals of the
single realization of f. The simplest predictor for the process fcould then be a piecewise constant function
where
Since we intend to use the predictor in animations, any
discontinuity in such a function would be distracting and must be
avoided (see Eddy and Mockus 1994). Thus we must consider methods
that produce a continuous predictor.
In cases when the joint distribution of
is
Gaussian, the quantity
is a well known linear
function of
given by the formula for the conditional
normal distribution. Let
be the mean of
the process,
be the means of the observed integrals,
,
and
.
Then
where
denotes the vector of all
's, and
denotes the vector of all ci's, and (cij) is
the covariance matrix. In non-Gaussian cases when we know only the
first two moments of the vector
,
Equation
(1) defines the best linear unbiased predictor for
.
Equation (1) is a simple form of equations used in spatial
prediction and is referred to as the kriging equation (see, for
example, Cressie 1991).
Equation (1) involves quantities
and cij that are unknown. In space-time prediction
we assume the trend
to be a constant in spatial
dimensions (a constant which varies over time), and we can estimate
ci(s), cij as described in Section 4.
The next step is to estimate geographic and time dependence structure for the stochastic process of disease incidence rates. There are a number of techniques for estimating mean and covariance functions from observations representing point values of a process (see, for example, Delfner 1976, Kitanidis 1983, Cressie 1985, Marshall and Mardia 1985).
In our case the observations are in the form of integrals of the
process f described in Section 3. Surprisingly,
the estimation of the covariance function of a continuous process
from integral observations has not previously been investigated. To
illustrate our method we assume that the process f is zero-mean
and stationary. We are interested in the covariance function
of the process f. It should be noted that
the symbol
sometimes is used to denote the semi-variogram, e.g.
Cressie (1993), but here we use the symbol to denote a covariance
function. We obtain the estimate of
by solving appropriate
integral equations (a general statistical perspective on solving
integral equations can be found in O'Sullivan 1986).
We will describe an efficient algorithm to estimate ,
when
is an isotropic covariance function, (isotropic means that
is a function of only
,
where
is Euclidean distance), and
each Ai is a polygon, or a prism with a polygon base.
In Subsection 4.1 integral equations are obtained for a covariance function. The method used to estimate the covariance function from those equations is then presented. Two main obstacles to an efficient numeric solution are the large number of equations and that the solution must be in the space of positive definite functions. The second problem arises because we want our solution to be a valid covariance function. We address the first problem by designing a fast algorithm to compute the integral of interest and the second problem by re-expressing the isotropic covariance function in terms of a nondecreasing function.
The products zi zj approximate the integrals
.
(For a
stationary process,
is a function of only
). We can estimate
by solving an
inverse problem for
:
In the special case when the Ai's are regularly spaced, the appropriate zizj could be averaged, reducing the number of equations. However, in the general case, it is not clear how to reduce the total number of N(N+1)/2 equations.
For an isotropic and stationary process,
,
where
is Euclidean distance,
and thus Equation (2) can be rewritten as
where
and
There are two important difficulties when trying to solve
Equation (3) with respect to :
To address the first problem we use a fast algorithm to compute the kernel function WAiAj(l). An efficient algorithm to calculate the kernels WAiAj is described by Mockus (1994) for cases when the regions Ai are piecewise polygons in R2 or prisms with a piecewise polygon base in R3.
To address the second problem we may use a parametric family of
covariance functions (e.g.,
).
Another approach is to re-express the covariance function in terms
of some simpler function. A form of an isotropic covariance
function for dimension
is
where G is a distribution function,
is the
Bessel function of the first kind,
,
lrepresents distance, and s is a scale factor. The
isotropic covariance function is a scale mixture of Bessel
functions, and the estimation of such a function could be done by
estimating the function G appearing in Equation (4).
Reformulating Equation (3) in terms of the
function G we get
![]() | |||
= | ![]() |
||
= | ![]() |
where
and
.
To simplify the notation the
subscript AiAj will become simply ij in the remainder of
the paper.
We can look for solutions that minimize the criteria in the
following Equations:
where Cij are weights and p > 0. When p=2 the above criteria give a weighted least squares (WLS) solution. Alternative criteria such as maximization of the likelihood function or of the posterior probability would lead to a more complicated set of equations. We chose to solve the simplest Equations (5) and (6).
In this section we discuss the ways to construct an estimate of the
covariance function. When p=2 the optimality criteria in the
equations above are quadratic in
and dG. Hence,
the unconstrained solutions are linear functions (functionals) of
the data (zizj). Unfortunately, the constraints require that
be a covariance function and G be a nondecreasing
function. To simplify the optimization problem we can perform
estimation in two steps: first we do an unconstrained minimization
(simple least squares problem), and then we project the solution
into the desired space. However, the final estimate after the
projection may be suboptimal. Another approach (used in the
examples of the next section) is to perform optimization entirely
within the restricted domain. In that case we have to use nonlinear
minimization techniques (see Appendix).
In the following examples we solve Equation (5)
for
of a parametric form, i.e.,
.
In solving Equation
(6) it is convenient to represent G as the
sum of an absolutely continuous function g and a step function.
Then the integral in (6) can be rewritten as
,
where
Gk=G(Sk)-G(Sk-) are
jump sizes and Sk are jump locations. In this form we have the
simplest constraints on the parameters (
).
Since the number of observations zi is finite, the solutions
and
are not well defined unless we
estimate parameterized versions of those functions or add
regularization/penalty terms to achieve a unique solution.
To illustrate the behavior of the covariances between regions, we
first consider an example with simulated data
and the covariance function
,
where
and J0 is a Bessel function of the first
kind. The constant 10 gives approximately two periods for the
covariance function (see Figure 2). A smaller
value of the
constant would lead to fewer oscillations. The 48 regions
partition the map of United
States into the 48 contiguous states. The map obtained from statlib (http://lib.stat.cmu.edu/maps/) was converted to Albers equal
area projection coordinates, (see Deetz and Adams 1921.) These
coordinates were rescaled so that the area of the map would be equal
to 1. All distances in this paper were obtained in that way. Thus,
for example, on our scale a distance of 3000 miles is approximately
1.
Figure 2 shows exact covariances (
)
between all possible pairs
of regions. The covariances are plotted on the Y axis, while the
average distances between regions are plotted on the X axis;
average distance dij between regions Ai, Aj is given by
There are
48 (48+1)/2= 1176 points in the plot.
The points approximately follow the line of
.
![]() |
The display and assessment of the variability of quantities
(zizj) used to estimate the covariance function are given in
Figure 3. Figure 3 shows the
estimated conditional density
d(x, y | x) of observed covariances
between all possible pairs of regions, where x is the average
distance between regions and y is the covariance between regions.
The density was estimated empirically by generating 1000
realizations of the process f. To reduce the variance we
calculated one hundred 10 point averages
,
where
are integrals of the realization k of the
process f over region Ai. The two dimensional density d(x, y)and the marginal density d(x) were estimated using David Scott's
ash package available from statlib ( http://lib.stat.cmu.edu). The contour level lines of
displayed in Figure 3 approximate
highest density regions having probabilities 0.75, 0.50, and
0.25.
![]() |
We then estimated the covariance function
from the same data obtaining 100 estimates of
and
by solving
Equation (6) with
and
p=2. The results are shown in Table 2. The least squares
solution (p=2) overestimates the scale parameter and is much less
stable than the solution for p=1. Therefore we used p=1 in the
next section. The scale parameter overestimation would have no
effect on the mean of the predictor (Equation 1) but would
entail larger error bounds. Two covariance functions shown in
Figure 3 are based on the extreme estimated
parameters for p=1 (
and
). The covariance function
corresponding to the median of estimated parameters is not shown
since it obscures the exact covariances.
Parameter | Quantiles | |||||
0 | 0.25 | 0.50 | 0.75 | 1 | ||
p=2 | ![]() |
5.4 | 9.7 | 11 | 22.8 | 1056 |
p=2 | ![]() |
0.58 | 1.87 | 2.54 | 5.5 | 615 |
p=1 | ![]() |
9.1 | 9.8 | 10 | 10.2 | 10.9 |
p=1 | ![]() |
0.41 | 0.76 | 0.9 | 1.63 | 1.63 |
The incidence rate process of disease data often has seasonal and longer time trends (see Figure 1) that need to be removed before estimation of the covariance function. The data are based on reports by state epidemiologists, who themselves receive reports from a variety of sources, such as individual practitioners, hospitals, laboratories, and health departments. Hence each state might have a slightly different disease reporting mechanism.
Let zij be the reported incidence rate in state i for time interval j. Very good temporal resolution (weeks) and poor spatial resolution (states) lead to strong temporal and weaker spatial correlations. To increase spatial correlations we aggregated the data into periods of four weeks, so j corresponds to approximately one month.
Trends were removed by fitting state effects si and time effects
tj via median polish (Mosteller and Tukey (1983)) implemented as
twoway function in S language (Becker, et al (1988)). The
residuals
for any particular state did not have an obvious trend. The residuals were normalized since we were comparing the shapes of the covariance functions, not the scales (the minimization criteria are scale invariant, hence the actual covariance function can be obtained by multiplying the estimated covariance function by the square of the normalization factor).
To illustrate the behavior of observed covariances
for temporal lags
l=0, 3, 6, Figure 4 shows three quantiles (25, 50, and 75) for the
average lagged covariance computed
over equally spaced intervals (0-0.1, 0.1-0.2, ...).
For each interval
[xi,xi+1) the average observed lagged covariances
were computed between regions
having distance in the range of
[xi,xi+1), and
the three quantiles were calculated.
Estimated covariance functions are also shown in Figure 4
and Table 3. We used
to
estimate parameters
for the covariance functions
and
by solving Equations
(5, 6) with
Cij = 1and p=1. The population density dP in Equation (3) was
assumed constant over the territory of each state. More
precise definition of Wij's would be obtained using smaller
regions like counties or census tracts.
Judging by the estimated parameters (Table 3) of the exponential covariance function, the Tuberculosis data has short range correlations rapidly decreasing with geographic distance, unlike the other two diseases. The quantiles in Figure 4 indicate that for Hepatitis B, which spreads by a different mechanism than other diseases (contaminated blood products), covariances appear to increase for the largest distances, possibly showing relationship between large cities on east and west coasts.
![]() |
The exponential covariance model did not fit the observed covariances very well because at geographic distances equal to 2000 miles, covariances for all three diseases appear to be negative. Table 3 shows the cyclic covariance function outperformed the exponential covariance function in 6 out of 9 cases. The large peak at zero (no temporal lag) for the Tuberculosis data caused excessive oscillations for the cyclic covariance function. This example suggests that a weighted mixture of exponential and Bessel functions might best describe Tuberculosis data.
function | Disease | Lag 0 | Lag 3 | Lag 6 |
![]() |
Hepatitis A | 3.18 | 45 | 8.9+ |
![]() |
Hepatitis A | 2.52+ | .002+ | 2.58 |
![]() |
Hepatitis B | 6.7 | 3.0 | 3.3+ |
![]() |
Hepatitis B | 3.93+ | 2.9+ | .02 |
![]() |
Tuberculosis | 157 | 271+ | 672 |
![]() |
Tuberculosis | 11.5+ | 6.11 | 5.9+ |
We designed and implemented a set of methods to estimate the dependence structure of a spatial-temporal stochastic process when the available data are represented in the form of integrals of the considered process. We illustrated our techniques on simulated data and applied them to real data concerning the spread of three diseases. We concentrated on spatial (two-dimensional) isotropic covariance functions and estimated two parametric forms of the covariance function. We described a way to re-express a covariance function as a nonnegative function for the purpose of numeric non-parametric estimation.
Our estimation method was based on a solution to a system of integral equations given in Equation (2) via numeric optimization. In simulated data we were able to recover the original covariance function. In real data concerning infectious diseases we obtained spatial correlations that reasonably describe possible mechanisms by which each disease spreads. We, therefore, achieved our purpose of obtaining a quantitative estimate of variability and smoothness for the predictor of a stochastic process. This makes possible prediction of spatial and space-time processes using observations representing averages of such process.
The shapes and locations of the regions affect the quality of the estimate at various spatial distances. If the goal of the estimation is to obtain a reliable covariance function estimate at a particular distance l it is best to have more pairs of small regions that have average distance close to l. To obtain the estimate for all distances it is better to have regions of different sizes and shapes (for more discussion see Mockus 1994).
There still are a number of issues with the described approach in general and with the treatment of the disease data in particular. The approach assumes an isotropic covariance function (in two spatial dimensions). The numeric optimization required to achieve the estimates reliably is nontrivial (see Appendix). Simple covariance functions were used in the examples; use of covariance function with larger number of parameters would make the optimization task even harder. The temporal trends of the diseases were removed using simple median polish and the spatial covariance function was fitted independently for different time lags. A more sophisticated approach would be to fit a time series model (like ARMA) together with spatial covariances.