Key Words: covariance function, aggregate data, spatial-temporal process
The value of such functions can be predicted by first estimating the dependence structure of the underlying stochastic process. Our proposed method for estimating the covariance function from the integrals of a stationary isotropic stochastic process poses the problem as a set of integral equations. To test this proposal we applied it to epidemiological data on the incidence rates of three diseases in the United States between 1980 and 1994. Spatial correlations obtained in this way reasonably described the mechanism by which those diseases spread. We therefore conclude that it is possible to reliably estimate covariance functions from aggregate observations. The estimate of the covariance functions provides valuable insights into the nature of the space-time process - in the epidemiological data it described a possible mechanism by which the diseases spread.
In many instances, observations collected in an experiment or in a study represent averages of measurements taken over a period in time and/or over a region in space. Examples include the number of cases of a disease, population size, income, or other epidemiological, socio-economic, or physical quantity. This quantity can be approximated by modeling it as a stochastic process of continuous space and time, and there are several ways to create such a model. We consider ways to estimate the covariance function of such a process and ways to predict the values of the processes.
The modeling has three objectives:
The last objective is the most demanding as it requires efficient software implementation of the results of the first two problems. An attempt to merge exploratory analysis of the considered type of data with appropriate modeling tools has led to the creation of a software program called MIASMA (Multivariate Interactive Animation System for Map Analysis, see Eddy and Mockus 1995). The software implementation of covariance function estimation described in this paper is being integrated into the MIASMA system.
A problem of spatial estimation from aggregate data was considered by Tobler (1979) in the context of estimating population density. That paper contains a bibliography from the field of applied geography. The interpolation proposed by Tobler is a numeric solution of a Dirichlet's equation with boundary constraints. The method is designed to produce an interpolant under such geographic constraints as lakes, mountains, and deserts. Dyn and Wahba (1982) describe a thin plate histospline approach for a similar problem. Clayton and Kaldor (1987) describe an empirical Bayes estimator for disease incidence rates. Available modeling methods include a kernel type smoothing method described in Eddy and Mockus (1994). Knowledge of the covariance function provides important insights into the nature of the space-time process, but none of the above methods considers ways to estimate the dependence structure.
After describing our datasets, we will briefly consider prediction, then discuss estimation of the dependence structure, and conclude by analyzing the spread of three diseases.
The data were obtained from the National Notifiable Disease Surveillance System. The dataset contains weekly reports, by state, on 57 diseases, from 1980 to 1994. There are 783 report weeks in this period and the reports are provided for 50 states, District of Columbia, 3 territories and New York City. We analyzed 3 common diseases: Hepatitis A, Hepatitis B, and Tuberculosis.
Table 1 lists the 3 diseases and the total number of reported cases from 1980 to 1994.
Disease Name | Total Number of Cases |
Hepatitis A | 328964 |
Hepatitis B | 280586 |
Tuberculosis | 327441 |
The dataset can be shown as multiple time series for each state and
as multiple disease incidence maps for each week. The MIASMA tool is
specially designed to explore such data interactively. Here we can
only illustrate the overall time trends. Figure 1 shows
the weekly average incidence rates (new cases per week per 100,000
people averaged over states) for the 3 diseases. All of the
diseases are seasonal and show a nontrivial trend.
The disease incidence rates can be modeled as a smooth function over the territory of the United States and over the time period considered. Of interest is the number of cases of each disease each week in each state. Those quantities represent integrals of the incidence rate function with respect to population density over the territories of individual states and over weeks.
More formally, we are interested in determining a function
from its integrals
,
where the Ai's
are subsets of a finite set
and
represents
population density. We can model the function as a
stochastic process. Let
be a stochastic process with the
index
.
The zi are observed integrals of the
single realization of f. The simplest predictor for the process fcould then be a piecewise constant function
where
Since we intend to use the predictor in animations, any
discontinuity in such a function would be distracting and must be
avoided (see Eddy and Mockus 1994). Thus we must consider methods
that produce a continuous predictor.
In cases when the joint distribution of
is
Gaussian, the quantity
is a well known linear
function of
given by the formula for the conditional
normal distribution. Let
be the mean of
the process,
be the means of the observed integrals,
,
and
.
Then
where denotes the vector of all 's, and denotes the vector of all ci's, and (cij) is the covariance matrix. In non-Gaussian cases when we know only the first two moments of the vector , Equation (1) defines the best linear unbiased predictor for . Equation (1) is a simple form of equations used in spatial prediction and is referred to as the kriging equation (see, for example, Cressie 1991).
Equation (1) involves quantities and cij that are unknown. In space-time prediction we assume the trend to be a constant in spatial dimensions (a constant which varies over time), and we can estimate ci(s), cij as described in Section 4.
The next step is to estimate geographic and time dependence structure for the stochastic process of disease incidence rates. There are a number of techniques for estimating mean and covariance functions from observations representing point values of a process (see, for example, Delfner 1976, Kitanidis 1983, Cressie 1985, Marshall and Mardia 1985).
In our case the observations are in the form of integrals of the process f described in Section 3. Surprisingly, the estimation of the covariance function of a continuous process from integral observations has not previously been investigated. To illustrate our method we assume that the process f is zero-mean and stationary. We are interested in the covariance function of the process f. It should be noted that the symbol sometimes is used to denote the semi-variogram, e.g. Cressie (1993), but here we use the symbol to denote a covariance function. We obtain the estimate of by solving appropriate integral equations (a general statistical perspective on solving integral equations can be found in O'Sullivan 1986).
We will describe an efficient algorithm to estimate , when is an isotropic covariance function, (isotropic means that is a function of only , where is Euclidean distance), and each Ai is a polygon, or a prism with a polygon base.
In Subsection 4.1 integral equations are obtained for a covariance function. The method used to estimate the covariance function from those equations is then presented. Two main obstacles to an efficient numeric solution are the large number of equations and that the solution must be in the space of positive definite functions. The second problem arises because we want our solution to be a valid covariance function. We address the first problem by designing a fast algorithm to compute the integral of interest and the second problem by re-expressing the isotropic covariance function in terms of a nondecreasing function.
The products zi zj approximate the integrals
.
(For a
stationary process,
is a function of only
). We can estimate
by solving an
inverse problem for :
In the special case when the Ai's are regularly spaced, the appropriate zizj could be averaged, reducing the number of equations. However, in the general case, it is not clear how to reduce the total number of N(N+1)/2 equations.
For an isotropic and stationary process,
,
where
is Euclidean distance,
and thus Equation (2) can be rewritten as
where
and
There are two important difficulties when trying to solve Equation (3) with respect to :
To address the first problem we use a fast algorithm to compute the kernel function WAiAj(l). An efficient algorithm to calculate the kernels WAiAj is described by Mockus (1994) for cases when the regions Ai are piecewise polygons in R2 or prisms with a piecewise polygon base in R3.
To address the second problem we may use a parametric family of
covariance functions (e.g.,
).
Another approach is to re-express the covariance function in terms
of some simpler function. A form of an isotropic covariance
function for dimension
is
where G is a distribution function, is the Bessel function of the first kind, , lrepresents distance, and s is a scale factor. The isotropic covariance function is a scale mixture of Bessel functions, and the estimation of such a function could be done by estimating the function G appearing in Equation (4).
Reformulating Equation (3) in terms of the
function G we get
= | |||
= |
where
and . To simplify the notation the subscript AiAj will become simply ij in the remainder of the paper.
We can look for solutions that minimize the criteria in the
following Equations:
where Cij are weights and p > 0. When p=2 the above criteria give a weighted least squares (WLS) solution. Alternative criteria such as maximization of the likelihood function or of the posterior probability would lead to a more complicated set of equations. We chose to solve the simplest Equations (5) and (6).
In this section we discuss the ways to construct an estimate of the covariance function. When p=2 the optimality criteria in the equations above are quadratic in and dG. Hence, the unconstrained solutions are linear functions (functionals) of the data (zizj). Unfortunately, the constraints require that be a covariance function and G be a nondecreasing function. To simplify the optimization problem we can perform estimation in two steps: first we do an unconstrained minimization (simple least squares problem), and then we project the solution into the desired space. However, the final estimate after the projection may be suboptimal. Another approach (used in the examples of the next section) is to perform optimization entirely within the restricted domain. In that case we have to use nonlinear minimization techniques (see Appendix).
In the following examples we solve Equation (5) for of a parametric form, i.e., . In solving Equation (6) it is convenient to represent G as the sum of an absolutely continuous function g and a step function. Then the integral in (6) can be rewritten as , where Gk=G(Sk)-G(Sk-) are jump sizes and Sk are jump locations. In this form we have the simplest constraints on the parameters ( ). Since the number of observations zi is finite, the solutions and are not well defined unless we estimate parameterized versions of those functions or add regularization/penalty terms to achieve a unique solution.
To illustrate the behavior of the covariances between regions, we first consider an example with simulated data and the covariance function , where and J0 is a Bessel function of the first kind. The constant 10 gives approximately two periods for the covariance function (see Figure 2). A smaller value of the constant would lead to fewer oscillations. The 48 regions partition the map of United States into the 48 contiguous states. The map obtained from statlib (http://lib.stat.cmu.edu/maps/) was converted to Albers equal area projection coordinates, (see Deetz and Adams 1921.) These coordinates were rescaled so that the area of the map would be equal to 1. All distances in this paper were obtained in that way. Thus, for example, on our scale a distance of 3000 miles is approximately 1.
Figure 2 shows exact covariances (
)
between all possible pairs
of regions. The covariances are plotted on the Y axis, while the
average distances between regions are plotted on the X axis;
average distance dij between regions Ai, Aj is given by
There are 48 (48+1)/2= 1176 points in the plot. The points approximately follow the line of .
The display and assessment of the variability of quantities (zizj) used to estimate the covariance function are given in Figure 3. Figure 3 shows the estimated conditional density d(x, y | x) of observed covariances between all possible pairs of regions, where x is the average distance between regions and y is the covariance between regions. The density was estimated empirically by generating 1000 realizations of the process f. To reduce the variance we calculated one hundred 10 point averages , where are integrals of the realization k of the process f over region Ai. The two dimensional density d(x, y)and the marginal density d(x) were estimated using David Scott's ash package available from statlib ( http://lib.stat.cmu.edu). The contour level lines of displayed in Figure 3 approximate highest density regions having probabilities 0.75, 0.50, and 0.25.
We then estimated the covariance function from the same data obtaining 100 estimates of and by solving Equation (6) with and p=2. The results are shown in Table 2. The least squares solution (p=2) overestimates the scale parameter and is much less stable than the solution for p=1. Therefore we used p=1 in the next section. The scale parameter overestimation would have no effect on the mean of the predictor (Equation 1) but would entail larger error bounds. Two covariance functions shown in Figure 3 are based on the extreme estimated parameters for p=1 ( and ). The covariance function corresponding to the median of estimated parameters is not shown since it obscures the exact covariances.
The incidence rate process of disease data often has seasonal and longer time trends (see Figure 1) that need to be removed before estimation of the covariance function. The data are based on reports by state epidemiologists, who themselves receive reports from a variety of sources, such as individual practitioners, hospitals, laboratories, and health departments. Hence each state might have a slightly different disease reporting mechanism.
Let zij be the reported incidence rate in state i for time interval j. Very good temporal resolution (weeks) and poor spatial resolution (states) lead to strong temporal and weaker spatial correlations. To increase spatial correlations we aggregated the data into periods of four weeks, so j corresponds to approximately one month.
Trends were removed by fitting state effects si and time effects
tj via median polish (Mosteller and Tukey (1983)) implemented as
twoway function in S language (Becker, et al (1988)). The
residuals
for any particular state did not have an obvious trend. The residuals were normalized since we were comparing the shapes of the covariance functions, not the scales (the minimization criteria are scale invariant, hence the actual covariance function can be obtained by multiplying the estimated covariance function by the square of the normalization factor).
To illustrate the behavior of observed covariances for temporal lags l=0, 3, 6, Figure 4 shows three quantiles (25, 50, and 75) for the average lagged covariance computed over equally spaced intervals (0-0.1, 0.1-0.2, ...). For each interval [xi,xi+1) the average observed lagged covariances were computed between regions having distance in the range of [xi,xi+1), and the three quantiles were calculated.
Estimated covariance functions are also shown in Figure 4 and Table 3. We used to estimate parameters for the covariance functions and by solving Equations (5, 6) with Cij = 1and p=1. The population density dP in Equation (3) was assumed constant over the territory of each state. More precise definition of Wij's would be obtained using smaller regions like counties or census tracts.
Judging by the estimated parameters (Table 3) of the exponential covariance function, the Tuberculosis data has short range correlations rapidly decreasing with geographic distance, unlike the other two diseases. The quantiles in Figure 4 indicate that for Hepatitis B, which spreads by a different mechanism than other diseases (contaminated blood products), covariances appear to increase for the largest distances, possibly showing relationship between large cities on east and west coasts.
The exponential covariance model did not fit the observed covariances very well because at geographic distances equal to 2000 miles, covariances for all three diseases appear to be negative. Table 3 shows the cyclic covariance function outperformed the exponential covariance function in 6 out of 9 cases. The large peak at zero (no temporal lag) for the Tuberculosis data caused excessive oscillations for the cyclic covariance function. This example suggests that a weighted mixture of exponential and Bessel functions might best describe Tuberculosis data.
function | Disease | Lag 0 | Lag 3 | Lag 6 |
Hepatitis A | 3.18 | 45 | 8.9+ | |
Hepatitis A | 2.52+ | .002+ | 2.58 | |
Hepatitis B | 6.7 | 3.0 | 3.3+ | |
Hepatitis B | 3.93+ | 2.9+ | .02 | |
Tuberculosis | 157 | 271+ | 672 | |
Tuberculosis | 11.5+ | 6.11 | 5.9+ |
We designed and implemented a set of methods to estimate the dependence structure of a spatial-temporal stochastic process when the available data are represented in the form of integrals of the considered process. We illustrated our techniques on simulated data and applied them to real data concerning the spread of three diseases. We concentrated on spatial (two-dimensional) isotropic covariance functions and estimated two parametric forms of the covariance function. We described a way to re-express a covariance function as a nonnegative function for the purpose of numeric non-parametric estimation.
Our estimation method was based on a solution to a system of integral equations given in Equation (2) via numeric optimization. In simulated data we were able to recover the original covariance function. In real data concerning infectious diseases we obtained spatial correlations that reasonably describe possible mechanisms by which each disease spreads. We, therefore, achieved our purpose of obtaining a quantitative estimate of variability and smoothness for the predictor of a stochastic process. This makes possible prediction of spatial and space-time processes using observations representing averages of such process.
The shapes and locations of the regions affect the quality of the estimate at various spatial distances. If the goal of the estimation is to obtain a reliable covariance function estimate at a particular distance l it is best to have more pairs of small regions that have average distance close to l. To obtain the estimate for all distances it is better to have regions of different sizes and shapes (for more discussion see Mockus 1994).
There still are a number of issues with the described approach in general and with the treatment of the disease data in particular. The approach assumes an isotropic covariance function (in two spatial dimensions). The numeric optimization required to achieve the estimates reliably is nontrivial (see Appendix). Simple covariance functions were used in the examples; use of covariance function with larger number of parameters would make the optimization task even harder. The temporal trends of the diseases were removed using simple median polish and the spatial covariance function was fitted independently for different time lags. A more sophisticated approach would be to fit a time series model (like ARMA) together with spatial covariances.