PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 18 February 2004

New algorithms for statistical analysis of interval data

Gang Xiang, Scott A. Starks, Vladik Kreinovich, and Luc Longpre
NASA Pan-American Center for Earth and Environmental Studies PACES
University of Texas, El Paso
TX 79968, USA
email: vladik@cs.utep.edu

Once we have several results $X_1,...,X_n$ of measuring some physical quantity - e.g., the amount of pollution in a lake - traditional statistical data processing starts with computing the sample average E and the sample variance V of these results.

The values Xi come from measurements, and measurements are never 100accurate. In many real-life situations, the only information about the corresponding measurement errors is the upper bound Di on the absolute value of the measurement error. As a result, the only information we have about the actual value $x_i$ of each measured quantity is that $x_i$ belongs to the interval $[X_i-D_i,X_i+D_i]$. For such interval data, instead of the exact values of $E$ and $V$, it is desirable to get the intervals $[E]$ and $[V]$ of possible values, intervals formed by all possible values of $E$ $(corr., V)$ when each $x_i$ takes values from the interval $[X_i-D_i,X_i+D_i]$.

Computing $[E]$ is straightforward, but computing the exact range $[V]$ of $V$ turns out to be an NP-hard problem; specifically, computing the lower endpoint of $[V]$ is feasible, but computing the upper endpoint of $[V]$ is NP-hard; see, e.g., our paper [Ferson et al. 2002].

In this same paper, we also show that if all the values $X_i$ are very different from one another, then we can compute $[V]$ in feasible time (actually, quadratic time). In some practical examples, values $X_i$ are different, but in many other practical cases, we may have many similar readings $X_i$ - so the above result is not applicable.

In this paper, we describe new practically useful cases when we can compute $[V]$ by a feasible (polynomial-time) algorithm. The first case is when all the measurements are made by the same measuring instrument or by similar measurement instruments. In this case, none of two input intervals $[X_i-D_i,X_i+D_i]$ is a proper subset of one another, and as a result, we can find the exact range $[V]$ in time $O(n*log(n))$.

The second case is when instead of a single type of measuring instruments, we use a limited number $(m>1)$ of different types of measuring instruments. It turns out that in this case, we can compute $[V]$ in polynomial time $O(n^{m+1})$.

The third case is related to privacy in statistical databases. When the measurements $X_i$ correspond to data that we want to keep private, e.g., health parameters of different patients, we do not want statistical programs to have full access to the data - because otherwise, by computing sufficiently many different statistics, we would be able to uniquely reconstruct the actual values $X_i$. One way to prevent this from happening is to supply the statistical data processing programs not with the exact data, but only with intervals of possible values of this data, intervals corresponding to a fixed partition. For example, instead of the exact age, we tell the program that a person's age is between 30 and 40. For such interval data, it is also possible to compute the exact range $[V]$ in polynomial time - namely, in time $O(n*log(n))$.

Reference:
Scott Ferson, Lev Ginzburg, Vladik Kreinovich, Luc Longpre, and Monica Aviles, "Computing Variance for Interval Data is NP-Hard", ACM SIGACT News, 2002, Vol. 33, No. 2, pp. 108-118.

Home page


2004-02-18