Gang Xiang, Scott A. Starks, Vladik Kreinovich, and Luc Longpre
NASA Pan-American Center for Earth and Environmental Studies PACES
University of Texas, El Paso
TX 79968, USA
email: vladik@cs.utep.edu
Once we have several results
of measuring some physical
quantity - e.g., the amount of pollution in a lake - traditional
statistical data processing starts with computing the sample average
E and the sample variance V of these results.
The values Xi come from measurements, and measurements are never 100accurate. In many real-life situations, the only information about the
corresponding measurement errors is the upper bound Di on the
absolute value of the measurement error. As a result, the only
information we have about the actual value
of each measured
quantity is that
belongs to the interval
. For such
interval data, instead of the exact values of
and
, it is
desirable to get the intervals
and
of possible values,
intervals formed by all possible values of
when each
takes values from the interval
.
Computing
is straightforward, but computing the exact range
of
turns out to be an NP-hard problem; specifically, computing the
lower endpoint of
is feasible, but computing the upper endpoint of
is NP-hard; see, e.g., our paper [Ferson et al. 2002].
In this same paper, we also show that if all the values
are very
different from one another, then we can compute
in feasible time
(actually, quadratic time). In some practical examples, values
are
different, but in many other practical cases, we may have many similar
readings
- so the above result is not applicable.
In this paper, we describe new practically useful cases when we can
compute
by a feasible (polynomial-time) algorithm. The first case
is when all the measurements are made by the same measuring instrument
or by similar measurement instruments. In this case, none of two
input intervals
is a proper subset of one another, and
as a result, we can find the exact range
in time
.
The second case is when instead of a single type of measuring
instruments, we use a limited number
of different types of
measuring instruments. It turns out that in this case, we can compute
in polynomial time
.
The third case is related to privacy in statistical databases. When
the measurements
correspond to data that we want to keep private,
e.g., health parameters of different patients, we do not
want statistical programs to have full access to the data - because
otherwise, by computing sufficiently many different statistics, we
would be able to uniquely reconstruct the actual values
. One way to
prevent this from happening is to supply the statistical data
processing programs not with the exact data, but only with intervals
of possible values of this data, intervals corresponding to a fixed
partition. For example, instead of the exact age, we tell the program
that a person's age is between 30 and 40. For such interval data, it
is also possible to compute the exact range
in polynomial time -
namely, in time
.
Reference:
Scott Ferson, Lev Ginzburg, Vladik Kreinovich, Luc Longpre, and
Monica Aviles, "Computing Variance for Interval Data is NP-Hard", ACM
SIGACT News, 2002, Vol. 33, No. 2, pp. 108-118.