The Hoshen-Kopelman (HK) algorithm (Hoshen and Kopelman, 1976)
is a one-pass approach to
cluster identification that utilizes two working arrays, `matrix` and
`csize`, to store the post-processed input map and interim cluster statistics.
Post-processed refers to the map resulting from pre-processing - which filters data
values not corresponding to the chosen map class. The original input
ERDAS/Lan map is read into a holding buffer, filtered, and moved into the ` matrix`
array for final processing. This three-step process is referred to as pre-processing.
The algorithm traverses the map from left to right, and then from top to
bottom, assigning a temporary
cluster label (integer) to each non-zero pixel and recording any pixel label or
cluster membership changes in the two working arrays. Temporary label assignment
is a function
of the enforced neighborhood rule and the algorithm implementation, which are
explained in detail in Sections 1.3 and 1.4, respectively.
` Matrix` is a two dimensional integer array of size ,
where **n** is the height (rows) and **m** is the width (columns) of the map in pixels.
The dimensions **n** and **m** are increased by 1 to accommodate boundary columns and rows,
containing unique boundary integer values, that are necessary for this implementation.
Six integer pointers (` firstRow`, ` secondRow`, ` lastRow`, ` current`,
` west`, and ` north`) are maintained to traverse the array and make changes
when necessary. Pointers are used to index the ` matrix` array and to minimize
the overhead of base-plus-offset array indexing.

Integer storage (32 bits/pixel) is required because the ` matrix`
array serves a dual purpose. First, ` matrix` holds the post-processed
map data, and second, it serves as a working array for the pixels' temporary
cluster label. If it is impossible to store the entire ` matrix` array
in memory, a * divide-and-conquer* approach can be implemented whereby cluster
identification is performed by partitioning the map along row boundaries and
merging the results of each partition. Each partition is of size
where **k** is the number of rows in each partition. If **n** is not evenly divisible
by **k**, the last partition would have less than **k** rows.

An integer array (`csize`) is used to store either the running total of cluster size or
an index value for another entry in the ` csize` array. If the stored value is positive,
it is the running total of the pixel membership in that cluster. If it is negative,
the absolute value of the stored number is an index to the
true cluster label. Negative indexes can follow a non-circular, recursive path for
a finite number of steps. Figure 4 illustrates the ` csize` working array
displaying the positive or negative entries for each index in the array. Temporary cluster
1 shows a membership of 12 pixels whereas temporary cluster 2 is the beginning of a
5-step pointer path () pointing to
temporary cluster 3 containing 22 pixels. The existence of the path signifies
that
the temporary cluster at the head of the path (array index 2)
had been previously merged into the
temporary cluster at the tail of the path (array index 3). For
example, as the result of four
independent merge operations, temporary cluster 2 had previously been merged into
temporary cluster 3. Compression of this path is significant
to the overall efficiency of the FSM implementation
and is discussed in more detail in Section 1.5.2.

Michael W. Berry (berry@cs.utk.edu)

Sat Mar 30 23:40:13 EST 1996