Can Quadtree Encoding help with MPI Collective Algorithm
Selection Process?
Jelena Pjesivac-Grbovic, Graham E. Fagg
Abstract:
The performance of an MPI collective operation depends on a number
of different factors, some of which are at the hardware level such
as network characteristics and topology, while some others are at
the software level: algorithm implementation, choice of segment
size (if applicable), and the collective operation parameters.
Selecting the best possible collective implementation for the
particular set of parameters is important for overall application
performance.
In this paper, we focus on MPI collective algorithm selection
process and explore the applicability of the quadtree encoding
method to this problem.
We construct quadtrees with different properties from the measured
algorithm performance data and analyze the quality and performance
of decision functions generated from these trees.
The experimental data shows that in some cases, the decision
function based on a quadtree structure with a mean depth of 3 can
incur as little as a 5% performance penalty on average. The exact,
experimentally measured, decision function for all tested
collectives could be fully represented using quadtrees with a
maximum of 6 levels. We also evaluate the performance of prototype
version of quadtree-based in-memory decision systems, and find that
that the decision can be obtained from a 6-level quadtree in
close to 170ns
. In comparison, fixed reduce decision in FT-MPI
takes 45ns
to evaluate and is not an exact decision. Our
results indicate that quadtrees may be a feasible choice for both
processing of the performance data and automatic decision function
generation.
- Reduce decision map
- Broadcast decision maps
- Performance penalty broadcast decision functions
- Accuracy threshold and quadtree performance penalty
- Accuracy threshold vs. maximum depth
- Algorithm Selection Problem as a Classification
Problem
- Decision map example
- Broadcast decision tree statistics
- In-memory decision system performance
Introduction
The performance of MPI collective operations is crucial for good
performance of MPI application which use them
[12]. For this reason, significant efforts
have been put on design and implementation of efficient collective
algorithms both for homogeneous and heterogeneous cluster environments
[16,7,13,8,3,14,9,2,15,11].
Performance of these algorithms varies with the total number of nodes
involved in communication, system and network characteristics, size of
data being transferred, current load, and if applicable, the operation
that is being performed as well as the segment size which is used for
operation pipelining. Thus, selecting the best possible algorithm and
segment size combination (method) for every instance of
collective operation is important.
To ensure good performance of MPI applications, collective operations
can be tuned for the particular system. The tuning process often
involves detailed profiling of the
system possibly combined with communication modeling, analyzing the
collected data, and generating a decision function. During
run-time, the decision function selects close-to-optimal method for
a particular collective instance. This approach relies on the ability
of the decision function to accurately predict algorithm and segment
size to be used for the particular collective instance.
Alternatively, one could construct an in-memory decision system which
could be queried/searched at the run-time to provide the optimal
method information. In order for either of these approaches to be
feasible, the memory footprint and the time it takes to make decisions
need to be minimal.
This paper studies the applicability of the quadtree encoding method
as a storage and optimization technique within the MPI collective
method selection process. We assume that the system of interest has
been benchmarked and that detailed performance information exists for
each of available collective communication algorithm. With this
information, we focus our efforts on investigating whether the
quadtree encoding is a feasible way to generate static decision
functions as well as, to represent the decision function in memory.
We implemented a prototype quadtree implementation and programs to
analyze the experimental performance data, construct the quadtree
decision functions, and analyze their performance penalty in
comparison to the exact decision function. We collected detailed
profiles for broadcast and reduce MPI collective algorithms on two
different clusters, and analyzed the quality of decisions from
quadtrees built using this data but under different constraints.
The experimental data collected on Grig cluster located at the
University of Tennessee at Knoxville for broadcast and reduce MPI
collectives, shows that the decision function based on quadtree
structure with a mean depth of 3, on average can incur as little as
a 5% performance penalty. The exact, experimentally measured,
decision function for both collectives on this system could be fully
represented using quadtrees with a maximum of 6 levels. We also
measured performance of the prototype version of in-memory
quadtree-based decision system and found that 6-level tree can return
decision in close to 170ns
on average. The 3-level tree search for
both reduce and broadcast took around 150ns
on average. This is
comparable to 45.22ns
and 110.53ns
for fixed decision functions
in FT-MPI for reduce and broadcast collectives respectively.
These results indicate that quadtrees may be a feasible choice for
processing of the performance data and decision function generation.
The paper proceeds as follows: Section 2
discusses existing approaches to the decision making/algorithm
selection problem; Section 3 describes the
quadtree construction and analysis of quadtree decision function in
more detail; Section 4 presents experimental
results; Section 5 concludes the
paper with discussion of the results and future work; finally the
appendix A provides some implementation
details about the library.
Related work
The MPI collective algorithm selection problem has been addressed in
many MPI implementations.
In the FT-MPI [4], the decision function is generated
manually using visual inspection method augmented with Matlab scripts
used for analysis of the experimentally collected performance data. This
approach results in a precise albeit complex decision functions.
In the MPICH-2 MPI implementation, the algorithm selection is based on
bandwidth and latency requirements of an algorithm, and the switching
points are predetermined by the implementers
[14].
In the tuned collective module of the Open MPI [5], the
algorithm selection can be done in either of the following three ways:
via compiled decision function, via user-specified command line flags,
or using rule-based run-length encoding scheme which can be tuned for
particular system.
It is also possible to view this problem as a data mining task in
which the algorithm selection problem is replaced by an equivalent
classification problem. The new problem is to classify collective
parameters, (collective operation, communicator size, message
size), into a correct category, a method in our case, to be used at
run time.
The major benefit of this approach is that the decision making process
is a well studied topic in engineering and machine learning fields.
Decision trees are extensively used in pattern recognitions, CAD
design, signal processing, medicine, biology, and search engines
[10].
Table 1:
Interpreting the algorithm selection problem as a
classification problem.
|
|
Alternatively, one can interpret the optimal collective implementation
on a system, i.e. a decision map, as an image and apply a
standard compression algorithms to it. Figure
1 illustrates a decision map for the
reduce operation on Nano cluster. We then use the encoded structure to
generate decision function code. To the best of our knowledge, we are
the only group which has approached the MPI collective tuning process
in this way.
Figure 1:
Reduce decision map from
Nano cluster. Different colors correspond to different method
indexes.
|
Quadtrees and MPI collective operations
We use the collective algorithm performance information on a
particular system to extract the information about the optimal methods
and construct a decision map for the collective on that
system. An example of a decision map is displayed in Table
2.
The decision map which will be used to initialize the quadtree must be
a complete and square matrix with a dimension size that is a power of
two,
2k x 2k
. Complete decision map means that tests must
cover all message and communicator sizes of interest. Neither of
these requirements are real limitations, as the missing data can be
interpolated and the size of the map can be adjusted by
replicating some of the entries. The replication process does not
affect the quadtree decisions, but may affect efficiency of the
encoding (both in positive and negative manner).
Table 2:
Decision map example. The axis information relates to the
decision maps in Figure 2.
|
Communicator size (y-axis) |
Message size (x-axis) |
Algorithm |
Segment size |
Method index |
|
3 |
1 |
Linear |
none |
1 |
|
3 |
2 |
Linear |
none |
1 |
|
... |
... |
... |
... |
... |
|
16 |
128 |
Linear |
none |
1 |
|
... |
... |
... |
... |
... |
|
128 |
64KB |
BinaryTree |
8KB |
13 |
|
... |
... |
... |
... |
... |
|
Building the quadtree decision structure
Once a decision map is available, we initialize the quadtree from it
using user specified constraints such as accuracy
threshold and maximum allowed depth of the tree. The
accuracy threshold is the minimum percentage of points in a block with
the same ``color'', such that the whole block is ``colored'' in that
``color''. The quadtree with no maximum depth set and threshold of 100%
is an exact tree. The exact tree truthfully represents the measured
data. A quadtree with either threshold or maximum depth limit set
allows us to reduce the size of the tree at the cost of prediction
accuracy, as it is no longer an exact copy of the original data.
Limiting the absolute tree depth limits the maximum number of
tests we may need to execute to determine the method index for
specified communicator and message size. Setting the accuracy
threshold helps smooth the experimental data, thus possibly making the
decision function more resistant to inaccuracies in measurements.
Applying the maximum depth and/or the accuracy thresholds is
equivalent to applying low-pass filters to the original data set.
Decision quadtree properties
A property of any decision tree is that an internal node of the tree
corresponds to an attribute test, and the links to children nodes
correspond to the particular attribute values. In our encoding scheme,
every non-leaf node in the quadtree corresponds to a test which
matches both communicator and message size values. The leaf nodes
contain information about the optimal method for the particular
communicator and message size ranges. Thus, leaves represent the rules
of the particular decision function. In effect, quadtrees allow us to
perform a recursive binary search in a two-dimensional space.
Generating decision function source code
We provide functionality to generate decision function source code
from the initialized quadtree. Recursively, for every internal node
in the quadtree we generate the following code segment:
if (NW) {...} else if (NE) {...} else if (SW) {...} else if
(SE) {...} else {error}.
The current implementation is functional but lacks
optimizations, i.e. ability to merge conditions with same
color1. The conditions for
boundary points (minimum and maximum communicator and message sizes)
are expanded to cover that region fully. For example, the rule for
minimum communicator size will be used for all communicator sizes
less than the minimum communicator size.
In-memory quadtree decision structure
Alternative to generating the decision function source code is
maintaining an in-memory quadtree decision structure which can be
queried during the run time.
An optimized quadtree structure would contain 5 pointers and 1 method
field, which could probably be a single byte or an integer value.
Thus, the size of a node of the tree would be around 44B on 64-bit
architectures2. Additionally, the system would need to maintain in
memory the mapping of (algorithm, segment size) pairs to method
indexes as well.
The maximum depth decision quadtree we encountered in our tests had
6 levels. This means that in the worst case, the 6-level decision
quadtree could take up to
= 5461
nodes, which
would occupy close to 235KB of memory. However, our results
indicate that the quadtrees with 3 levels can still produce reasonably
good decisions. Three-level quadtree would occupy at most 3740B and
as such could fit into 4 1KB pages of main memory. Even so, the smaller
quadtree if cached could still occupy significant portion of the
cache.
Our prototype implementation provides program for managing the
in-memory decision function - however the node structure has 104B
instead of minimal 44B. We believe that optimized version should
achieve significantly better performance than the prototype version.
Experimental results and analysis
In order to determine whether quadtrees are a feasible choice for
encoding the automatic method selection process for MPI collective
operations, we analyzed the accuracy and the
performance3 of quadtrees built from the same experimental data but
under different constraints.
Under the assumption that the collective operations parameters are uniformly
distributed across communicator size and message size space, we expect
that the average depth of the quadtree is the average number of
conditions we need to evaluate before we can determine which method to
use. In the worst case, we will follow the longest path in the tree
to make the decision, and in the best case the shortest.
The performance data for broadcast and reduce collective algorithms
was collected on Grig cluster located at the University of Tennessee at
Knoxville and Nano cluster located at the Lawrence Berkeley National
Laboratory.
Broadcast decision maps
Figure 2 shows six different quadtree
decision maps for a broadcast collective on the Grig cluster. We
considered five different broadcast algorithms (Linear, Binomial,
Binary, Splitted-Binary, and Pipeline),4 and
four different segment sizes (no segmentation, 1KB, 8KB, and 16KB).
The measurements covered all communicator sizes between 2 and 28
processes and message sizes in 1B to 384KB range.
Figure 2:
Broadcast decision maps from
Grig cluster. Different colors correspond to different method
indexes. The trees were generated by limiting the maximum tree
depth. The x-axis scale is logarithmic. The crossover line for
1-level quadtree is not in the middle due to the ``fill-in''
points used to adjust the original size of the decision map from
25 x 48
to
64 x 64
form.
|
The exact decision map in Figure 2 exhibits
trends, but there is a considerable amount of information for intermediate
size messages (between 1KB and 10KB) and small communicator sizes.
Limiting the maximum tree depth smoothes the decision map and
subsequently decreases the size of the quadtree. Table
3 shows the mean tree depth and related
statistics for the decision maps presented in Figure
2.
Performance penalty of decision quadtrees
One possible metrics of merit is the performance penalty one would incur
by using a restricted quadtree instead of the exact one. To compute
this, one can use the performance information for methods suggested by the
restricted tree for particular set of communicator and message size
values, and compare them to the performance results for methods
suggested by the exact tree.
The reproducibility of measured results is out of scope of this paper,
but we followed the guidelines from [6] to
ensure good quality measurements. Even so, the ``exact'' decision
function corresponds to a particular data set, and the performance
penalty of other decision functions was evaluated against the data
that was used to generate them in the first place.
Figure 3 shows the
performance penalty of decision quadtrees from Figure
2 and the Table
3 summarizes the properties and
performance penalties for the same data.
Figure 3:
Performance
penalty of broadcast decision function from Grig cluster.
Colorbar represents relative performance penalty in percents.
White color means less than 25%, yellow is between 25% and 75%.
|
Table 3:
Statistics for
broadcast decision quadtrees in Figure
2. The number of leaves
corresponds to the number of regions we divided the (communicator
size, message size) space into. The number of lines in decision function
includes lines containing only braces, error handling, etc.
| Tree Depth |
Performance
penalty [%] |
Number of |
Function size |
|
Max |
Min |
Mean |
Min |
Max |
Mean |
Median |
Leaves |
[# of lines] |
|
1 |
1 |
1.0000 |
0.00 |
346.05 |
37.11 |
0.00 |
4 |
24 |
|
2 |
2 |
2.0000 |
0.00 |
436.02 |
18.63 |
0.00 |
16 |
82 |
|
3 |
2 |
2.9655 |
0.00 |
436.02 |
08.83 |
0.00 |
58 |
330 |
|
4 |
2 |
3.8554 |
0.00 |
391.53 |
06.29 |
0.00 |
166 |
932 |
|
5 |
2 |
4.7783 |
0.00 |
356.47 |
05.41 |
0.00 |
442 |
2,496 |
|
6 |
2 |
5.6269 |
0.00 |
000.00 |
00.00 |
0.00 |
973 |
5,505 |
|
The analysis shows that even for noisy decision map in Figure
2, a 3-level quadtree would have less
than 9% performance penalty on average, while the exact decision
could be represented with a total of 6 levels.
Quadtree accuracy threshold
In Section 3.3 we mentioned that an
alternative way to limit the size of quadtree is to specify the tree
accuracy threshold.
Figure 4 shows the effect of varying the accuracy
threshold on the mean performance penalty of a reduce quadtree
decision function on two different systems. On both systems, the
mean performance penalty of the reduce decision was below 10% for an
accuracy threshold of approximately 45%. This threshold corresponds
to the quadtree structures of maximum depth 3.
This means that the quadtree decision which would on average
potentially cause a 10% performance penalty would be evaluated at
most in 3 expressions.
Figure 4:
Effect of the accuracy threshold on mean quadtree performance
penalty.
|
Accuracy threshold vs. limiting maximum depth
Figure 5 shows the mean performance penalty
of broadcast and reduce decisions on Grig cluster (See Figures
2,
3, and 4,
and Table 3) as a function of the mean
quadtree depth for quadtrees constructed by specifying accuracy
threshold and maximum depth.
Figure 5:
Accuracy
threshold vs. maximum depth quadtree construction.
|
The results indicate that in the cases we considered, constructing the
decision quadtree by restricting the maximum depth of the tree
directly incurs a smaller mean performance penalty than the tree of
similar mean depth constructed by setting the accuracy threshold.
The results for the broadcast decision function show that when the
quadtree is deep enough to cover almost the whole initial data set,
the tree constructed using an accuracy thresholds achieves the smaller
mean performance penalty. This is not the case for the quadtree-based
reduce decision functions. This is probably due to the fact that the
reduce decision function was smoother to start with, so smoothing it
with an accuracy threshold had no further positive effects. Still, we
believe that the example of the broadcast decisions indicates that
the accuracy threshold setting could be used to avoid over-fitting the
data when the tree depth is not a concern.
In-memory quadtree-based decision system
Table 4 contains the performance
results for in-memory quadtree-based decision system for reduce
collective on Grig cluster.
The reported decision time value is the average time it took for the
quadtree to make decision for a random communicator and message size
from the 2 - 32
and 1 - 16MB
ranges respectively. Even though the
broadcast decision function was more complex than reduce on this
cluster, the time to decision was actually lower.
Table 4:
Performance of
prototype implementation of in-memory, quadtree-based decision
system for reduce collective on Grig cluster. Time per level was
computed as
.
|
Mean tree depth |
Mean performance penalty [%] |
Number of leaves |
Number of nodes |
Memory [Bytes] |
Decision time [ns] |
Decision time per level [ns] |
|
0.00 |
561.67 |
1 |
1 |
104 |
22.22 |
22.22 |
|
1.00 |
143.59 |
4 |
5 |
520 |
66.38 |
33.19 |
|
2.00 |
12.51 |
16 |
21 |
2,184 |
109.26 |
36.42 |
|
2.95 |
3.23 |
55 |
73 |
7,592 |
146.78 |
37.16 |
|
3.84 |
0.96 |
157 |
209 |
21,736 |
161.84 |
33.44 |
|
4.63 |
0.37 |
334 |
445 |
46,280 |
170.02 |
30.20 |
|
5.40 |
0.00 |
610 |
813 |
84,552 |
174.95 |
27.34 |
|
The results in Table 4 show that
using even unoptimized quadtree implementation the time to decision
for these quadtrees was around 30ns
per level. Furthermore, the
time to decision for the complete tree with 6 levels took less than
175ns
for reduce and 165ns
for broadcast collective. The
3-level trees, which incur less than 10% performance penalty, took
around 150ns
for both collectives.
In comparison, the fixed reduce and broadcast decisions in FT-MPI took
45.22ns
and 110.53ns
on average. Even though we did not
compute the performance penalty for the fixed decision functions, we
know they perform well on Grig cluster. Still, the results using
unoptimized quadtree-based decision system are comparable to the
fixed ones.
The optimized quadtree implementation will have both smaller memory
footprint and better memory layout due to non-recursive construction.
We expect that this quadtree implementation will decrease the time to
make decision considerably.
Discussion and future work
In this paper, we studied the applicability of a modified quadtree
encoding method to the algorithm selection problem for the MPI collective
function optimization. We analyzed the properties and performance of
quadtree decision functions constructed by either limiting the maximum
tree depth or specifying the accuracy threshold a the construction time.
Our experimental results for broadcast and reduce collectives, show
that in some cases, the decision function based on a quadtree
structure with a mean depth of 3, incurs less than a 5% performance
penalty on the average.
In other cases, deeper trees (5 or 6 levels) were necessary to achieve
the same performance. However, in all cases we considered, a quadtree
with 3-levels would incur less than a 10% performance penalty on average.
Our results indicate that quadtrees may be a feasible choice
for processing the performance data and decision function generation.
Our analysis of the performance of the in-memory quadtree decision
systems was based on unoptimized quadtree implementation, and as such
should be interpreted as worst case scenario. The results with this
system show that in around 170ns
on average, we can get the
experimentally optimal decision. Not surprisingly, the fixed decision
function in FT-MPI outperformed the in memory system in terms of
performance. Even so, the current results are good enough to persuade
us to explore this track further. We expect that
the performance of an in-memory system will depend greatly on the
implementation efficiency and the application access pattern. It is
possible that in some cases and or in combination with other methods,
it could achieve very good performance. n
One of the limitations of the quadtree encoding method is that since
the decision is based on a 2D-region in communicator size - message
size space, it will not be able to capture decisions which are
optimal for single communicator values, e.g. communicator sizes which
are power of 2. The same problem is exacerbated if the performance
measurement data used to construct trees is too sparse. The sparse
data set is a high-frequency information and applying low-pass
filters to it can cause loss of important information.
The decision map reshaping process to convert measured data from
n x m
shape to
2k x 2k
affects encoding efficiency of the
quadtree. In our current study, we did not address this issue, but in
future work we plan to further improve the efficiency of the encoding
regardless of initial data space.
The major focus of future research will be comparing the
quadtree-based decision functions, to the ones generated using
run-length encoding and standard decision tree algorithms such as
C4.5.
Finally, if one is interested in an application level optimization,
assumptions based on the premise that the communication parameters are
uniformly distributed across the communicator and message size space
are probably optimistic.
Thus, it is possible that it would make sense to refine the trees for
frequently used message and communicator sizes while the rest of the
domain is more sparse. Quadtrees may or may not be right structure for
this type of approach, but we plan to investigate this approach additionally.
Library Information
Implementation Details
The following data structures represent the core of the quadtree
decision function library:
- Decision map/Decision structure.
- The decision map in this
context it is a one-dimensional array whose value at location
n represent the method index to be used for communicator size
commsize[n/nummsgsizes] and message size
msgsize[n%nummsgsizes]. The decision structure contains
the additional information neccessary to interpret the decision
map, eg. communicator and message sizes, topology and segment sizes,
etc.
- Experimental performance data.
- The structure
exp_perf_data_t maintains the experimental performance
data information. It maintains the array of linked list where
exp_data[i][j] contains sorted performance information for
communicator size
commsize[i] and message size
msgsize[j]. The requested timer values can be obtained using
appropriate functions (epd_find_perf_data and
epd_find_perf_timer). The structure also provides
functionality to compute the performance penalty of external
decision map. However, we can only get the performance penalty for
the map whose sample points (communicator and message sizes) are
covered by the performance data.
Currently, the experimental performance data structure can be
initialized only from the OCC performance data files [1].
- Quadtree.
- The quadtree structure is basic quadtree
implementation which supports the maximum tree length and the tree
accuracy threshold as explained in Section
3.2.
- Full decision.
- This is an composite data structure which
contains the experimental performance data, decision, and quadtree
data structures, and provides the functionality to query the
quadtree using the communicator and message size, as well as to
generate the source code of the decision function using this
particular combination of experimental data and quadtree parameters.
We have provided three utility programs with the code:
- analyze_qt_dec_quality.c
- - analyzes the quality of
decisions generated by the quadtree decision functions,
- output_single_decision_function.c
- - generates source
code for the specified decision function, and
- measure_in_memory_decision_time
- - measures the time
it takes to come to a decision.
All three programs have similar look and feel, and they support same
input arguments for describing the quadtree structure. We have also
provided a shell script,
bin/generate_qt_decision_tree_stats.sh which allows user
to run analyze_qt_dec_quality.c program for range of
threshold values.
Installation
The Quadtree Decision Function V2 software can be obtained by contacting the
author via email.
To install the software follow the steps below:
- unpack the archive containing the source code:
# tar -xzvf qtdec_v2.tar.gz
- enter the newly created root directory for the library
# cd quadtree_decision_function_analysis
- review Makefiles in root, src/, and tests/ directory.
- make program from the root directory
# make
- If everything goes as planned, executables will be generated in
bin/ directory. The
generate_qt_decision_tree_stats.sh will be in the
same directory.
Enjoy!
- 1
-
OCC, Optimized collective communication
http://www.cs.utk.edu/~pjesa/projects/occ.
Accessed on March 2006.
- 2
-
BERNASCHI, M., IANNELLO, G., AND LAURIA, M.
Efficient implementation of reduce-scatter in MPI.
J. Syst. Archit. 49, 3 (2003), 89-108.
- 3
-
CHAN, E. W., HEIMLICH, M. F., PURKAYASTHA, A., AND VAN DE GEIJN, R. M.
On optimizing of collective communication.
In Cluster (2004).
- 4
-
FAGG, G. E., GABRIEL, E., BOSILCA, G., ANGSKUN, T., CHEN, Z.,
PJEŠSIVAC-GRBOVI´C, J., LONDON, K., AND DONGARRA, J.
Extending the mpi specification for process fault tolerance on high
performance computing systems.
In Proceedings of the International Supercomputer Conference
(ICS) 2004 (2004), Primeur.
- 5
-
GABRIEL, E., FAGG, G. E., BOSILCA, G., ANGSKUN, T., DONGARRA, J. J.,
SQUYRES, J. M., SAHAY, V., KAMBADUR, P., BARRETT, B., LUMSDAINE, A., CASTAIN,
R. H., DANIEL, D. J., GRAHAM, R. L., AND WOODALL, T. S.
Open MPI: Goals, concept, and design of a next generation MPI
implementation.
In Proceedings, 11th European PVM/MPI Users' Group Meeting
(Budapest, Hungary, September 2004), pp. 97-104.
- 6
-
GROPP, W., AND LUSK, E. L.
Reproducible measurements of MPI performance characteristics.
In Proceedings of the 6th European PVM/MPI Users' Group
Meeting on Recent Advances in PVM and MPI (1999), Springer-Verlag,
pp. 11-18.
- 7
-
HUSE, L. P.
Collective communication on dedicated clusters of workstations.
In Proceedings of the 6th European PVM/MPI Users' Group Meeting
on Recent Advances in PVM and MPI (1999), Springer-Verlag, pp. 469-476.
- 8
-
JUHÁSZ, S., AND KOVÁCS, F.
Asynchronous distributed broadcasting in cluster environment.
In PVM/MPI (2004), D. Kranzlmüller, P. Kacsuk, and
J. Dongarra, Eds., vol. 3241 of Lecture Notes in Computer Science,
Springer, pp. 164-172.
- 9
-
KIELMANN, T., HOFMAN, R. F. H., BAL, H. E., PLAAT, A., AND BHOEDJANG, R.
A. F.
MagPIe: MPI's collective communication operations for clustered
wide area systems.
In Proceedings of the seventh ACM SIGPLAN symposium on
Principles and practice of parallel programmin g (1999), ACM Press,
pp. 131-140.
- 10
-
MURTHY, S. K.
Automatic construction of decision trees from data: A
multi-disciplinary survey.
Data Mining and Knowledge Discovery 2, 4 (1998), 345-389.
- 11
-
PJESIVAC-GRBOVIC, J., ANGSKUN, T., BOSILCA, G., FAGG, G. E.,
GABRIEL, E., AND DONGARRA, J. J.
Performance analysis of mpi collective operations.
In IPDPS '05: Proceedings of the 19th IEEE International
Parallel and Distributed Processing Symposium (IPDPS'05) - Workshop 15
(Washington, DC, USA, 2005), IEEE Computer Society, p. 272.1.
- 12
-
RABENSEIFNER, R.
Automatic MPI counter profiling of all users: First results on a
CRAY T3E 900-512.
In Proceedings of the Message Passing Interface Developer's and
User's Conference (1999), pp. 77-85.
- 13
-
RABENSEIFNER, R., AND TRÄFF, J. L.
More efficient reduction algorithms for non-power-of-two number of
processors in message-passing parallel systems.
In Proceedings of EuroPVM/MPI (2004), Lecture Notes in
Computer Science, Springer-Verlag.
- 14
-
THAKUR, R., AND GROPP, W.
Improving the performance of collective operations in MPICH.
In Recent Advances in Parallel Virtual Machine and Message
Passing Interface (2003), J. Dongarra, D. Laforenza, and S. Orlando, Eds.,
no. 2840 in LNCS, Springer Verlag, pp. 257-267.
10th European PVM/MPI User's Group Meeting, Venice, Italy.
- 15
-
VADHIYAR, S. S., FAGG, G. E., AND DONGARRA, J. J.
Automatically tuned collective communications.
In Proceedings of the 2000 ACM/IEEE conference on Supercomputing
(CDROM) (2000), IEEE Computer Society, p. 3.
- 16
-
WORRINGEN, J.
Pipelining and overlapping for MPI collective operations.
In 28th Annyal IEEE Conference on Local Computer Network
(Boon/Königswinter, Germany, October 2003), IEE Computer Society,
pp. 548-557.
Independent Study Report
Can Quadtree Encoding help with MPI Collective Algorithm
Selection Process?
This document was generated using the
LaTeX2HTML translator Version 2002-2-1 (1.71)
Copyright © 1993, 1994, 1995, 1996,
Nikos Drakos,
Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999,
Ross Moore,
Mathematics Department, Macquarie University, Sydney.
The command line arguments were:
latex2html -html_version 4.0,math -local_icons -dir /home/pjesa/www-home/projects/qtdec -up_url http://www.cs.utk.edu/ pjesa/projects -up_title Projects -split 3 -toc_depth 3 -link 6 -accent_images textrm -t 'MPI Collective Algorithm Selection using Decision Quadtrees' qt_decision_for_web.tex
The translation was initiated by Jelena Pjesivac-Grbovic on 2006-05-12
Jelena Pjesivac-Grbovic
2006-05-12