PARA'04 State-of-the-Art
in Scientific Computing
June 20-23, 2004 (Home page)

Updated: 29 February 2004

Cache Optimizations for Iterative Numerical Codes Aware of Hardware Prefetching

Josef Weidendorfer
Computer Science
Technical University Munich
Germany

Cache optimizations typically include code transformations to increase the locality of memory accesses. An orthogonal approach is to enable for full latency hiding by introducing prefetching techniques; i.e., by ensuring that any data is loaded early enough, before it is actually used. Software prefetching enables this by inserting cache load instructions into the program code. However, the use of such instructions occupies both decoding bandwidth and hardware resources for the handling of outstanding loads. Thus, modern processers are often equipped with hardware prefetching units which predict future memory accesses in order to load data into cache in advance.

The best way seems to combine both prefetching approaches. In this contribution, we use a cache simulation enhanced with a hardware prefetcher to run code for a 3D multigrid solver. Even with its simple design, the prefetcher proves its usefullness. Cache misses which are not predicted by the prefetcher can be easily located in simulation results. By selectively introducing only a small number of software prefetch instructions, we show that we still can improve not only simulation, but also real runtime on actual processors such as the Intel Pentium 4, for example.

Home page


2004-02-29