Serial Performance Optimization - Exercises

The purpose of these exercises is to familiarize the application programmer with single-node performance optimization techniques. Even for parallel codes, single-node performance is still usually the most important factor in overall performance.
  1. Consider the following loop:
    DO I = 1,N
      Y(J) = Y(J) + X(I)
    ENDDO
    
    Ideally the compiler should hoist the load for Y(J) before the loop and sink the store for Y(J) below the loop. Due to Y being a function of the index J, however, some compilers may have trouble performing this optimization on the above loop. How could you change this code to help the compiler perform this loop optimization?

  2. Consider the following array declaration and nested loop code:
    REAL X(8,N)
    DO J = 1,8
      DO I = 1,N
        X(J,I) = 0.0
        CALL SUB1(X)
        ...
      ENDDO
    ENDDO
    
    Assume that the loops cannot be interchanged due to other work in the inner loop. Suppose that the X array is out-of-cache and that the systems on which you will be running the code have interleaved memory systems with eight banks. What performance problem might occur, and how might you change the code to improve the performance?

  3. Given
    S = DDOT( 10, X, 1, Y, 1 )
    
    where DDOT is defined as
    FUNCTION DDOT( N, X, IX, Y, IY )
    READ*8 X(0:N-1), Y(0:N-1)
    S = 0.0
    IF (IX .EQ. 1 .AND. IY .EQ. 1) THEN
      DO I = 0,N-1
        S = S + X(I) * Y(I)
      ENDDO
    ELSE
      DO I = 0,N-1
        S = S + X(I*INCX) * Y(I*INCY)
      ENDDO
    ENDIF
    
    how might a compiler use inlining, constant propagation, and dead code elimination to optimize the code?

  4. Consider the following loop:
    DO I=1,N
      IF (D(J) .LE. 0.0) X(I) = 0.0
      A(I) = B(I)+C(I)*D(I)
      E(I) = X(I)+F*G(I)
    ENDDO
    
    What is inefficient about this code and how could you re-code it to be more efficient?

  5. Consider the following loop:
    DO I=1,N
      DO J=1,N
        A(J,I)=B(J,I)*SIN(X(J))
      ENDDO
    ENDDO
    
    How could you rewrite this code to reduce the number of calls to SIN?

  6. What problem is likely to occur with the following code and how can it be fixed:
          integer nx,nz
          parameter (nx=2048,nz=2048)   
          real p(2,nx,nz)
    ...
    ...
          do 25 ix=2,nx-1
             do 20 iz=2,nz-1
                p(itl,ix,iz) = -p(itl,ix,iz)
         &                     +s*p(it2,ix-1,iz)
         &                     +s*p(it2,ix+1,iz)
         &                     +s*p(it2,ix,iz-1)
         &                     +s*p(it2,ix,iz+1)
    20       continue
    25     continue
    

  7. Use the following files to execute and measure the performance of variations of matrix-vector multiply using PAPI on the IBM POWER4: Which variation perfoms the best and why? What result was surprising and what does it indicate?

  8. Consider the following blocked matrix multiply code discussed in lecture:
    DO II = 1,N,NB
        DO JJ = 1,N,NB
            DO KK = 1,N,NB
                DO I = II,MIN(N,II+NB-1)
                    DO J = JJ,MIN(N,JJ+NB-1)
                        DO K = KK,MIN(N,KK+NB-1)
                            C(I,J) = C(I,J)+A(I,K)*B(K,J)
                        ENDDO
                    ENDDO
                ENDDO
            ENDDO
        ENDDO
    ENDDO
    
    1. Determine the best block size for the IBM POWER4 L2 cache.
    2. Use outer loop unrolling to increase the F:M ratio as much as possible without causing register spilling. Code up the result and and instrument the code with PAPI to measure the relevant hardware counter metrics.
    3. Compare your performance with that of ESSL's dgemm.

shirley@cs.utk.edu