CS360 Lecture notes -- Assembler Lecture #3

  • Jim Plank
  • Directory: /blugreen/homes/plank/cs360/notes/Assembler3
  • Lecture notes: http://www.cs.utk.edu/~plank/plank/classes/cs360/360/notes/Assembler3/lecture.html
    Email that I have answered
    This is the final lecture on assembler. We'll go over branches, recursion, and some other stuff.

    Branch Instructions

    Finally, there are "compare" and "branch" instructions which are used to implement if, for and while constructs: They work as follows:
       cmp %r0, %r1       This says to compare the values of the registers
                          r0 and r1, and set the control status register 
                          (CSR) to reflect the outcome.   The CSR
                          will store whether (r0==r1), (r0 < r1) or (r0 > r1).
       b l1               This says go (branch) directly to label l1.  This sets 
                          the pc to l1 rather than (pc+4).  Note that you
                          can't "return" from a branch like you can from a
                          "jsr" statement.
       beq l1             This says that if the CSR denotes that the two compared
                          values are equal, go (set the pc) to label l1.  
                          If the two compared values are not equal, the
                          next statement (pc+4) is executed.
       ble l1             These should be obvious (<=, <, >=, >, !=).             
       blt l1
       bge l1
       bgt l1
       bne l1

    Thus, conditional expressions such as if, for and while statements are straightforward. There are multiple ways to do them. Here is how I recommend to do each type of statement:

    if (cond) {
    } else {
       set up conditional
       branch on the negation of the conditional to l1
       b l2

    For example:

    int a(int i, int j)
      int k;
      if (i < j) {
        k = i;
      } else {
        k = j;
      return k;
       push #4
       ld [fp+12] -> %r0           / Load i into r0
       ld [fp+16] -> %r1           / Load j into r0
       cmp %r0, %r1                / Compare and branch on the negation (greater than or equal)
       bge l1
       ld [fp+12] -> %r0           / k = i
       st %r0 -> [fp]
       b l2
       ld [fp+16] -> %r0           / k = j
       st %r0 -> [fp]
       ld [fp] -> %r0              / return k

    while (cond) {
       set up conditional
       branch on the negation of the conditional to l2
       b l1

    for (S1; cond; S2) {
       b l2
       set up conditional
       branch on the negation of the conditional to l3
       b l1

    For example:
    int a(int k)
      int i, j;
      j = 0;
      for (i = 0; i < k; i++) j += i;
      return j;
    will compile into:
      push #8                      / Allocate i and j on the stack
      st %g0 -> [fp-4]             / Set j to zero
      st %g0 -> [fp]               / Initialize the for loop  (S1)
      b l2
      ld [fp] -> %r0               / Do i++ (S2)
      add %r0, %g1 -> %r0
      st %r0 -> [fp]
      ld [fp] -> %r0               / Perform the test, and branch on the negation
      ld [fp+12] -> %r1
      cmp %r0, %r1
      bge l3
      ld [fp-4] -> %r0             / Do j += i  (S3)
      ld [fp] -> %r1
      add %r0, %r1 -> %r0
      st %r0 -> [fp-4]
      b l1
      ld [fp-4] -> %r0             / return j (S4)
    As always, this code can be optimized greatly. I'll leave it to you to figure out how.


    By now, recursive procedures shouldn't seem mysterious. For example:

    int fact(int i)
      if (i == 0) return 1;
      return fact(i-1)*i;
    will compile into:
    	ld [fp+12] -> %r0          / do the if statement
    	cmp %r0, %g0
    	bne l1
            mov %g1 -> %r0
            ld [fp+12] -> %r0          / push i-1 on the stack
    	add %r0, %gm1 -> %r0
    	st %r0 -> [sp]--
    	jsr fact                   / jump to fact
    	pop #4                     / pop the argument off the stack
    	ld [fp+12] -> %r1          / multiply fact(i-1)*i
    	mul %r0, %r1 -> %r0
    We'll go over the execution in class. Each recursive call pushes a new stack frame. Use jassem.tcl to trace through fact(4) (I have a main that calls fact(4) in fact4.jas.

    One More Example

    I won't go over this in detail here, but behold bsort.c. This is a simple bubble sort of a 4-element array:
    void bsort(int *a, int size)
      int i, j, tmp;
      for (i = size-1; i > 0; i--) {
        for (j = 0; j < i; j++) {
          if (a[j] > a[j+1]) {
            tmp = a[j];
            a[j] = a[j+1];
            a[j+1] = tmp;
      int array[4];
      array[0] = 6;
      array[1] = 1;
      array[2] = 4;
      array[3] = 2;
      bsort(array, 4);
    There are a lot of array operations here, so the assembly code is lengthy. It is in bsort.jas, and below:

       push #12                 / i=fp-8, j=fp-4, tmp=fp
       st %r2 -> [sp]--         / Spill r2
                                / For loop #1: labels f11, f12, f13
       ld [fp+16] -> %r0        / i = size-1
       add %r0, %gm1 -> %r0
       st %r0 -> [fp-8]
       b f12
       ld [fp-8] -> %r0         / i--
       add %r0, %gm1 -> %r0
       st %r0 -> [fp-8]
       ld [fp-8] -> %r0         / i > 0
       cmp %r0, %g0
       ble f13
                                / For loop #2: labels f21, f22, f23
       st %g0 -> [fp-4]         / j = 0
       b f22
       ld [fp-4] -> %r0         / j++
       add %r0, %g1 -> %r0
       st %r0 -> [fp-4]
       ld [fp-4] -> %r0
       ld [fp-8] -> %r1
       cmp %r0, %r1
       bge f23
                                / If (a[j] > a[j+1])
       ld [fp-4] -> %r0         / First put a[j] into register r0
       mov #4 -> %r1
       mul %r0, %r1 -> %r0
       ld [fp+12] -> %r1
       add %r0, %r1 -> %r0
       ld [r0] -> %r0
       ld [fp-4] -> %r1         / Now put a[j+1] into register r1
       add %r1, %g1 -> %r1      / without touching r0
       mov #4 -> %r2           
       mul %r1, %r2 -> %r1
       ld [fp+12] -> %r2
       add %r1, %r2 -> %r1
       ld [r1] -> %r1
       cmp %r0, %r1
       ble i1
       ld [fp-4] -> %r0         / tmp = a[j]
       mov #4 -> %r1
       mul %r0, %r1 -> %r0
       ld [fp+12] -> %r1
       add %r0, %r1 -> %r0
       ld [r0] -> %r0
       st %r0 -> [fp]
       ld [fp-4] -> %r0         / a[j] = a[j+1]
       add %r0, %g1 -> %r0      / Load a[j+1] into r0
       mov #4 -> %r1
       mul %r0, %r1 -> %r0
       ld [fp+12] -> %r1
       add %r0, %r1 -> %r0
       ld [r0] -> %r0
       ld [fp-4] -> %r1         / Load &(a[j]) into r1
       mov #4 -> %r2            
       mul %r1, %r2 -> %r1
       ld [fp+12] -> %r2
       add %r1, %r2 -> %r1
       st %r0 -> [r1]           / Store r0 into a[j]
       ld [fp] -> %r0           / a[j+1]  = tmp
       ld [fp-4] -> %r1        
       add %r1, %g1 -> %r1    
       mov #4 -> %r2            
       mul %r1, %r2 -> %r1
       ld [fp+12] -> %r2
       add %r1, %r2 -> %r1
       st %r0 -> [r1]
    i1:                         / End of if statement
       b f21                    / End of for loop #2  
       b f11                    / End of for loop #1
       ld ++[sp] -> %r2
       push #16
       mov #-1 -> %r2       / This is just to show spilling
       mov #6 -> %r0
       st %r0 -> [fp-12]
       mov #1 -> %r0
       st %r0 -> [fp-8]
       mov #4 -> %r0
       st %r0 -> [fp-4]
       mov #2 -> %r0
       st %r0 -> [fp]
       mov #4 -> %r0
       st %r0 -> [sp]--
       mov #12 -> %r0
       sub %fp, %r0 -> %r0
       st %r0 -> [sp]--
       jsr bsort
       pop #8
    The execution of this with jas is a bit cumbersome -- it goes blazingly fast on my linux box, but not on my windows box -- this is not the most efficient tcl/tk code in the world. Oh well. As always, make sure you understand both the translation to assembly code, and the workings of the assembler. Yes, this code is grossly inefficient and can be made world's faster with the judicious use of some registers.

    Delay Slots

    I will only go over this in class if there is time. If not, only read this if you are interested.

    In all assembler assignments in class, in homeworks and on tests, assume that there is no delay slot. This is just for your own knowledge.

    Reading assembler from a random machine can be difficult, but usually you can figure out how its assembler maps into the one defined in this class. One point of confusion which is probably unique to our Sparc processors is the delay slot. There is a technique for speeding up processors called "pipelining" which means that the CPU doesn't finish executing the current instruction before it starts executing the next instruction. Usually, this does not involve much confusion. However, on jsr and ret and b instructions, there is a problem: These instructions change the pc, which means that the next instruction should not be executed. But on a pipelined processor, by the time the instruction is done, the next instruction has already been partially executed.

    The solution on our Sparcs is that the instruction after the jsr, ret and b is executed, and then control goes to the changed value of the pc. This instruction -- the one after the jsr, ret or b -- is known as the delay slot. Note that the semantics of jsr must change too -- it must push pc+8 onto the stack so that when ret is called, it returns to the instruction after the delay instruction.

    It is up to the compiler-writers to ensure that this slot is used correctly. For example, without compiler optimization, most compilers simply insert a noop after the jsr, ret or b. For example:

    a(int i)
      return b(i+1)+1;
    compiles to:
      ld [fp+12] -> %r0            / Push i+1 onto the stack
      add %r0, %g1 -> %r0
      st %r0 -> [sp]--
      jsr b                        / Call procedure b
      noop                         / Delay slot
      pop #4
      add %r0, %g1 -> %r0          / Put b(i+1)+1 into r0
      ret                          / return
      noop                         / delay slot
    An optimized compiler, however, will use the delay slot, which makes code harder to read, since you have to remember that the instruction after the jsr, ret or b gets executed. Moreover, subroutines return to the instruction after the instruction after the jsr call. Here's an example of the above procedure compiled in such a way that the delay slots following the jsr and ret statements are used.
      ld [fp+12] -> %r0            / Push i+1 onto the stack
      add %r0, %g1 -> %r0
      jsr b                        / Call procedure b
      st %r0 -> [sp]--
      pop #4
      ret                          / return
      add %r0, %g1 -> %r0          / Put b(i+1)+1 into r0 -- this gets executed
                                     before the return actually occurs.