My goal is to teach you how assembly code is generated by a compiler to work on a standard processor. I don't do this in a precise manner, as would be done in a compiler course, but in a manner that is intended to be more intuitive.
The assembly code is a made-up RISC assembly code that works on a fictitious machine that has 4-byte pointers and no floating point. I have written a visual assembler for it so that you can see it in action.
We'll assume that our machine has 8 general-purpose registers in the CPU. All are 4 bytes, and can be read or written by the user. The first five are named r0, r1, r2, r3, r4. The last three registers are special:
Additionally, the computer has three read-only registers, which always contain the same values:
Finally, the computer also has two special registers that the user cannot access directly:
In other words, if the pc contains the value 0x2040, then the IR is executing the instruction contained in the 4 bytes starting at memory address 0x2040.
Assembly code is a readable encoding of instructions. A program called an assembler converts assembly code into the proper 0's and 1's that compose the program. If you call gcc with the -S flag, it will produce a .s file containing the assembler for that C program. Without the -S flag, it produces the instructions directly.
ld mem -> %reg Load the value of the register from memory. st %reg -> mem Store the value of the register into memory.There are a few ways to address memory:
st %r0 -> i Store the value of register r0 into the memory location of global variable i. st %r0 -> [r1] Treat the value of register r1 as a pointer to a memory location, and store the value of r0 in that memory location. st %r0 -> [fp+4] Treat the value of the frame pointer as a pointer to a memory location, and store the value of r0 in the memory location 4 bytes after that location. You can use any value, positive or negative -- the value must be a multiple of four. However, you cannot use a register (i.e. you can't do st %r0 -> [fp+r2]). This only works with the frame pointer. It does not work with any other register. st %r0 -> [sp]-- Treat the value of register sp as a pointer to a memory location, store the value of r0 into that memory location, and then subtract 4 to the value of sp. st %r0 -> ++[sp] Treat the value of register sp as a pointer to a memory location. First, add 4 to that value, then store the value of r0 into that memory location.
mov %reg -> %reg Copy a register's value to another mov #val -> %reg register, or set its value to a constant.All arithmetic goes from register to register:
add %reg1, %reg2 -> %reg3 Add reg1 & reg2 and put the sum in reg3. sub %reg1, %reg2 -> %reg3 Subtract reg2 from reg1. mul %reg1, %reg2 -> %reg3 Multiply reg1 & reg2. idiv %reg1, %reg2 -> %reg3 Do integer division of reg2 into reg1. imod %reg1, %reg2 -> %reg3 Do reg1 mod reg2.There are two special instructions that let you perform addition and subtraction on the stack pointer:
push %reg This subtracts the value of %reg or #val push #val from the stack pointer. pop %reg This adds the value of %reg or #val pop #val to the stack pointer.
jsr a Call the subroutine starting at instruction a. ret Return from a subroutine.There are also "compare" and "branch" instructions, which is how you implement for and if statements, but I won't go over them yet.
Finally, there are also "directives" which are not really code, but specify that memory must be allocated for variables. In this assembler, this is just one such directive:
.globl i Allocate 4 bytes in the globals segment for the variable i.
The program counter points to where the instruction register must go to load its value. On normal instructions, the pc is incremented by 4 so that the next instruction can be loaded. On control instructions, the pc gets a new value, allowing the machine to call subroutines, perform "if-then" statements, etc.
Generally, a process treats memory like a huge array of bytes; however, the bytes are organized logically into units of 4 bytes each, as that is the size of registers. We assume that this memory is of size 0x80000000. The code starts at address 0x1000. The globals follow the code, starting in the first address that is a multiple of 4096 (0x1000). On a real machine, the heap would follow the code, or start at some other address which is a multiple of 4096. With jassem, there is no heap.
As a program executes, the stack will grow and shrink, but the code and globals stay the same size. The stack grows from back to front, starting at address 0x80000000 and growing toward lower memory addresses. In between the globals and stack is unused memory:
The programs' address space: |--------------------------|<------- 0x00000000 | No access (Seg faults) | |--------------------------|<------- 0x00001000 | | | code | | | |--------------------------| | | | globals | | | |--------------------------| .... Unused memory .... | | | (grows up) | |^^^^^^^^^^^^^^^^^^^^^^^^^^| | | | stack | | | |--------------------------|<------- 0x80000000
int i; int j; int main() { i = 3; j = 2; j = i + j; } /* I am intentionally not returning anything. */ |
Will compile into the following assembly code:
.globl i / Allocate i and j as global variables. .globl j main: mov #3 -> %r0 / i = 3 st %r0 -> i mov #2 -> %r0 / j = 2 st %r0 -> j ld i -> %r0 / j = i + j ld j -> %r1 add %r0,%r1 -> %r1 st %r1 -> j ret |
This code is pretty straightforward. Each instruction in C has a corresponding set of instructions in assembler. Unless your compiler is smart, it will produce inefficient code. For example, you can probably see that:
.globl i .globl j main: mov #3 -> %r0 mov #2 -> %r1 add %r0,%r1 -> %r1 st %r1 -> j st %r0 -> i retwould work just as well, and has fewer instructions. If you call gcc with the -O flag, it will attempt to optimize your code so that it has fewer instrcutions. However, normally, gcc simply produces straightforward, unoptimized code. In this class (I'll repeat this over and over), we are going to produce unoptimized code, which means that every C statement is translated to assembly code independently. No assumptions are made from one statement to the next. We leave compiler optimization to another course (like a compiler course)
Now, suppose instead that we have the following code to run:
int main() { int i, j; i = 3; j = 2; j = i + j; } |
Since i and j are local variables, they must come from temporary storage: The stack. How does the stack work? It is governed by the sp and fp registers. The sp and fp designate what is known as a "frame" on the stack. The fp points to the bottom of the frame, and the sp points to the top. All memory locations above (less than or equal to) the stack pointer are considered unused. Thus, we can get new temporary memory by decrementing the sp, thus putting memory locations into the current stack frame.
For example, When a procedure is first called, these two registers point to the same place in the stack. The frame is considered empty.
.... |----------------| | unused | |----------------| | unused | <------------- sp, fp |----------------| | used | |----------------| | used | |----------------| .... |----------------| | used | |----------------| <---------- Location 0x80000000To allocate room for the two local variables i and j, we decrement the stack pointer by 8. This allocates two 4-byte quantities in the current stack frame: By convention, we'll call the lower one j, and the upper one i. This is something that the compiler defines. We could just as easily have called the lower one i, and the upper one j:
.... |----------------| | unused | <------------- sp |----------------| | i | |----------------| | j | <------------- fp |----------------| | used | |----------------| | used | |----------------| .... |----------------| | used | |----------------| <---------- Location 0x80000000Now, the code for main() is just like before, only instead of accessing i and j as global variables, we access them as offsets to the frame pointer.
main: push #8 / This allocates i and j mov #3 -> %r0 st %r0 -> [fp-4] / Set i to 3 mov #2 -> %r0 st %r0 -> [fp] / Set j to 2 ld [fp-4] -> %r0 ld [fp] -> %r1 add %r0,%r1 -> %r1 / Add i and j and put the result st %r1 -> [fp] / back into j ret |
Let's look at what happens when main() is executed:
Stack |----------------| | | registers | | |-----------------| | | | | r0 | | | | r1 | ..... | | | r2 | unused | | | r3 | unused | | | r4 | unused | /----------- | | sp | unused | <------------------ | | fp | used | | main | pc | .... | |-----------------| |--------------- |Note that the fp and sp point to the base of the empty stack frame. The pc points to the beginning of the main routine. This is the instruction "push #8". When this is done executing, we have the following:
Stack registers |----------------| |-----------------| | | | | r0 | | | | r1 | ..... | | | r2 | unused | | | r3 | unused | <---------------\ | | r4 | i | \- | | sp | j | <------------------ | | fp | used | | main + 4 | pc | .... | |-----------------| |--------------- |Space has been allocated on the current stack frame for i and j, and the pc has been incremented. It now points to the instruction "mov #3 -> %r0" This puts the machine into the following state:
Stack registers |----------------| |-----------------| | | | 3 | r0 | | | | r1 | ..... | | | r2 | unused | | | r3 | unused | <---------------\ | | r4 | i | \- | | sp | j | <------------------ | | fp | used | | main + 8 | pc | .... | |-----------------| |--------------- |Now, the pc points to "st %r0 -> [fp-4]". When this is done, the location for i is set to the value 1:
Stack registers |----------------| |-----------------| | | | 3 | r0 | | | | r1 | ..... | | | r2 | unused | | | r3 | unused | <---------------\ | | r4 | i=3 | \- | | sp | j | <------------------ | | fp | used | | main + 12 | pc | .... | |-----------------| |--------------- |After the next two instructions, the state of the machine will be:
mov #2 -> %r0 st %r0 -> [fp] Stack registers |----------------| |-----------------| | | | 2 | r0 | | | | r1 | ..... | | | r2 | unused | | | r3 | unused | <---------------\ | | r4 | i=3 | \- | | sp | j=2 | <------------------ | | fp | used | | main + 20 | pc | .... | |-----------------| |--------------- |Finally, the last 4 instructions do:
ld [fp-4] -> %r0 ld [fp] -> %r1 add %r0,%r1 -> %r1 st %r1 -> [fp] Stack registers |----------------| |-----------------| | | | 2 | r0 | | | 3 | r1 | ..... | | | r2 | unused | | | r3 | unused | <---------------\ | | r4 | i=3 | \- | | sp | j=3 | <------------------ | | fp | used | | main + 36 | pc | .... | |-----------------| |--------------- |
You can get this file from this directory, in jassem.tcl.
The nice thing about tcl/tk is that it works on Unix, Windows, and Macintosh. To use jassem.tcl on our machines, simply run :
UNIX> wish ~jplank/cs360/bin/jassem.tcl [filename]Wish should be installed on all of our machines.
To use jassem.tcl on a Windows or Macintosh machine, you will need to install tcl/tk. This is free -- get the code from www.scriptics.com.
In running jassem.tcl, the first thing you do is load a program, such as p1-g.jas (the program above that adds global variables) or p1.jas (the program above that adds local variables). You should see a picture of the system -- stack, registers, globals and code, as below:
Now you can step through the program, looking at everything as you go. Make sure you understand each step as you go through it. This is a very helpful tool.
In jassem, the code starts at 0x1000, and global variables, if they exist, start at the next multiple of 0x1000. There is no heap. The stack ends at 0xfff44c. That's pretty arbitrary, but such is life.
>1. Does the term uniprocessor mean that we have only one CPU, or is there > something else I should understand?That is correct -- one CPU (as opposed to a parallel processor that has many CPUs)
>2. In the register <--> memory operation > st %r0 --> [r1] > the notes said "... treat the value of register r1 as a pointer to a memory > location ...". Except for the pc, fp and the sp do we ever know where in > memory a register is pointing ?You can assume that the pc points to the code, and the fp/sp both point to memory in the stack. Even these assumptions can be violated in some systems if you are doing complex stuff (I won't go into it). Otherwise, you cannot assume that a register is pointing to a specific memory segment. r1's pointer can point to the code, globals, heap or stack.
>3. Does each process assume that the address space is 0x80000000 ? If this is > the number of bytes it looks huge to me. I got exactly 2048Mb. The hydras, > for example, have only 96Mb of RAM. > I got 2048 by doing > 8*(16^7)/(1024^2). > Am I making any wrong assumptions ? Things don't seem right.Yes, the process assumes that memory is an array of 2 GB. However, it won't use all 2 GB. In particular, the addresses between the stack and heap are unused, and they compose the bulk of the address space. Even though a processor may have much less than 2 GB in RAM, the system is set up to look as though each process can access 2 GB. This is called "virtual memory", and is something that you'll learn about in CS361. In a few weeks, we'll see how your interface to memory is limited. In particular, usually your code and globals segment is smaller than a megabyte. On my machine, the OS does not allow the stack to grow larger than 8M or so, (type "limit" and look at "stacksize") and it does not allow the heap to grow too much larger than 96MB. If you don't believe me -- try it:
The program test1.c tests the heap:
#include <stdio.h> #include <stdlib.h> #include <unistd.h> int main(int argc, char **argv) { char *s; long sz; if (argc != 2 || sscanf(argv[1], "%ld", &sz) == 0) { fprintf(stderr, "usage: test1 bytes\n"); exit(1); } s = (char *) malloc(sz); if (s == NULL) { perror("malloc"); exit(1); } printf("malloc %ld worked\n", sz); return 0; } |
In 2018, you had to allocate quite a bit of memory to have the hydra's fail:
UNIX> gcc test1.c UNIX> a.out 6000000000 malloc 6000000000 worked UNIX> a.out 7000000000 malloc: Cannot allocate memory UNIX>My Raspberry Pi fails quicker:
pi@raspberrypi:~$ gcc test1.c pi@raspberrypi:~$ ./a.out 400000000 malloc 400000000 worked pi@raspberrypi:~$ ./a.out 500000000 malloc: Cannot allocate memory pi@raspberrypi:~$This program (test2.c) tests the stack:
#include <stdio.h> int main() { char s[9000000]; printf("s = 0x%x\n", s); printf("%d\n", *s); return 0; } |
As you can see below, our stacks are (as of 2018) limited to 8 MB, which seems like plenty of memory to me. So, when test2.c tries to allocate 9,000,000 bytes on the stack, you get a seg fault:
UNIX> limit | grep stack stacksize 8192 kbytes UNIX> gcc test2.c UNIX> ./a.out Segmentation fault UNIX>If we instead only allocate 8,000,000 bytes, it succeeds:
UNIX> sed 's/90/80/' test2.c > test3.c UNIX> gcc test3.c UNIX> a.out s = 0xaf67d5d0 0 UNIX> rm test3.c a.out UNIX>