Stage 1 : Code generation for Arithmetic Expressions¶
Time estimate
0.5 week, 5-10 hours/week
Prerequisites
- You must be comfortable with LEX and YACC. ( If you are not, you must first do LEX tutorial, YACC Tutorial and Using Lex with Yacc tutorials.)
- You must have completed the XSM environment tutorial including all the exercises before staring this stage.
Learning Objectives
In this stage, you will:
- Parse an input arithmetic expression and create an expression tree using YACC and LEX.
- Recursively traverse the tree and generate assembly language code. The allocation of registers for storing results of intermediate computations will be handled enroute.
A compiler is a software that takes as input a high level program and produces a machine recognizible target program as output. The high level program typically allows variables, expressions, conditionals, iterative constructs, user defined types, functions etc. The low level target program on the other hand will be a sequence of assembly level instructions that can be run on a target machine (which is the XSM simulator in this project).
The strategy of the roadmap is to help you build the compiler in stages. We start here by building a very simple compiler whose input (high level) program contains only simple arithmetic expressions. In subsequent stages, we will add more and more constructs to the input language one by one, learning the relevant theoretical concepts along the way.
We assume that you have implemented the library routine for handling console output, which you were asked to do in the XSM execution environment tutorial.
Consider arithmetic expressions with the following syntax.
E : E + E | (E) | NUM
Where the lexeme NUM correspond to integers. Assume left associativity for +
, Thus, the tokens relevant are NUM
and +
. The attribute value associated with a number is the number read. Assume that the input file is passed as argument to the main()
function in YACC.
The lexer must pack the attribute into a tree node of the following structure:
typedef struct tnode{
int val;
char *op; //indicates the name of the operator for a non leaf node
struct tnode *left, *right; //left and right branches
} tnode;
#define YYSTYPE tnode*
Since the semantics actions in the parser must build the tree, the following function must be written:
/*Make a leaf tnode and set the value of val field*/
struct tnode* makeLeafNode(int n);
/*Make a tnode with operator, left and right branches set*/
struct tnode* makeOperatorNode(char op,struct tnode *l,struct tnode *r);
Task 1
Build the expression tree for the given input.
Exercise 1
Output the prefix and postfix forms of the expression from the tree.
Note
You would have already completed this task if you have done the Using Yacc With Lex tutorial
Now, comes the next task - to generate assembly language program equivalent for the expression and write it out into an executable file in the XEXE format. Once this is done, you can use the simulator to load the XEXE file into the memory of the XSM machine and execute it as outlined in the XSM run time environment tutorial.
To do this, one needs to know the following:
- The machine model and the instruction set of the target machine.
- Format of the executable file.
-
You need to know the address in the memory (in the target machine) where each instruction you generate will be loaded (by the OS loader). This is because program control instructions like
JMP
,CALL
etc., requires specification of the jump address.As already outlined in the XSM run time environment tutorial, the header will be loaded into addresses 2048-2055. The first instruction generated by you will be loaded to the address 2056. Each XSM instruction occupies 2 memory words. Hence, the next instruction will be loaded at address 2058 and so on. The entry point field of the header must contain the address of the first instruction to be fetched and executed.
-
You need to fix the memory addresses where variables and other data is stored. For example, for each variable in the program, the compiler will have to allocate storage space in memory. The ABI stipulates that the region for this is the stack region. Thus each variable must be stored in some address between 4096 and 5119.
-
Since XSM machine stipulates that arithmetic and logic instructions can be done only when operands are loaded into machine registers, we need to load the contents of variables/constants in the program into the machine registers before processing them. This brings in the problem of register allocation. The XSM machine makes available 20 registers (
R0
-R19
) for the compiler.
Of the above, the XSM execution environment tutorial has already explained (1) and (2). Evaluation of expressions do not involve either storage allocation or program control transfer (JMP
). Hence, we will not take up (3) and (4) at this stage. However, we need to solve (5) now.
What must be the evaluation strategy?¶
Let us take an example: If you are given a two node expression tree as shown below corresponding to the expression (3+2):
The evaluation strategy will be:
- Store 3 in a register – say R0.
- Store 2 in a register – say R1.
- ADD R0, R1.
The result will be stored in R0 and is sufficient for us. To generate code for the above tasks and write it into a target_file
, you must write code as:
fprintf(target_file, "MOV R0, 3");
fprintf(target_file, "MOV R1, 2");
fprintf(target_file, "ADD R0, R1");
However, life becomes complicated if we have an expression like (3+2)+(5+6)
resulting in the following tree.
Of course, we can “hand craft” this case also. But the strategy will not generalize. The basic issue is that your compiler does not know the input expression before-hand. Technically speaking, the issue is that the “expression is not available at compile time, but only known at run time”. Your code generation module must be more "intelligent" to handle arbitrary expressions.
The root of the problem with the above code is that R0 and R1 were picked by you and not by your compiler. Thus, we must have a register assignment policy (basically a function) that returns a free register whenever we require one. That is, you must design the following functions:
int getReg() // Allocate a free register
That returns the register number of an unallocated register, so that your code for adding 3 and 2 would look like:
int p = getReg();
int q = getReg();
fprintf(target_file, "MOV R%d, 3", p);
fprintf(target_file, "MOV R%d, 2", q);
fprintf(target_file, "ADD R%d, R%d", p, q);
In addition to allocating registers, you must also have mechanism to release a register back into the register pool. In the above example, after the ADD instruction is generated R1 can be released and send back to the register pool.
For this purpose, you will write a function
freeReg() // Releases a register.
To make the allocation strategy simple, we suggest that you generate target code in such a way that the result of a CPU instruction involving two registers will be always stored in the register with lower index. In the code above the result of the computation is kept in R0
and not R1
so that the register with the higher index value can be released. As a consequence, the freeReg()
function does not require any arguments. Instead, freeReg()
and getReg()
can be designed to keep track of the highest numbered register allocated so far and hence can keep track of the correct register that must be allocated or freed.
The following summarizes the register allocation strategy:
-
Whenever a register is needed, allocate the lowest numbered register that is free. (Thus, give R0 if possible, otherwise R1 etc.)
-
Whenever we free a register, always release the highest used register that was allocated previously. (Thus, if R0, R1 and R2 were allocated,
freeReg()
must release R2).
Finally, we must design a code generation module. The strategy here is to start with an expression tree and do the following:
- At the leaf nodes of the tree (corresponding to a NUM), Allocate a new register and store the number to the register.
-
At the intermediete nodes:
a. Generate code for the left subtree (recursively). Find out the register holding the result.
b. Evaluate the right subtree (recursively). Find out the register holding the result.
c. ADD the contents of the two registers and store the result in the lower numbered register.
d. Release the higher numbered register and return.
In the above box, as step 2.a and 2.b requires finding the index of the register which stores the result of expression evaluation. The simplest strategy is to design a codeGen()
function that can take as input an expression tree and generates code for the expression, returning the index of the register storing the result:
#define reg_index int;
reg_index codeGen( struct node *t) {
..
..
return register number storing result
}
The codeGen() function takes as input a pointer to the root of an expression tree and generates code for the subtree rooted at that node. After generating code, the function must return the index of the register storing the result. See this link for furthur details.
Task 2
Complete the simple compiler for expression evaluation and generate the executable file. The result of expression evaluation may be stored in the first location of the stack region – memory address 4096. This value may be printed out using the write system call. Note that the XEXE executable format must be adhered so that the XSM simulator can load and execute the file.
Note
To run the simulator, you must prepare the library.lib together with the XEXE executable file. Please follow instructions in the XSM environment tutorial.
Exercise 2
Modify the grammar to
E : E + E | E*E | E-E| E/E | (E) | NUM
Exercise 3
Redo Exercise 2 assuming that the input expression is given in prefix from.
Note
Here we assumed that machine registers never get exhausted. XSM provides 20 general purpose registers and these registers are sufficient for all practial purposes. However, if all registers are exhausted, then space will have to be allocated in memory. We will not address this contingency in this roadmap. If register pool is exhausted, your compiler may stop compilation and flag "Out of registers" error.