Writing a simple Compiler on my own - Generating Code for Expressions (part 2)




[Custom Thumbnail]

All the Code of the series can be found at the Github repository:

https://github.com/drifter1/compiler


Introduction

    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will tweak some of the previous code and generate the code of Arithmetic expressions and "simple" Variable References.

Requirements:

    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"
  • How we declare types and check the type inter-compatibility for different cases that include function parameters and assignments
  • How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
  • Intermediate Code Representations, including AST's
  • How we implement an AST (structure and management)
  • Action Rules for AST nodes and more
  • The target system's architecture and its assembly language (MIPS)
  • Register Allocation & Assignment and how we implement it
  • How we generate Assembly code for various cases

Difficulty:

Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Tweaking Previous Code

    During the implementation of the new Code, I found out some things that have to be changed in order for them to work correctly or to just simplify things a little more...

expression_data_type() Function

    The first change is inside of the ast.c function "expression_data_type" that returns the datatype of any given node/expression. Knowing that the actual datatype of a pointer is an integer, we will return INT_TYPE instead of the inf_type entry that can easily return REAL_TYPE for pointers to real variables.

So, the REF_NODE case in the switch-case of that function now becomes:
case REF_NODE:    /* identifier reference */
    temp_ref = (AST_Node_Ref *) node;
    /* if "simple" type */
    int type = temp_ref->entry->st_type;
    if(type == INT_TYPE || type == REAL_TYPE || type == CHAR_TYPE){
        return temp_ref->entry->st_type;
    }
    /* if array */
    else if(type == ARRAY_TYPE){
        return temp_ref->entry->inf_type;
    }
    else if(type == POINTER_TYPE){
        return INT_TYPE;
    }
    break;

main_reg_allocation() Function

    Thinking about the stuff that we talked about in part 1, relational and equality nodes don't get stored inside of temporary variables, but are translated into branch instructions. This tells us that the REL_NODE and EQU_NODE cases don't have to insert a new temporary variable. For those types of nodes we also don't have to insert edges to the adjacency graph.

    So, we will just comment-out any call of the function that's about conditions and empty-out those two previously mentioned node-cases.

The new code is:
...
case REL_NODE: /* used in branch and loop conditions - not stored */ break; case EQU_NODE: /* used in branch and loop conditions - not stored */ break;
...
case IF_NODE: temp_if = (struct AST_Node_If *) node;
//main_reg_allocation(temp_if->condition);
...
case ELSIF_NODE: temp_elsif = (struct AST_Node_Elsif *) node;
//main_reg_allocation(temp_elsif->condition); main_reg_allocation(temp_elsif->elsif_branch); break; case FOR_NODE: temp_for = (struct AST_Node_For *) node;
//main_reg_allocation(temp_for->initialize);
//main_reg_allocation(temp_for->condition);
main_reg_allocation(temp_for->for_branch);
//main_reg_allocation(temp_for->increment);
break; case WHILE_NODE: temp_while = (struct AST_Node_While *) node; //main_reg_allocation(temp_while->condition);
main_reg_allocation(temp_while->while_branch); break;
....

generate_statements() Function

    In order to make the load and store operations of the main "declared" variables less, we will add lots of new edges to the AdjGraph, so that those variables have a different register with each other and all the temporary variables.

The Code for adding those edges is:

for(i = 0; i < var_count - temp_count; i++){
    for(j = 1; j < var_count; j++){
        if(i < j){
            insertEdge(i, j);
            insertEdge(j, i);
        }
    }
}

GetRegisterName() Function

    To simplify the naming process of registers, we create a new function that takes in two parameters: the color and if the registers if GP or FP. So, each variable of the varArray is basically allocated two registers: one GP and one FP. Lots of the GP and FP registers will be allocated for other specific purposes (FP constants, type conversion etc.), but we will still have more than 10 GP registers available and 12 dual-FP registers available. Those numbers might change, but for now that's it!

    The GP registers that can be used are $s0-$s7 and $t0-$t7. The FP registers that are available are $f0-$f24. As we said last time, register $f30 is reserved for conversion purposes. Additionally, we will now also reserve register $f28 for floating-point constants and register $26 for the conversion of the result case: double to int.

Either way, for now the function's Code is:
char * GetRegisterName(int color, int isFloat){
    char* regName;
    regName = (char*) malloc(5 * sizeof(char));
if(isFloat == 0){ switch(color){ /* callee saved values */ case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: sprintf(regName, "$s%d", color); break; /* caller saved temporaries */ case 8: case 9: case 10: case 11: case 12: case 13: case 14: case 15: sprintf(regName, "$t%d", color - 8); break; default: fprintf(stderr, "Too many GP registers!\n"); exit(1); } } else{ switch(color){ /* callee saved values */ case 0: case 1: case 2: case 3: case 4: case 5: case 6: case 7: case 8: case 9: case 10: case 11: case 12: sprintf(regName, "$f%d", color*2); break; default: fprintf(stderr, "Too many FP registers!\n"); exit(1); } }
return regName; }

Generating the Main Function

main_func_traversal() Function

    To generate the code of the main funciton we have to define a new function that again uses the same "recursive" concept as "ast_traversal" or the newer "main_reg_allocation" function. We will only use that new function to traverse through the main function's AST Tree (main_func_tree). The actual generation of expressions, statements etc. will be done through other functions. In other words, we will define functions like generate_arithm() or generate_bool(), to generate the needed code for those cases.

So, how does this function look like? Well, the Code is
void main_func_traversal(FILE *fp, AST_Node *node){
    int i;
AST_Node_Declarations *temp_declarations; AST_Node_Decl *temp_decl; AST_Node_Arithm *temp_arithm; AST_Node_Bool *temp_bool; AST_Node_Rel *temp_rel; AST_Node_Equ *temp_equ; AST_Node_Ref *temp_ref; AST_Node_Statements *temp_statements; AST_Node_If *temp_if; AST_Node_Elsif *temp_elsif; AST_Node_For *temp_for; AST_Node_While *temp_while; AST_Node_Incr *temp_incr; AST_Node_Assign *temp_assign; AST_Node_Func_Call *temp_func_call; AST_Node_Call_Params *temp_call_params;
/* temp variable ST entry */ list_t *entry;>
/* check if empty */ if(node == NULL){ return; }
switch(node->type){ /* declarations case */ case DECLARATIONS: /* nothing */ break; /* declaration case */ case DECL_NODE: /* nothing */ break; /* left and right child cases */ case BASIC_NODE: main_func_traversal(fp, node->left); main_func_traversal(fp, node->right); break; case ARITHM_NODE: temp_arithm = (struct AST_Node_Arithm *) node;
main_func_traversal(fp, node->left); main_func_traversal(fp, node->right);
generate_arithm(fp, temp_arithm);
break; case BOOL_NODE: temp_bool = (struct AST_Node_Bool *) node;
main_func_traversal(fp, node->left); main_func_traversal(fp, node->right);
generate_bool(fp, temp_bool);
break; case REL_NODE: temp_rel = (struct AST_Node_Rel *) node;
main_func_traversal(fp, node->left); main_func_traversal(fp, node->right);
generate_rel(fp, temp_rel);
break; case EQU_NODE: temp_equ = (struct AST_Node_Equ *) node;
main_func_traversal(fp, node->left); main_func_traversal(fp, node->right);
generate_equ(fp, temp_equ);
break; /* reference case */ case REF_NODE: temp_ref = (struct AST_Node_Ref *) node;
/* load value from memory to register */ generate_load(fp, temp_ref); break; /* constant case */ case CONST_NODE: /* already managed in the various generation functions */ break; /* statements case */ case STATEMENTS: temp_statements = (struct AST_Node_Statements *) node; for(i = 0; i < temp_statements->statement_count; i++){ main_func_traversal(fp, temp_statements->statements[i]); } break; /* the if case */ case IF_NODE: temp_if = (struct AST_Node_If *) node;
main_func_traversal(fp, temp_if->condition);
main_func_traversal(fp, temp_if->if_branch);
if(temp_if->elseif_count > 0 ){ for(i = 0; i < temp_if->elseif_count; i++){ main_func_traversal(fp, temp_if->elsif_branches[i]); } }
if(temp_if->else_branch != NULL){ main_func_traversal(fp, temp_if->else_branch); } break; /* the else if case */ case ELSIF_NODE: temp_elsif = (struct AST_Node_Elsif *) node;
main_func_traversal(fp, temp_elsif->condition);
main_func_traversal(fp, temp_elsif->elsif_branch); break; /* for case */ case FOR_NODE: temp_for = (struct AST_Node_For *) node;
main_func_traversal(fp, temp_for->initialize);
main_func_traversal(fp, temp_for->condition);
main_func_traversal(fp, temp_for->for_branch);
main_func_traversal(fp, temp_for->increment); break; /* while case */ case WHILE_NODE: temp_while = (struct AST_Node_While *) node;
main_func_traversal(fp, temp_while->condition);
main_func_traversal(fp, temp_while->while_branch); break; /* assign case */ case ASSIGN_NODE: temp_assign = (struct AST_Node_Assign *) node;
main_func_traversal(fp, temp_assign->assign_val); break; /* simple case */ case SIMPLE_NODE: /* will be managed in another article */ break; /* increment statement */ case INCR_NODE: temp_incr = (AST_Node_Incr*) node;
/* will be covered in another article */ break; /* function call case */ case FUNC_CALL: temp_func_call = (struct AST_Node_Func_Call *) node;
if(temp_func_call->num_of_pars != 0){ for(i = 0; i < temp_func_call->num_of_pars; i++){ main_func_traversal(fp, temp_func_call->params[i]); } }
/* when function non-void */ if(temp_func_call->entry->inf_type != VOID_TYPE){
generate_func_call_res(fp, temp_func_call); } break; case CALL_PARAMS: temp_call_params = (struct AST_Node_Call_Params*) node;
/* parameters will be covered in another article */
break; /* function declaration stuff */ case FUNC_DECLS: case FUNC_DECL: case RET_TYPE: case DECL_PARAMS: case RETURN_NODE: /* can't occur in main */ break; default: /* wrong choice case */ fprintf(stderr, "Error in node selection!\n"); exit(1); }
}

The various generation function that we use are declared as following:
void generate_arithm(FILE *fp, AST_Node_Arithm *node);
void generate_bool(FILE *fp, AST_Node_Bool *node);
void generate_rel(FILE *fp, AST_Node_Rel *node);
void generate_equ(FILE *fp, AST_Node_Equ *node);
void generate_load(FILE *fp, AST_Node_Ref *node);
void generate_func_call_res(FILE *fp, AST_Node_Func_Call *node);

Let's get into the implementation of two of them...

generate_load() Function

    The load instruction is called to get the value of a variable from memory and to load it into the corresponding register of that variable. The types of load instructions that we can have are basically:
  • LW - loading value from memory
  • LA - loading address from memory
  • L.D - loading floating-point value from memory

Based on that, we write the following code
void generate_load(FILE *fp, AST_Node_Ref *node){
    if(node->entry->st_type == REAL_TYPE){
        fprintf(fp, "L.D %s, %s\n", GetRegisterName(node->entry->g_index, 1), node->entry->st_name);    
    }
    else{
        if(node->ref == 1){
            fprintf(fp, "LA %s, %s($0)\n", GetRegisterName(node->entry->g_index, 0), node->entry->st_name);
        }
        else{
            fprintf(fp, "LW %s, %s($0)\n", GetRegisterName(node->entry->g_index, 0), node->entry->st_name);
        }
    }
}

generate_arithm() Function

    Generating arithmetic expressions is veeery veeery complicated as we might have to convert values, store floating-point constant...There are basically lots of different types of operands and combinations between them. So, let's start simple...

Basic block

    The basic block of this function will be a switch-case of the operator-type and so we have:
void generate_arithm(FILE *fp, AST_Node_Arithm *node){
...
/* operation */ switch(node->op){ case ADD: /* code */ break; case SUB: /* code */ break; case MUL: /* code */ break; case DIV: /* code */ break; case INC: /* code */ break; case DEC: /* code */ break; } }

Of course the instruction depends on the operator!

Increment/Decrement Operators

    The simplest arithmetic instructions are increments and decrements. For those we basically just have to create:
  • ADDI - incrementation of integer value
  • SUBI - decrementation of integer value
  • ADD.D with FP constant "1.0" stored in $f28 - incrementation of FP value
  • SUB.D with FP constant "1.0" stored in $f28 - decrementation of FP value

In Code this looks like this
...
case INC: /* check data type */ if (node->data_type == REAL_TYPE){ fprintf(fp, "LI.D $28, 1.0\n"); fprintf(fp, "ADD.D %s, %s, $28\n", GetRegisterName(node->g_index, 1), GetRegisterName(node->g_index, 1)); } else{ fprintf(fp, "ADDI %s, %s, 1\n", GetRegisterName(node->g_index, 0), GetRegisterName(node->g_index, 0)); } break; case DEC: /* check data type */ if (node->data_type == REAL_TYPE){ fprintf(fp, "LI.D $28, 1.0\n"); fprintf(fp, "SUB.D %s, %s, $28\n", GetRegisterName(node->g_index, 1), GetRegisterName(node->g_index, 1)); } else{ fprintf(fp, "SUBI %s, %s, 1\n", GetRegisterName(node->g_index, 0), GetRegisterName(node->g_index, 0)); } break;
...

    As you can see we are using GetRegisterName() to get the specific register of the variable that gets incremented (node->g_index). Also see how we use 1 or 0 as the second parameter, to denote the type of the register (FP or GP).

Other Operators

    The code of the other operators simply changes by the prefix (ADD, SUB, MUL, DIV), but when do we use OP, OPI or OP.D and how do we manage the various operands? Well, the easiest thing to do is to define some new variables:
  • float_op flag - 0: integer operation, 1: floating-point operation
  • const_op flag - 0 : operation without constant, 1: operation contains constant
  • Result (type) - 0: integer, 1: double
  • Operand1 - 0: GP, 1: FP, 2: constant, 3: FP constant
  • Operand2 - 0: GP, 1: FP, 2: constant, 3: FP constant

First we have to check the left (or 1st) operand:
if (expression_data_type(node->left) == REAL_TYPE){
   float_op = 1;
   if(node->left->type == CONST_NODE){
      const_op = 1;
      Operand1 = 3;
   }
   else{
      Operand1 = 1;
   }
}
else{
   if(node->left->type == CONST_NODE){
      const_op = 1;
      Operand1 = 2;
   }
   else{
      Operand1 = 0;
   }
}

Similar code is written to check the right (or 2nd) operand.

Afterwards, we also check the result:
if(node->data_type == REAL_TYPE){
    float_op = 1;
    Result = 1;
}

    After getting the information of the operands and result and storing it into the previously mentioned 5 variables, we basically have to check in which of the various combinations we are in...using continuous if-statements...

Experimenting a lot I ended up with the following code for the ADD operator:
if(float_op == 1){
    if(const_op == 1){
        if(Operand1 == 2 || Operand1 == 3){
            temp_const = (AST_Node_Const *) node->left;
        }
        else{
            temp_const = (AST_Node_Const *) node->right;
        }
/* floating-point constant */ if(temp_const->const_type == REAL_TYPE){ fprintf(fp, "LI.D $f28, %.2f\n", temp_const->val); } else{ fprintf(fp, "LI.D $f28, %d.0\n", temp_const->val); } }
/* operand needs conversion */ if(Operand1 == 0){ fprintf(fp, "MTC1.D %s, $f30\n", GetRegisterName(getGraphIndex(node->left) , 0)); fprintf(fp, "CVT.D.W $f30, $f30\n"); } else if(Operand2 == 0){ fprintf(fp, "MTC1.D %s, $f30\n", GetRegisterName(getGraphIndex(node->right) , 0)); fprintf(fp, "CVT.D.W $f30, $f30\n"); }
fprintf(fp, "ADD.D ");
/* check if result needs conversion */ if(Result == 0){ fprintf(fp, "$f26, "); } else{ fprintf(fp, "%s, ", GetRegisterName(node->g_index, 1)); }
switch(Operand1){ case 0: fprintf(fp, "$f30 "); break; case 1: fprintf(fp, "%s ", GetRegisterName(getGraphIndex(node->left) , 1)); break; case 2: case 3: fprintf(fp, "$f28 "); }
switch(Operand2){ case 0: fprintf(fp, "$f30 "); break; case 1: fprintf(fp, "%s ", GetRegisterName(getGraphIndex(node->right) , 1)); break; case 2: case 3: fprintf(fp, "$f28 "); } fprintf(fp, "\n");
/* result needs type-conversion */ if(Result == 0){ fprintf(fp, "CVT.W.D $f26, $f26\n"); fprintf(fp, "MTC1 %s, $f26\n", GetRegisterName(node->g_index, 0)); } } else if(const_op == 1){ if(Operand1 != 0){ temp_const = (AST_Node_Const *) node->left; fprintf(fp, "ADDI %s, %s, %d\n", GetRegisterName(node->g_index, 0), GetRegisterName(getGraphIndex(node->right), 0), temp_const->val); } if(Operand2 != 0){ temp_const = (AST_Node_Const *) node->right; fprintf(fp, "ADDI %s, %s, %d\n", GetRegisterName(node->g_index, 0), GetRegisterName(getGraphIndex(node->left), 0), temp_const->val); } } else{ fprintf(fp, "ADD %s, %s, %s\n", GetRegisterName(node->g_index, 0), GetRegisterName(getGraphIndex(node->left), 0), GetRegisterName(getGraphIndex(node->right), 0)); }

Hope that its understandable!

For now, the other generation functions simply print out the type...

Output for Examples

Let's run our compiler for the examples to see the new output files!

Simple Example 1

    Running the simplest example, we are basically already creating the final code...

The Console output is:


The generated output file is:
.data
# variables
i: .word 0
val: .double 0.000000
res: .double 0.000000
# messages
.text main: L.D $f2, val LI.D $f28, 1.0 ADD.D $f6, $f2 $f28

Interesting right?

Simple Example 2

Running for the second example we get:


The output file is:
.data
# variables
i: .word 0
val: .double 2.500000
res: .space 80
# messages
.text main: LW $s0, i($0) Relational Expression L.D $f2, val LW $s0, i($0) Function Call result LW $s2, res($0) LW $s2, res($0)

Here we can see that some of the code is missing!

Full Example

Running for the complete example we get:


The output code is:
.data
# variables
c: .byte 'c'
i: .word 0
p: .word 0
val: .double 2.500000
res: .double 0.500000, 1.500000, 2.500000, 3.500000, 4.500000, 5.500000
# messages
msg1: .asciiz "\n"
msg2: .asciiz "\n"
msg3: .asciiz "iteration: 3\n"
msg4: .asciiz " "
msg5: .asciiz "\n"
.text main: LA $s3, res($0) LW $s0, i($0) Relational Expression LW $s0, i($0) Relational Expression LW $s0, i($0) Equality Expression LW $s0, i($0) MULI $s5, $s0, 2 Function Call result L.D $f4, val LW $s0, i($0) Function Call result LW $s3, res($0) L.D $f4, val LW $s0, i($0) Function Call result LW $s3, res($0) LW $s3, res($0) LW $s4, p($0) ADDI $t1, $s4, 1 LW $s0, i($0) Equality Expression L.D $f4, val Equality Expression Boolean Expression LW $s0, i($0) Relational Expression LW $s0, i($0) LW $s1, c($0)

    Oh no! In this example lots of the code is still missing, but we will make sure that all the empty spots get filled out, by the end of all these "generating code" articles!

The other expressions will be covered in the next part(s)...

RESOURCES

References:

No references, just using code that I implemented in my previous articles.

Images:

All of the images are custom-made!

Previous parts of the series

General Knowledge and Lexical Analysis

Syntax Analysis

Semantic Analysis (1)

Intermediate Code Generation (AST)

Semantic Analysis (2)

Machine Code Generation


Final words | Next up on the project

    And this is actually it for today's post! I hope that I explained everything as much as I needed to. In the next part(s) we will generate the code for the remaining expressions...

Next up on this series, in general, are:
  • Machine Code generation (MIPS Assembly)
  • Various Optimizations in the Compiler's Code
     Which are all topics for which we will need more than one article to complete them. Also, note that we might also get into AST Code Optimizations later on, or that we could even extend the Language by adding complex datatypes (structs and unions), more rules etc.

So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)

H2
H3
H4
3 columns
2 columns
1 column
4 Comments