Writing a simple Compiler on my own - Type Checking for Assignments [C][Flex][Bison]





[Custom Thumbnail]

All the Code of the series can be found at the Github repository:
https://github.com/drifter1/compiler


Introduction

    Hello it's a me again @drifter1! Today we continue with my Compiler Series, a series where we implement a complete compiler for a simple C-like language by using the C-tools Flex and Bison. In this article we will check if the types of the expression and identifier in an assignment statement are compatible with each other or not. Furthermore, we will also simplify some rules and tweak other things. One of those things is a "sneak-peak" of the next articles :)

The topics that we will cover today are:

  1. Simplify pointer and array rules
  2. Advanced type error message
  3. Assignment type check
  4. Some stuff around parameters

Requirements:

    Actually you need to read and understand all the topics that I covered in the series as a whole, as these articles will give you access to knowledge about:
  • What Compiler Design is (mainly the steps)
  • For which exact Language the Compiler is build for (Tokens and Grammar)
  • How to use Flex and Bison
  • How to implement a lexer and parser for the language using those tools
  • What the Symbol Table is and how we implement it
  • How we combine Flex and Bison together
  • How we can pass information from the Lexer to the Parser
  • How we define operator priorities, precedencies and associativity
  • What Semantic Analysis is (Attributes, SDT etc.)
  • How we do the so called "Scope Resolution"
  • How we declare types and check the type inter-compatibility for different cases, including function parameters
  • How we check function calls later on using a "Revisit queue", as function declarations mostly happen after functions get used (in our Language)
  • Intermediate Code Representations, including AST's
  • How we implement an AST (structure and management)
  • Action Rules for AST nodes and more

Difficulty:

Talking about the series in general this series can be rated:
  • Intermediate to Advanced
Today's topic(s) can be rated:
  • Medium
So, without further ado, let's now finally start with the actual Tutorial...

Actual Tutorial Content

Simplify pointer and array rules

    First of all, I would like to point out that I removed the DOT token (.) from the lexer, as it's not used right now. We might add it back later on, but right now it just gives us an "unused" warning. Also, the pointer and array rules from the grammar that we have right now, allow multiple pointers and multi-dimensional arrays. But, managing those in the symbol table structure and generating Assembly code is not that easily. So, to make the whole process easier for you to understand, and most importantly easier to code, I will only allow single-pointers and one-dimensional arrays. This means that the rules for them change into:
pointer: MULOP ; /* for now only single pointers! */
array: /* for now only one-dimensional arrays */ LBRACK expression RBRACK /* can only be used in expressions */ { // if declaration then error! if(declare == 1){ fprintf(stderr, "Array declaration at %d contains expression!\n", lineno); exit(1); } } | LBRACK ICONST RBRACK { // set array_size for declaration purposes $$ = $2.ival; } ;

    That way the actual structure that we have for storing such information (symbol table entry) makes much more sense now! We could also add new entries for such things, but why make things so complicated from the beginning? We can always add such behavior later on...

Advanced type error message

    Next up is the "type_error" function that we defined during Semantics Analysis and so inside of the "semantics.c" file. This function get's triggered from the "get_result_type" function, whenever there is a type conflict. To help us distinguish the error that we have, we also added an error message. But this error message just prints out numbers, and remembering all the data and operator type integer mappings is quite difficult! So, let's make it print out the actual names of the data and operator types.

In code this looks as following:
void type_error(int type_1, int type_2, int op_type){
  fprintf(stderr, "Type conflict between ");
  /* first type */
  if      (type_1 == INT_TYPE)           fprintf(stderr,"%s","int");
  else if (type_1 == REAL_TYPE)          fprintf(stderr,"%s","real");
  else if (type_1 == CHAR_TYPE)          fprintf(stderr,"%s","char");
  else                                   fprintf(stderr,"%s","other");
fprintf(stderr, " and ");
/* second type */ if (type_2 == INT_TYPE) fprintf(stderr,"%s","int"); else if (type_2 == REAL_TYPE) fprintf(stderr,"%s","real"); else if (type_2 == CHAR_TYPE) fprintf(stderr,"%s","char"); else fprintf(stderr,"%s","other");
/* operator */ fprintf(stderr," using op type "); switch(op_type){ case NONE: fprintf(stderr,"%s","NONE"); break; case ARITHM_OP: fprintf(stderr,"%s","ARITHM_OP"); break; case INCR_OP: fprintf(stderr,"%s","INCR_OP"); break; case BOOL_OP: fprintf(stderr,"%s","BOOL_OP"); break; case NOT_OP: fprintf(stderr,"%s","NOT_OP"); break; case REL_OP: fprintf(stderr,"%s","REL_OP"); break; case EQU_OP: fprintf(stderr,"%s","EQU_OP"); break; default: fprintf(stderr, "Error in operator selection!\n"); exit(1); }
/* line */ fprintf(stderr, " in line %d\n", lineno);
exit(1); }

Consider the following code snippet:
...
int i; i = 5.5;
...

Running the compiler now, when having a type conflict, would give us:

Type error cause we can't assign (operator NONE) a real to an integer!

Assignment type check

    Having all the needed functions already implemented in previous articles/parts of the series, we just have to call the "get_result_type" function with the correct arguments:
  • The variable datatype can be found using the "get_type" function.
  • The expression datatype can be found with the "expression_data_type" function
  • The operator is NONE, which corresponds to assignments and so type checks only
Knowing all that we just have to add the following code to the assignment-rule:
assigment: var_ref ASSIGN expression
{
  AST_Node_Ref *temp = (AST_Node_Ref*) $1;
  $$ = new_ast_assign_node(temp->entry, temp->ref, $3);
// check assignment semantics get_result_type( get_type(temp->entry->st_name), /* variable datatype */ expression_data_type($3), /* expression datatype */ NONE /* checking compatibility only (no operator) */ ); } ;

    Running this function with a NONE argument, will make it check if the first datatype can get a value of the second datatype. That way we know if the expression can be assigned as a value of the identifier. If not, then we will get an error message (like the one that we saw earlier).

Some stuff around parameters

    The stuff that we covered up to now is actually all that I have to do for today's topic! But, as that would be just to little, let's now also dive into some other stuff, to prepare for the next big topic of function call semantics. In this topic we have to:
  • check if the calling parameters are compatible with the formal (declaration) parameters (which includes checking the parameter count)
  • set the return type of the function, so that we can check if the function can be used in an expression, assignment etc.
    All this means that we will have to do a lot of stuff in the "Revisit queue" structure. Also, a lot of semantic checking will have to be delayed up to the point where we have the needed information (mainly datatype).
Let's first add some new entries to the revisit queue entry. To check parameters and to be able to tweak the datatype of the symbol table entry/item we have to add:
  • a "list_t" entry
  • an array of parameter datatype ("par_types"
  • the number of parameters ("num_of_pars")
So, the new structure in code looks like this now:
typedef struct revisit_queue{
  // symbol table entry
  list_t *entry;
// name of identifier char *st_name;
// type of revisit int revisit_type;
// parameters int *par_types; int num_of_pars;
// maybe additional information to simplify the process ...
struct revisit_queue *next; }revisit_queue;

    The first of those newly added entries can be accessed easily during the "entry" creation inside of the "insert" function of the Symbol Table. This makes us think that the "add_to_queue" function of the Revisit Queue structure, should contain such a parameter. In the same way, to be able to access the queue entry easier, we should also create a search function that gives back the entry without removing it. This will become useful when we actually want to add the information of parameter data-types ("par_types") and the parameter count ("num_of_pars") to this entry. That way we now end up with the following new functions:
symtab.h:
...
void add_to_queue(list_t *entry, char *name, int type); revisit_queue *search_queue(char *name);
...
----------------------------------------------------------------------------
symtab.c:
...
void add_to_queue(list_t *entry, char *name, int type){ revisit_queue *q;
/* queue is empty */ if(queue == NULL){ /* set up entry */ q = (revisit_queue*) malloc(sizeof(revisit_queue)); q->entry = entry; q->st_name = name; q->revisit_type = type; q->next = NULL;
/* q "becomes" the queue */ queue = q; } /* queue not empty */ else{ /* find last element */ q = queue; while(q->next != NULL) q = q->next;
/* add element to the end */ q->next = (revisit_queue*) malloc(sizeof(revisit_queue)); q->next->entry = entry; q->next->st_name = name; q->next->revisit_type = type; q->next->next = NULL; } }
revisit_queue *search_queue(char *name){ revisit_queue *q; /* search for the entry */ q = queue; while( strcmp(q->st_name, name) != 0 ) q = q->next;
return q; }
...

    Doing that we of course have to change the "insert"-function, as it uses the "add_to_queue" function. That's pretty easily actually, as we just have to tweak the following line:
add_to_queue(l, l->st_name, PARAM_CHECK);
which is line 68 of the "symtab.c" file.

    The "search_queue" function can be used in the "function_call"-rule, to be able to set the two other new entries of the revisit queue entry. This function gives back the queue entry and so we can then easily set the number of parameters, allocate memory for the array, and get the expression data type of each parameter, passing it to each corresponding "par_types" index. In code that looks as following:
function_call: ID LPAREN call_params RPAREN
{
  AST_Node_Call_Params *temp = (AST_Node_Call_Params*) $3;
  $$ = new_ast_func_call_node($1, temp->params, temp->num_of_pars); 
/* add information to revisit queue entry (if one exists) */ revisit_queue *q = search_queue($1->st_name); if(q != NULL){ q->num_of_pars = temp->num_of_pars; q->par_types = (int*) malloc(temp->num_of_pars * sizeof(int)); /* get the types of the parameters */ int i; for(i = 0; i < temp->num_of_pars; i++){ /* get datatype of parameter-expression */ q->par_types[i] = expression_data_type(temp->params[i]); } } } ;

More around that in the next articles!

RESOURCES

References:

No references, just using code that I implemented in my previous articles.

Images:

All of the images are custom-made!

Previous parts of the series


Final words | Next up on the project

     And this is actually it for today's post! I hope that I explained everything as much as I needed to, meaning that you learned something out of it.
Next up on this series are:
  • Semantic analysis (using even more action rules in Bison)
  • Machine Code generation (MIPS Assembly)
     Which are all topics that will need more than one article to complete. Also, note that we might also get into Optimizations later on, or could even extend the Language by adding complex datatypes (structs and unions), more rules etc.
So, see ya next time!

GitHub Account:

https://github.com/drifter1

Keep on drifting! ;)

H2
H3
H4
3 columns
2 columns
1 column
5 Comments
Ecency