Stage 1 compiler
The goal with the p1 language was to create a simple statically typed language, which would contain the core constructs we have been accustomed to in programming:
- local, global variables
- function parameters and return values
- types and type checking
- custom data types (structs)
- complex expressions with priority
- multi file ‘support’
- (syscalls)
Of course, it is the compiler’s job to implement these, which will be done in multiple passes:
- lexical analysis (tokenization)
- syntax analysis
- semantic analysis (type checking)
- memory
- code generation
Before the lexical analysis actually starts, the compiler first reads the input program from stdin into memory.
Lexical analysis
The purpose of lexical analysis is to break up the input program into tokens (groups of characters with attached meaning), for example:
- keywords (int, bool, void, if, while, …)
- identifiers (variable, parameter names)
- constants (5, ‘a’, “test”)
- symbols ( ( , ) , … , operators)
- comments/whitespace (which are skipped)
Another purpose is to catch some early errors (malformed string/character constants, invalid escape sequences).
For a list of all possible tokens and how their memory entries look like, check out the docs.
Implementation snippet
Lets look at an example (from the actual implementation) of how a token is parsed:
tokenize_isKeyword_helper
returns1
iny
if the character inx
is not a legal character in identifiers. That means that we are indeed looking at a keyword, and not a variable name for example.
We can notice that this is quite cumbersome due to:
- there is no simple way of comparing strings
- there is no way to loop through the strings, which means that every keyword/symbol has its own if statement somewhere
Example
We can observe the debug output of how the hello world program gets tokenized.
INT IDENT(write) LPARENT INT IDENT(fd) COMMA CHAR MUL IDENT(cbuf) COMMA INT IDENT(count) RPARENT SEMICOLON VOID IDENT(main) LPARENT RPARENT LCURLY IDENT(write) LPARENT CINT(1) COMMA CSTRING("Hello world!\n") COMMA CINT(13) RPARENT SEMICOLON RCURLY
Token variants
As the compiler is going over the source program, recognised tokens are stored to memory in a uniform way (4 components each):
Token | Type | ID | Optional | Line |
---|---|---|---|---|
int | 0 | 0 | - | Line no. |
char | 0 | 1 | - | Line no. |
bool | 0 | 2 | - | Line no. |
void | 0 | 3 | - | Line no. |
if | 11 | - | - | Line no. |
else | 12 | - | - | Line no. |
while | 13 | - | - | Line no. |
return | 27 | - | - | Line no. |
struct | 45 | - | - | Line no. |
sizeof | 47 | - | - | Line no. |
identifier | 2 | start | length | Line no. |
constant(int) | 3 | 0 | value | Line no. |
constant(char) | 3 | 1 | value | Line no. |
constant(boolean) | 3 | 2 | value | Line no. |
constant(string) | 3 | 3 | start | Line no. |
null | 3 | 4 | 0 | Line no. |
( | 14 | - | - | Line no. |
) | 15 | - | - | Line no. |
{ | 16 | - | - | Line no. |
} | 17 | - | - | Line no. |
[ | 18 | - | - | Line no. |
] | 19 | - | - | Line no. |
. | 20 | - | - | Line no. |
, | 21 | - | - | Line no. |
; | 22 | - | - | Line no. |
: | 23 | - | - | Line no. |
! | 24 | - | - | Line no. |
= | 1 | 0 | - | Line no. |
+= | 1 | 1 | - | Line no. |
-= | 1 | 2 | - | Line no. |
/= | 1 | 3 | - | Line no. |
*= | 1 | 4 | - | Line no. |
++ | 25 | - | - | Line no. |
— | 26 | - | - | Line no. |
& | 28 | - | - | Line no. |
| | 29 | - | - | Line no. |
== | 30 | - | - | Line no. |
!= | 31 | - | - | Line no. |
< | 32 | - | - | Line no. |
> | 33 | - | - | Line no. |
<= | 34 | - | - | Line no. |
>= | 35 | - | - | Line no. |
* | 36 | - | - | Line no. |
/ | 37 | - | - | Line no. |
% | 38 | - | - | Line no. |
+ | 39 | - | - | Line no. |
- | 40 | - | - | Line no. |
&& | 41 | - | - | Line no. |
|| | 42 | - | - | Line no. |
~ | 43 | - | - | Line no. |
^ | 44 | - | - | Line no. |
-> | 46 | - | - | Line no. |
Syntax analysis
The purpose of syntax analysis is to group up tokens into higher level constructs, such as:
- declaration
- type (chain)
- expression (tree)
- statements
- …
Doing this allows us to strip away keywords and other extra tokens, resulting in only semantically important constructs.
Implementation snippet
Lets take a look at an example (from the actual implementation) of how tokens are grouped into a larger construct:
The example I chose is not the simplest, but its also far from the most complicated one. It handles variable declarations, which can be either type name;
or type name = expression;
. Lets break it down:
syntax_isType
is invoked. If a type was found, it will have been written in the previous memory entry (each entry = 5 cells),f
would be set to 0, and the token index i would be moved up.- the next token is loaded.
- if a type was matched, and is followed up by an identifier, we might be looking at a declaration, so we move the token index further.
- If there is no initializer expression (i.e. we are looking at
;
), move up the token index and store the necessary cells in memory. - if not, we confirm its an initialization, and continue to parse the initializer expression. The variable type is saved to
n
so it doesn’t get trashed. - if an expression was matched and followed by
;
, the initialization is written into memory.
Example
Again we can take a look at the hello world example.
int write(int fd, char* cbuf, int count);
void main(){
write(1,"Hello world!\n",13);
}
PAR PAR PAR FUNDECL const const const call const const const call EXP FUN
Expressions:
EXP[ CALL(write)[ 1, "Hello world!\n", 13, ] ]
Types:
FUN [ int ] PAR [ int ] PAR [ ptr(char) ] PAR [ int ] FUN [ void ]
Lets take a look at the first line, which contains debug logs during parsing:
PAR PAR PAR FUNDECL
is correctly identified as a function declaration- but then
const const const call
repeats twice, even though there is only one function call in the program. This is because the compiler first tries to match an assignment (expr = expr
or one of the other assignment operators), fails, and tries again later with an expression statement (expr;
), which is successful. - at the end
FUN
is output, because it was a normal function definition.
The post parsing printout is more consise, but might not always be present, if the compiler exits with some error beforehand.
Memory structure
Syntax nodes
Similarly to tokens, syntax nodes are also stored in memory, this time with 5 components each. For the rest of the compilation, the compiler will be working with these nodes in one way or another.
Type\Offset | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
Declaration | 0 | type | expression (-1 if none) | ident | - |
Type | 1 | 0int, 1char, 2bool, 3void, 4arr, 5ptr, 6name | basetype | const token (array only) | - |
Expression | 2 | id | expr1 | expr2 | type(added later) |
Function | 3 | type | ident | 0start, 1end, 2startdecl | - |
Parameter | 4 | type | ident | - | - |
Return | 5 | expr (-1 if none) | - | - | - |
Expression statement | 6 | expr | - | - | - |
Assignment | 7 | 0=, 1+=, 2-=, 3/=,4*= | expr1 | expr2 | - |
If/Else | 8 | 0if, 1else, 2end | expr(if) | - | - |
While | 9 | 0start, 1end | expr(start only) | - | - |
Call | 10 | 0start, 1end | ident(start only) | start(end only) | - |
Argument | 11 | expr | - | - | - |
Struct | 12 | 0start, 1end | ident | -1 | - |
Note: Cast expression: type is expr2
Note: Declaration address is added as expr2 to ident expressions during name resolution (phase 3).
Expression types
The expression node covers many operations, which are discerned by their id:
Operator | Id |
---|---|
Postf ++ | 0 |
Postf — | 1 |
Postf arr | 24 |
Post comp | 25 |
Post ptr comp | 31 |
Pref ++ | 2 |
Pref — | 3 |
Pref + | 4 |
Pref - | 5 |
Pref ! | 6 |
Pref ~ | 27 |
Pref & | 28 |
Pref * | 29 |
Pref cast | 30 |
Binop * | 7 |
Binop / | 8 |
Binop % | 9 |
Binop + | 10 |
Binop - | 11 |
Binop < | 12 |
Binop > | 13 |
Binop <= | 14 |
Binop >= | 15 |
Binop == | 16 |
Binop != | 17 |
Binop & | 18 |
Binop | | 19 |
Binop && | 20 |
Binop || | 21 |
Binop ^ | 26 |
ident | 22 |
const | 23 |
sizeof | 32 |
Expression grammar
Expressions are parsed using the following LL(1) grammar, which enables a fairly trivial implementation with a recursive descent parser:
a | b |
---|---|
E -> FE’ | E’-> || FE’ | ε |
F -> GF’ | F’ -> && GF’ | ε |
G -> XG’ | G’ -> | XG’ | ε |
X -> HX’ | X’ -> ^ HX’ | ε |
H -> IH’ | H’ -> & IH’ | ε |
I -> JI’ | I’ -> == JI’ | != JI’ | ε |
J -> KJ’ | J’ -> < KJ’ | > KJ’ | <= KJ’ | >= KJ’ | ε |
K -> LK’ | K’ -> + LK’ | - LK’ | ε |
L -> ML’ | L’ -> * ML’ | / ML’ | % ML’ | ε |
M -> ++M | —M | +M | -M | !M | &M | *M | ~M | (type)M | N | |
N -> ON’ | N’ -> ++N’ | —N’ | [E]N’ | .identN’ | ->identN’ | ε |
O -> ident | const | (E) |
Semantic analysis
The purpose of semantic analysis is to:
- resolve names (check if used variable has been declared and is in scope)
- check types (in expressions, returns, …)
- check Lvalues (determine whether expression has an address)
Name resolver
During name resolving, declarations (and their depth/scope) are collected and written temporarily into memory. When a scope is closed, the contained declarations are removed (similar to a stack). This includes:
- function declarations
- type declarations (structs)
- variable / parameter declarations
Name resolving is done in 2 passes; the first one only handles global declaration, while the second one goes through the entire program.
When a named type / expression / call is encountered, the memory entries are checked for a matching identifier. If successful, the address of the declaration is added to the memory entry for the expression (at offset 3), named type (at offset 2) or call (at offset 2).
Memory structure
During execution of the name resolver, the memory is used in a stack-like fashion to store currently active declarations.
Type\Offset | 0 | 1 |
---|---|---|
Any declaration | address | depth |
Implementation snippet
We can take a look at an example of how new declaration entries are added:
We can also check out how names are resolved (in this case, named expression):
For good measure, lets check out the getDeclaration
method as well.
Type resolver
During type resolving, type constraints are checked, such as:
- operand types must be compatible with the operator
- expression in if/while must be a boolean
- argument types in calls must match
- non void functions must have return statement (and its type must match)
The type resolving code is quite long and contains many edge case checking, so we’ll only take a look at a simple example:
Address resolving
During type resolving, expressions are also checked for addresses. Lvalues (variables, parameters) are recognised, and enforced in their uses (pointer dereference, array access, component access, assignment(left), increment/decrement, pointer &).
Memory
This phase handles:
- collecting global variables
- assigning ids to strings
- determining sizes of types
- calculating offsets for function parameters
- calculating offsets for struct components
- calculating offsets for local variables
- calculating size of activation record (parameters + local variables + return address + frame pointer)
Activation record
|previous f.| <- FP
|parameter 0|
| ... |
|parameter n|
|local var.0|
| ... |
|local var.n|
| old fp |
|return addr| <- SP
local variables can be of any size (1-4 bytes) and will not take up uniform space in the activation record
Type sizes
- char, bool and void take up 1 byte
- int and ptr take up 4 bytes
- arrays take up (length * base type size) bytes
- structs are as big as the sum of their components
Void taking up 1 byte is intentional, in order to allow incrementing void* (adding/subtracting ints from pointers moves the pointer by the size of its basetype, which would be 1 byte in this case).
Codegen
We’ll take a look at only some of the more interesting aspects of the implementation, such as:
- expressions/calls
- multi file support
The implemented code generation contains just the basics, with no optimizations and such.
Expressions / Calls
Expression generation was designed in such a way, that register allocation was not necessary, which allows us to go directly from the memory phase into code generation. It can be described with a couple of rules:
- results of expressions (and calls) are left in r0
- if an operator needs to evaluate multiple expressions, it calculates the first one, stores the result on the stack, calculates the other one, and combines the results.
- function arguments are pushed to the stack as soon as they are calculated
- function return values are also left in r0
Multi file support
The compiler/language does not implement the concept of header files (or equivalent), but the source code can still be in multiple files. Function declarations must still be present, and the appropriate implementation must be provided during the linking process.
In order to compile a file that ‘exports’ some functions, it must not contain a main method, so an entry point will not be generated.