Project 5 (100 points)
Project 5 is due 12/2/15 on or before 11:59:59pm MST.
Your goal is to write the front-end for a compiler. This front-end is responsible for parsing the input language and creating an intermediate representation. The back-end of the compiler takes in the intermediate representation and executes that intermediate representation. The back-end of the compiler is already written and cannot be changed.
Grammar Description
program → var_section body
var_section → id_list SEMICOLON
id_list → ID COMMA id_list | ID
body → LBRACE stmt_list RBRACE
stmt_list → stmt stmt_list | stmt
stmt → assign_stmt | print_stmt | while_stmt | if_stmt | switch_stmt
assign_stmt → ID EQUAL primary SEMICOLON
assign_stmt → ID EQUAL expr SEMICOLON
expr → primary op primary
primary → ID | NUM
op → PLUS | MINUS | MULT | DIV
print_stmt → PRINT ID SEMICOLON
while_stmt → WHILE condition body
if_stmt → IF condition body
condition → primary relop primary
relop → GREATER | LESS | NOTEQUAL
switch_stmt → SWITCH ID LBRACE case_list RBRACE
switch_stmt → SWITCH ID LBRACE case_list default_case RBRACE
case_list → case case_list | case
case → CASE NUM COLON body
default_case → DEFAULT COLON body
The tokens used in the grammar description are (note PRINT is not what you would expect):
SEMICOLON = ;
ID = letter(letter | digit)*
COMMA = ,
LBRACE = {
RBRACE = }
EQUAL = =
NUM = 0 | (digit digit*)
PLUS = +
MINUS = -
MULT = *
DIV = /
PRINT = print
WHILE = WHILE
IF = IF
GREATER = >
LESS = <
NOTEQUAL = <>
SWITCH = SWITCH
CASE = CASE
COLON = :
DEFAULT = DEFAULT
Lexing
compiler.h
and compiler.c
also define the lexing functions
(getToken
and ungetToken
) which you should use for reading in the
tokens.
Language Semantics
Division is integer division (just as in C) and the result of the division of two integers is an integer.
All variables must be explicitly declared in the
var_section
and have global scope.All variables are of type integer.
All variables initially have the value 0.
All statements in a statement list are executed sequentially according to the order in which they appear.
Boolean Condition
A boolean condition takes two operands as parameters and returns a boolean value.
If Statements
An if_stmt
has the following (standard) semantics (note that our
language does not have an else
clause to the if_stmt
):
- The condition is evaluated.
- If the condition evaluates to
true
, then the body of theif_stmt
is executed, then the next statement following theif_stmt
is executed. - If the condition evaluates to
false
, then the next statement following theif_stmt
is executed.
While Statements
while_stmt
has the following (standard) semantics:
- The condition is evaluated.
- If the condition evaluates to
true
, the body of thewhile_stmt
is executed, then goto step 1. - If the condition evaluates to
false
, then the next statement following thewhile_stmt
is executed.
Switch Semantics
switch_stmt
has the following (standard) semantics:
- The value of the switch variable is checked against each case number in order.
- If the value matches the case number, the body of the case is
executed, then the next statement following the
switch_stmt
is executed. - If the value does not match the case number, the next case is evaluated.
- If a default case is provided and the value does not match any of
the case numbers, then the body of the default case is executed,
then the next statement following the
switch_stmt
is executed. - If there is no default case and the value does not match any of the
case numbers, then the next statement following the
switch_stmt
is executed.
Print Semantics
print_stmt
will print out the value of the variable (give as the
ID
) at the time of the execution of the print_stmt
.
Intermediate Representation
The intermediate representation is a graph that the compiler back-end
knows how to interpret and execute. All intermediate representation
data structures are defined in compiler.h
, and you are not allowed
to change compiler.h
or compiler.c
, which are posted on the
submission site.
You should become very familiar with the intermediate representation
data structures in compiler.h
and their usage to execute the program
in the execute_program
function in compiler.c
.
StatementNode
The main data structure of the intermediate representation is the
StatementNode
, which has a type
field which specifies what type of
statement it is (AssignmentStatement
, PrintStatement
,
IfStatement
, or GotoStatement
) and a field next
which points to the
next StatementNode
in the list. Thus, StatementNode
forms a linked
list of statements that should be executed, in order. If the next
field is NULL
, then program execution terminates.
ValueNode
The ValueNode
data structure is used to represent a value in the
program (both literals and variables). You should read compiler.c
to
understand how ValueNode
is used when executing the intermediate
representation.
AssignmentStatement
The intermediate language AssignmentStatement
is constructed of a
left_hand_side
(the ValueNode
that is being assigned to) and a
potential operation performed on two operands. The op
field
specifies the type of the operation to perform, and an op
value of
OP_NOOP
means that no operation is performed, so the value
associated with operand1
is copied to the value associated with
left_hand_side
.
PrintStatement
The intermediate language PrintStatement
has a ValueNode
id
as
its only field, which specified the value to print out.
IfStatement
The intermediate language IfStatement
is more complicated structure
than the other structures. Similar to an AssignmentStatement
, the
condition of an IfStatement
must be an operator, represented
by condition_op
. This operator is applied to condition_operand1
and condition_operand2
, and if the result is true
, then program
execution continues executing the StatementNode
in the
true_branch
, otherwise program execution continues executing the
StatementNode
in the false_branch
.
GotoStatement
The intermediate language GotoStatement
is a fairly simple
construct. It has one field, target
, that is the StatementNode
to
begin executing from when the GotoStatement
is executed.
Implementation
You are required to implement the function
parse_generate_intermediate_representation
(in another file than
compiler.c
and compiler.h
, as you cannot modify these files) with
the following signature:
struct StatementNode* parse_generate_intermediate_representation();
You can implement this function in C or C++, follow the guidelines in
compiler.h
.
You are free to write whatever helper functions you need (and really
you should, a single huge function is not a good way to code), but you
cannot modify compiler.h
and compiler.c
.
You can assume that all input programs will properly type check and that there will not be any syntax errors.
Hopefully you have noticed by now that there is a gap between the language semantics and the intermediate representation semantics. Essentially, your goal is to overcome this gap, by parsing the program and translating program semantics that do not exist in the intermediate representation semantics into semantically equivalent intermediate representation. This is the heart of the compiler, and it is exactly what GCC does when it compiles your C program to x86 assembly: x86 does not have while loops or switch statements, but C does.
Consider the program language if_stmt
, which you must translate into
a corresponding intermediate language IfStatement
. We want the
semantics of the if_stmt
to be preserved when performing this
translation.
One way to do this would be to create a StatementNode
data
structure, which has a type
of IF_STMT
. The if_stmt
field of the
StatementNode
is an IfStatement
. The true_branch
should be the
body of the if_stmt
, and the false_branch
can be a NOOP_STMT
.
However, this is not enough! Due to the semantics of the intermediate
language, if the IfStatement
condition evaluates to false
, then
execution starts at the false_branch
. This branch is a NOOP_STMT
,
so that is executed, then execution terminates. This does not follow
the semantics of our original language! The program should continue
execution at the next statement following the if_stmt
. The same
logic causes a problem if the IfStatement
condition evaluates to
true
. The body will execution, then execution will terminate.
Therefore, we must ensure that the next
field of the last
StatementNode
in the true_branch
and the false_branch
is the
StatementNode
that follows the if_stmt
.
One way to implement this is in the following diagram:
A similar situation occurs when implementing the while_stmt
, however
the semantics of the while_stmt
can be implemented using a
GotoStatement
as the last instruction of the true_branch
, as the
following diagram demonstrates:
switch_stmt
must be handled similarly.
Examples
You will be given all the test cases for assignment statements, if statements, and while statements. However, here are some additional examples:
a;
{
a = 10 ;
print a;
a = 20;
print a;
}
Should output:
10
20
The program:
a;
{
a = 10 ;
SWITCH a
{
CASE 0 :
{
a = 20 ;
}
CASE 10 :
{
a = a + 40;
}
}
print a ;
}
Should output:
50
Evaluation
Your submission will be graded on passing the automated test cases.
The test cases (there will be multiple test cases in each category, each with equal weight) will be broken down in the following way (out of 100 points):
- Assignment statements : 20
- If statements : 25
- While statements : 25
- Switch statements : 20
- All statements : 10
Note that all test cases depend on successful implementation of print statements (otherwise, how do we test the output), and if, while, and switch statement test cases all depend on assignment statements.
There is extra credit available (described in the following section):
- For loops : 5
- Labels and GOTO : 5
Just as in the normal assignment, you’re not allowed to change compiler.c or compiler.h to implement the extra credit.
Extra Credit
Extend the language by supporting for loops and/or goto statements in the following way:
program → var_section body
var_section → id_list SEMICOLON
id_list → ID COMMA id_list | ID
body → LBRACE stmt_list RBRACE
stmt_list → stmt stmt_list | stmt
stmt → assign_stmt | print_stmt | while_stmt | if_stmt | switch_stmt | for_stmt | label_stmt | goto_stmt
for_stmt → FOR assign_stmt condition SEMICOLON assign_stmt body
label_stmt → ID COLON
goto_stmt → GOTO ID SEMICOLON
assign_stmt → ID EQUAL primary SEMICOLON
assign_stmt → ID EQUAL expr SEMICOLON
expr → primary op primary
primary → ID | NUM
op → PLUS | MINUS | MULT | DIV
print_stmt → PRINT ID SEMICOLON
while_stmt → WHILE condition body
if_stmt → IF condition body
condition → primary relop primary
relop → GREATER | LESS | NOTEQUAL
switch_stmt → SWITCH ID LBRACE case_list RBRACE
switch_stmt → SWITCH ID LBRACE case_list default_case RBRACE
case_list → case case_list | case
case → CASE NUM COLON body
default_case → DEFAULT COLON body
With the following added tokens:
FOR = FOR
GOTO = GOTO
For Loop Semantics
A for_stmt
has the following (standard) semantics:
- The first
assign_stmt
is executed. - The
condition
is evaluated. - If the
condition
evaluates totrue
, the body of thefor_stmt
is executed, then the secondassign_stmt
is executed, then goto step 2. - If the
condition
evaluates tofalse
, then the next statement following thefor_stmt
is executed.
For example, the following program:
a;
{
FOR a = 10; a > 0; a = a-1;
{
print a;
}
}
Will output the following:
10
9
8
7
6
5
4
3
2
1
Goto Semantics
A goto_stmt
has the following (standard) semantics:
- The next execution after the
goto_stmt
will be thelabel_stmt
that has anID
that matches theID
of thegoto_stmt
.
A label_stmt
is a NOOP.
For example, the following program:
foo;
{
foo = 1;
label:
IF foo < 100
{
foo = foo + 1;
IF foo > 10
{
GOTO end;
}
GOTO label;
}
end:
print foo;
}
Should output:
11
Submission
Submit your project here before the deadline.
Credit
This project description was adapted from Prof. Rida Bazzi, and is used with permission. Images used are copyright Rida Bazzi.