Project 5 (100 points)
Project 5 is due 12/2/15 on or before 11:59:59pm MST.
Your goal is to write the front-end for a compiler. This front-end is responsible for parsing the input language and creating an intermediate representation. The back-end of the compiler takes in the intermediate representation and executes that intermediate representation. The back-end of the compiler is already written and cannot be changed.
Grammar Description
program → var_section body
var_section → id_list SEMICOLON
id_list → ID COMMA id_list | ID
body → LBRACE stmt_list RBRACE
stmt_list → stmt stmt_list | stmt
stmt → assign_stmt | print_stmt | while_stmt | if_stmt | switch_stmt
assign_stmt → ID EQUAL primary SEMICOLON
assign_stmt → ID EQUAL expr SEMICOLON
expr → primary op primary
primary → ID | NUM
op → PLUS | MINUS | MULT | DIV
print_stmt → PRINT ID SEMICOLON
while_stmt → WHILE condition body
if_stmt → IF condition body
condition → primary relop primary
relop → GREATER | LESS | NOTEQUAL
switch_stmt → SWITCH ID LBRACE case_list RBRACE
switch_stmt → SWITCH ID LBRACE case_list default_case RBRACE
case_list → case case_list | case
case → CASE NUM COLON body
default_case → DEFAULT COLON body
The tokens used in the grammar description are (note PRINT is not what you would expect):
SEMICOLON = ;
ID = letter(letter | digit)*
COMMA = ,
LBRACE = {
RBRACE = }
EQUAL = =
NUM = 0 | (digit digit*)
PLUS = +
MINUS = -
MULT = *
DIV = /
PRINT = print
WHILE = WHILE
IF = IF
GREATER = >
LESS = <
NOTEQUAL = <>
SWITCH = SWITCH
CASE = CASE
COLON = :
DEFAULT = DEFAULT
Lexing
compiler.h and compiler.c also define the lexing functions
(getToken and ungetToken) which you should use for reading in the
tokens.
Language Semantics
Division is integer division (just as in C) and the result of the division of two integers is an integer.
All variables must be explicitly declared in the
var_sectionand have global scope.All variables are of type integer.
All variables initially have the value 0.
All statements in a statement list are executed sequentially according to the order in which they appear.
Boolean Condition
A boolean condition takes two operands as parameters and returns a boolean value.
If Statements
An if_stmt has the following (standard) semantics (note that our
language does not have an else clause to the if_stmt):
- The condition is evaluated.
- If the condition evaluates to
true, then the body of theif_stmtis executed, then the next statement following theif_stmtis executed. - If the condition evaluates to
false, then the next statement following theif_stmtis executed.
While Statements
while_stmt has the following (standard) semantics:
- The condition is evaluated.
- If the condition evaluates to
true, the body of thewhile_stmtis executed, then goto step 1. - If the condition evaluates to
false, then the next statement following thewhile_stmtis executed.
Switch Semantics
switch_stmt has the following (standard) semantics:
- The value of the switch variable is checked against each case number in order.
- If the value matches the case number, the body of the case is
executed, then the next statement following the
switch_stmtis executed. - If the value does not match the case number, the next case is evaluated.
- If a default case is provided and the value does not match any of
the case numbers, then the body of the default case is executed,
then the next statement following the
switch_stmtis executed. - If there is no default case and the value does not match any of the
case numbers, then the next statement following the
switch_stmtis executed.
Print Semantics
print_stmt will print out the value of the variable (give as the
ID) at the time of the execution of the print_stmt.
Intermediate Representation
The intermediate representation is a graph that the compiler back-end
knows how to interpret and execute. All intermediate representation
data structures are defined in compiler.h, and you are not allowed
to change compiler.h or compiler.c, which are posted on the
submission site.
You should become very familiar with the intermediate representation
data structures in compiler.h and their usage to execute the program
in the execute_program function in compiler.c.
StatementNode
The main data structure of the intermediate representation is the
StatementNode, which has a type field which specifies what type of
statement it is (AssignmentStatement, PrintStatement,
IfStatement, or GotoStatement) and a field next which points to the
next StatementNode in the list. Thus, StatementNode forms a linked
list of statements that should be executed, in order. If the next
field is NULL, then program execution terminates.
ValueNode
The ValueNode data structure is used to represent a value in the
program (both literals and variables). You should read compiler.c to
understand how ValueNode is used when executing the intermediate
representation.
AssignmentStatement
The intermediate language AssignmentStatement is constructed of a
left_hand_side (the ValueNode that is being assigned to) and a
potential operation performed on two operands. The op field
specifies the type of the operation to perform, and an op value of
OP_NOOP means that no operation is performed, so the value
associated with operand1 is copied to the value associated with
left_hand_side.
PrintStatement
The intermediate language PrintStatement has a ValueNode id as
its only field, which specified the value to print out.
IfStatement
The intermediate language IfStatement is more complicated structure
than the other structures. Similar to an AssignmentStatement, the
condition of an IfStatement must be an operator, represented
by condition_op. This operator is applied to condition_operand1
and condition_operand2, and if the result is true, then program
execution continues executing the StatementNode in the
true_branch, otherwise program execution continues executing the
StatementNode in the false_branch.
GotoStatement
The intermediate language GotoStatement is a fairly simple
construct. It has one field, target, that is the StatementNode to
begin executing from when the GotoStatement is executed.
Implementation
You are required to implement the function
parse_generate_intermediate_representation (in another file than
compiler.c and compiler.h, as you cannot modify these files) with
the following signature:
struct StatementNode* parse_generate_intermediate_representation();
You can implement this function in C or C++, follow the guidelines in
compiler.h.
You are free to write whatever helper functions you need (and really
you should, a single huge function is not a good way to code), but you
cannot modify compiler.h and compiler.c.
You can assume that all input programs will properly type check and that there will not be any syntax errors.
Hopefully you have noticed by now that there is a gap between the language semantics and the intermediate representation semantics. Essentially, your goal is to overcome this gap, by parsing the program and translating program semantics that do not exist in the intermediate representation semantics into semantically equivalent intermediate representation. This is the heart of the compiler, and it is exactly what GCC does when it compiles your C program to x86 assembly: x86 does not have while loops or switch statements, but C does.
Consider the program language if_stmt, which you must translate into
a corresponding intermediate language IfStatement. We want the
semantics of the if_stmt to be preserved when performing this
translation.
One way to do this would be to create a StatementNode data
structure, which has a type of IF_STMT. The if_stmt field of the
StatementNode is an IfStatement. The true_branch should be the
body of the if_stmt, and the false_branch can be a NOOP_STMT.
However, this is not enough! Due to the semantics of the intermediate
language, if the IfStatement condition evaluates to false, then
execution starts at the false_branch. This branch is a NOOP_STMT,
so that is executed, then execution terminates. This does not follow
the semantics of our original language! The program should continue
execution at the next statement following the if_stmt. The same
logic causes a problem if the IfStatement condition evaluates to
true. The body will execution, then execution will terminate.
Therefore, we must ensure that the next field of the last
StatementNode in the true_branch and the false_branch is the
StatementNode that follows the if_stmt.
One way to implement this is in the following diagram:

A similar situation occurs when implementing the while_stmt, however
the semantics of the while_stmt can be implemented using a
GotoStatement as the last instruction of the true_branch, as the
following diagram demonstrates:

switch_stmt must be handled similarly.
Examples
You will be given all the test cases for assignment statements, if statements, and while statements. However, here are some additional examples:
a;
{
a = 10 ;
print a;
a = 20;
print a;
}
Should output:
10
20
The program:
a;
{
a = 10 ;
SWITCH a
{
CASE 0 :
{
a = 20 ;
}
CASE 10 :
{
a = a + 40;
}
}
print a ;
}
Should output:
50
Evaluation
Your submission will be graded on passing the automated test cases.
The test cases (there will be multiple test cases in each category, each with equal weight) will be broken down in the following way (out of 100 points):
- Assignment statements : 20
- If statements : 25
- While statements : 25
- Switch statements : 20
- All statements : 10
Note that all test cases depend on successful implementation of print statements (otherwise, how do we test the output), and if, while, and switch statement test cases all depend on assignment statements.
There is extra credit available (described in the following section):
- For loops : 5
- Labels and GOTO : 5
Just as in the normal assignment, you’re not allowed to change compiler.c or compiler.h to implement the extra credit.
Extra Credit
Extend the language by supporting for loops and/or goto statements in the following way:
program → var_section body
var_section → id_list SEMICOLON
id_list → ID COMMA id_list | ID
body → LBRACE stmt_list RBRACE
stmt_list → stmt stmt_list | stmt
stmt → assign_stmt | print_stmt | while_stmt | if_stmt | switch_stmt | for_stmt | label_stmt | goto_stmt
for_stmt → FOR assign_stmt condition SEMICOLON assign_stmt body
label_stmt → ID COLON
goto_stmt → GOTO ID SEMICOLON
assign_stmt → ID EQUAL primary SEMICOLON
assign_stmt → ID EQUAL expr SEMICOLON
expr → primary op primary
primary → ID | NUM
op → PLUS | MINUS | MULT | DIV
print_stmt → PRINT ID SEMICOLON
while_stmt → WHILE condition body
if_stmt → IF condition body
condition → primary relop primary
relop → GREATER | LESS | NOTEQUAL
switch_stmt → SWITCH ID LBRACE case_list RBRACE
switch_stmt → SWITCH ID LBRACE case_list default_case RBRACE
case_list → case case_list | case
case → CASE NUM COLON body
default_case → DEFAULT COLON body
With the following added tokens:
FOR = FOR
GOTO = GOTO
For Loop Semantics
A for_stmt has the following (standard) semantics:
- The first
assign_stmtis executed. - The
conditionis evaluated. - If the
conditionevaluates totrue, the body of thefor_stmtis executed, then the secondassign_stmtis executed, then goto step 2. - If the
conditionevaluates tofalse, then the next statement following thefor_stmtis executed.
For example, the following program:
a;
{
FOR a = 10; a > 0; a = a-1;
{
print a;
}
}
Will output the following:
10
9
8
7
6
5
4
3
2
1
Goto Semantics
A goto_stmt has the following (standard) semantics:
- The next execution after the
goto_stmtwill be thelabel_stmtthat has anIDthat matches theIDof thegoto_stmt.
A label_stmt is a NOOP.
For example, the following program:
foo;
{
foo = 1;
label:
IF foo < 100
{
foo = foo + 1;
IF foo > 10
{
GOTO end;
}
GOTO label;
}
end:
print foo;
}
Should output:
11
Submission
Submit your project here before the deadline.
Credit
This project description was adapted from Prof. Rida Bazzi, and is used with permission. Images used are copyright Rida Bazzi.