C-to-Assembly (x86-64) compiler for a basic subset of C.
Simply to learn more about compilers, assembly, and how not to design languages. :)
If something is missing in the list below, then it's not planned to be implemented.
- Operators:
- Unary:
- Prefix (
--
,++
,!
,~
,-
) - Postfix (
--
,++
)
- Prefix (
- Binary
- Arithmetic (
+
,-
,*
,/
,%
) - Bitwise (
&
,|
,^
,<<
,>>
)
- Arithmetic (
- Logical (
!
,&&
,||
) - Relational (
<
,<=
,>
,>=
,==
,!=
)
- Unary:
- Local variables:
- Declaration
- Assignments
- Compound assignments (
+=
,-=
, etc.) - Scopes
- Storage-class specifiers:
-
static
-
extern
-
typedef
-
- Conditionals and control flow:
- If statements
- Ternary expressions
- Labeled statements
- Switch statements
-
goto
statements -
break
andcontinue
- Loops:
- For loops
- While loops
- Do-while loops
- Functions:
- Function declarations
- Function definitions
- Function calls
- Types:
-
void
-
int
-
long
-
unsigned int
-
unsigned long
-
double
-
char
-
signed char
-
unsigned char
- Structs
- Unions
- Pointers
- Pointer arithmetic
- Arrays
-
- Memory management:
-
sizeof
operator -
malloc
-
calloc
-
realloc
-
aligned_alloc
-
free
-
Optimizations:
- Constant folding
- Dead code elimination
- Dead store elimination
- Copy propagation
- Register allocation
- Register coalescing
Defined using EBNF-like notation.
Definition
<program> = <function>
<function> = "int" <identifier> "(" "void" ")" <block>
<block> = "{" { <block-item> } "}"
<block-item> = <declaration> | <statement>
<declaration> = "int" <identifier> [ "=" <expression> ] ";"
<statement> = "return" <expression> ";"
| <expression> ";"
| <identifier> ":" <statement>
| "if" "(" <expression> ")" <statement> [ "else" <statement> ]
| "break" ";"
| "continue" ";"
| "switch" "(" <expression> ")" <statement>
| "while" "(" <expression> ")" <statement>
| "do" <statement> "while" "(" <expression> ")" ";"
| "for" "(" <initializer> [ <expression> ] ";" [ <expression> ] ";" [ <expression> ] ")" <statement>
| "goto" <identifier> ";"
| <block>
| ";"
<initializer> = <declaration> | [ <expression> ] ";"
<expression> = <factor>
| <expression> <binary-op> <expression>
| <expression> "?" <expression> ":" <expression>
<factor> = <unary-op> <factor> | <postfix>
<postfix> = <primary> { <postfix-op> }
<primary> = <int> | <identifier> | "(" <expression> ")"
<unary-op> = "-" | "~" | "!" | "++" | "--"
<postfix-op> = "++" | "--"
<binary-op> = "+" | "-" | "*" | "/" | "%"
| "<<" | ">>" | "&" | "|" | "^"
| "&&" | "||" | "==" | "!=" | "<" | "<=" | ">" | ">="
| "=" | "+=" | "-=" | "*=" | "/=" | "%=" | "&=" | "|=" | "^=" | "<<=" | ">>="
<identifier> = ? An identifier token ?
<int> = ? A constant token ?
This is used to represent the syntax tree of the program, and to perform semantic analysis.
This IR stands between the AST and the assembly code, and lets us handle structural transformations separately from the details of assembly language (this is to be done), and it's also well suited for applying some compile-time optimizations (also to be done).
This IR is very low-level, and is used to emit assembly code.
MIT.