Ekans is a programming language that is basically a subset of Racket with minor difference where we build a compiler of it using itself.
The goal of the ekans project is to experience self-hosting. Self-hosting is amazing, you can build a compiler of a language using itself. On the other hand, implementing every Racket functions is not a goal. Language features are prioritized by whether or not they are used by the compiler or not.
Build a language that is compatible with Racket is basically a cheat. A typical bootstrapped language needs two compilers, one written in an established language (e.g. C++) and then once it is completed then we can use that one to build the compiler written in the target language. With limited time, that is a trade off we made.
At a high-level, ekans divides into 3 big components:
- A compiler written in ekans that compiles ekans source code to C source code
- A runtime written in C that manages ekans object references, and
- A library (part of it written in C if they are primitive) and (part of it written in ekans if it doesn't require primitive functionalities) so that the user code and compiled code can use to perform useful work.
The runtime mostly deal with data representation and memory management. Here is how the data is represented at runtime:
typedef enum {
number,
boolean,
character,
string,
environment,
closure,
nil,
cons,
symbol,
} ekans_type;
typedef struct ekans_environment {
int binding_count;
ekans_value** bindings;
struct ekans_value* parent;
} ekans_environment;
typedef struct ekans_closure {
ekans_value* closure;
ekans_function function;
} ekans_closure;
typedef struct ekans_cons {
ekans_value* head;
ekans_value* tail;
} ekans_cons;
struct ekans_value {
ekans_type type;
union {
int n;
bool b;
char a;
char* s;
ekans_environment e;
ekans_closure c;
ekans_cons l;
} value;
struct ekans_value* prev;
struct ekans_value* next;
};
extern ekans_value head;
extern ekans_value tail;
typedef struct stack_slot {
ekans_value** slot;
struct stack_slot* next;
} stack_slot;
ekans is untyped - which mean at runtime, a variable can be of any type. There we need a type that can represent variable of any type, therefore we have ekans_value
. Using ekan_value->type
, we know what type of value it is, and then we can use the union to obtain the data as we will need.
All ekans values are chained together into a doubly linked list using the prev
and next
pointers. This allows the garbage collector to scan through all objects.
The garbage collector is implemented using the classic mark and sweep strategy. The object themselves are allocated using malloc
and then chained to the list of objects before returning to the user code. User code are supposed to make sure it holds the roots alive by using stack_slot
. These stack slots contains pointers to the object in question, and we will mark them recursively during the mark phase. Then we will walk the objects sequentially and then free
them it is not marked.
A couple of data type are for internal usage only.
An ekans_environment
can be thought as an array with a parent, something looks like this:
NULL
^
|
[v3 v4]
^
|
[v1 v2 v3]
This is how a function obtain any values that it needs to use, while a value in a ekans language is identified by a name, at runtime, a value is identified by a pair (level, index)
, indicating how many times it needs to go up and go right to find the value. For example, (1, 1)
would mean v4
.
An ekans_closure
represents a lambda. A lambda in ekans is not just a function pointer, it is also all the values that it captures. The ekans_value->value.e.closure
is always an environment, so the lambda can access the captured values just like in any other functions.
Every ekans function has the underlying signature like this:
void function_name(ekans_value* env, ekans_values** pReturn)
An env
suffice to pass every needed values by the function, by convention, the arguments are at level 0, and the other levels are the captured closure.
The library implements all the primitive routines that perform the work, something like plus
or cons
is directly implemented as the library.
The code generated by the codegen mostly just use these functions instead of performing the actual work.
Physically, the library is also part of the ekans.o
binary together with the runtime.
Some ekans routines could be embedded as some secret defines hard coded in the compiler as well.
The compiler is further breakdown into the lexer/parser/codegen
- The lexer produces a stream of tokens
- The parser produces a parse tree by calling the lexer
- The codegen produces the C code.
The C code is then compiled using clang, linking with the runtime and the library routines to produce the final executable. The project intentionally avoid dealing with low level details.
The lexer mostly just read the input stream and produce the tokens. It used a simple lookahead strategy to see if it might match something interesting, and report them as is.
The parser try not to be smart, it helps with matching parentheses, handle various types of token, but that's it. The generated parse tree is mostly just a plain mirror of how the code actually is.
The codegen is the most complicated component in the compiler. It is responsible for translating the whole program into a C program, and it done so using a mix of DFS and BFS.
To begin with, the given problem is simply a list of statements. We have a generic routine (generate-statements)
to handle the code generation of a list of statements.
To generate a list of statements, we extract and convert the (define)
into a special internal form called (set-environment)
, these (set-environment)
forms will make the definition available in the execution envirnoment, so that it will be available as symbol at compile time and available as a value in the env
at runtime.
Next, individual statements are generated using (generate-statement)
. That routine is a general dispatcher that just call the statement specific generation routine.
For simple constructs, these routines simply recursively call to generate the values it needs, and combine them to produce the value it need to produce.
For more complicated constructs such as let
or cond
, instead of generating the code directly, it converts itself into simpler primitives such as lambda
or if
and reuse the generation routines there instead.
Roughly speaking, it is mostly a post-order trasveral of the parse tree.
lambda
, in particular, is the most complicated one. In addition to generating the code that produce the ekans_closure
, it also need to make sure the body is available as a function in the generated code. Therefore, in addition to the code generation at the spot, it needs to enqueue the body into a queue so that some higher level routine pick it up and generate the body for it.
Roughly speaking, we are compiling the functions in breadth first order.
Last but not least, to make the C compiler happy, we generate all the variable declarations, function forward declarations, as a post-processing step, we generate the builtins table as well.
As is, a primitive ekans compiler is built using ekans. We took all the possible shortcuts to get there. In the future, we will want to revisit and do it in a more feature rich fashion. There are a lot of things we could do, these just scratches the surface:
- Compile error checking
- Error stack traces
- A debugger
- An IR representation phase to allow for
- Generating assembly instructions instead of C
- Various optimizations
- ...
Alternatively, with a self-hosted compiler, we are official off the control of Racket, we could invent our own language features that doesn't exist on Racket, interesting inclusion could be:
- destructuring
- try/catch
- monads
- ...