Friday, 2 October 2020

How compilers and interpreters work & what’s the difference?

Whatever language you use these days, be it JavaScript or C, C# or Lua, all of your source files have eventually to go through a compiler. Some of the languages require you to compile the code just once to be run how many times you want, and some require you to ship your source code in order to compile and execute it at the client-side. It feels like, there could not be a more clear difference between the two, but where do you draw the line between interpreters and compilers?

Compiler or interpreter

We, humans, love to divide things into neat categories. Apples are fruit and potatoes are vegetables, cats are animals and rocks are not, Pluto is a planet and …

But wait a second, why exactly did Pluto stop being identified as a planet by scientists? You see, the term “planet” is pretty obscure. Here is a common definition from my dictionary:

Planet is a celestial body moving in an elliptical orbit around a star.

And this definition worked for us for centuries, until we started observing our solar system closer. It turns out, Pluto is way smaller, than we thought it was, and it has a lot of friends orbiting just near of it. Do we also call them all planets? It does sound like a pretty bad idea, just imagine a teacher in school trying to get children to memorize thousands of planets. And since Pluto was so small, the easier solution was to give it a new category: dwarf planet. But not everyone is happy about the solution, and this is understandable.

And I feel like, the same thing happened with interpreters and compilers. Compilers compile code once, and interpreters do it every time you launch the program. Seems clear enough. But let’s dig into the internal workings of both compilers and interpreters, and see, if they are really that far apart from one another, as we previously thought.

O, compiler, what do you compile into?

So, let’s talk about compilers first. What is its job? Of course, it takes your source files, then something happens inside of the black box, and on the other side of it comes out nice and clean binary files. But is it always the case? Actually, no. In fact, let’s talk about JVM (Java Virtual Machine), as the most known example of compiling into bytecode.

Bytecode

Bytecode is a sequence of instructions, that a virtual machine executes. Usually, it’s just an array of bytes, hence the name bytecode. You are probably familiar with its older brother – native code (also known as machine code). It’s the same concept, actually, but it runs on real hardware, CPU follows the instructions and does what we tell him to do. Each CPU has its own set of instructions and other details, together known as CPU architecture. In fact, you can say, that bytecode is machine code, but designed for a different machine to be run on. In the case of Java it is JVM.

A huge plus of virtual machines is that you can design it to run on any processor, hence making your bytecode instruction set cross-platform. You also have the power of imagining and developing your own instructions, and I have to say, that it’s a ton of fun.

I know it might sound scary and complex, but let’s take a concrete example. Say we wrote a program, that adds variables a and b. How would a virtual machine do that? I will use my own little language lit for this example, as it has some nice bytecode debug tools.

So, here is our program:

3 + 4

After stuffing the program into the black box (what is inside we will talk later on), we get the following bytecode program out of it:

lit -d -Ono-all -e "3 + 4"

-e evals our string.
-d dumps the resulting bytecode instead of executing it.
-Ono-all disables all optimizations, otherwise, my optimizer will convert it to just 7.

== repl ==
        3 + 4
0000    1 OP_CONSTANT         0 '3'
0002    | OP_CONSTANT         1 '4'
0004    | OP_ADD
0005    | OP_RETURN

Let’s do the work the virtual machine does ourselves. We shall go through each instruction from top to bottom and execute them. For that, we will need only one thing: the stack.

Again, this fancy name makes it sound like a bigger deal than it actually is. In the reality, it’s just an array of values, that we can push more value to (aka stack data structure).

So let’s execute our program. The first instruction is OP_CONSTANT, with argument 0.
Numbers and strings are stored in an array for easier access. So let’s see, at index 0 in our constants we have the first number – 3. We push it onto the stack so that it now looks like this:

[ 3 ]

Moving on, the next instruction is OP_CONSTANT with argument 1. Easy enough, just push the number 4 (constants[1]) onto the stack:

[ 3 ][ 4 ]

Now, this is where the fun begins: OP_ADD. We take the last two numbers from the stack, add them together, and put the result back, resulting in:

[ 7 ]

Great! Now we have only a single instruction left, OP_RETURN. We take the last value on the stack and return it to the user. In our casem it’s the answer to our expression 3 + 4 is equal to 7 (right?)!

But wait, there is more! Did you ever hear about language into language compilers?

The art of transcompiling

Yes, you’ve heard me right, you can compile languages into other languages! Why? Sometimes just for fun or to learn how compilers work without having to compile into spooky native code or writing a virtual machine. But the more interesting use cases are porting codebases from one language to another. For the sake of what, you ask? Well, let’s say you have an amazing fresh out of the development c library, but you want to integrate it with JavaScript and run it in browsers? Say no more, Emscripten got you covered! It is a compiler, that takes your C/C++ code and produces JavaScript/Wasm equivalent.

I’ve written two myself, the first one converts Lua into JavaScript and the second one converts Java into C#, I’ve actually rewritten Java version of Burning Knight in C# with its help.

The garden of abstract syntax

And finally, for the desert, we have abstract syntax trees (AST). Without going into too much detail about it, it’s a way to represent the source of your program as a tree of nodes. Let’s look at our good old friend 3 + 4 again, here he is in all his tree glory:

3 + 4 represented as AST

As we will talk a tiny bit later, most of the compilers represent the code you give them in this way, before processing, optimizing & emitting whatever the output format is. For Java its bytecode, for C it’s native code. Some, either for the sake of simplicity or processing speed, skip it, and some, just stop here. But what’s the point of having an AST tree, you may ask? Well, you can execute it too, with a little algorithm called AST-walker.

It looks at the node you gave to him and then executes it. If it’s an addition node, it will evaluate the expression on the left, then on the right, and add them. It’s an insanely simple algorithm, but as you’ve probably have guessed at this point, it is neither memory efficient, nor is it fast. But it works, and for some projects, it’s good enough!

So what is the best thing to compile into?

As with any subject this wide, there is no concrete answer to this question. Each of the methods has its pros and cons. Native code is the fastest in terms of execution time, but it requires an insane amount of knowledge and effort to compile into. On the other hand, walking an AST is super simple but very slow. Bytecode provides a nice middle ground between the two, being a lot faster than walking down AST, and a lot more simple, than native code.

But let’s talk a tiny bit more in-depth about how compilers actually create their output.

The internals of a compiler

I will remind you again, that no two compilers are the same, and some developers choose wildly different methods of compilation, but most of the compilers work on this basic structure:

(Optional) Preprocessor modifies the input source code, for example in C compiler works out all your #define at this stage.
Scanner converts the source code into a stream of tokens.
Parser takes the tokens and converts them either straight to the output type of the compiler (for example, bytecode) and we end here, or to the AST to be processed more.
(Optional) Optimizer runs down the AST and tries to rewire it to be more optimal.
Emitter runs down the AST tree and produces the compiler output.

For the sake of example, let’s discuss two languages: lua and lit, and see for ourselves what’s going on inside of each of them, using the worlds most famous program:

print("Hello, world!")

Scanning

The task of the scanner is to identify important chunks of the source strings, like print, "Hello, world!" and punctuation. It makes it much easier to work with the code later down the line while parsing, if you can refer to strings as TOKEN_STRING and not have to call find_string() every time.

Okay, but what is a token? Well, in terms of the official definition, it’s described as "a single element of a programming language". Usually, it contains the information about where it was found in the file (line), where it started and where it ended, as well it’s type. If you are curious, here are all the token types in lit.

The process of converting the source code to tokens is pretty straight forward, you just remove all the unneeded white space, and decide the token type based on what your current char stream is.

In our example, the output of a scanner should be something like this:

TOKEN_IDETIFIER // print
TOKEN_LEFT_PAREN // (
TOKEN_STRING // "Hello, world!"
TOKEN_RIGHT_PAREN )

Parsing

The next step is converting the stream of tokens into either bytecode in the case of lua, or into the AST in the case of lit.

There are two most commonly used algorithms for parsing:

Recursive descent parser is a more simple and straightforward one.
Pratt parser is a bit harder to wrap your head around but is also more efficient.

Lua uses the first one, while lit – the pratt parser. We won’t go into the details of implementing them, since it was covered quite a bit before and this post is already big enough as it is, but instead, let’s talk about why Lua has a single-pass compiler, that compiles straight into bytecode, and lit instead creates an AST tree first.

Single-pass compiler? N-pass compiler?
N-pass compiler does n passes (goes over) the code / AST n times. So a single-pass compiler does not represent the code as AST but instead emits the result straight away.
Extra passes can be optimization, emission of AST, and resolving.

Since the compilation of Lua code stops at this phase, let’s follow the lit compilation process further.

Optimization

This is, probably, the hardest part of writing a good compiler. Programmers are usually lazy or do not know full details of how everything works under the hood, so it is your job as the language creator to try and make the code run faster. Only if, of course, you choose to even implement optimizations in your language. With lit, I’ve only recently started working on the optimizer module, here is the most advanced optimization in it yet:

for (var i in 0 .. 10) {
    print(i)
}

Is converted to the equivalent of:

for (var i = 0; i < 10; i++) {
    print(i)
}

Doesn’t seem like a huge change, huh? Well, my benchmarks showed a 30% increase in execution speed of for-loop tests with this optimization, and here is why: yes, the first loop looks much more simple, but the simplicity has a hidden cost: method calls. When it’s compiled, it’s equivalent to this code:

{
    var _seq = 0 .. 10
    var _iter = null

    while (_iter = _seq.iterator(_iter)) {
        var i = _seq.iteratorValue(_iter)
        print(i)
    }
}

Since any object can define two methods required for iteration, iterator(i) and iteratorValue(i) we have no choice but to compile for the general case. This is what happens when you iterate lists, maps, classes, and anything else in lit. Except for ranges. Since range is a unique expression, we can try and optimize it into a faster c-style for loop, that we feel much lazier to write.

Lit optimizer is really basic at this point in time, but I have big plans on expanding it. A huge plus of compiling into AST being able to backtrack, look up parts of code at any place in the file, without having to recompile it again. But what happens with AST now, how does it become our program? Let’s find out.

Emitting

As I’ve said many times already, it sounds way more complex, than it actually is. An emitter is a simple AST walker, that looks at expression and tells the compiler, how it looks in bytecode.

Probably the hardest part of it is finding all the variables and linking them. I hope you won’t be surprised to hear, that accessing variables by their names at runtime is not a very cost-effective thing to do, since looking up a string in a hashmap will always be longer, than just getting a value from an array (it is as simple as adding the index to array address and accessing that region of memory). That’s why a lot of languages actually access local variables by a unique index, assigned to them (it’s unique only in the scope of the current function). How do we get from variable names to variable indexes? Well, we walk.

Let’s consider this example, and walk with the emitter:

// I see a variable expression with the name a, define it and assign index 0 to it
var a = 32
// A new variable, assign it an index of 1
var b

// By looking up a in the current function scope we find the declared before variable a with index 0
// So we emit instruction OP_GET_LOCAL 0
print(a)
// Same here, but b has index of 1
// So we emit instruction OP_GET_LOCAL 1
print(b)

// Hey, I've seen this one, I already have the variable a in this scope, throw an error
var a = 48

And that’s it. After visiting all the code, we get our program. In case of lit its bytecode.

So what exactly is an interpreter?

And now let’s get back to our question: what’s the difference between a compiler and an interpreter? Well, the most commonly agreed distinction is that an interpreter runs the result straight away, but a compiler stores it for later use. But now let’s cast our minds back to most modern "interpreted" languages: most of them allow you to compile to bytecode, save that bytecode, and run it later! Lit even allows you to embed it into runtime and have it as a native application! So what’s the difference? I think this is the problem we face so often with old definitions: we just can’t fit modern discoveries and advancements into the old boxes!

So. Is there a difference? To me, they all are compilers. They are the most amazing pieces of software, I’ve ever studied, I’m fascinated by every single of them. But I will leave it to you to decide for yourself. Is Lua a compiler or an interpreter?

P.S. by the way, I just can’t spell the word interpreter, I have such a hard time typing it compared to a compiler, so that is that.

You might also find interesting

2023: the year of 3D
Rexcellent Games 30.12.2023
3D printed: the experience
Rexcellent Games 11.12.2023
Does Go Have Subtyping?
stuffwithstuff 19.10.2023

Rexcellent Games

A blog