Wednesday 20 July 2016

Implementing A Programming Language In C - Update

Static or Dynamic

It’s been quite some time since my last post because I’ve been considering making the language statically typed. Most of the languages I’ve made are dynamically typed (except for my most recent one) and they’re perfectly fine. The problem is, it’s particularly difficult optimizing such languages for real-world use without greatly increasing the size or complexity of the implementation. Ultimately, most dynamically typed languages are still strongly typed (save a few: https://www.destroyallsoftware.com/talks/wat), and errors which would be caught at compile time are caught at runtime. This runtime overhead is easily avoided by moving type-checking over to the compiler. The language runtime itself can still be “dynamic” but type errors can be caught at compile-time (this is what python does with type hints).
One of the most common arguments against static typing is that it adds a lot of superfluous noise to the source code:
BitBuffer* bitBuffer = new BitBuffer; // lots of typenames
And while that may be true, much of this noise can be removed by inferring types, making the code practically indistinguishable from its dynamically typed equivalent:
auto bitBuffer = new BitBuffer; // not very different from 'var bitBuffer = new BitBuffer;'
Do note that no type information is lost in the latter example.
Concerning performance, bytecode produced for a dynamically typed language will likely be much slower than for a statically typed one. For example:
def add(x, y):
    return x + y
would produce something like (hypothetical machine code):
.add
    type_of_x = typeof x                 ; types extracted at runtime
    type_of_y = typeof y
    
    beq type_of_x, int, add_x_int           ; if the type of x is 'int' goto ...
    beq type_of_x, float, add_x_float       ; ...
    
    print "Cannot add values"
    error

.add_x_int
    beq type_of_y, int, add_x_y_int
    beq type_of_y, float, add_x_int_y_float
    
.add_x_float
    beq type_of_y, int, add_x_float_y_int
    beq type_of_y, float, add_x_float_y_float
    
.add_x_y_int
    result = addi x, y
    ret

.add_x_int_y_float
.add_x_float_y_int 
    print "attempted to add int and float"
    error

.add_x_float_y_float
    result = addf x, y
    ret
Whereas a statically typed program like the following:
int add(int x, int y)
{
    return x + y;
}
Would produce something like:
.add
    result = addi x, y
    ret
I’d say that’s a win.

Memory Management

Okay so languages with dynamic runtimes pretty much have to have a means of automatically managing memory because they simply do not know the size of any values at compile-time. If they didn’t, you’d have to write shit like:
def shit():
    x = 10
    free(x) # free the dynamically allocated integer value
Or at least:
def shit():
    x = 10 
    release(x) # manual ref counting
Keeping track of dynamically allocated values is hard enough. Having to keep track of values allocated on the “stack” would only exacerbate the problem. So, most languages have garbage collectors or automatic reference counting + a garbage collector for reference cycles (like python).

New Syntax

Taking all that I’ve discussed into consideration, I’ve revised the syntax of “Tut” (the programming language we’re making) to the following:
var pos : vec2; // do note that the type is declared after its use

struct vec2
{
    x : int;
    y : int;
};

func make_vec2(x : int, y : int) : vec2
{
    var result : vec2;
    
    result.x = x;
    result.y = y;
    
    return result;
}

// functions which make use of strings should take cstr's as parameters
func puts(str : cstr) : void
{
    printf("%s\n", str);
}

func main() : int
{
    pos = make_vec2(10, 10);
    printf("(%i, %i)\n", pos.x, pos.y);
    
    var name : cstr = "Andy";                       // "cstr" is an immutable reference to a string
    var start_of_name : str = substr(name, 0, 2);   // "str" is a mutable heap-allocated string
    
    puts(name);
    puts(startOfName);                              // "str"s are implicitly casted to cstr
    
    free_str(start_of_name);                        // substr allocates memory so it has to be freed
    return 0;
}
Yes, this is basically C with postfix types, but it has built-in strings and, more importantly, it’s ours to change as we see fit!
Well, I'll get to work on an implementation ASAP.

Tuesday 28 June 2016

Implementing a Programming Language in C - Part 1 - Lexer

In the previous blog post pertaining to implementing programming languages, I provided a sample of the syntax of the language:

In this blog post, we'll be going over how to split up this 'Tut' script into its constituent elements, or "tokens". This process is generally referred to as "Lexical Analysis" and it is done by what is called a "Lexer".
Let's begin!

Defining The Tokens

Before we can create our lexer, we must first define each and every valid token in the language. This can be done very easily with a C enum:


As you can see, every valid entity in the language is represented by one of these values.

The Lexer

The job of the lexer is to take a string containing "Tut" code and transform it into "TutToken" on demand. This is what the header file of the lexer looks like.

// TODO: Finish blog post
Here is the repository thus far.

Sunday 10 April 2016

My Own Simple Version Control System

Why?



Well, it was Friday night. I was trying to revert my video game to a previous version. I opened up the Git bash, stashed my work, git add -A, and git reset --hard to the previous commit. But, lo and behold, git refused to change the repository in any way whatsoever despite professing that it was successful in doing so.

"Odd.", I thought, "Perhaps I should check this out on StackOverflow and see how others do this.". I found a question similar to mine; unfortunately, every answer presented at least 3 different methods of doing the very same. Moreover, each answer had comment sections filled with people fuming about the shortcomings of the method that the answer used. 

Now, I understand that I should've consulted the manual; StackOverflow isn't exactly known for providing robust solutions to highly specified problems. That being said, the fact that there isn't one agreed upon way of reverting to a previous version (which is practically the foundation of VCS) is rather frustrating. Don't get me wrong, git works very well on a larger scale; its complexity is simply a byproduct of the type of problem it is trying to solve. But for a lone user like me who requires nothing but a simple means of keeping track of versions and being able to switch between them without hassle, it is clearly too complex.

This is why I decided to create my own simple VCS. To be quite honest, I know there are other, simpler VCS than git which I could use instead, but I was intrigued by the concept of writing one.

How?

At first, I presumed a simple program which copied the current directory and then stored a zipped version of it in another one would suffice. This would be quite trivial to implement and would serve my needs (sorta). There were, of course, some clear problems with this, most prominently: it wasted space. 

You see, I store most of my project files on Dropbox. Foolhardy, maybe, but I like having access to it regardless of where I am or what device I'm using. This is why a system like the one I proposed above simply couldn't work for me. The fact that it stored every file regardless of whether it changed or not would result it my Dropbox folder (with a meager 2 GB of space) would be filled with just a couple dozen versions of my programs. 

As such, I set out to build a system which would track changed files, and instead of storing each individual version as a compressed archive, the entire project repository and every version would be compressed as a whole into one file. This would, for the most part, result in a smaller repository than if each version were compressed separately.

Obviously, this is easier said than done. By far, the hardest problem to solve is that of tracking changes. My solution to this was as simple as I could possibly make it:

Each item in the project directory was stored in an ENTRY structure. Each entry has its own path and last write time stored, and it was tagged as either a FILE entry, DIRECTORY entry, or a REFERENCE entry. The first simply stored the size and data of a file, the second stored a linked list of child entries, and the third stored a version name.

Let's say you stored your first version named "first" (by the way, I made version names mandatory for now, and you can't have multiple versions with the same name). The system would create entries for each item it would find, and store it all in a compressed file called .ctrl (I used miniz.c for compression). You then proceeded to work on the directory, change some files and then you stored another version named "added_player_controls". The only file changed between this version and "first" was "player.c" What the system would do is that for all files which did not change, it would simply create a reference with the same path and last write time, and have it refer to the version from which it originated, in this case "first":
Forgive my crudely drawn diagram.

Anyways, this worked well, especially with compression, meaning I could store A LOT of versions with many little changes without using too much space.

But then came the problem of removing referenced versions.

I dealt with this problem by doing the simplest thing I could yet again. When a version is removed, all later versions are scanned for references to files stored in that version. The next version which references such a file now owns the file (i.e the entry is changed to a FILE entry) and all versions after that have their references now pointing to the owning version:

Again, my diagrams are horrendous, but I hope you get the idea. Now the version "added_player_controls" owns all the files it referenced previously, and "updated_loop" now references "added_player_controls" as opposed to "first".

User Interface

For now, the system is simply a command line application (only for Windows). You have access to the following commands:


Now, I know this doesn't provide any of the fancy branching, merging and forking capabilities of git; nor does it do any delta encoding (whereby only changed bytes are stored as opposed to entire files), but it's simple, fast, and it works.

And that's enough for me.

PS. If there was ever a feature I desperately needed, I could add it in quite simply too since the whole program spans about 1200 lines of C code.

Sunday 13 March 2016

Implementing a programming language in C - Introduction

Why?


There is a point in every programmers life where they are confronted with a problem which they think is unnecessarily tedious to solve in language X. You think "Hmm, if only language X had feature Y, it would make my life so much easier." Thus begins your journey into the endless abyss that is programming language design and implementation.

Your first instinct is to research. You scour whatever resources you have at your disposal to learn as much about the subject as possible. "What's this?" you ask yourself. "A book that discusses this very subject? Marvelous! It's even referred to as the 'dragon book'! It must be a fun read!"

Things are looking up. You might just be on your way to being the next Dennis Ritchie. You torrent a PDF version of the book, and open the document. You work your way through the dense introduction, a couple chapters maybe, but find yourself none the wiser for it. Dismayed, you restore your minimized text editor, and continue programming in language X.

But maybe, you were very persistent when looking for resources. Perhaps you checked the third or even the fourth page of google and encountered this blog.

You came to the right place.

Despite what most programmers think, implementing a programming language isn't a gargantuan task fit only for the Computer Science PhDs of the world. In fact, you can make one right now if you follow this tutorial.

 

The language


The language we'll be implementing has the following syntax. Do note that this is by far the least important aspect of the language, but one which most people who try this sort of thing dwell on the most (I mean, syntax has some importance insofar as how difficult it is to parse).

Here is what the language we'll be making will look like:

With all the imagination I could muster, I decided to name the language 'Tut'.

As you can see, it is a pretty complex language as far as tutorial languages go. Anyways, this will be the language we'll be implementing over the course of the next few tutorials.

Thanks for reading.