Tuesday, 28 June 2016

Implementing a Programming Language in C - Part 1 - Lexer

In the previous blog post pertaining to implementing programming languages, I provided a sample of the syntax of the language:

var x = 10
var y = 20
func add(x, y)
{
return x + y
}
// this is a comment
write(add(x, y))
view raw example.tut hosted with ❤ by GitHub
In this blog post, we'll be going over how to split up this 'Tut' script into its constituent elements, or "tokens". This process is generally referred to as "Lexical Analysis" and it is done by what is called a "Lexer".
Let's begin!

Defining The Tokens

Before we can create our lexer, we must first define each and every valid token in the language. This can be done very easily with a C enum:

#ifndef TUT_TOKEN_H
#define TUT_TOKEN_H
typedef enum
{
TUT_TOK_NUMBER,
TUT_TOK_STRING,
TUT_TOK_IDENT,
TUT_TOK_VAR,
TUT_TOK_FUNC,
TUT_TOK_RETURN,
TUT_TOK_IF,
TUT_TOK_ELSE,
TUT_TOK_WHILE,
TUT_TOK_OPENPAREN,
TUT_TOK_CLOSEPAREN,
TUT_TOK_OPENSQUARE,
TUT_TOK_CLOSESQUARE,
TUT_TOK_OPENCURLY,
TUT_TOK_CLOSECURLY,
TUT_TOK_COMMA,
TUT_TOK_PLUS,
TUT_TOK_MINUS,
TUT_TOK_MUL,
TUT_TOK_DIV,
TUT_TOK_ASSIGN,
TUT_TOK_EOF,
TUT_TOK_ERROR,
TUT_TOK_COUNT
} TutToken;
const char* Tut_TokenRepr(TutToken token);
#endif
view raw tut_token.h hosted with ❤ by GitHub

As you can see, every valid entity in the language is represented by one of these values.

The Lexer

The job of the lexer is to take a string containing "Tut" code and transform it into "TutToken" on demand. This is what the header file of the lexer looks like.

// TODO: Finish blog post
Here is the repository thus far.

No comments:

Post a Comment