2021-08-15: Let's write a compiler, part 2: A lexer

Brian Robert Callahan

academic, developer, with an eye towards a brighter techno-social life

2021-08-15
Let's write a compiler, part 2: A lexer

All source code for this blog post can be found here.

Time to get started coding. We will write the lexer for the PL/0 language today. At the end of this entry, our compiler will be able to accept any free-form PL/0 source code and output all its constituent tokens, along with some metadata. The lexer should accept all valid tokens and should reject invalid tokens.

A simple `main()`

All good things start with a single function. I'm going to start with a simple main() function. In this function, I want to make sure we have the correct number of arguments, open the input file, read the entire input file into memory, and finally call the parser to start the compiling process. If you remember from the previous entry, we decided on an organization and workflow in which the parser controls the action. The parser will call the lexer whenever it needs a token and call the code generator whenever it has enough information to make a decision as to what C code to write out.

I am also going to put the entire PL/0 syntax as a comment in my source code for safe keeping. Here's a starting point:

#include <stdio.h>
#include <stdlib.h>

/*
 * pl0c -- PL/0 compiler.
 *
 * program	= block "." .
 * block	= [ "const" ident "=" number { "," ident "=" number } ";" ]
 *		  [ "var" ident { "," ident } ";" ]
 *		  { "procedure" ident ";" block ";" } statement .
 * statement	= [ ident ":=" expression
 *		  | "call" ident
 *		  | "begin" statement { ";" statement } "end"
 *		  | "if" condition "then" statement
 *		  | "while" condition "do" statement ] .
 * condition	= "odd" expression
 *		| expression ( "=" | "#" | "<" | ">" ) expression .
 * expression	= [ "+" | "-" ] term { ( "+" | "-" ) term } .
 * term		= factor { ( "*" | "/" ) factor } .
 * factor	= ident
 *		| number
 *		| "(" expression ")" .
 */

char *raw;

/*
 * Main.
 */

int
main(int argc, char *argv[])
{
	char *startp;

	if (argc != 2) {
		(void) fputs("usage: pl0c file.pl0\n", stderr);
		exit(1);
	}

	readin(argv[1]);
	startp = raw;

	parse();

	free(startp);

	return 0;
}

Alright, we're going to need a readin() function. We could inline it into main() but I prefer to keep it separate for no other real reason than style.

static size_t line = 1;

/*
 * Misc. functions.
 */

static void
error(const char *fmt, ...)
{
	va_list ap;

	(void) fprintf(stderr, "pl0c: error: %lu: ", line);

	va_start(ap, fmt);
	(void) vfprintf(stderr, fmt, ap);
	va_end(ap);

	(void) fputc('\n', stderr);

	exit(1);
}

static void
readin(char *file)
{
	int fd;
	struct stat st;

	if (strrchr(file, '.') == NULL)
		error("file must end in '.pl0'");

	if (!!strcmp(strrchr(file, '.'), ".pl0"))
		error("file must end in '.pl0'");

	if ((fd = open(file, O_RDONLY)) == -1)
		error("couldn't open %s", file);

	if (fstat(fd, &st) == -1)
		error("couldn't get file size");

	if ((raw = malloc(st.st_size + 1)) == NULL)
		error("malloc failed");

	if (read(fd, raw, st.st_size) != st.st_size)
		error("couldn't read %s", file);
	raw[st.st_size] = '\0';

	(void) close(fd);
}

We also implemented our error handling routine too. We're going to go easy on ourselves: as soon as we encounter an error, no matter what that error might be, we will report it to the user and then give up on the compilation.

The readin() function does a few things. First, to be a little fancy, we enforce a .pl0 file extension on all input files. We then read the whole input file into a memory buffer named raw, for "raw code," and close the input file since we no longer need it. Our compiler will output its final code to stdout. For today, we will have our compiler write out all the tokens it finds to stdout.

The lexer will also be in charge of incrementing the line counter for the error handler. There really isn't much of a choice in the matter. The lexer is the only part of the compiler that reads the free-form source code, so it is the only one that knows when you (the programmer) put a newline in that free-form source code. Line counting starts at 1.

Writing a lexer

In order to write a lexer, we need to know all the valid tokens that are possible in PL/0. We can reference the grammar to learn this. Tokens for us include variable names, reserved words, numbers, and all the different symbols that are possible.

According to the grammar, the reserved words are const, var, procedure, call, begin, end, if, then, while, do, and odd. The symbols are ., =, ,, ;, :=, #, <, >, +, -, *, /, (, and ). Finally, there are the identifiers and numbers, but we cannot know those in advance.

We'll #define all the possible tokens to integers so we can use simple comparisons on them. We'll call this the "token type." We will give a different token type to each reserved word and symbol. Because we cannot know in advance the identifiers and numbers, we will give every identifier the same token type and every number the same token type:

#define TOK_IDENT	'I'
#define TOK_NUMBER	'N'
#define TOK_CONST	'C'
#define TOK_VAR		'V'
#define TOK_PROCEDURE	'P'
#define TOK_CALL	'c'
#define TOK_BEGIN	'B'
#define TOK_END		'E'
#define TOK_IF		'i'
#define TOK_THEN	'T'
#define TOK_WHILE	'W'
#define TOK_DO		'D'
#define TOK_ODD		'O'
#define TOK_DOT		'.'
#define TOK_EQUAL	'='
#define TOK_COMMA	','
#define TOK_SEMICOLON	';'
#define TOK_ASSIGN	':'
#define TOK_HASH	'#'
#define TOK_LESSTHAN	'<'
#define TOK_GREATERTHAN	'>'
#define TOK_PLUS	'+'
#define TOK_MINUS	'-'
#define TOK_MULTIPLY	'*'
#define TOK_DIVIDE	'/'
#define TOK_LPAREN	'('
#define TOK_RPAREN	')'

I like having the TOK_ prefix on all the token types since it makes it easier to see at a glance that they are token types. For the symbols, I used the character of the symbol itself as the symbol type. That will simplify the lexer a little. The only exception is the assignment operator, :=, since that is a series of two characters. But the colon was unused so I used it for the token type for assign. For the identifiers, numbers, and reserved words, I used the first unique letter for their token type define. There were a couple of duplicates there but that's why we have capital and lowercase letters.

Now we need a function that iterates over the input source code and makes a decision about what token type this is based on what ASCII character it is currently looking at. It should also skip over all whitespace and it should also, upon identifying a comment, skip over the entire comment:

static char *raw, *token;
static int type;
static size_t line = 1;

/*
 * Lexer.
 */

static void
comment(void)
{
	int ch;

	while ((ch = *raw++) != '}') {
		if (ch == '\0')
			error("unterminated comment");

		if (ch == '\n')
			++line;
	}
}

static int
ident(void)
{
	char *p;
	size_t i, len;

	p = raw;
	while (isalnum(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++)
		token[i] = *p++;
	token[i] = '\0';

	if (!strcmp(token, "const"))
		return TOK_CONST;
	else if (!strcmp(token, "var"))
		return TOK_VAR;
	else if (!strcmp(token, "procedure"))
		return TOK_PROCEDURE;
	else if (!strcmp(token, "call"))
		return TOK_CALL;
	else if (!strcmp(token, "begin"))
		return TOK_BEGIN;
	else if (!strcmp(token, "end"))
		return TOK_END;
	else if (!strcmp(token, "if"))
		return TOK_IF;
	else if (!strcmp(token, "then"))
		return TOK_THEN;
	else if (!strcmp(token, "while"))
		return TOK_WHILE;
	else if (!strcmp(token, "do"))
		return TOK_DO;
	else if (!strcmp(token, "odd"))
		return TOK_ODD;

	return TOK_IDENT;
}

static int
number(void)
{
	const char *errstr;
	char *p;
	size_t i, j = 0, len;

	p = raw;
	while (isdigit(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++) {
		if (isdigit(*p))
			token[j++] = *p;
		++p;
	}
	token[j] = '\0';

	(void) strtonum(token, 0, LONG_MAX, &errstr);
	if (errstr != NULL)
		error("invalid number: %s", token);

	return TOK_NUMBER;
}

static int
lex(void)
{

again:
	/* Skip whitespace.  */
	while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
		if (*raw++ == '\n')
			++line;
	}

	if (isalpha(*raw) || *raw == '_')
		return ident();

	if (isdigit(*raw))
		return number();

	switch (*raw) {
	case '{':
		comment();
		goto again;
	case '.':
	case '=':
	case ',':
	case ';':
	case '#':
	case '<':
	case '>':
	case '+':
	case '-':
	case '*':
	case '/':
	case '(':
	case ')':
		return (*raw);
	case ':':
		if (*++raw != '=')
			error("unknown token: ':%c'", *raw);

		return TOK_ASSIGN;
	case '\0':
		return 0;
	default:
		error("unknown token: '%c'", *raw);
	}

	return 0;
}

/*
 * Parser.
 */

static void
parse(void)
{

	while ((type = lex()) != 0) {
		++raw;
		(void) fprintf(stdout, "%lu|%d\t", line, type);
		switch (type) {
		case TOK_IDENT:
		case TOK_NUMBER:
		case TOK_CONST:
		case TOK_VAR:
		case TOK_PROCEDURE:
		case TOK_CALL:
		case TOK_BEGIN:
		case TOK_END:
		case TOK_IF:
		case TOK_THEN:
		case TOK_WHILE:
		case TOK_DO:
		case TOK_ODD:
			(void) fprintf(stdout, "%s", token);
			break;
		case TOK_DOT:
		case TOK_EQUAL:
		case TOK_COMMA:
		case TOK_SEMICOLON:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
		case TOK_PLUS:
		case TOK_MINUS:
		case TOK_MULTIPLY:
		case TOK_DIVIDE:
		case TOK_LPAREN:
		case TOK_RPAREN:
			(void) fputc(type, stdout);
			break;
		case TOK_ASSIGN:
			(void) fputs(":=", stdout);
		}
		(void) fputc('\n', stdout);
	}
}

Let's discuss what we did. The lexer itself, in the lex() function, is just a big switch looking at the current character in our raw input buffer.

Before we try to read in a token, we should move past all whitespace. Tokens may be separated by any amount of whitespace, including, in some instances, no whitespace at all, and so moving past any whitespace is the first thing the lexer does.

Next, we handle identifiers and reserved words, followed by numbers. All identifiers begin with a case-sensitive letter or an underscore. But also, all reserved words begin with a lowercase letter. There is no way to disambiguate between an identifier and a reserved word until we scan the entire token, so in all cases where we see a letter or an underscore, we enter the ident() function. This function reads in all letters, numbers, and underscores until it encounters a character that isn't one of those. We then rewind to the last character of the identifier for simplicity's sake (the last step of the lexer is always to move to the next character in the input buffer, and so by rewinding here we avoid accidentially skipping a token).

Now we need to determine if this token string we have is a reserved word. We just iterate down the list of reserved words and if there is a match, we return the correct token type for that reserved word. If we get to the end of the reserved word list and we don't have a match, then we must have an identifier and report the token type as such.

If instead of a letter or underscore the lexer sees a digit, then we must have a number. According to the grammar, all numbers begin with digits and are a sequence of one or more digits. I am going to add a little extension here. For really large numbers, or numbers that might follow a standard logical grouping (like dollars and cents), it might be nice to give the PL/0 programmer a way of having easy-for-humans-to-read number separation. As found in Ada, Java, D, Python, and others, underscores can be used to break up those numbers for the human reader. I think that's a nice addition, so we'll do that too.

To do that, inside the number() function, we scan all digits and underscores until we find a character that isn't one of those, and rewind the lexer to the last character of the number. We then create the token string by recording only the digits into the token string. Yes, we will allocate more space than is strictly necessary for the token string if we have underscores in the number, but that's fine.

At this point, we validate the number. We do this by using a helpful function from OpenBSD, strtonum(3). This function takes a string, a minimum value, and maximum value, and a string location to report any errors. If that string location is NULL after the function runs, then all is OK. Otherwise, something went wrong. It doesn't really matter to us what went wrong—it could be that the number is out of range or somehow otherwise invalid—but in that case we should report the error. Usually you would assign the output of strtonum(3) (it returns a long long) to a variable. But we don't actually care about that. We just care that the number is valid. So by casting it to void, we turn the function into a number validator.

For completeness, I put a copy of strtonum(3) in the GitHub repository with the compiler. If you need it, remove -DHAVE_STRTONUM from the CFLAGS line in the Makefile.

There is also a little bit of awkwardness because of the PL/0 grammar. The most negative possible valid number can't be checked with this method. There are some strategies we can use: first, we can choose not to care at all and if we need the most negative valid number do something like (assuming an 8-/16-bit system) x := -32767 - 1; second, we could massage the grammar to help ameliorate the problem; third, we could borrow a concept from Pascal and add two new reserved words: minint and maxint. I am going to choose option 1 but you feel free to choose any option you want.

If the lexer didn't see a letter, number, or underscore, then we must have a symbol.

If the lexer sees any symbol that isn't the assignment operator, it returns that symbol as-is, since in our long list of defines, we make the value of the symbol its ASCII character code. So they will be equivalent. Because we do this, we do not need to bother with reading these symbols into a token string like we do for identifiers, reserved words, and numbers. The token type alone already provides us with everything we need to know about the symbol.

Except if the lexer sees a colon. If the lexer sees a colon, we have to make sure the very next character in the input buffer is an equals sign, since that makes the assignment operator token. Nothing else makes sense, and since there is no bare colon symbol, we can confidently declare an error if you don't find an equals sign.

If we see a { then we know we have entered a comment, and so we should enter the comment() function. That function reads through the input buffer until it reaches a }, which signifies the end of the comment. Even so, we still need to count lines so the line count should be incremented if we see a newline in a comment. We also want to ensure that every comment opened is closed, and so if we see a NUL-byte then we somehow reached the end of the input without closing the comment, which should never happen.

If we find a comment, I am perfectly OK with using a goto to return back to the top of the lex() function and trying again to find a token. If you believe you are allergic to gotos, you could have comments return -1 and put a while loop in the parser to iterate until you get back a return code from the lexer that isn't -1.

Eventually, the lexer will see a NUL-byte, which signifies that we are at the end of the raw input buffer. At this point the lexer informs the parser that we are finished.

If the lexer ever sees anything else, then we have an invalid token and that's an error.

And that's it. That's all the work our lexer needs to do. For the sake of being able to prove to ourselves that our lexer works, we'll create a parse() function that spits out tokens one on a line alongside the line number this token was found on as well as the token type in numerical form.

All together now

Here is all the source code we wrote today all together:

#include <sys/stat.h>

#include <ctype.h>
#include <fcntl.h>
#include <limits.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>

#define TOK_IDENT	'I'
#define TOK_NUMBER	'N'
#define TOK_CONST	'C'
#define TOK_VAR		'V'
#define TOK_PROCEDURE	'P'
#define TOK_CALL	'c'
#define TOK_BEGIN	'B'
#define TOK_END		'E'
#define TOK_IF		'i'
#define TOK_THEN	'T'
#define TOK_WHILE	'W'
#define TOK_DO		'D'
#define TOK_ODD		'O'
#define TOK_DOT		'.'
#define TOK_EQUAL	'='
#define TOK_COMMA	','
#define TOK_SEMICOLON	';'
#define TOK_ASSIGN	':'
#define TOK_HASH	'#'
#define TOK_LESSTHAN	'<'
#define TOK_GREATERTHAN	'>'
#define TOK_PLUS	'+'
#define TOK_MINUS	'-'
#define TOK_MULTIPLY	'*'
#define TOK_DIVIDE	'/'
#define TOK_LPAREN	'('
#define TOK_RPAREN	')'

/*
 * pl0c -- PL/0 compiler.
 *
 * program	= block "." .
 * block	= [ "const" ident "=" number { "," ident "=" number } ";" ]
 *		  [ "var" ident { "," ident } ";" ]
 *		  { "procedure" ident ";" block ";" } statement .
 * statement	= [ ident ":=" expression
 *		  | "call" ident
 *		  | "begin" statement { ";" statement } "end"
 *		  | "if" condition "then" statement
 *		  | "while" condition "do" statement ] .
 * condition	= "odd" expression
 *		| expression ( "=" | "#" | "<" | ">" ) expression .
 * expression	= [ "+" | "-" ] term { ( "+" | "-" ) term } .
 * term		= factor { ( "*" | "/" ) factor } .
 * factor	= ident
 *		| number
 *		| "(" expression ")" .
 */

static char *raw, *token;
static int type;
static size_t line = 1;

/*
 * Misc. functions.
 */

static void
error(const char *fmt, ...)
{
	va_list ap;

	(void) fprintf(stderr, "pl0c: error: %lu: ", line);

	va_start(ap, fmt);
	(void) vfprintf(stderr, fmt, ap);
	va_end(ap);

	(void) fputc('\n', stderr);

	exit(1);
}

static void
readin(char *file)
{
	int fd;
	struct stat st;

	if (strrchr(file, '.') == NULL)
		error("file must end in '.pl0'");

	if (!!strcmp(strrchr(file, '.'), ".pl0"))
		error("file must end in '.pl0'");

	if ((fd = open(file, O_RDONLY)) == -1)
		error("couldn't open %s", file);

	if (fstat(fd, &st) == -1)
		error("couldn't get file size");

	if ((raw = malloc(st.st_size + 1)) == NULL)
		error("malloc failed");

	if (read(fd, raw, st.st_size) != st.st_size)
		error("couldn't read %s", file);
	raw[st.st_size] = '\0';

	(void) close(fd);
}

/*
 * Lexer.
 */

static void
comment(void)
{
	int ch;

	while ((ch = *raw++) != '}') {
		if (ch == '\0')
			error("unterminated comment");

		if (ch == '\n')
			++line;
	}
}

static int
ident(void)
{
	char *p;
	size_t i, len;

	p = raw;
	while (isalnum(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++)
		token[i] = *p++;
	token[i] = '\0';

	if (!strcmp(token, "const"))
		return TOK_CONST;
	else if (!strcmp(token, "var"))
		return TOK_VAR;
	else if (!strcmp(token, "procedure"))
		return TOK_PROCEDURE;
	else if (!strcmp(token, "call"))
		return TOK_CALL;
	else if (!strcmp(token, "begin"))
		return TOK_BEGIN;
	else if (!strcmp(token, "end"))
		return TOK_END;
	else if (!strcmp(token, "if"))
		return TOK_IF;
	else if (!strcmp(token, "then"))
		return TOK_THEN;
	else if (!strcmp(token, "while"))
		return TOK_WHILE;
	else if (!strcmp(token, "do"))
		return TOK_DO;
	else if (!strcmp(token, "odd"))
		return TOK_ODD;

	return TOK_IDENT;
}

static int
number(void)
{
	const char *errstr;
	char *p;
	size_t i, j = 0, len;

	p = raw;
	while (isdigit(*raw) || *raw == '_')
		++raw;

	len = raw - p;

	--raw;

	free(token);

	if ((token = malloc(len + 1)) == NULL)
		error("malloc failed");

	for (i = 0; i < len; i++) {
		if (isdigit(*p))
			token[j++] = *p;
		++p;
	}
	token[j] = '\0';

	(void) strtonum(token, 0, LONG_MAX, &errstr);
	if (errstr != NULL)
		error("invalid number: %s", token);

	return TOK_NUMBER;
}

static int
lex(void)
{

again:
	/* Skip whitespace.  */
	while (*raw == ' ' || *raw == '\t' || *raw == '\n') {
		if (*raw++ == '\n')
			++line;
	}

	if (isalpha(*raw) || *raw == '_')
		return ident();

	if (isdigit(*raw))
		return number();

	switch (*raw) {
	case '{':
		comment();
		goto again;
	case '.':
	case '=':
	case ',':
	case ';':
	case '#':
	case '<':
	case '>':
	case '+':
	case '-':
	case '*':
	case '/':
	case '(':
	case ')':
		return (*raw);
	case ':':
		if (*++raw != '=')
			error("unknown token: ':%c'", *raw);

		return TOK_ASSIGN;
	case '\0':
		return 0;
	default:
		error("unknown token: '%c'", *raw);
	}

	return 0;
}

/*
 * Parser.
 */

static void
parse(void)
{

	while ((type = lex()) != 0) {
		++raw;
		(void) fprintf(stdout, "%lu|%d\t", line, type);
		switch (type) {
		case TOK_IDENT:
		case TOK_NUMBER:
		case TOK_CONST:
		case TOK_VAR:
		case TOK_PROCEDURE:
		case TOK_CALL:
		case TOK_BEGIN:
		case TOK_END:
		case TOK_IF:
		case TOK_THEN:
		case TOK_WHILE:
		case TOK_DO:
		case TOK_ODD:
			(void) fprintf(stdout, "%s", token);
			break;
		case TOK_DOT:
		case TOK_EQUAL:
		case TOK_COMMA:
		case TOK_SEMICOLON:
		case TOK_HASH:
		case TOK_LESSTHAN:
		case TOK_GREATERTHAN:
		case TOK_PLUS:
		case TOK_MINUS:
		case TOK_MULTIPLY:
		case TOK_DIVIDE:
		case TOK_LPAREN:
		case TOK_RPAREN:
			(void) fputc(type, stdout);
			break;
		case TOK_ASSIGN:
			(void) fputs(":=", stdout);
		}
		(void) fputc('\n', stdout);
	}
}

/*
 * Main.
 */

int
main(int argc, char *argv[])
{
	char *startp;

	if (argc != 2) {
		(void) fputs("usage: pl0c file.pl0\n", stderr);
		exit(1);
	}

	readin(argv[1]);
	startp = raw;

	parse();

	free(startp);

	return 0;
}

A standalone lexer

As we mentioned in the previous post, we could create a compiler that uses three separate standalone binaries: one for the lexer, one for the parser, and one for the code generator. If that was our approach, our lexer would now be complete. We could pipe the output of the lexer into the input for the parser. Our output file format where the line number, token type, and token string are placed one on a line would be enough contextual information for the parser to do all the work it needs to do. But then we would have to teach the parser how to understand this intermediate representation. It would be a bit of duplicated effort, so perhaps if we were to take this route, it might be better to combine the lexer and parser together into one "front end" binary and the code generator would be a "back end" binary.

Side project: A code formatter

An interesting program we can now write is a code formatter. Because PL/0 can be written in a myriad of styles (or no style at all), you might get some PL/0 code from someone else that is difficult for you to understand. Since we can now tokenize this free-form source code, we can in theory reassemble those tokens back into free-form source code conforming to the style of our choosing. You won't be able to validate the source code with this code formatter just yet though. That will come after we have written our parser. But if you're looking for a challenge right now, see if you can extend our lexer into a code formatter. If you choose to embark on the challenge, consider what you might do with comments and extremely long lines as well.

On the next episode: A parser

We will soon complete our front end. In the next blog post, we will write a parser for the PL/0 language. Once the parser is complete, we will have a PL/0 validator; that is, a program that can read through PL/0 source code and tell you if it is legal source code.

As always, what we did today is not the only way this problem of lexing can be solved. But it works for me and it works for PL/0. If you want to experiment and come up with other solutions, please do and let us know what you learn! I hope you enjoyed this look into writing your own lexer.

OpenBSD httpd

Brian Robert Callahan

[prev]

[next]