

Write your own compiler - Station #1: the tokenizer
source link: https://blog.klipse.tech/javascript/2017/02/08/tiny-compiler-tokenizer.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Write your own compiler - Station #1: the tokenizer
Feb 8, 2017 • Yehonathan Sharvit
The plan
Our journey is made of 4 stations - each of them depending on the previous ones:
- The tokenizer (aka “Lexical Analysis”): converting an input code - in
LISP
syntax - into an array of tokens. - The parser (aka “Syntactic Analysis”): transforming an array of tokens into an Abstract Syntax Tree (AST).
- The emitter (aka “Code Generation”): string-ifying an AST into
C
-like code. - The compiler (aka “You made it”): combining all the pieces together.
(The interactive code snippets are powered by a tool of mine named KLIPSE.)
The tokenizer
The tokenizer
receives a string of code and breaks it down into an array of tokens.

There are three kinds of tokens:
- single-character token:
(
and)
- multiple character token:
123
orabcd
- a string: something that starts with a
"
and ends with a"
(no escaping!)
First, we are going to write a couple of tokenizers for a single token. Each tokenizer receives the code as a string and the current position and returns:
- the length of the token
- the token as an object with two keys:
type
andvalue
Single-character token
Let’s write a generic function that tokenizes a single character:
xxxxxxxxxx
tokenizeCharacter = (type, value, input, current) =>
(value === input[current]) ? [1, {type, value}] : [0, null]
[Function tokenizeCharacter]
Here is the tokenizer for (
:
xxxxxxxxxx
tokenizeParenOpen = (input, current) => tokenizeCharacter('paren', '(', input, current)
xxxxxxxxxx
[Function tokenizeParenOpen]
xxxxxxxxxx
tokenizeParenOpen('(', 0)
xxxxxxxxxx
Array [
1,
Object {
"type": "paren",
"value": "(",
},
]
And here is the tokenizer for )
:
xxxxxxxxxx
tokenizeParenClose = (input, current) => tokenizeCharacter('paren', ')', input, current)
xxxxxxxxxx
[Function tokenizeParenClose]
xxxxxxxxxx
tokenizeParenClose(')', 0)
xxxxxxxxxx
Array [
1,
Object {
"type": "paren",
"value": ")",
},
]
Multiple character tokens:
We will describe our multi-character token by means of regular expressions:
Here is a generic regexp tokenizer:
xxxxxxxxxx
tokenizePattern = (type, pattern, input, current) => {
let char = input[current];
let consumedChars = 0;
if (pattern.test(char)) {
let value = '';
while (char && pattern.test(char)) {
value += char;
consumedChars ++;
char = input[current + consumedChars];
}
return [consumedChars , { type, value }];
}
return [0, null]
}
xxxxxxxxxx
[Function tokenizePattern]
And here is the number
tokenizer:
xxxxxxxxxx
tokenizeNumber = (input, current) => tokenizePattern("number", /[0-9]/, input, current)
xxxxxxxxxx
[Function tokenizeNumber]
xxxxxxxxxx
tokenizeNumber("123aad", 0)
xxxxxxxxxx
Array [
3,
Object {
"type": "number",
"value": "123",
},
]
And the name
tokenizer (in our language names are chains of letters):
xxxxxxxxxx
tokenizeName = (input, current) => tokenizePattern("name", /[a-z]/i, input, current)
xxxxxxxxxx
[Function tokenizeName]
xxxxxxxxxx
tokenizeName('hello world', 0)
xxxxxxxxxx
Array [
5,
Object {
"type": "name",
"value": "hello",
},
]
String tokenizer
A string is something that starts with a "
and ends with a "
(no escaping in our language!):
xxxxxxxxxx
tokenizeString = (input, current) => {
if (input[current] === '"') {
let value = '';
let consumedChars = 0;
consumedChars ++;
char = input[current + consumedChars];
while (char !== '"') {
if(char === undefined) {
throw new TypeError("unterminated string ");
}
value += char;
consumedChars ++;
char = input[current + consumedChars];
}
return [consumedChars + 1, { type: 'string', value }];
}
return [0, null]
}
xxxxxxxxxx
[Function tokenizeString]
xxxxxxxxxx
tokenizeString('"Hello World"', 0)
xxxxxxxxxx
Array [
13,
Object {
"type": "string",
"value": "Hello World",
},
]
Last thing, we want to skip whitespaces:
xxxxxxxxxx
skipWhiteSpace = (input, current) => (/\s/.test(input[current])) ? [1, null] : [0, null]
xxxxxxxxxx
[Function skipWhiteSpace]
The tokenizer
Let’s put all our tokenizers into an array:
xxxxxxxxxx
tokenizers = [skipWhiteSpace, tokenizeParenOpen, tokenizeParenClose, tokenizeString, tokenizeNumber, tokenizeName];
xxxxxxxxxx
Array [
[Function skipWhiteSpace],
[Function tokenizeParenOpen],
[Function tokenizeParenClose],
[Function tokenizeString],
[Function tokenizeNumber],
[Function tokenizeName],
]
The code tokenizer is going go over its input and try all the tokenizers and when it finds a match it will:
- push the token object
- update the current position
Here is the code:
xxxxxxxxxx
tokenizer = (input) => {
let current = 0;
let tokens = [];
while (current < input.length) {
let tokenized = false;
tokenizers.forEach(tokenizer_fn => {
if (tokenized) {return;}
let [consumedChars, token] = tokenizer_fn(input, current);
if(consumedChars !== 0) {
tokenized = true;
current += consumedChars;
}
if(token) {
tokens.push(token);
}
});
if (!tokenized) {
throw new TypeError('I dont know what this character is: ' + char);
}
}
return tokens;
}
xxxxxxxxxx
[Function tokenizer]
Let’s see our tokenizer in action:
xxxxxxxxxx
tokenizer('(add 2 3)')
xxxxxxxxxx
Array [
Object {
"type": "paren",
"value": "(",
},
Object {
"type": "name",
"value": "add",
},
Object {
"type": "number",
"value": "2",
},
Object {
"type": "number",
"value": "3",
},
Object {
"type": "paren",
"value": ")",
},
]
Our tokenizer doesn’t do any semantic validation. As an example, it can read unbalanced parenthesis:
xxxxxxxxxx
tokenizer('(add 2')
xxxxxxxxxx
Array [
Object {
"type": "paren",
"value": "(",
},
Object {
"type": "name",
"value": "add",
},
Object {
"type": "number",
"value": "2",
},
]
Let’s make sure we can handle nested expressions properly:
xxxxxxxxxx
tokenizer('(add 2 (subtract "314" 2))')
xxxxxxxxxx
Array [
Object {
"type": "paren",
"value": "(",
},
Object {
"type": "name",
"value": "add",
},
Object {
"type": "number",
"value": "2",
},
Object {
"type": "paren",
"value": "(",
},
Object {
"type": "name",
"value": "subtract",
},
Object {
"type": "string",
"value": "314",
},
Object {
"type": "number",
"value": "2",
},
Object {
"type": "paren",
"value": ")",
},
Object {
"type": "paren",
"value": ")",
},
]
Hourra!!!
Please take a short rest before moving towards Station #2: The parser.
to stay up-to-date with the coolest interactive articles around the world.
Discover more cool interactive articles about javascript, clojure[script], python, ruby, scheme, c++ and even brainfuck!
Give Klipse a Github star to express how much you appreciate Code Interactivity.
Subscribe to the Klipse newsletter:Feel free to email me [email protected] for getting practical tips and tricks in writing your first interactive blog post.
Recommend
-
57
README.md
-
49
README.md JSMN
-
31
Multi-lingual Chatbot Using Rasa and Custom Tokenizer
-
9
Some of you may not be aware, but I'm currently rewriting the GDScript. It was discussed during the last Godot Sprint in Brussels and the core developers approved the idea. The main rationale to rewrite the GDScrip...
-
17
The plan Our journey is made of 4 stations - each of them depending on the previous ones:
-
13
Write your own compiler - Station #3: the emitter Feb 8, 2017 • Yehonathan Sharvit The plan Our
-
11
Write your own compiler - Station #2: the parser Feb 8, 2017 • Yehonathan Sharvit The plan Our
-
15
Writing a compiler Nothing in computer science sounds more challenging than writing a compiler. And indeed, it is challenging - very challenging. You probably have to be one of those genius guys in order to...
-
25
Replace Helm Chart Variables in your CI/CD Pipeline with Tokenizer Feb 222021-02-22T00:00:00+01:00 by Wolfgang Ofner Helm is a great tool to deploy your application into Kubernetes. In my post,
-
10
Tutorial: How to build a Tokenizer in Spark and Scala Reading Time: 2 minutesIn our earlier blog A Simple Application in Spar...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK