10

Write your own compiler - Station #1: the tokenizer

 4 years ago
source link: https://blog.klipse.tech/javascript/2017/02/08/tiny-compiler-tokenizer.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Write your own compiler - Station #1: the tokenizer

Feb 8, 2017 • Yehonathan Sharvit

The plan

Our journey is made of 4 stations - each of them depending on the previous ones:

  1. The tokenizer (aka “Lexical Analysis”): converting an input code - in LISP syntax - into an array of tokens.
  2. The parser (aka “Syntactic Analysis”): transforming an array of tokens into an Abstract Syntax Tree (AST).
  3. The emitter (aka “Code Generation”): string-ifying an AST into C-like code.
  4. The compiler (aka “You made it”): combining all the pieces together.

(The interactive code snippets are powered by a tool of mine named KLIPSE.)

The tokenizer

The tokenizer receives a string of code and breaks it down into an array of tokens.

tokens

There are three kinds of tokens:

  1. single-character token: ( and )
  2. multiple character token: 123 or abcd
  3. a string: something that starts with a " and ends with a " (no escaping!)

First, we are going to write a couple of tokenizers for a single token. Each tokenizer receives the code as a string and the current position and returns:

  1. the length of the token
  2. the token as an object with two keys: type and value

Single-character token

Let’s write a generic function that tokenizes a single character:

xxxxxxxxxx
tokenizeCharacter = (type, value, input, current) =>
(value === input[current]) ? [1, {type, value}] : [0, null]
[Function tokenizeCharacter]

Here is the tokenizer for (:

xxxxxxxxxx
tokenizeParenOpen = (input, current) => tokenizeCharacter('paren', '(', input, current)
xxxxxxxxxx
[Function tokenizeParenOpen]
xxxxxxxxxx
tokenizeParenOpen('(', 0)
xxxxxxxxxx
Array [
  1,
  Object {
    "type": "paren",
    "value": "(",
  },
]

And here is the tokenizer for ):

xxxxxxxxxx
tokenizeParenClose = (input, current) => tokenizeCharacter('paren', ')', input, current)
xxxxxxxxxx
[Function tokenizeParenClose]
xxxxxxxxxx
tokenizeParenClose(')', 0)
xxxxxxxxxx
Array [
  1,
  Object {
    "type": "paren",
    "value": ")",
  },
]

Multiple character tokens:

We will describe our multi-character token by means of regular expressions:

Here is a generic regexp tokenizer:

xxxxxxxxxx
tokenizePattern = (type, pattern, input, current) => {
  let char = input[current];
  let consumedChars = 0;
  if (pattern.test(char)) {
    let value = '';
    while (char && pattern.test(char)) {
      value += char;
      consumedChars ++;
      char = input[current + consumedChars];
    }
    return [consumedChars , { type, value }];
  }
  return [0, null]
}
xxxxxxxxxx
[Function tokenizePattern]

And here is the number tokenizer:

xxxxxxxxxx
tokenizeNumber = (input, current) => tokenizePattern("number", /[0-9]/, input, current)
xxxxxxxxxx
[Function tokenizeNumber]
xxxxxxxxxx
tokenizeNumber("123aad", 0)
xxxxxxxxxx
Array [
  3,
  Object {
    "type": "number",
    "value": "123",
  },
]

And the name tokenizer (in our language names are chains of letters):

xxxxxxxxxx
tokenizeName = (input, current) => tokenizePattern("name", /[a-z]/i, input, current)
xxxxxxxxxx
[Function tokenizeName]
xxxxxxxxxx
tokenizeName('hello world', 0)
xxxxxxxxxx
Array [
  5,
  Object {
    "type": "name",
    "value": "hello",
  },
]

String tokenizer

A string is something that starts with a " and ends with a " (no escaping in our language!):

xxxxxxxxxx
tokenizeString = (input, current) => {
  if (input[current] === '"') {
    let value = '';
    let consumedChars = 0;
    consumedChars ++;
    char = input[current + consumedChars];
    while (char !== '"') {
      if(char === undefined) {
        throw new TypeError("unterminated string ");
      }
      value += char;
      consumedChars ++;
      char = input[current + consumedChars];
    }
    return [consumedChars + 1, { type: 'string', value }];
  }
  return [0, null]
}
xxxxxxxxxx
[Function tokenizeString]
xxxxxxxxxx
tokenizeString('"Hello World"', 0)
xxxxxxxxxx
Array [
  13,
  Object {
    "type": "string",
    "value": "Hello World",
  },
]

Last thing, we want to skip whitespaces:

xxxxxxxxxx
skipWhiteSpace = (input, current) =>   (/\s/.test(input[current])) ? [1, null] : [0, null]
xxxxxxxxxx
[Function skipWhiteSpace]

The tokenizer

Let’s put all our tokenizers into an array:

xxxxxxxxxx
tokenizers = [skipWhiteSpace, tokenizeParenOpen, tokenizeParenClose, tokenizeString, tokenizeNumber, tokenizeName];
xxxxxxxxxx
Array [
  [Function skipWhiteSpace],
  [Function tokenizeParenOpen],
  [Function tokenizeParenClose],
  [Function tokenizeString],
  [Function tokenizeNumber],
  [Function tokenizeName],
]

The code tokenizer is going go over its input and try all the tokenizers and when it finds a match it will:

  1. push the token object
  2. update the current position

Here is the code:

xxxxxxxxxx
tokenizer = (input) => {
  let current = 0;
  let tokens = [];
  while (current < input.length) {
    let tokenized = false;
    tokenizers.forEach(tokenizer_fn => {
      if (tokenized) {return;}
      let [consumedChars, token] = tokenizer_fn(input, current);
      if(consumedChars !== 0) {
        tokenized = true;
        current += consumedChars;
      }
      if(token) {
        tokens.push(token);
      }
    });
    if (!tokenized) {
      throw new TypeError('I dont know what this character is: ' + char);
    }
  }
  return tokens;
}
xxxxxxxxxx
[Function tokenizer]

Let’s see our tokenizer in action:

xxxxxxxxxx
tokenizer('(add 2 3)')
xxxxxxxxxx
Array [
  Object {
    "type": "paren",
    "value": "(",
  },
  Object {
    "type": "name",
    "value": "add",
  },
  Object {
    "type": "number",
    "value": "2",
  },
  Object {
    "type": "number",
    "value": "3",
  },
  Object {
    "type": "paren",
    "value": ")",
  },
]

Our tokenizer doesn’t do any semantic validation. As an example, it can read unbalanced parenthesis:

xxxxxxxxxx
tokenizer('(add 2')
xxxxxxxxxx
Array [
  Object {
    "type": "paren",
    "value": "(",
  },
  Object {
    "type": "name",
    "value": "add",
  },
  Object {
    "type": "number",
    "value": "2",
  },
]

Let’s make sure we can handle nested expressions properly:

xxxxxxxxxx
tokenizer('(add 2 (subtract "314" 2))')
xxxxxxxxxx
Array [
  Object {
    "type": "paren",
    "value": "(",
  },
  Object {
    "type": "name",
    "value": "add",
  },
  Object {
    "type": "number",
    "value": "2",
  },
  Object {
    "type": "paren",
    "value": "(",
  },
  Object {
    "type": "name",
    "value": "subtract",
  },
  Object {
    "type": "string",
    "value": "314",
  },
  Object {
    "type": "number",
    "value": "2",
  },
  Object {
    "type": "paren",
    "value": ")",
  },
  Object {
    "type": "paren",
    "value": ")",
  },
]

Hourra!!!

Please take a short rest before moving towards Station #2: The parser.

If you enjoy this kind of interactive articles would you consider a (small) donation💸 on Patreon or at least giving a star⭐ for the Klispe repo on Github?

to stay up-to-date with the coolest interactive articles around the world.

Discover more cool interactive articles about javascript, clojure[script], python, ruby, scheme, c++ and even brainfuck!

Give Klipse a Github star to express how much you appreciate Code Interactivity.

Subscribe to the Klipse newsletter:

Feel free to email me [email protected] for getting practical tips and tricks in writing your first interactive blog post.


Recommend

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK