18

Building a Parser

 4 years ago
source link: https://chelseatroy.com/2019/11/11/building-a-parser/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Reading Time: 7 minutes

n2UfEr6.png!web

I recently picked up Crafting Interpreters . The project guides programmers through building their own interpreters for the Lox programming language. I’m writing this blog series as I go through the project. You can see all the posts so far right here .

In Chapter 5 of Crafting Interpreters , we pick up from the scanner that turned our code into a series of tokens , and we start writing the  parser .

A parser takes in a list of tokens from our source code and produces an abstract syntax tree  like the ones we talked about when we drew illustrations of intermediate representation strategies .

Usually the AST sits somewhere between  the source language and an intermediate representation,(IR), though it’s significantly closer in form to an IR than our source code is. This is where it becomes important for us to adjudicate the grammar of our language: which tokens belong to which expressions. We haven’t gotten to adjudicating  meaning yet, but the way we parse the grammar plays a critical role in that.

How would I imagine a parser to be built?

Here is the representation from Crafting Interpreters for Lox’s grammar:

2uuyeee.png!web

This representation is meant to concisely portray precedence in the grammar: which parts of an expression are evaluated first . For example, in a piece of code like 2 + 3 < 5 + 1 , the sums on the left and right are calculated before the comparison < is calculated because addition expressions have  precedence over comparison expressions.

The tokens on the left in the above representation list the different types of expressions. The tokens on the right show how each of those expression types nest inside each other. When we inline each expression definition, check out the shape of the resulting diagram:

VBjuQj3.png!web

It’s a tree.

Back when I wrote the scanner , before I started coding, I asked myself: given what I know about what scanners do, how would I imagine such a thing to be built?

When I do the same for parsing, I think of this tree. I have done some tree traversal in my day for visualizing method calls , querying a graph database , and building an in-memory database with transaction support . Each case employed  recursion .

And that’s what the parser does, too. It assumes that each expression is of the type of lowest precedence, and when it finds an indication of  higher precedence, it drops into another method for that higher precedence expression. Think of each expression as water coming out of the top of a tiered fountain, and the bowls of the fountain representing each expression type.

myeERf3.png!web

An indication of higher precedence (say, a + sign in a comparison expression) causes that part of the expression to  spill over from the comparison() method to an  addition() method below it. Like so:

So on down the fountain all the way to the expression type with the highest precedence, primary() .

Notice, though, that our tree has ‘expression’ both on the bottom and the top. This is the recursive element: larger groupings of expressions can contain smaller groupings of expressions, resulting in a parse tree in which methods call themselves.

Precedence vs. Associativity

Precedence   refers to which operator is evaluated first in an expression that composes multiple  different operators.  Associativity  refers to which operator is evaluated first in an expression that composes multiple of the  same operator.

In the expression 2 / 1 + 1 precedence determines whether the answer is 1 or 3.

In the expression 2 / 1 / 4 , associativity determines whether the answer is one half or 8.

Different programming languages have different precedence and associativity rules, and you can look up precedence and associativity tables for them if you don’t feel like figuring it out by typing in expressions and seeing what comes back. Here’s a table for C and C++ (expressions are listed in order of precedence):

rqqeMfN.png!web

We’re going to talk more about precedence and associativity in the next post, but it makes sense to introduce associativity alongside precedence because the contrast between them helps cement in the memory what each one is doing.

Visualizing Order of Operations

The parser takes in a list of tokens and spits out an expression, which can have other expressions that will be evaluated first nested inside it. We could visualize this several different ways. I implemented the two discussed in Crafting Interpreters. 

The first prints the AST as a string that looks a lot like the Scheme programming language. We’ll find out why that choice makes sense in a future post about writing interpreters in Scheme…for Scheme. (And someday I’ll learn to make a good cliffhanger on a programming blog). Here’s an example:

YF7VvyJ.png!web

Here, I’m typing in expressions and printing both the token list and the abstract syntax tree representation. If you don’t write Scheme, here’s the skinny so you can understand this image: the operator comes first, followed by the operands. Narrowest set of parentheses is evaluated first.

In Lox, mathematical operations are left-associative. So when we do 4 + 3 + 2 + 1 , Lox evaluates 4 + 3 first, then adds the result to 2, then adds the sum of all those to 1. When we do 9 / 8 / 7 / 6 , 9 is divided by 8, the quotient of that by 7, and the quotient of that by 6.

In Lox, 2 / 1 / 4 is one half.

Here are the same operations with the AST represented in Reverse Polish Notation, with the operator at the end of the expression. This constituted one of the exercises at the end of Chapter 5, I suspect to encourage folks to practice a programming pattern that we used to implement the Scheme-like AST representation.

7zmyuuu.png!web

I thought about making a third representation using GrahViz because I like pictures. I decided against this, though, because my goal in this project is to learn about how to write interpreters, and futzing with the GraphViz dependency to do the same pattern a third time that I had already done twice doesn’t contribute to that goal. As we’ve discussed at length,I like to dig holes, and using projects to focus my learning helps me recognize when I’m getting out the shovel so I can force myself to put it away.

So, rather than digging that hole, in the next post we’ll move on to talk some more about abstract syntax trees.

If you liked this post, you might also like:

The Listening Series (or maybe you’ll hate it, but I’m proud of it anyway)

The series about reducing job interview anxiety (especially for folks with a little experience)

This start to a new series about what you can do as an individual to align your engineering career with your values


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK