1

Parsing and all that

 3 weeks ago
source link: https://blog.jeffsmits.net/compsci/2024/04/07/parsing-and-all-that/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Hello again! I’m picking up my series on Automata, with this post that goes into what I had always meant to get to: parsers. We’ll check out the old-school linear time parsing algorithms, which only need to go over the input once, without backtracking or caching. We’ll check out LL and LR, parse tables, recursive descent and recursive ascent. Welcome to the world of deterministic parsing…

Refresher from Pushy Automata

We’ll start with a brief refresher from the previous post of the series, pushy automata, since that was a little while back.

Push-Down Automata

Push-down automata (PDAs) are automata with a stack. Normal finite automata just consume input and have fixed memory via their states. PDAs can remember things on a single stack too, by pushing onto it and popping from it. Here’s a deterministic PDA for recognising the language of words that start with zeroes, followed by an equal number of ones:

Non-regular language example, deterministicdigraph "Non-regular language example, deterministic" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start [shape=none, label="", width=0]; q₀ [shape=doublecircle, width=0.4]; q₃ [shape=doublecircle, width=0.4]; start -> q₀; q₀ -> q₁ [label="0, ε → $"]; q₁ -> q₂ [label="1, 0 → ε"]; q₂ -> q₃ [label="1, $ → ε"]; q₁ -> q₁ [label="0, ε → 0"]; q₂ -> q₂ [label="1, 0 → ε"]; }Non-regular language example, deterministicstartq₀q₀start->q₀q₁q₁q₀->q₁0, ε → $q₃q₃q₁->q₁0, ε → 0q₂q₂q₁->q₂1, 0 → εq₂->q₃1, $ → εq₂->q₂1, 0 → ε

So we start at q0q_0q0​, see if there is a 000 as input, ignore the top of the stack, and put a $\$$ on the stack as a marker for the end of the stack. Now we’re in state q1q_1q1​, in which we can consume more zeroes from the input and put those on the stack. If we find a one as input, we remove a zero from the stack by not pushing anything new on the stack. Now we’re in state q2q_2q2​ where we remove zeroes from the stack for every one in the input, until we consume the final one by removing the $\$$ from the stack.

Aside: This is one of the examples from the old blog post, and I now see that it is missing a transition! This automaton rejects the input 010101, because there is no transition q1→1,$→εq3q_1 \xrightarrow{1,\ \$\ \to\ \varepsilon} q_3q1​1, $ → ε​q3​. Oops ^_^

Context-Free Grammars, Derivations, and a naive PDA translation

A context-free grammar that describes the above language is:

S=0S1 S = 0 S 1 S=0S1 (step) \text{(step)} (step)
S=ε S = \varepsilon S=ε (ε) \text{(}\varepsilon\text{)} (ε)

Sort SSS is the start symbol, the starting point in the grammar. If we’re using the grammar productively we start from the start symbol and use the rules left-to-right to replace sorts until we get the sentence in the corresponding context-free language that we want. Something like: S→0S1→00S11→0011 S \to 0 S 1 \to 0 0 S 1 1 \to 0 0 1 1 S→0S1→00S11→0011. This is called a derivation.

The most general, naive way to derive a push-down automaton for any context-free grammar is by pushing the end-of-stack and start symbol at the start, having a “main” state that uses the grammar rules with the body reversed (to deal with the stack order), and an accept state that pops the end-of-stack:

Non-regular language example, deterministicdigraph "Non-regular language example, deterministic" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start [shape=none, label="", width=0]; q₁ [shape=doublecircle, width=0.4]; start -> q₀ [label="$ S"]; q₀ -> q₁ [label="ε, $ → ε"]; q₀ -> q₀ [label="ε, S → 1 S 0\nε, S → ε\n0, 0 →ε\n1, 1 → ε"]; }Non-regular language example, deterministicstartq₀q₀start->q₀$ Sq₁q₁q₀->q₁ε, $ → εq₀->q₀ε, S → 1 S 0ε, S → ε0, 0 →ε1, 1 → ε

Here the stack grows left-to-right, so the lowest symbol on the stack is $ (end of stack), followed by S (the grammar start symbol). By the rules of the grammar we can manipulate the top of the stack and rewrite it to the body. If the input lines up with what we have on the stack, we can eliminate both. It’s simple, but inefficient because of all the nondeterminism.

Derivations, Parse Trees and Ambiguity

Let’s look at a slightly more interesting grammar from a parser perspective:

S=S+S S = S + S S=S+S (add) \text{(add)} (add)
S=S∗S S = S * S S=S∗S (mul) \text{(mul)} (mul)
S=1 S = 1 S=1 (ε) \text{(}\varepsilon\text{)} (ε)

When you want to derive 1+1∗1 1 + 1 * 1 1+1∗1, you can do this in all manner of ways. The following derivation picks just an arbitrary sort on which to apply a rule from the grammar:

  1. S S S (first S)
  2. S+S S + S S+S (first S)
  3. 1+S 1 + S 1+S (first S)
  4. 1+S∗S 1 + S * S 1+S∗S (second S)
  5. 1+S∗1 1 + S * 1 1+S∗1 (first S)
  6. 1+1∗1 1 + 1 * 1 1+1∗1

Notice how in some steps the leftmost SSS was replaced, while in others the rightmost was replaced. Generally speaking, you’ll want either a leftmost or a rightmost derivation for parsers, which is to say: a grammar rule is always applied to the leftmost or rightmost sort. There are three reasons for this. The first is that you want a parser to be predictable in when it applies grammar rules, as you may connect so-called semantic actions to each rule. These are pieces of code that are run when the parser applies the particular rule. (A typical example is a simple calculator). Such actions could perform side-effects, therefore order matters. For this reason, leftmost vs rightmost can also be observed. Two other reasons you to want this predictable derivation order is ease of implementation, and ease of proving things about your algorithm. These last two care less for whether it’s leftmost or rightmost.

The most common semantic actions I’m aware of is to build a syntax tree with a parser. This builds a tree structure out of the parsed text. A parse tree, or concrete syntax tree, contains all the rule applications as seen in the grammar. An abstract syntax tree abstracts over some parts of the syntax tree, such as leaving out whitespace, or parentheticals (the shape of the tree captures the precedence anyway), or injections (grammars rules of the form S1=S2 S₁ = S₂ S1​=S2​). Let’s look at some parse trees of the last example, 1+1∗1 1 + 1 * 1 1+1∗1:

Parse trees of 1 + (1 * 1) and (1 + 1) * 1digraph "Parse trees of 1 + (1 * 1) and (1 + 1) * 1" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101", label=S]; edge [fontcolor="#010101", color="#010101"]; rankdir=TB; subgraph { Lfirst1 [label=1]; Lplus [label="+"]; Lhidden [shape=none, label="", width=0.5]; Ladd; Lsecond1 [label=1]; Lstar [label="*"]; Lthird1 [label=1]; Lmul; Ladd -> Lfirst1; Ladd -> Lplus; Ladd -> Lhidden [style="invis"]; Ladd -> Lmul; Lmul -> Lsecond1; Lmul -> Lstar; Lmul -> Lthird1; } subgraph { Rfirst1 [label=1]; Rplus [label="+"]; Rsecond1 [label=1]; Radd; Rhidden [shape=none, label="", width=0.5]; Rstar [label="*"]; Rthird1 [label=1]; Rmul; Rmul -> Radd; Rmul -> Rhidden [style="invis"]; Rmul -> Rstar; Rmul -> Rthird1; Radd -> Rfirst1; Radd -> Rplus; Radd -> Rsecond1; } }Parse trees of 1 + (1 * 1) and (1 + 1) * 1Lfirst11Lplus+LhiddenLaddSLadd->Lfirst1Ladd->LplusLmulSLadd->LmulLsecond11Lstar*Lthird11Lmul->Lsecond1Lmul->LstarLmul->Lthird1Rfirst11Rplus+Rsecond11RaddSRadd->Rfirst1Radd->RplusRadd->Rsecond1RhiddenRstar*Rthird11RmulSRmul->RaddRmul->RstarRmul->Rthird1

Notice how the leaves of the two trees are in the same order left-to-right as the input, but for the left tree the plus is higher up in the tree while in the right tree the star is higher up. If we want to interpreter the input as simple arithmetic, where multiplication binds tighter than addition, the left tree is the one we want. This is the predecedence of the operators, ∗>+ * > + ∗>+.

When you can get multiple trees like this, the grammar is called ambiguous. More formally, if you use only leftmost derivations (or only rightmost) and still find two distinct derivations that give the same sentence, the grammar is ambiguous. So to be clear: the above trees can be created with only leftmost derivations, it’s not a matter of choosing one or the other for the two trees. Derivation order (leftmost or rightmost) has to do with side-effect order of semantic actions only. When you build trees you don’t really need side-effects, so the derivation order has no effect on it.

With that recap out of the way: For the purposes of this blog post, we’ll look at unambiguous grammars for the languages we want to parse. Still, whether you use leftmost derivation or rightmost derivation in a parser that parses unambiguous grammars matters quite a lot in terms of what languages you can describe deterministically. It also influences how easily you can write a parser by hand for such a grammar, and how easily you can (programmatically) explain why your parser doesn’t accept certain inputs (parser error messages). So let’s have a look at LL and LR parsing techniques, where the first L in those abbreviations stands for Left-to-right (as in reading direction in text), and the second letters are respectively leftmost and rightmost derivation.

Topdown, (Strong) LL parsing

To take a good look at LL parsing, we will first work with a grammar that is not ambiguous or left-recursive:

S=F S = F S=F (1) \text{(1)} (1)
S=(S+F) S = ( S + F ) S=(S+F) (2) \text{(2)} (2)
F=a F = a F=a (3) \text{(3)} (3)

So sort S S S is the start symbol, we also have sort F F F, and we have round brackets, plusses, and a a a’s. This is enough information to create a table that, based on (1) the next sort to be parsed and (2) the next symbol in the input, predicts which rule from the grammar to use to parse the input further. In other words, if you know where you are in the input and grammar, you can look ahead at the next symbol of input and tell which unique grammar rule predicts the next bit of input (assuming the input fits the grammar). The table for the above grammar looks like so:

( a
SSS 2 1
FFF   3

A table like the above is an LL(1) parse table, because it uses only 1 symbol of “look-ahead” in the columns. LL(1) grammars are always strong LL grammars, which means that they only need the combination of the sort to be parsed and the next symbol(s) to decide on a unique grammar rule to apply. In general, LL(k) grammars do not have to be strong, and if they are not, you also need to know what was already parsed from the input (the “left context”) in order to choose a unique grammar rule1. For example, the following grammar is LL(2), and not strong:

S=AabAba S = A\ a\ b\ A\ b\ a S=A a b A b a (1) \text{(1)} (1)
A=a A = a A=a (2) \text{(2)} (2)
A= A = A= (3) \text{(3)} (3)

You can see this if you try to write an LL(2) parse table for it:

a a a b b a
SSS 1 1  
AAA 2 2,3 3

If you look ahead to a b on the input, and the next sort is AAA, then it really depends on whether you are at the start of the input or in the middle of rule 1. If you’re at the start, you must choose rule 3 so you can parse a b as part of the rule 1, but if you’re already in the middle of rule 1, you must choose rule 2 for AAA so you can continue to parse b a of rule 1.

If you mark AAA in rule 1 with where you are in rule 1 (S=A1abA2ba S = A₁\ a\ b\ A₂\ b\ a S=A1​ a b A2​ b a), you get an LL(2) grammar that is strong, although the table for it is larger2:

a a a b b a
SSS 1 1  
A1A_1A1​ 2 3  
A2A_2A2​   2 3

In general, you can always use this trick to construct a strong, structurally equivalent LL grammar with the same look-ahead. This is quite useful for constructing simple LL parsers. However, the downside of these parsers is that on wrong input they can fail later than a more complicated LL(k) parser that works for the non-strong grammar. And that matters if you want to give nice error messages.

An intuition for table construction by automaton

Building the above tables was a matter of keeping in mind what they mean, and squinting a little. But in the case of a larger grammar, or a parsetable generator, of course you want an exact process. Before I dive into First and Follow sets that are the traditional method for building these tables, let me give you a method that is less practical but in my opinion more intuitive.

Step 1: Let’s build a simple automaton for each rule of the grammar, where we assume both sorts and terminals are on the input.

Simple automata for each grammar rule from the last exampledigraph "Simple automata for each grammar rule from the last example" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; start1 -> S₁₀; S₁₀ -> S₁₁ [label="A"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; } subgraph { node [style=filled, fillcolor="#ddd"]; start2 [style="", shape=none, label="", width=0]; A₂ [shape=doublecircle, width=0.4]; start2 -> A₂₀; A₂₀ -> A₂ [label="a"]; } subgraph { node [style=filled, fillcolor="#aaa"]; start3 [style="", shape=none, label="", width=0]; A₃ [shape=doublecircle, width=0.4]; start3 -> A₃; } }Simple automata for each grammar rule from the last examplestart1S₁₀S₁₀start1->S₁₀S₁S₁S₁₁S₁₁S₁₀->S₁₁AS₁₂S₁₂S₁₁->S₁₂aS₁₃S₁₃S₁₂->S₁₃bS₁₄S₁₄S₁₃->S₁₄AS₁₅S₁₅S₁₄->S₁₅bS₁₅->S₁astart2A₂₀A₂₀start2->A₂₀A₂A₂A₂₀->A₂astart3A₃A₃start3->A₃

Note how each node of a rule automaton has the number of the rule followed by the offset into the body of the rule. The state represents where we are in the rule while parsing by that rule. The last node doesn’t have this offset so you can easily identify it, even when it’s no longer a final state.

Typically you’ll find texts on parsers display the position in a rule more visually with “LR item” notation. This uses the actual rule and a dot to indicate where we are in the rule. While this makes individual automata states more readable, it makes the automata large and therefore harder to display in a readable way as a whole. That’s why you won’t find that notation in this post’s automata. But just to illustrate an example of the notation:

Shorthand in this blog LR Item notation
S₁₀ S=.AabAba S = .\ A\ a\ b\ A\ b\ a S=. A a b A b a
S₁₁ S=A.abAba S = A\ .\ a\ b\ A\ b\ a S=A . a b A b a
S₁₅ S=AabAb.a S = A\ a\ b\ A\ b\ .\ a S=A a b A b . a
S₁ S=AabAba. S = A\ a\ b\ A\ b\ a\ . S=A a b A b a .

Step 2: Now instead of consuming a sort with an automaton, we’ll use ε\varepsilonε rules to jump to the automata of the rules for that sort instead. We’ll use the PDA stack with unique labels to get back to where you would be after consuming the sort.

Single PDA using the automata from the grammar rulesdigraph "Single PDA using the automata from the grammar rules" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; subgraph { rank=same; edge [style=invis]; A₂₀ [style=filled, fillcolor="#ddd"]; S₁₀ -> A₂₀; } subgraph { rank=same; edge [style=invis]; A₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; A₃ -> S₁₁; S₁₁ -> A₂; } subgraph { start1 -> S₁₀; S₁₀ -> S₁₁ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; }

subgraph { A₂₀ -> A₂ [label="a"]; }

S₁₀ -> A₂₀ [taillabel="↓S₁₁", labelangle=70, labeldistance=1.5, constraint=false]; A₂ -> S₁₁ [taillabel="↑S₁₁", labelangle=-60, labeldistance=2, constraint=false]; S₁₀ -> A₃ [taillabel="↓S₁₁", labelangle=60, labeldistance=1.75, constraint=false]; A₃ -> S₁₁ [taillabel="↑S₁₁", labelangle=60, labeldistance=2, constraint=false];

S₁₃ -> A₂₀ [taillabel="↓S₁₄", labelangle=-35, labeldistance=3, constraint=false]; A₂ -> S₁₄ [taillabel="↑S₁₄", labelangle=40, labeldistance=2, constraint=false]; S₁₃ -> A₃ [taillabel="↓S₁₄", labelangle=-50, labeldistance=2, constraint=false]; A₃ -> S₁₄ [taillabel="↑ S₁₄", labelangle=40, labeldistance=2, constraint=false]; }Single PDA using the automata from the grammar rulesstart1S₁₀S₁₀start1->S₁₀S₁S₁A₂₀A₂₀A₂A₂A₂₀->A₂aS₁₀->A₂₀↓S₁₁A₃A₃S₁₀->A₃↓S₁₁S₁₁S₁₁S₁₀->S₁₁AA₂->S₁₁↑S₁₁S₁₄S₁₄A₂->S₁₄↑S₁₄A₃->S₁₁↑S₁₁A₃->S₁₄↑ S₁₄S₁₂S₁₂S₁₁->S₁₂aS₁₃S₁₃S₁₂->S₁₃bS₁₃->A₂₀↓S₁₄S₁₃->A₃↓S₁₄S₁₃->S₁₄AS₁₅S₁₅S₁₄->S₁₅bS₁₅->S₁a

The ↓X\downarrow{}X↓X is an abbreviation for an ε,ε→X\varepsilon, \varepsilon \rightarrow Xε,ε→X edge that pushes a symbol on the stack unconditionally, it was hard to get graphviz to cooperate on node placement of this graph otherwise… Now at every node that had a sort transition you have multiple transition options, these are where you need to look ahead. Soooo…

Step 3: at a sort transition node, for each ↓\downarrow↓ transition, follow transitions until you’ve consumed k terminals (2 in this example) from the input. These are the terminals of the column in the parse table, and the rule of the ↓\downarrow↓ transition gets put into that cell. You can also put the look-ahead into the automaton:

Single PDA using the automata from the grammar rulesdigraph "Single PDA using the automata from the grammar rules" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; subgraph { rank=same; edge [style=invis]; A₂₀ [style=filled, fillcolor="#ddd"]; S₁₀ -> A₂₀; } subgraph { rank=same; edge [style=invis]; A₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; A₃ -> S₁₁; S₁₁ -> A₂; } subgraph { start1 -> S₁₀; S₁₀ -> S₁₁ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₁ -> S₁₂ [label="a"]; S₁₂ -> S₁₃ [label="b"]; S₁₃ -> S₁₄ [label="A", fontcolor="#01010140", color="#01010140"]; S₁₄ -> S₁₅ [label="b"]; S₁₅ -> S₁ [label="a"]; }

subgraph { A₂₀ -> A₂ [label="a"]; }

S₁₀ -> A₂₀ [taillabel="(aa), ↓S₁₁", labelangle=0, labeldistance=1.5, constraint=false]; A₂ -> S₁₁ [taillabel="↑S₁₁", labelangle=-60, labeldistance=2, constraint=false]; S₁₀ -> A₃ [taillabel="(ab), ↓S₁₁", labelangle=80, labeldistance=2.5, constraint=false]; A₃ -> S₁₁ [taillabel="↑S₁₁", labelangle=60, labeldistance=2, constraint=false];

S₁₃ -> A₂₀ [taillabel="(ab), ↓S₁₄", labelangle=-30, labeldistance=5, constraint=false]; A₂ -> S₁₄ [taillabel="↑S₁₄", labelangle=40, labeldistance=2, constraint=false]; S₁₃ -> A₃ [taillabel="(ba), ↓S₁₄", labelangle=-60 labeldistance=2, constraint=false]; A₃ -> S₁₄ [taillabel="↑ S₁₄", labelangle=40, labeldistance=2, constraint=false]; }Single PDA using the automata from the grammar rulesstart1S₁₀S₁₀start1->S₁₀S₁S₁A₂₀A₂₀A₂A₂A₂₀->A₂aS₁₀->A₂₀(aa), ↓S₁₁A₃A₃S₁₀->A₃(ab), ↓S₁₁S₁₁S₁₁S₁₀->S₁₁AA₂->S₁₁↑S₁₁S₁₄S₁₄A₂->S₁₄↑S₁₄A₃->S₁₁↑S₁₁A₃->S₁₄↑ S₁₄S₁₂S₁₂S₁₁->S₁₂aS₁₃S₁₃S₁₂->S₁₃bS₁₃->A₂₀(ab), ↓S₁₄S₁₃->A₃(ba), ↓S₁₄S₁₃->S₁₄AS₁₅S₁₅S₁₄->S₁₅bS₁₅->S₁a

Building LL tables for strong LL grammars by traditional method

While the above building of automata gives a visual intuition, it’s not the most efficient way to express how we can build LL tables. The traditional method does the same thing in essence, but using some pre-computed sets to calculate the cells in the table.

A cell at the row labeled with sort AAA and the column labeled with terminal(s) vvv should have the grammar rule A=wA = wA=w (where www is a mix of terminals and sorts or ε\varepsilonε), under the following condition: vvv is in the First set of www, or ε\varepsilonε is in the First set of www and vvv is in the Follow set of AAA. In other words: v∈First(w)⋅Follow(A)v \in \textit{First}(w) \cdot \textit{Follow}(A)v∈First(w)⋅Follow(A)

Let’s unpack that. The First set of a terminal is a singleton set with that terminal. The First set of a sort is the set of first non-terminals that the sort can expand to, directly or indirectly. So a rule A=a[...]A = a \textrm{\footnotesize[...]}A=a[…] causes aaa to appear in the First set of AAA, A=B[...]A = B \textrm{\footnotesize[...]}A=B[…] causes the First set of BBB to be included in the First set of AAA, and A=εA = \varepsilonA=ε causes ε\varepsilonε to appear in the First set of AAA. This last rule says AAA can be expanded to “nothing”, so if that’s an option we need to check the Follow set of AAA.

The Follow set is basically every non-terminal that can follow AAA in the grammar. So when you have B=[...]Aa[...]B = \textrm{\footnotesize[...]} A\ a \textrm{\footnotesize[...]}B=[…]A a[…], aaa is in the Follow set of AAA. A rule B=[...]AB = \textrm{\footnotesize[...]} AB=[…]A causes the Follow set of BBB to be included in the Follow set of AAA. And the Follow set of the start symbol has the end-of-file ‘terminal’ $\$$.

The Follow set is basically every non-terminal that can follow AAA in the grammar. So when you have B=[...]Aa[...]B = \textrm{\footnotesize[...]} A\ a \textrm{\footnotesize[...]}B=[…]A a[…], aaa is in the Follow set of AAA. A rule B=[...]AB = \textrm{\footnotesize[...]} AB=[…]A causes the Follow set of BBB to be included in the Follow set of AAA. And the Follow set of the start symbol has the end-of-file ‘terminal’ $\$$.

Finally, there is the dot operator between the First and Follow sets: this is a truncated product, that takes every combination of the two sets, sticks them together (in order), and truncates to length k. That’s a bit of an abstraction over the k in LL(k), which I didn’t take into account in the explanation of First and Follow sets. The First sets should have length k strings of course, and so you may need to take more First/Follow sets into account when computing these. Another thing I glossed over is that we actually use the First set of www, a mix of terminals and sorts on the right-hand side of our grammar rules. If www is aABba\ A\ B\ ba A B b, then its First set is {a}⋅First(A)⋅First(B)⋅{b}\{a\} \cdot \textit{First}(A) \cdot \textit{First}(B) \cdot \{b\}{a}⋅First(A)⋅First(B)⋅{b}.

Ok, with that all done, we can use those tables. But before we do, a quick word about expressive power, because LL is not particularly powerful…

Limitations and Expressive power

There are always languages that cannot be captured by an LL(k) grammar that can be captured by an LL(k+1) grammar. In other words, look-ahead size is important in the expressivity of an LL grammar, and LL(k) for any specific k does not capture all deterministic context-free languages3.

We’ve already seen an example of an LL(2) grammar, but what would be a language that cannot be captured by any LL(k)? Take the language of a’s followed by b’s, where the number of a’s ≥\geq≥ the number of b’s. Or as a grammar:

S=aS S = a S S=aS (1) \text{(1)} (1)
S=A S = A S=A (2) \text{(2)} (2)
A=aAb A = a A b A=aAb (3) \text{(3)} (3)
A=ε A = \varepsilon A=ε (4) \text{(4)} (4)

The problem for LL here is that we would have to look ahead in the input until we read the entire input before we could decide whether we can start consuming the input with Rule 1 or Rule 2 (and then Rule 3).

There is a class of grammars called LL-regular (LLR) grammars captures all LL(k) grammars for any k and slightly more. These LLR grammars are cool in that they are still parseable in linear time, as long as you have something called a “regular partition” of your grammar. Getting that is an undecidable problem though. And since there is an LR(1) grammar that is not in LLR, this stuff is the trifecta of confusing, impractical, and less powerful4 than a much more useful technique that we will cover later in this post: LR. But first, going from tables to parsers!

Predictive Parsing

Since we already know what the tables mean, we can write a simple parse table interpreter to finish our predictive parser. The parser is called predictive because based on the k look-ahead terminals, we decide the grammar rule to use to continue parsing, which typically predicts some of the structure of the input well beyond the part we peeked at for the look-ahead.

Ok, let’s write a quick parse table interpreter for our LL(2) example. We’ll start with some definitions.

use std::collections::HashMap;
use std::env;

use lazy_static::lazy_static;
use peekmore::PeekMore;

type Terminal = char;

#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum Sort {
    S,
    A1,
    A2,
}

enum Symbol {
    Sort(Sort),
    Terminal(Terminal),
}

#[derive(Debug, Eq, PartialEq)]
enum Rule {
    S,
    Aa,
    AEpsilon,
}

The imports become useful in a second, for now we have our terminals, sorts, a combination type Symbol, and the names of our grammar rules. Assuming we keep around a proper PDA stack of symbols, we can write our grammar rules now:

impl Rule {
    fn apply(&self, stack: &mut Vec<Symbol>) {
        match self {
            Rule::S => Self::s(stack),
            Rule::Aa => Self::aa(stack),
            Rule::AEpsilon => Self::a_epsilon(stack),
        }
    }

    fn s(stack: &mut Vec<Symbol>) {
        stack.push(Symbol::Terminal('a'));
        stack.push(Symbol::Terminal('b'));
        stack.push(Symbol::Sort(Sort::A2));
        stack.push(Symbol::Terminal('b'));
        stack.push(Symbol::Terminal('a'));
        stack.push(Symbol::Sort(Sort::A1));
    }

    fn aa(stack: &mut Vec<Symbol>) {
        stack.push(Symbol::Terminal('a'));
    }

    #[allow(clippy::ptr_arg)]
    fn a_epsilon(_stack: &mut Vec<Symbol>) {}
}

Clippy is great for catching all kinds of poor code, but for consistency I’ve chosen to #[allow] this time. Note that to effectively run a context-free grammar on a PDA, you need to push the symbols in your rules on the stack in reverse, as mentioned in the recap.

lazy_static! {
    static ref TABLE: HashMap<(Sort, Terminal, Terminal), Rule> = {
        let mut table = HashMap::new();
        assert_eq!(None, table.insert((Sort::S,  'a', 'a'), Rule::S));
        assert_eq!(None, table.insert((Sort::S,  'a', 'b'), Rule::S));
        assert_eq!(None, table.insert((Sort::A1, 'a', 'a'), Rule::Aa));
        assert_eq!(None, table.insert((Sort::A1, 'a', 'b'), Rule::AEpsilon));
        assert_eq!(None, table.insert((Sort::A2, 'a', 'b'), Rule::Aa));
        assert_eq!(None, table.insert((Sort::A2, 'b', 'a'), Rule::AEpsilon));
        table
    };
}

Nothing very special really, just encoding what we had already. The main parse loop is also very unexciting now that we have implemented most of the logic of the grammar already. We basically manage the stack, eliminating terminals on the stack with those from the input and applying rules from the table based on sort and look-ahead, and give errors if we get unexpected input:

pub fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekmore();
    let mut stack = Vec::new();
    stack.push(Symbol::Sort(Sort::S));
    while let Some(symbol) = stack.pop() {
        return match symbol {
            Symbol::Terminal(predicted) => {
                if let Some(&&actual) = input.next() {
                    if predicted == actual {
                        continue;
                    }
                    Err(format!(
                        "Expected terminal {predicted:?}, but got {actual:?}."
                    ))
                } else {
                    Err(format!("Expected terminal {predicted:?}, but got EOF."))
                }
            }
            Symbol::Sort(sort) => {
                if let &[Some(&term1), Some(&term2)] = input.peek_amount(2) {
                    if let Some(r) = TABLE.get(&(sort, term1, term2)) {
                        r.apply(&mut stack);
                        continue;
                    } else {
                        Err(format!(
                            "Unexpected {term1:?} {term2:?} while parsing {sort:?}"
                        ))
                    }
                } else {
                    Err("Unexpected end of input.".to_owned())
                }
            }
        };
    }
    Ok(())
}

Recursive Descent

By encoding the parse table in data, we get some amount of interpretive overhead. We have a parse table interpreter with a stack we manage ourselves, but the stack is not really used any different from a call stack. So what if we use function calls instead? That’s the idea of recursive descent parsing. It actually makes our code smaller and more straight-forward, which is why it’s so popular as a technique for hand-written parsers.

use std::env;

use peekmore::PeekMore;
use peekmore::PeekMoreIterator;

type Iter<'a> = PeekMoreIterator<std::slice::Iter<'a, Terminal>>;

type Terminal = char;

fn consume(input: &mut Iter, predicted: Terminal) -> Result<(), String> {
    if let Some(&actual) = input.next() {
        if actual == predicted {
            Ok(())
        } else {
            Err(format!(
                "Expected terminal {predicted:?}, but got {actual:?}."
            ))
        }
    } else {
        Err("Unexpected end of file.".to_owned())
    }
}

This time we only need terminals as a type, the rest is gone, and so is the hashmap import for the parsetable. We will need the input, and be able to remove predicted terminals from it, so consume comes in handy.

fn sort_s(input: &mut Iter) -> Result<(), String> {
    // S
    match input.peek_amount(2) {
        &[Some('a'), Some('a')] => s(input),
        &[Some('a'), Some('b')] => s(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing S")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn sort_A1(input: &mut Iter) -> Result<(), String> {
    // A1
    match input.peek_amount(2) {
        &[Some('a'), Some('a')] => a_a(input),
        &[Some('a'), Some('b')] => a_epsilon(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn sort_A2(input: &mut Iter) -> Result<(), String> {
    // A2
    match input.peek_amount(2) {
        &[Some('a'), Some('b')] => a_a(input),
        &[Some('b'), Some('a')] => a_epsilon(input),
        &[term1, term2] => Err(format!("Unexpected {term1:?} {term2:?} while parsing A")),
        _ => Err("Unexpected end of file.".to_owned()),
    }
}

fn s(input: &mut Iter) -> Result<(), String> {
    sort_A1(input)?;
    consume(input, 'a')?;
    consume(input, 'b')?;
    sort_A2(input)?;
    consume(input, 'b')?;
    consume(input, 'a')
}

fn a_a(input: &mut Iter) -> Result<(), String> {
    consume(input, 'a')
}

fn a_epsilon(_input: &mut Iter) -> Result<(), String> {
    Ok(())
}

Our parse table has now become code directly, with these functions named after the sorts of the rows. They call rules that are also functions, which in turn use the sort functions. Those rules also use consume, this time without having to reverse the order of the rule body.

pub fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekmore();
    sort_s(&mut input)
}

Finally, our main function just calls the right sort function instead of putting that sort on the stack. And the loop is gone, since we now use recursion.

Summary of LL, and an insight from the automaton

We’ve now seen LL(k) parsing, left-to-right leftmost derivation. This leftmost derivation directly corresponds to walking through the parse tree topdown, depth-first, leftmost child first. Whenever we expand a leftmost sort by a rule for that sort, we have to choose a rule, therefore we use the look-ahead (with a length of k) to see ahead and choose based on this.

We’ve seen an LL(1) and an LL(2) grammar, and in general more look-ahead allows us to parse more grammars and more languages. Both are important: certain languages cannot be expressed in LL(1) or LL(2), and some LL(1) grammars are harder to read and write than the LL(2) grammar of the same language.

We’ve seen how we can construct simple DFAs for each rule in our grammar, and then replace the sort transitions N1→AN2N_1 \xrightarrow{A} N_2N1​A​N2​ by a (PDA) push transition (↓A\downarrow A↓A) from N1N_1N1​ to all starts of DFAs corresponding to rules of AAA, and a pop transition (↑A\uparrow A↑A) from the ends of those DFAs to N2N_2N2​. Then the LL table, the decision table of sort + look-ahead = rule, naturally follows from this PDA by finding what input will be consumed if a certain rule is chosen, and using that as the look-ahead to make the decision for that rule.

The recursive descent way of writing a parser directly as code is nice and simple, it really just follows the grammar. Since you’re writing plain old code with function calls, you can imagine people have found nice ways to extend and adapt the pattern of recursive descent parsers. For one, it’s quite easy to reason about where you are in the parse when hitting an error state, which makes it fairly easy to give friendly error messages when the parser doesn’t accept an input. You can also use a trick to fix up direct left-recursion called node reparenting, where you use a loop or tail-recursion locally construct the tree bottom-up. You could argue that such a parser is a hybrid between recursive descent and ascent, a “recursive descent-ascent parser”.

Finally, if we look back at the automaton, we can see that the PDAs we build have a very strict shape. We either have a non-deterministic choice due to multiple push transitions for a sort, or we have predicted input, a single path of terminals to consume from the input. If we think back to the NFAs and DFAs from early on in this blog post series, those used the input to chose what state to go to next. Now we have single-path DFAs that just consume input, and a separate table on a look-ahead to resolve non-determinism from the pushes and pops. The strict shape here indicated that we’re not really making full use of the power of automata. This will change with the next parsing technique.

Bottomup, LR parsing

LR stands for left-to-right, rightmost derivation in reverse. If you think about it, left-to-right and rightmost derivation are incompatible: The rightmost derivation chooses the rule for the rightmost sort first every time, but that means skipping over some unknown amount of input if you read left-to-right to even get to that point. However, the reverse of the rightmost derivation is a left-to-right form of parsing. This reverse derivation describes going bottomup, left-to-right through the parse tree.

Expressive power and relation to LL

One of the biggest upsides of LR(k) parsing is its expressivity. The set of all LL(k) languages of any k is a strict subset of all LR(1) languages. Note that this is speaking of languages, not grammars. For grammars it holds that any LL(k) grammar for a specific k is also an LR(k) grammar, and not necessarily the other way around.

An LR(k) grammar of any k greater than 1 can be automatically transformed into an LR(1) grammar that is not necessarily structurally equivalent. This is highlights the difference between grammar and language level equivalence. We can basically capture any LR language in an LR(1) grammar, but LR with larger k may be able to describe the language in a nicer way (smaller grammar).

A good overview of how LL and LR relate to each other on the grammar and language level is summarised on the Computer Science Stack Exchange. In the comments someone suggests making a list of examples for each of these relationships, which seems like a great idea, but not something I have the patience for right now. This blog post has enough scope creep already.

How LR works

In order to give a reverse rightmost derivation, we need to figure what sorts can be at the leftmost leaf of the parse tree for our LR grammar. Then we try to apply the rules for those sorts all simultaneously. And to do so we can’t just use the automaton build we’ve used for LL.

Remember that the automata we’ve used previously mapped well on recursive descent, and showed us where to use an LL parse table with look-ahead to resolve ambiguity. Crucially, those automata observe every rule we go into. But for LR we need to explore down all the rules simultaneously. Let’s see if we can’t get somewhere with that idea and the example grammar of the language that wasn’t LL:

S=aS S = a S S=aS (1) \text{(1)} (1)
S=A S = A S=A (2) \text{(2)} (2)
A=aAb A = a A b A=aAb (3) \text{(3)} (3)
A=ε A = \varepsilon A=ε (4) \text{(4)} (4)

We start again with the separate automata for each rule:

Simple automata for each grammar rule from the exampledigraph "Simple automata for each grammar rule from the example" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; subgraph { node [style=filled, fillcolor="#777", fontcolor="#fefefe"]; start4 [style="", shape=none, label="", width=0]; A₄ [shape=doublecircle, width=0.4]; start4 -> A₄; } subgraph { node [style=filled, fillcolor="#aaa"]; start3 [style="", shape=none, label="", width=0]; A₃ [shape=doublecircle, width=0.4]; start3 -> A₃₀; A₃₀ -> A₃₁ [label="a"]; A₃₁ -> A₃₂ [label="A"]; A₃₂ -> A₃ [label="b"]; } subgraph { node [style=filled, fillcolor="#ddd"]; start2 [style="", shape=none, label="", width=0]; S₂ [shape=doublecircle, width=0.4]; start2 -> S₂₀; S₂₀ -> S₂ [label="A"]; } subgraph { start1 [shape=none, label="", width=0]; S₁ [shape=doublecircle, width=0.4]; start1 -> S₁₀; S₁₀ -> S₁₁ [label="s"]; S₁₁ -> S₁ [label="A"]; } }Simple automata for each grammar rule from the examplestart4A₄A₄start4->A₄start3A₃₀A₃₀start3->A₃₀A₃A₃A₃₁A₃₁A₃₀->A₃₁aA₃₂A₃₂A₃₁->A₃₂AA₃₂->A₃bstart2S₂₀S₂₀start2->S₂₀S₂S₂S₂₀->S₂Astart1S₁₀S₁₀start1->S₁₀S₁S₁S₁₁S₁₁S₁₀->S₁₁sS₁₁->S₁A

Now in order to explore to the bottom-left of the parse tree, we need to be free to go into any rule. So we will connect the rules again to the nodes that expect a certain sort, but with epsilon transitions so we don’t observe how far down we are or with what rule in particular we got there. We’ll need that later, but let’s not worry about that until we have the downward exploration.

Partially constructed automaton using the automata from the grammar rules, using epsilon transitionsdigraph "Partially constructed automaton using the automata from the grammar rules, using epsilon transitions" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR; start1 [shape=none, label="", width=0]; start2 [shape=box, label="", width=0, color="#00000000"]; start3 [shape=none, label="", width=0]; subgraph { node [style=filled, fillcolor="#777", fontcolor="#fefefe"]; A₄; } subgraph { node [style=filled, fillcolor="#aaa"]; A₃; start3 -> A₃₀ [style=invis, weight=3]; A₃₀ -> A₃₁ [label="a", weight=10]; A₃₁ -> A₃₂ [label="A", weight=3]; A₃₂ -> A₃ [label="b", weight=3]; } subgraph { node [style=filled, fillcolor="#ddd"]; S₂; start1 -> start2:w [arrowhead=none]; start2:e -> S₂₀ [weight=3]; S₂₀ -> S₂ [label="A", weight=3]; } subgraph { S₁; start1 -> S₁₀ [weight=3]; S₁₀ -> S₁₁ [label="a", weight=3]; S₁₁ -> S₁ [label="S", weight=3]; } subgraph{ rank=same; edge [style=invis]; start1 -> start3 [minlen=3, weight=3]; } subgraph{ rank=same; edge [style=invis]; S₁₀ -> start2 [weight=3]; start2 -> A₃₀ [weight=3, minlen=2]; } subgraph{ rank=same; edge [style=invis]; S₁₁ -> S₂₀ [weight=3]; S₂₀ -> A₃₁ [weight=3, minlen=2]; A₃₁ -> A₄ [weight=3]; } subgraph{ rank=same; edge [style=invis]; S₁ -> S₂ [weight=3]; S₂ -> A₃₂ [weight=3, minlen=2]; } S₁₁ -> S₂₀ [label="ε"]; S₁₁ -> S₁₀ [label="ε"]; S₂₀ -> A₃₀ [label="ε"]; A₃₁ -> A₄ [label="ε"]; A₃₁ -> A₃₀:se [label="ε"]; S₂₀ -> start3:n [arrowhead="none"]; start3:s -> A₄ [label="ε"]; }Partially constructed automaton using the automata from the grammar rules, using epsilon transitionsstart1start2start1->start2:wstart3S₁₀S₁₀start1->S₁₀A₃₀A₃₀S₂₀S₂₀start2:e->S₂₀A₄A₄start3:s->A₄εA₃A₃A₃₁A₃₁A₃₀->A₃₁aA₃₁->A₄εA₃₁->A₃₀:seεA₃₂A₃₂A₃₁->A₃₂AA₃₂->A₃bS₂S₂S₂₀->start3:nS₂₀->A₃₀εS₂₀->S₂AS₁S₁S₁₁S₁₁S₁₀->S₁₁aS₁₁->S₂₀εS₁₁->S₁SS₁₁->S₁₀ε

Obviously this is not a full automaton model of a parser yet, but it allows us to always go down to the next leaf of the parse tree without using the stack. Let’s clean up the automaton with an NFA-to-DFA conversion:

Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversiondigraph "Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversion" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

start1 [shape=none, label="", width=0]; S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a", weight=2]; Box0:t -> S₂ [label="A"]; Box1:t:n -> Box1:t:n [label="a"]; Box1:t -> Box2:t [label="A", weight=2]; Box1:t -> S₁ [label="S"]; Box2:t -> A₃ [label="b"];

subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=0]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; } }Partially constructed automaton using the automata from the grammar rules, after merging states through NFA-to-DFA conversionstart1Box0S₁₀S₂₀A₃₀A₄start1->Box0:tS₁S₁S₂S₂A₃A₃Box0:t->S₂ABox1S₁₀S₁₁S₂₀A₃₀A₃₁A₄Box0:t->Box1:taBox1:t->S₁SBox1:n->Box1:naBox2S₂A₃₂Box1:t->Box2:tABox2:t->A₃b

This is almost exactly how an LR(0) automaton would be drawn. Instead of S₁₀ and S₁₁, you write out the “LR item”s S = . a S and S = a . S. But otherwise it would be pretty much this. This is considered a PDA, though what’s happening on the stack is left implicit. That’s because what’s actually happening on the stack of LR automata is very structured, but a little involved. That makes the PDA harder to read and draw, but I’ll demonstrate it once:

A fully explicit PDA that does LR parsingdigraph "A fully explicit PDA that does LR parsing" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""] S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a,↓[a,0]", weight=3]; Box0:t -> S₂ [headlabel="↓[A,0]", labelangle=30, labeldistance=5]; Box1:t:n -> Box1:t:n [label="a,↓[a,1]"]; Box1:t -> Box2:t [label="↓[A,1]", weight=3]; Box2:t -> A₃ [label="b,↓[b,2]", weight=3];

S₁:e -> S₁:e [taillabel=" ↑[S,1] [a,1] ↓[S,1]", labeldistance=5, labelangle=-15]; S₁ -> fin [headlabel="↑[S,1] [a,0]", labeldistance=4.5, labelangle=-30]; S₂ -> fin [label="↑[A,0] "] Box2:t -> S₁:n [xlabel="\n\n↑[A,1] ↓[S,1] "]; A₃ -> Box2:t [xlabel="↑[b,2] [A,1] [a,1] ↓[A,1]"]; A₃ -> S₂ [taillabel="↑[b,2] [A,1] [a,0] ↓[A,0]", labeldistance=5, labelangle=40];

subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=4]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; S₂ -> fin [minlen=0]; } }A fully explicit PDA that does LR parsingstart1Box0S₁₀S₂₀A₃₀A₄start1->Box0:tfinS₁S₁S₁->fin↑[S,1] [a,0]S₁:e->S₁:e ↑[S,1] [a,1] ↓[S,1]S₂S₂S₂->fin↑[A,0]     A₃A₃A₃->S₂↑[b,2] [A,1] [a,0] ↓[A,0]Box2S₂A₃₂A₃->Box2:t↑[b,2] [A,1] [a,1] ↓[A,1]Box0:t->S₂↓[A,0]Box1S₁₀S₁₁S₂₀A₃₀A₃₁A₄Box0:t->Box1:ta,↓[a,0]Box1:n->Box1:na,↓[a,1]Box1:t->Box2:t↓[A,1]Box2:t->S₁:n↑[A,1] ↓[S,1]  Box2:t->A₃b,↓[b,2]

This should look quite familiar. We’re pushing inputs on the stack as we consume them, paired with the state we’re in at that point. And then we’re popping whole bodies of rules off the stack and replacing them with the sort of that rule. The first thing is called a shift action, the second is called a reduce action. We’ve seen this trick before in the naive PDA built from a CFG, all the way at the start of this post in the refresher. But this time we get an automaton with more states.

Notice that where a reduce action goes depends on originating state of the last thing that’s popped. That’s why we keep track of those on the stack. When we reduce by rule 3 (state A₃), depending on whether the a came from box 1 or box 0, we go to different places. This is easier to see in our proper LR(0) automaton, where box 1 points to state S₁ with a transition labeled A. This is a goto action. In an LR parse table interpreter, the goto is a separate action that immediately follows a reduce action, which merely returns to the last popped state. When a reduce just returns that’s also more like a function call and return, so that’s nice. Anyway, that’s also why a reduce transition in the above automaton always labels the originating state of the pushed sort the same as the last thing popped from the stack.

Something worth repeating now that we’re looking at the details: LL decides what rule to take before consuming the input for that rule, whereas LR decides what rule to take after consuming all the input for that rule. In other words, we only reduce by a rule when we’ve seen the entire body of the rule, that why there’s less trouble with look-ahead.

Speaking of look-ahead: we have some shift-reduce problems in our automaton. And by that I mean: how do we choose when to shift and when to reduce when both are an option? This is a determinism issue in our current automaton, and just like in our LL automaton, we solve it with look-ahead (and yes, that can and will later be summarised in a parse table). Our latest automaton gives a clear view of what we will be able to do if we reduce, so the look-ahead follows from what input can be consumed next after each reduce:

A fully explicit PDA that does LR parsing, with look-aheaddigraph "A fully explicit PDA that does LR parsing, with look-ahead" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁; S₂ [style=filled, fillcolor="#ddd"]; A₃ [style=filled, fillcolor="#aaa"]; Box0 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box1 [shape=none, label=<<table cellborder="0" port="t"><tr><td>S₁₀</td></tr><tr><td>S₁₁</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr><tr><td bgcolor="#aaa">A₃₀</td></tr><tr><td bgcolor="#aaa">A₃₁</td></tr><tr><td bgcolor="#777"><font color="#fefefe">A₄</font></td></tr></table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"><tr><td bgcolor="#ddd">S₂</td></tr><tr><td bgcolor="#aaa">A₃₂</td></tr></table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a,↓[a,0]", weight=3]; Box0:t -> S₂ [headlabel="($), ↓[A,0]", labelangle=30, labeldistance=5]; Box1:t:n -> Box1:t:n [label="a,↓[a,1]"]; Box1:t -> Box2:t [label="(b,$), ↓[A,1]", weight=3]; Box2:t -> A₃ [label="b,↓[b,2]", weight=3];

S₁:e -> S₁:e [taillabel=" ↑[S,1] [a,1] ↓[S,1]", labeldistance=5, labelangle=-15]; S₁ -> fin [headlabel="↑[S,1] [a,0]", labeldistance=4.5, labelangle=-30]; S₂ -> fin [label="↑[A,0] "] Box2:t -> S₁:n [xlabel="\n\n($), ↑[A,1] ↓[S,1] "]; A₃ -> Box2:t [xlabel="↑[b,2] [A,1] [a,1] ↓[A,1]"]; A₃ -> S₂ [taillabel="↑[b,2] [A,1] [a,0] ↓[A,0]", labeldistance=5, labelangle=40];

subgraph { rank=same; edge [style=invis]; Box2:t -> S₁ [minlen=4]; } subgraph { rank=same; edge [style=invis]; Box1:t -> S₂ [minlen=0]; S₂ -> fin [minlen=0]; } }A fully explicit PDA that does LR parsing, with look-aheadstart1Box0S₁₀S₂₀A₃₀A₄start1->Box0:tfinS₁S₁S₁->fin↑[S,1] [a,0]S₁:e->S₁:e ↑[S,1] [a,1] ↓[S,1]S₂S₂S₂->fin↑[A,0]     A₃A₃A₃->S₂↑[b,2] [A,1] [a,0] ↓[A,0]Box2S₂A₃₂A₃->Box2:t↑[b,2] [A,1] [a,1] ↓[A,1]Box0:t->S₂($), ↓[A,0]Box1S₁₀S₁₁S₂₀A₃₀A₃₁A₄Box0:t->Box1:ta,↓[a,0]Box1:n->Box1:na,↓[a,1]Box1:t->Box2:t(b,$), ↓[A,1]Box2:t->S₁:n($), ↑[A,1] ↓[S,1]  Box2:t->A₃b,↓[b,2]

As you can see, we need at most 1 look-ahead to deterministically parse this grammar. We’re sometimes looking ahead to the end-of-input represented with $. The look-ahead makes this an LALR(1) grammar; what that is and why it’s different from normal LR(1) is what we’ll see in the next section.

LR parsetable construction and expressivity

Let’s look at some example grammars, how to construct their tables, and when you need a better parsetable construction method.

LR(0)

LR(0) does not look ahead but just reduces whenever possible. If there are multiple options, you have a shift-reduce or a reduce-reduce conflict. Shift-shift conflicts don’t exist in LR since the NFA-to-DFA process would have merged the two states such conflicting transitions would point to. Let’s look at an LR(0) grammar:

S=E2 S = E 2 S=E2 (1) \text{(1)} (1)
E=E1 E = E 1 E=E1 (2) \text{(2)} (2)
E=1 E = 1 E=1 (3) \text{(3)} (3)

The LR automaton for this grammar is:

An LR(0) automaton for the above grammardigraph "An LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#aaa">E₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁; E₂ [style=filled, fillcolor="#ddd"]; E₃ [style=filled, fillcolor="#aaa"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">E₂₁</td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="E", weight=2, minlen=2]; Box0:t -> E₃ [label="1"]; Box1:t -> S₁ [label="2"]; Box1:t -> E₂ [label="1"]; }An LR(0) automaton for the above grammarBox0Box0S₁₀E₂₀E₃₀finBox0:s->fin:nS  E₃E₃Box0:t->E₃1Box1Box1S₁₁E₂₁Box0:t->Box1:tEstart1start1->Box0:tS₁S₁E₂E₂Box1:t->S₁2Box1:t->E₂1

The corresponding parse table follows this automaton:

1 2 $ E
Box0 s E₃   accept Box1
Box1 s E₂ s S₁    
E₃ r 3 r 3 r 3  
E₂ r 2 r 2 r 2  
S₁ r 1 r 1 r 1  

The transition from box 0 to E₃ that shifts 1 becomes a shift action to E₃ in the row of box 0 and the column of 1. The transition from box 0 to box 1 with E becomes a goto to box 1 in the row of box 0 and column of E. Finally a state that’s at the end of a rule will get all reduce actions by that rule (indicated by its number) in the column for input. Accepting the input is typically based on look-ahead of the end-of-input.

Simple LR (SLR)

The smallest motivating example for Simple LR is the following grammar that parses the same language as before:

S=E2 S = E 2 S=E2 (1) \text{(1)} (1)
E=1E E = 1 E E=1E (2) \text{(2)} (2)
E=1 E = 1 E=1 (3) \text{(3)} (3)

Notice how rule 2 is now right-recursive instead of left-recursive. It’s a nice symmetry how left-recursive rules give you trouble in LL, and right-recursive rules (can) give you trouble in LR5.

An LR(0) automaton for the above grammardigraph "An LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#aaa">E₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₁; E₂ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td bgcolor="#ddd">E₂₀</td></tr> <tr><td bgcolor="#ddd">E₂₁</td></tr> <tr><td bgcolor="#aaa">E₃</td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="1", weight=2]; Box0:t -> S₁₁ [label="E"]; S₁₁ -> S₁ [label="2"]; Box1:t -> E₂ [label="E"]; Box1:t:s -> Box1:t:s [label="1"]; }An LR(0) automaton for the above grammarBox0Box0S₁₀E₂₀E₃₀finBox0:s->fin:nS  S₁₁S₁₁Box0:t->S₁₁EBox1Box1E₂₀E₂₁E₃Box0:t->Box1:t1start1start1->Box0:tS₁S₁S₁₁->S₁2E₂E₂Box1:t->E₂EBox1:s->Box1:s1

1 2 $ E
Box0 s Box1   accept S₁₁
Box1 s Box1 / r 3 r 3 r 3 E₂
S₁₁   s S₁    
S₁ r 1 r 1 r 1  
E₂ r 2 r 2 r 2  

Yay, we have a shift-reduce conflict. How do we solve it? By not blindly putting a reduce in the entire row of a state that could reduce. If we check the Follow set of the sort we’re reducing to (we defined that when we built LL parse tables, remember?), we can put the reduce action in only the column of the terminals that are in that follow set. If we look at the grammar, we can see that only 2 can follow E. So the SLR table for this grammar is:

1 2 $ E
Box0 s Box1   accept S₁₁
Box1 s Box1 r 3   E₂
S₁₁   s S₁    
S₁     r 1  
E₂   r 2    

Look-Ahead LR (LALR)

From now on we’ll be looking at reduce-reduce conflicts only. While you can get shift-reduce conflicts with the following algorithms through grammars that don’t fit (due to ambiguity or requiring more look-ahead than you’re taking into account), when you give an LALR(k) grammar to an SLR(k) algorithm you can only get reduce-reduce conflicts. Same with an LR(k) grammar put through the LALR(k) algorithm.

Here our example grammar that just barely doesn’t work with SLR:

S=aEc S = a E c S=aEc (1) \text{(1)} (1)
S=aFd S = a F d S=aFd (2) \text{(2)} (2)
S=bFc S = b F c S=bFc (3) \text{(3)} (3)
E=e E = e E=e (4) \text{(4)} (4)
F=e F = e F=e (5) \text{(5)} (5)

See how rules 4 and 5 are the same except they have different sort names? Yeah, that’s going to be fun if they’re used with the same prefix like in rules 1 and 2. Let’s have a look at the automaton and SLR parse table.

A LR(0) automaton for the above grammardigraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; F₅ [fontcolor="#fefefe", style=filled, fillcolor="#777"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> F₅ [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; }A LR(0) automaton for the above grammarBox0Box0S₁₀S₂₀S₃₀finBox0:s->fin:nS  Box1Box1S₁₁S₂₁E₄₀F₅₀Box0:t->Box1:taBox2Box2S₃₁E₄₀F₅₀Box0:t->Box2:tbstart1start1->Box0:tS₁₂S₁₂S₁S₁S₁₂->S₁cS₂₂S₂₂S₂S₂S₂₂->S₂dS₃₂S₃₂S₃S₃S₃₂->S₃cF₅F₅Box1:t->S₁₂EBox1:t->S₂₂FBox3Box3E₄F₅Box1:t->Box3:teBox2:t->S₃₂FBox2:t->F₅e

a b c d e $ E F
Box0 s Box1 s Box2       accept    
Box1         s Box3   S₁₂ S₂₂
Box2         s F₅   S₃₂  
Box3     r 4 / r 5 r 5        
S₁₂     s S₁          
S₁           r 1    
S₂₂       s S₂        
S₂           r 2    
S₃₂     s S₃          
S₃           r 3    
F₅     r 5 r 5        

The reduce-reduce conflict, as promised. It’s in box 3, where we can reduce by E₄ or F₅, when the look-ahead is c. This is because the look-ahead sets of both E and F contain c due to rules 1 and 3. If we look at the automaton though, we can clearly see that if we reduce and we have a c next, we should reduce by E.

Look-Ahead LR parsing uses basically this method, analysing what shifts can happen after certain reduces. Putting it is algorithmic terms, LALR doesn’t use LL Follow sets, but defines more accurate Follow sets based on the automaton. Each instance of the start of a rule in the automaton (F₅₀ in boxes 1 and 2) gets a separate Follow set computed. That’s how we resolve the conflict with LALR:

a b c d e $ E F
Box3     r 4 r 5        

Note that since the LALR Follow sets follow directly from the automaton, this is basically the same as the intuition given at the end of the previous section.

LR(1)

I like this LALR parsing story. It’s so intuitive with the NFA-to-DFA conversion, just looking at the automaton to see the follow sets. But, it’s doesn’t give you the complete power of deterministic push-down automata. I present to you the previous example grammar with one more rule:

S=aEc S = a E c S=aEc (1) \text{(1)} (1)
S=aFd S = a F d S=aFd (2) \text{(2)} (2)
S=bFc S = b F c S=bFc (3) \text{(3)} (3)
E=e E = e E=e (4) \text{(4)} (4)
F=e F = e F=e (5) \text{(5)} (5)
S=bEd S = b E d S=bEd (6) \text{(6)} (6)

This results in an automaton that’s almost the same as before:

A LR(0) automaton for the above grammardigraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr> <tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> <tr><td bgcolor="#ddd">S₆₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#ddd">S₆₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box3:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; }A LR(0) automaton for the above grammarBox0Box0S₁₀S₂₀S₃₀S₆₀finBox0:s->fin:nS  Box1Box1S₁₁S₂₁E₄₀F₅₀Box0:t->Box1:taBox2Box2S₃₁S₆₁E₄₀F₅₀Box0:t->Box2:tbstart1start1->Box0:tS₁₂S₁₂S₁S₁S₁₂->S₁cS₂₂S₂₂S₂S₂S₂₂->S₂dS₃₂S₃₂S₃S₃S₃₂->S₃cS₆₂S₆₂S₆S₆S₆₂->S₆dBox1:t->S₁₂EBox1:t->S₂₂FBox3Box3E₄F₅Box1:t->Box3:teBox2:t->S₃₂FBox2:t->S₆₂EBox2:t->Box3:te

We now have a reduce-reduce conflict in box 3 again. With look-ahead a you can reduce to both E and F. Same for look-ahead b by the way. It is deterministically decidable which one we should reduce to, but it basically now depends on which state we came from.

With LALR we build an automaton for each rule, and try to reuse that rule independent of the context in which it is used. That’s keeps our process simple, our automaton small, but it also causes us to lose exactly the information we need to resolve the reduce-reduce conflict in box 3 above: the left context. I know the information is technically on the stack, but our parsing process decides on the rule to reduce by based on the state and look-ahead only.

LR(k) automata/parsers keep the same parsing process still, they just have larger automata in which their states summarise the left context. We’re basically going to distinguish almost every occurrence of a sort in the grammar, similar to when we made our LL(2) grammar strong:

A LR(1) automaton for the above grammardigraph "A LR(1) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀ ($)</td></tr> <tr><td bgcolor="#ddd">S₂₀ ($)</td></tr> <tr><td>S₃₀ ($)</td></tr> <tr><td bgcolor="#ddd">S₆₀ ($)</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁ ($)</td></tr> <tr><td bgcolor="#ddd">S₂₁ ($)</td></tr> <tr><td bgcolor="#aaa">E₄₀ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀ (d)</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁ ($)</td></tr> <tr><td bgcolor="#ddd">S₆₁ ($)</td></tr> <tr><td bgcolor="#aaa">E₄₀ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀ (c)</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄₁ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁ (d)</font></td></tr> </table>>]; Box4 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box4</td></tr> <tr><td bgcolor="#aaa">E₄₁ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁ (c)</font></td></tr> </table>>]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box4:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; }A LR(1) automaton for the above grammarBox0Box0S₁₀ ($)S₂₀ ($)S₃₀ ($)S₆₀ ($)finBox0:s->fin:nS  Box1Box1S₁₁ ($)S₂₁ ($)E₄₀ (c)F₅₀ (d)Box0:t->Box1:taBox2Box2S₃₁ ($)S₆₁ ($)E₄₀ (d)F₅₀ (c)Box0:t->Box2:tbstart1start1->Box0:tS₁₂S₁₂S₁S₁S₁₂->S₁cS₂₂S₂₂S₂S₂S₂₂->S₂dBox1:t->S₁₂EBox1:t->S₂₂FBox3Box3E₄₁ (c)F₅₁ (d)Box1:t->Box3:teBox4Box4E₄₁ (d)F₅₁ (c)Box2:t->Box4:teS₃₂S₃₂Box2:t->S₃₂FS₆₂S₆₂Box2:t->S₆₂ES₃S₃S₃₂->S₃cS₆S₆S₆₂->S₆d

How do we do this? We duplicate each rule for each terminal in the LL follow set of the sort of that rule. We annotate each of those rules with that terminal. Now we do our usual thing: rule to automaton, epsilons, NFA-to-DFA. But when wiring the epsilons, extra terminal annotations should now match up with the LALR follow set of the occurrence of the sort.

With this particular example, the automaton looks almost the same. There’s a bit more fluff with the annotations, but they basically propagate the look-ahead for each rule. Which means we can distinguish the context in which E and F are used differently! In general though, duplicating each rule for each terminal in the LL follow set leads to a very large amount of rules, and plenty of the time this isn’t necessary… LR(1) automata have lots of redundant states that do basically the same thing and would have been merged in LALR without any reduce-reduce conflicts.

Parse table construction algorithm

You’ve already seen parse table construction by automaton for both LL and the many flavours of LR now. And you’ve seen parse table construction by First and Follow set for LL. Parse table construction for LR will of course also require First and Follow sets, sometimes including more accurate Follow sets for particular occurrences of sorts. It’s mostly an iterative build-up of the NFA-to-DFA (powerset construction) though. I’m not going to detail that in this post.

While researching the material, I found some claims for minimal LR(1) algorithms, which create LALR(1) tables when possible, and slightly larger tables when necessary. They look interesting, but quite something to figure out, and I haven’t gotten to what I wanted to write about most yet, so that will have to wait until some other time. Perhaps I’ll include the normal LR parse table construction algorithm there too as a start.

Recursive Ascent

We finally get to the original impetus for this blog post: recursive ascent parsing. As you might be able to guess, this is the LR analogue to recursive descent for LL. So we’re going to write code that directly executes the LR automaton instead of simulating it by parse table interpretation.

Before, in recursive descent parsing, we saw that rules and sorts become functions. Rules call sort functions to parse a sort, and sorts check the look-ahead to choose a rule by which to parse the alternative of the sort. Basically grammar rules became functions, and the parse table was split into functions for each sort.

In recursive ascent parsing we will turn states of the LR automaton into functions. Each function will shift or reduce based on the input and call the corresponding state for that edge. Let’s expand our LR(1) example a little bit, and then take a look at the recursive ascent parsing:

S=aEc S = a E c S=aEc (1) \text{(1)} (1)
S=aFd S = a F d S=aFd (2) \text{(2)} (2)
S=bFc S = b F c S=bFc (3) \text{(3)} (3)
E=ee E = e e E=ee (4) \text{(4)} (4)
F=ee F = e e F=ee (5) \text{(5)} (5)
S=bEd S = b E d S=bEd (6) \text{(6)} (6)
S=beea S = b e e a S=beea (7) \text{(7)} (7)

The reason for the extra es in rules 3 and 4 is to show how that increases the LR(1) automaton size. We’ll now have 4 states instead of 2 + an LALR reduce-reduce conflict. The reason for adding rule 7 is so we have a state where we might shift or reduce depending on the look-ahead, which influences the code we generate. Let’s check out the automaton first:

A LR(0) automaton for the above grammardigraph "A LR(0) automaton for the above grammar" { bgcolor="transparent"; node [shape=circle, fixedsize=shape, width=0.5, fontcolor="#010101", color="#010101"]; edge [fontcolor="#010101", color="#010101"]; rankdir=LR;

subgraph { rank=same; Box0 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box0</td></tr> <tr><td>S₁₀</td></tr><tr><td bgcolor="#ddd">S₂₀</td></tr> <tr><td>S₃₀</td></tr> <tr><td bgcolor="#ddd">S₆₀</td></tr> <tr><td>S₇₀</td></tr> </table>>]; fin [shape=doublecircle, width=0.4, label=""]; Box0:t:s -> fin:n [xlabel="S "]; }

start1 [shape=none, label="", width=0]; fin [shape=doublecircle, width=0.4, label=""]; S₁₂; S₁; S₂₂ [style=filled, fillcolor="#ddd"]; S₂ [style=filled, fillcolor="#ddd"]; S₃₂; S₃; S₆₂ [style=filled, fillcolor="#ddd"]; S₆ [style=filled, fillcolor="#ddd"]; Box1 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box1</td></tr> <tr><td>S₁₁</td></tr> <tr><td bgcolor="#ddd">S₂₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box2 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box2</td></tr> <tr><td>S₃₁</td></tr> <tr><td bgcolor="#ddd">S₆₁</td></tr> <tr><td>S₇₁</td></tr> <tr><td bgcolor="#aaa">E₄₀</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₀</font></td></tr> </table>>]; Box3 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box3</td></tr> <tr><td bgcolor="#aaa">E₄₁</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁</font></td></tr> </table>>]; Box4 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box4</td></tr> <tr><td>S₇₂</td></tr> <tr><td bgcolor="#aaa">E₄₁</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅₁</font></td></tr> </table>>]; Box5 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box5</td></tr> <tr><td bgcolor="#aaa">E₄ (c)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅ (d)</font></td></tr> </table>>]; Box6 [shape=none, label=<<table cellborder="0" port="t"> <tr><td>Box6</td></tr> <tr><td>S₇₃</td></tr> <tr><td bgcolor="#aaa">E₄ (d)</td></tr> <tr><td bgcolor="#777"><font color="#fefefe">F₅ (c)</font></td></tr> </table>>]; start1 -> Box0:t; Box0:t -> Box1:t [label="a"]; Box0:t -> Box2:t [label="b"]; Box1:t -> Box3:t [label="e"]; Box1:t -> S₁₂ [label="E"]; Box1:t -> S₂₂ [label="F"]; Box3:t -> Box5:t [label="e"]; S₁₂ -> S₁ [label="c"]; S₂₂ -> S₂ [label="d"]; Box2:t -> Box4:t [label="e"]; Box2:t -> S₃₂ [label="F"]; S₃₂ -> S₃ [label="c"]; Box2:t -> S₆₂ [label="E"]; S₆₂ -> S₆ [label="d"]; Box4:t -> Box6:t [label="e"]; Box6:t -> S₇ [label="a"]; }A LR(0) automaton for the above grammarBox0Box0S₁₀S₂₀S₃₀S₆₀S₇₀finBox0:s->fin:nS  Box1Box1S₁₁S₂₁E₄₀F₅₀Box0:t->Box1:taBox2Box2S₃₁S₆₁S₇₁E₄₀F₅₀Box0:t->Box2:tbstart1start1->Box0:tS₁₂S₁₂S₁S₁S₁₂->S₁cS₂₂S₂₂S₂S₂S₂₂->S₂dS₃₂S₃₂S₃S₃S₃₂->S₃cS₆₂S₆₂S₆S₆S₆₂->S₆dBox1:t->S₁₂EBox1:t->S₂₂FBox3Box3E₄₁F₅₁Box1:t->Box3:teBox2:t->S₃₂FBox2:t->S₆₂EBox4Box4S₇₂E₄₁F₅₁Box2:t->Box4:teBox5Box5E₄ (c)F₅ (d)Box3:t->Box5:teBox6Box6S₇₃E₄ (d)F₅ (c)Box4:t->Box6:teS₇S₇Box6:t->S₇a

Perhaps making both changes at the same time makes this a bad example to show off LR(1) automaton size… If you imagine the automaton without rule 7 you’ll see that boxes 3 and 4 are the same except for their ingoing and outgoing edges. This is what happens with longer rules and having to distinguish the final state of the rules for a different look-ahead (boxes 5 and 6 here).

The other notable difference is that we now have a box 6 that can both shift and reduce. This will make the code for the recursive ascent more interesting. Let’s start with the basics:

use std::env;
use std::iter::Peekable;

type Iter<'a> = Peekable<std::slice::Iter<'a, Terminal>>;

type Terminal = char;

#[derive(Clone, Copy, Debug, Eq, Hash, PartialEq)]
enum Sort {
    S,
    E,
    F,
}

/// Box0, the starting state of our automaton.
/// Itemset:
/// S = . a E c
/// S = . a F d
/// S = . b F c
/// S = . b E d
/// S = . b e e a
fn box0(input: &mut Iter) -> Result<(), String> {
    match input.next() {
        None => Err(String::from("Unexpected end of input: S = .")),
        Some('a') => match box1(input)? {
            Sort::S => Ok(()),
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some('b') => match box2(input)? {
            Sort::S => Ok(()),
            s => unreachable!("Unexpected sort: S = b {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = . {c}")),
    }
}

This should look familiar from the recursive descent parser code. The notable difference is that we now have a function name less connected to the grammar, and more to the LR automaton. This makes it harder to understand the code, stack traces, etc.

/// Box1
/// Itemset:
/// S = a . E c
/// S = a . F d
/// E = . e
/// F = . e
fn box1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a . E; S = a . F",
        )),
        Some('e') => match box3(input)? {
            Sort::E => s12(input),
            Sort::F => s22(input),
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = a . {c}")),
    }
}

/// S = a E . c
fn s12(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a E . c",
        )),
        Some('c') => s1(input),
        Some(c) => Err(format!("Unexpected input: S = a E . {c}")),
    }
}

/// S = a E c .
fn s1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = a E c . {c}")),
    }
}

/// S = a F . d
fn s22(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = a F . d",
        )),
        Some('d') => s2(input),
        Some(c) => Err(format!("Unexpected input: S = a F . {c}")),
    }
}

/// S = a F d .
fn s2(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = a F d . {c}")),
    }
}

/// Box3
/// Itemset:
/// E = e . e (c)
/// F = e . e (d)
fn box3(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => box5(input),
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

/// Box5
/// Itemset:
/// E = e e . (c)
/// F = e e . (d)
fn box5(input: &mut Iter) -> Result<Sort, String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok(Sort::E),
        Some('d') => Ok(Sort::F),
        Some(c) => Err(format!("Unexpected input: E = e e . {c} ; F = e e . {c}")),
    }
}

This bit of code should give you an idea of the code pattern in the “easy case”. Each state either shifts in one-or-more rules it’s in (e.g. s12, box3), shifts into a new rule expecting a sort back to use for the goto (e.g. box1), or reduces (e.g. s1, box5).

/// Box2
/// Itemset:
/// S = b . F c
/// S = b . E d
/// S = b . e e a
/// E = . e
/// F = . e
fn box2(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => match box4(input)? {
            (0, Sort::F) => s32(input),
            (0, Sort::E) => s62(input),
            (1, Sort::S) => Ok(Sort::S),
            s => unreachable!("Unexpected return/sort: S = b {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = b . {c}")),
    }
}

This is the point where things start looking different. In box 2 we might shift e because we’ve entered rules 4 or 5 which will reduce to E or F. But we could also be in rule 7. If the result from box 4 is that we were in rule 7, we need to go back to the previous caller. So function box4 returns a pair of the number of returns left to go and the sort we’re reducing to. This way we can distinguish the two cases and take the appropriate action.

If you want to keep a recursive ascent code generator simpler you can of course always return a pair. You could also generate the code in continuation passing style, where you pass a function that takes the sort and does the goto action instead of accepting a pair as a result. But because the Rust compiler is not very good at tail call optimisation, so I’m not doing that pattern here.

/// S = b F . c
fn s32(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = b F . c",
        )),
        Some('c') => s3(input),
        Some(c) => Err(format!("Unexpected input: S = b F . {c}")),
    }
}

/// S = b F c .
fn s3(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = b F c . {c}")),
    }
}

/// S = b E . d
fn s62(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing S = b E . d",
        )),
        Some('d') => s6(input),
        Some(c) => Err(format!("Unexpected input: S = b E . {c}")),
    }
}

/// S = b E d .
fn s6(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Ok(Sort::S),
        Some(c) => Err(format!("Unexpected input: S = b E d . {c}")),
    }
}

/// Box4
/// Itemset:
/// S = b e . e a
/// E = e . e (d)
/// F = e . e (c)
fn box4(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => box6(input).map(decr),
        Some(c) => Err(format!(
            "Unexpected input: S = b e . {c}; E = e . {c} ; F = e . {c}"
        )),
    }
}

/// helper
fn decr((c, s): (usize, Sort)) -> (usize, Sort) {
    (c - 1, s)
}

Note how in box4 we’re now calling the decrement helper function after the call to box6 to count one return we’re going to do immediately after.

/// Box6
/// Itemset:
/// S = b e e . a
/// E = e e . (d)
/// F = e e . (c)
fn box6(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok((2, Sort::F)).map(decr),
        Some('d') => Ok((2, Sort::E)).map(decr),
        Some('a') => {
            input.next();
            s7(input).map(decr)
        }
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

The number of returns to do is equal to the size of the body of the rule we are reducing. Of course we immediately decrement because we are going to immediately return, hence the map(decr).

/// S = b e e a .
fn s7(_input: &mut Iter) -> Result<(usize, Sort), String> {
    Ok((4, Sort::S)).map(decr)
}

fn lex(input: String) -> Vec<Terminal> {
    input.chars().collect()
}

pub fn main() -> Result<(), String> {
    let input = env::args().next().expect("Argument string to parse");
    let input = lex(input);
    let mut input = input.iter().peekable();
    box0(&mut input)
}

In our main function we can call box0 with the input. Since this is LR(1) we only need a peekable iterator, that can look ahead 1 terminal.

Table size = Code size

With both recursive descent and recursive ascent parsing, we’re representing the parsing logic directly in code, not as an explicit data representation of a parse table. As such, if you have a larger parse table, you get more code. In LR, when LALR doesn’t suffice, parse tables can potentially grow quite large, as we saw to a limited extent with the last example.

Recursive Ascent-Descent Parsing

Have you noticed that in the recursive ascent code there are some pretty boring and tiny looking functions? I’m talking about s12, s1, s22, s2, s32, s3, s62, s6. These will likely be targeted by the inliner of the Rust compiler6, but aren’t they a bit much to even generate?

The common denominator of these functions, and the states of the LR automaton they correspond to, is that they have already honed in on a single rule from the grammar and are only parsing that. Kind of like in an LL parser, except we used the LR automaton mechanism to select the rule instead of an LL look-ahead. If we follow that idea to its logical conclusion, we can do LL parsing from any point where we know there’s only one rule left (or equivalently, inline those simple functions). This means we only have box functions left:

fn box1(input: &mut Iter) -> Result<Sort, String> {
    match input.next() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('e') => match box3(input)? {
            Sort::E => {
                consume(input, 'c')?;
                Ok(Sort::S)
            }
            Sort::F => {
                consume(input, 'd')?;
                Ok(Sort::S)
            }
            s => unreachable!("Unexpected sort: S = a {s:?}"),
        },
        Some(c) => Err(format!("Unexpected input: S = a . {c}")),
    }
}

This is using the consume function from the recursive descent parser example from before.

/// Box6
/// Itemset:
/// S = b e e . a
/// E = e e . (d)
/// F = e e . (c)
fn box6(input: &mut Iter) -> Result<(usize, Sort), String> {
    match input.peek() {
        None => Err(String::from(
            "Unexpected end of input while parsing E or F.",
        )),
        Some('c') => Ok((2, Sort::F)).map(decr),
        Some('d') => Ok((2, Sort::E)).map(decr),
        Some('a') => {
            input.next(); // consume 'a'
            Ok((3, Sort::S)).map(decr)
        }
        Some(c) => Err(format!("Unexpected input: E = e . {c} ; F = e . {c}")),
    }
}

Note that in box 6 we now count the number of symbols in the body of the rule before the dot to come up with the number of returns.

Left Corners?

The left corner of a rule in the grammar is the left-most symbol in the body of the rule, plus the left corners of any sorts in left corner. So it’s basically a First set with the sorts included. I found this is some of the older literature, and figured I’d add a note for myself in here.

There is/was such a thing as left-corner parsing, seemingly mostly used in natural language processing (NLP). NLP mostly uses ambiguous context-free grammars, and typically uses (used?) a backtracking parser to deal with that. These can be slow of course. And it turns out left corners helped with this, by adding some “filtering” that allows the parser to backtrack less. This is connected to recursive ascent-descent parsing, which you could also see as filtering with LR to finish parsing with LL. In our case we just don’t do backtracking.

I really need to stop working on this blog post and publish it already. It’s been over a year since I started working on it (on and off, during holidays when I had enough focus time)7. I already had an idea of where to go to next (generalised parsers), but now I also want to study minimal LR(1) automaton/parse table algorithms, and look at continuation passing style again because I think you can pass the left-context as a function argument. This would give you an LALR automaton structure with LR parsing power. Is that a good idea? Don’t know, needs testing (or reading papers/blog posts, probably someone else already tried this before).

I usually have a pithy remark or sneak the Kaomoji into the footnotes, but I must be out of practice, because I can’t think of a good way to do that…

Ehh, whatever ¯\_(ツ)_/¯

Footnotes

  1. I’m fairly sure my prose description there is the same as a formal definition, and it feel a bit nicer to think about than the ones you can find on Wikipedia

  2. Technically you’d need to see A1 A₁ A1​ and A2 A₂ A2​ as separate symbols and duplicate the rules for A A A, resulting in a larger grammar in correspondence with the larger table. But I couldn’t be bothered, and the parse table as shown works just as well. This is relevant to the code size of a recursive descent parser too, since you can just reuse the code for rules 2 and 3 instead of having duplicate code for the two extra rules. What’s a recursive descent parser? That comes just a little later in the post, so keep reading ;) 

  3. Yes, there are non-deterministic context-free languages. Those are the context-free languages that can only be parsed with a non-deterministic PDA. Since this post is about deterministic parsers, we’ll ignore the non-deterministic languages. 

  4. While I find the Wikipedia article on LLR confusing, and it makes a good case for why it’s not really used, I’m still somewhat intrigued. This is one of those things that will stay on my reading list for a while I think, something I still want to understand further… 

  5. Indirect left recursion is even worse in LL. At least the direct version can still be dealt with by an automatic grammar rewrite algorithm. That’s more or less what the node-reparenting trick mentioned at the end of the LL section does. Similarly, there are automatic grammar rewrites for direct right-recursion for LR, and indirect right recursion can be more problematic… 

  6. Actually, I checked in Compiler Explorer how this turns out, and while s7 is inlined and compiled away entirely, adapting box1 to consume directly will make the assembly at opt-level=3 smaller. Adding an #[inline] hint on consume helps as well. Though I may just be seeing the effect of uniform error messages through consume. Actually following and understanding the optimised assembly code is a pain, so I just clicked around a bit to verify that the example code is reduced to a state machine with jumps and labels instead of using function call instructions. So that’s neat, basically what I was hoping for :) 

  7. I hope you appreciate how much time it took to find example grammars to steal (or occasionally develop myself) and especially how much time it took to get GraphViz to output somewhat decent automata of those examples! 


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK