Comments · arzg’s website

Part Fourteen: Comments

Posted in Make A Language
10 December 2020

The first thing we need to do is teach the lexer to recognise comments. We’ll begin with a test:

// lexer.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn lex_comment() {
        check("# foo", SyntaxKind::Comment);
    }
}

Here’s the implementation:

pub(crate) enum SyntaxKind {
    // snip

    #[regex("#.*")]
    Comment,

    #[error]
    Error,

    Root,
    BinaryExpr,
    PrefixExpr,
}

Take note of how we aren’t using #[logos::skip] here; instead, we are explicitly including comments in the output of our lexer. We do this to ensure that the parser fully contains the input text, which makes the parser lossless. This makes implementing tools that interact with the source text (a good example is automatic refactorings in an IDE) easier to implement.

Just like with whitespace, it would be nice if we don’t have to manually handle comments in the parser. We could add extra checks to our existing eat_whitespace methods on Sink and Source for comments, but that’s annoying. What if we have other token kinds that we want to automatically skip in future?

There’s a name for this kind of irrelevant token: trivia. As far as I can tell, the term comes from Roslyn. Let’s add an is_trivia method to SyntaxKind to abstract away this behaviour:

impl SyntaxKind {
    pub(crate) fn is_trivia(self) -> bool {
        matches!(self, Self::Whitespace | Self::Comment)
    }
}

Note how the method takes self; this is because it’s more efficient to pass SyntaxKind by value instead of by reference, since the size of SyntaxKind is one byte, which is less than the size of a reference (eight bytes on 64-bit systems). Also note that is_trivia won’t consume the instance of SyntaxKind, since SyntaxKind is Copy.

Now that we have a way to ask a SyntaxKind if it is trivia, we can use this method in Sink and Source:

// source.rs

impl<'l, 'input> Source<'l, 'input> {
    // snip

    pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {
        self.eat_trivia();

        let lexeme = self.lexemes.get(self.cursor)?;
        self.cursor += 1;

        Some(lexeme)
    }

    pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {
        self.eat_trivia();
        self.peek_kind_raw()
    }

    fn eat_trivia(&mut self) {
        while self.at_trivia() {
            self.cursor += 1;
        }
    }

    fn at_trivia(&self) -> bool {
        self.peek_kind_raw().map_or(false, SyntaxKind::is_trivia)
    }

    // snip
}

// sink.rs

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for event in reordered_events {
            match event {
                // snip
            }

            self.eat_trivia();
        }

        // snip
    }

    fn eat_trivia(&mut self) {
        while let Some(lexeme) = self.lexemes.get(self.cursor) {
            if !lexeme.kind.is_trivia() {
                break;
            }

            self.token(lexeme.kind, lexeme.text.into());
        }
    }

    // snip
}

Let’s write a test to find out if what we’ve made works:

// parser.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_comment() {
        check(
            "# hello!",
            expect![[r##"
[email protected]
  [email protected] "# hello!""##]],
        );
    }
}

The usage of an extra # in the raw string literal is to stop Rust from thinking that the "# in [email protected] "# is meant to end the string literal.

$ cargo t -q
running 34 tests
..................................
test result: ok. 34 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s try parsing a binary expression interspersed with comments:

// expr.rs

#[cfg(test)]
mod tests {
    // snip

    #[test]
    fn parse_binary_expression_interspersed_with_comments() {
        check(
            "
1
  + 1 # Add one
  + 10 # Add ten",
            expect![[r##"
[email protected]
  [email protected] "\n"
  [email protected]
    [email protected]
      [email protected] "1"
      [email protected] "\n  "
      [email protected] "+"
      [email protected] " "
      [email protected] "1"
      [email protected] " "
      [email protected] "# Add one"
      [email protected] "\n  "
    [email protected] "+"
    [email protected] " "
    [email protected] "10"
    [email protected] " "
    [email protected] "# Add ten""##]],
        );
    }

    // snip
}

The test fails, since we aren’t lexing newlines. Let’s write a test for this:

// lexer.rs

#[cfg(test)]
mod tests {
    use super::*;

    fn check(input: &str, kind: SyntaxKind) {
        // snip
    }

    #[test]
    fn lex_spaces_and_newlines() {
        check("  \n ", SyntaxKind::Whitespace);
    }

    // snip
}

pub(crate) enum SyntaxKind {
    #[regex("[ \n]+")]
    Whitespace,

    // snip
}

All our tests pass now:

$ cargo t -q
running 35 tests
...................................
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

In the next part we’ll introduce another new concept to our parser: markers.

Part Fourteen: Comments

Part Fourteen: Comments

Recommend

https://www.youtube.com/watch?v=EY7Wi9fV5bk&feature=youtu.be

The Six Principles for Building Robust Yet Flexible Shared Data Applications

Deis Labs

Post

Beyond R and Python: Rust for Science

Polymorphism in Rust: Enums vs Traits

Notes on cross-compiling Rust

Closures: Magic Functions

Qovery Blog: Why Rust Has a Bright Future in the Cloud

Adding BPF target support to the Rust compiler

About Joyk