Part Fourteen: Comments

 3 years ago
source link: https://arzg.github.io/lang/14/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Comments · arzg’s website

Part Fourteen: Comments

The first thing we need to do is teach the lexer to recognise comments. We’ll begin with a test:

// lexer.rs

mod tests {
    // snip

    fn lex_comment() {
        check("# foo", SyntaxKind::Comment);

Here’s the implementation:

pub(crate) enum SyntaxKind {
    // snip




Take note of how we aren’t using #[logos::skip] here; instead, we are explicitly including comments in the output of our lexer. We do this to ensure that the parser fully contains the input text, which makes the parser lossless. This makes implementing tools that interact with the source text (a good example is automatic refactorings in an IDE) easier to implement.

Just like with whitespace, it would be nice if we don’t have to manually handle comments in the parser. We could add extra checks to our existing eat_whitespace methods on Sink and Source for comments, but that’s annoying. What if we have other token kinds that we want to automatically skip in future?

There’s a name for this kind of irrelevant token: trivia. As far as I can tell, the term comes from Roslyn. Let’s add an is_trivia method to SyntaxKind to abstract away this behaviour:

impl SyntaxKind {
    pub(crate) fn is_trivia(self) -> bool {
        matches!(self, Self::Whitespace | Self::Comment)

Note how the method takes self; this is because it’s more efficient to pass SyntaxKind by value instead of by reference, since the size of SyntaxKind is one byte, which is less than the size of a reference (eight bytes on 64-bit systems). Also note that is_trivia won’t consume the instance of SyntaxKind, since SyntaxKind is Copy.

Now that we have a way to ask a SyntaxKind if it is trivia, we can use this method in Sink and Source:

// source.rs

impl<'l, 'input> Source<'l, 'input> {
    // snip

    pub(super) fn next_lexeme(&mut self) -> Option<&'l Lexeme<'input>> {

        let lexeme = self.lexemes.get(self.cursor)?;
        self.cursor += 1;


    pub(super) fn peek_kind(&mut self) -> Option<SyntaxKind> {

    fn eat_trivia(&mut self) {
        while self.at_trivia() {
            self.cursor += 1;

    fn at_trivia(&self) -> bool {
        self.peek_kind_raw().map_or(false, SyntaxKind::is_trivia)

    // snip
// sink.rs

impl<'l, 'input> Sink<'l, 'input> {
    // snip

    pub(super) fn finish(mut self) -> GreenNode {
        // snip

        for event in reordered_events {
            match event {
                // snip


        // snip

    fn eat_trivia(&mut self) {
        while let Some(lexeme) = self.lexemes.get(self.cursor) {
            if !lexeme.kind.is_trivia() {

            self.token(lexeme.kind, lexeme.text.into());

    // snip

Let’s write a test to find out if what we’ve made works:

// parser.rs

mod tests {
    // snip

    fn parse_comment() {
            "# hello!",
[email protected]
  [email protected] "# hello!""##]],

The usage of an extra # in the raw string literal is to stop Rust from thinking that the "# in [email protected] "# is meant to end the string literal.

$ cargo t -q
running 34 tests
test result: ok. 34 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

Let’s try parsing a binary expression interspersed with comments:

// expr.rs

mod tests {
    // snip

    fn parse_binary_expression_interspersed_with_comments() {
  + 1 # Add one
  + 10 # Add ten",
[email protected]
  [email protected] "\n"
  [email protected]
    [email protected]
      [email protected] "1"
      [email protected] "\n  "
      [email protected] "+"
      [email protected] " "
      [email protected] "1"
      [email protected] " "
      [email protected] "# Add one"
      [email protected] "\n  "
    [email protected] "+"
    [email protected] " "
    [email protected] "10"
    [email protected] " "
    [email protected] "# Add ten""##]],

    // snip

The test fails, since we aren’t lexing newlines. Let’s write a test for this:

// lexer.rs

mod tests {
    use super::*;

    fn check(input: &str, kind: SyntaxKind) {
        // snip

    fn lex_spaces_and_newlines() {
        check("  \n ", SyntaxKind::Whitespace);

    // snip
pub(crate) enum SyntaxKind {
    #[regex("[ \n]+")]

    // snip

All our tests pass now:

$ cargo t -q
running 35 tests
test result: ok. 35 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

In the next part we’ll introduce another new concept to our parser: markers.

About Joyk

Aggregate valuable and interesting links.
Joyk means Joy of geeK