4

Creating Your Own Grammar Rules

 2 years ago
source link: https://hackernoon.com/creating-your-own-grammar-rules
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Anatolii Kabanov

Developing and enjoying life.

Whenever you start wondering how to create your syntax or rules on expression, it’s probably time to check out this powerful tool you may be interested in. The Chevrotain allows you to build your own grammar rules, validate, and parse the result at the end.

What does this library structurally consist of?

First of all, it’s Lexer. Lexer is the class that basically goes through input text and match token to each symbol or sequence of symbols (words), it found. To create an instance of the Lexer class, I need to define all tokens that I want to use. The token should have a unique name, the pattern - means regex.

For example token ‘hackernoon‘ will match the exact word in the text.

const hackernoon = createToken({
  name: 'hackernoon',
  pattern: /hackernoon/
})

Also, there is a possibility to categorize tokens. For this purpose, I need to define one more token with a reserved pattern value Lexer.NA. In this case, Lexer will not use it to match some characters. And in other tokens that I want to categorize, I add my new token to a property ‘categories’. Like in the example below, I introduce MathOperator and add this token to MathPlus and MathMinus.

export const MathOperator = createToken({
    name: Token.MathOperator,
    pattern: Lexer.NA,
});

export const MathPlus = createToken({
    name: Token.MathPlus,
    pattern: /\+/,
    categories: MathOperator,
});

export const MathMinus = createToken({
    name: Token.MathMinus,
    pattern: /-/,
    categories: MathOperator,
});

export const Number = createToken({ 
    name: Token.Number, 
    pattern: /[+-]?\d*\.?\d+(?:[Ee][+-]?\d+)?/
});

export const Whitespace = createToken({
    name: Token.WhiteSpace,
    pattern: /\s+/,
    group: Lexer.SKIPPED,
});

When all tokens are defined, I need to create a new instance of the Lexer class and put there an array of these tokens.

const allTokens = [
  WhiteSpace,
  Number,
  MathPlus,
  MathMinus
]

const mathLexer = new Lexer(allTokens)

And from this moment the Lexer is ready to tokenize input strings. I need to invoke the function ‘tokenize’ of the Lexer, and as a result, I will receive an object with tokens. With them, I can go to the next step.

const lexingResult = mathLexer.tokenize(text);

lexingResult is having next interface ILexingResult. Where errors array can be very helpful to understand for the user what is wrong in the expression because it contains coordinates to the problem: line and column.

export interface ILexingResult {
  tokens: IToken[]
  groups: {
    [groupName: string]: IToken[]
  }
  errors: ILexingError[]
}

The second important thing is Parser. It is a tool to describe the rules of your expression. For example, in math expression, you are expecting that the first variable will be a Number the second should be some operator +/- and the third must be a Number as well. So rule will describe such behavior. First things first I create my own class the MathParser inherited from CstParser. That will accept tokens list in the constructor, and the constructor will contain all required rules and after them the this.performSelfAnalysis() to check all written rules.

class MathParser extends CstParser {
    public mathCond: RuleFn;
    private valueOrParen: RuleFn;
    private parentheses: RuleFn;
    private value: RuleFn;
    private mathOp: RuleFn;

    constructor(tokens: TokenType[]) {
      super(tokens);
      const $ = this;

      // Rules here
      
      this.performSelfAnalysis(); // Must be after all rules
    }
}

The mathCond rule is the main entrance rule that will be invoked to perform the parsing. Basically, in my example, I assume that expression should be starting from some number or from parentheses. You are wondering how to read this rule? It actually tells if at least one number 123 or parentheses with a number (123) or parentheses with some operation (123 + 456) (the operation described in MANY keyword), then the expression is valid.

Of course, as you can see there could be more parentheses and math operations inside one parentheses, in these terms - recursion. And MANY says that expression can repeat zero or more times. As like the example with (123 + 456) - it repeats one time if it will be like (123 + 456 - 789) then 2 repeats. (123 + 456) - 789 this can be read: 1 mathCond repeat in parentheses and one after them.

this.mathCond = $.RULE('mathCond', () => {
    $.SUBRULE($.valueOrParen, { LABEL: 'lhs' });
    $.MANY(() => {
        $.SUBRULE1($.mathOp)
        $.SUBRULE2($.valueOrParen, { LABEL: 'rhs' });
    });
});

The rule valueOrParen is expecting some Number or other parentheses with math expression.

this.valueOrParen = $.RULE('valueOrParen', () => {
    $.OR([
        { ALT: () => $.SUBRULE($.value) },
        { ALT: () => $.SUBRULE($.parentheses) },
    ]);
});

In the next rule parentheses, I want to cover the case with braces () and inside braces, recursion happens. LeftParen and RightParen are two more tokens. mathCond is the main rule function.

this.parentheses = $.RULE('parentheses', () => {
    $.CONSUME(LeftParen);
    $.SUBRULE($.mathCond);
    $.CONSUME(RightParen);
});

mathOp is the rule which looks for + OR -.

this.mathOp = $.RULE('mathOp', () => {
    $.OR([
        { ALT: () => $.CONSUME(MathPlus) },
        { ALT: () => $.CONSUME(MathMinus) },
    ]);
});

The value rule is created just to show that it can be extended with other specific types.

this.value = $.RULE('value', () => {
    $.OR([
        { ALT: () => $.CONSUME(Number) }
    ]);
});

Once the parser is ready, create it and call the main function.

const mathParser = new MathParser(allTokens);
mathParser.mathCond();

As for Lexer, there are errors that help users, here as well, after calling the main function parser may contain errors with Token where the error happened. mathParser.errors

So this library helps you to verify the string that the user typed and actually evaluate the expression. And even more, you can try a custom Visitor to create your own expression tree.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK