27

Parsing in C#: All the Tools and Libraries You Can Use (Part 2)

 5 years ago
source link: https://www.tuicool.com/articles/hit/6fIFbiq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Welcome back! If you missed Part 1, you can check it outhere.

Parser Generators

The basic workflow of a parser generator tool is quite simple: you write a grammar that defines the language, or document, and you run the tool to generate a parser usable from your C# code.

The parser might produce the AST, that you may have to traverse yourself or you can traverse with additional ready-to-use classes, such as Listeners or  Visitors . Some tools, instead, offer the chance to embed code inside the grammar to be executed every time the specific rule is matched.

Usually, you need a runtime library and/or program to use the generated parser.

Regular (Lexer)

Tools that analyze regular languages are typically called lexers.

Gardens Point LEX

Gardens Point LEX (GPLEX) is a lexical analyzer (lexer) generator based upon finite state automata. The input language is similar to the original LEX, but it also implement some extensions of FLEX. Obviously, it has better Unicode support. GPLEX can generate a C# lexer from a grammar file,  .lex .

The typical grammar is divided in three parts, separated by '%%':

  1. Options and C# definitions.
  2. Rules with embedded actions.
  3. Usercode.

As you can see in the following example, in standalone programs the usercode section contains the Main function, so a .lex file can generate a complete functioning program. Although this makes it quite messy and hard to read for the untrained reader.

// example from the documentation
%namespace LexScanner

%option noparser nofiles


alpha [a-zA-Z]

%%

foo         |
bar         Console.WriteLine("keyword " + yytext);
{alpha}{3}  Console.WriteLine("TLA " + yytext);
{alpha}+    Console.WriteLine("ident: " + yytext);

%%


    public static void Main(string[] argp) {
        Scanner scnr = new Scanner();
        for (int i = 0; i < argp.Length; i++) {
            Console.WriteLine("Scanning \"" + argp[i] + "\"");
            scnr.SetSource(argp[i], 0);
            scnr.yylex();
        }
    }

It can be used as a standalone tool, but, being a lexer generator, it can also work with parser generators: it is designed to work with Gardens Point Parser Generator, however, it has also been used with COCO/R and custom parsers.

Context Free

Let's see the tools that generate Context Free parsers.

ANTLR

ANTLR is a great parser generator written in Java that can also generate parsers for C# and many other languages. A particularity of the C# target is that there are actually two versions: the  original by sharwell and the  new standard runtime . The original defined itself as  C# optimized , while the standard one is included in the general distribution of the tool. Neither is a fork, since the authors work together and both are mentioned in the official website. It is more of a divergent path. The grammars are compatible, but the generated parsers are not.

If you are unsure of which one to pick, I suggest the standard one, since it is slightly more updated. In fact, the standard has a release version supporting .NET Core, while the original only has a pre-release.

ANTLR is based on a new LL algorithm developed by the author and described in this paper: Adaptive LL(*) Parsing: The Power of Dynamic Analysis (PDF) .

It is quite popular for its many useful features: for instance version 4 supports direct left-recursive rules. However, a real added value of a vast community is the  large amount of grammars available .

It provides two ways to walk the AST, instead of embedding actions in the grammar: visitors and listeners. The first one is suited for when you have to manipulate or interact with the elements of the tree, while the second is useful when you just have to do something when a rule is matched.

The typical grammar is divided into two parts: lexer rules and parser rules. The division is implicit, since all the rules starting with an uppercase letter are lexer rules, while the ones starting with a lowercase letter are parser rules. Alternatively, lexer and parser grammars can be defined in separate files.

grammar simple;

basic   : NAME ':' NAME ;

NAME    : [a-zA-Z]* ;

COMMENT : '/*' .*? '*/' -> skip ;

If you are interested in learning how to use ANTLR, you can look into this giant ANTLR tutorial we have written.

Coco/R

Coco/R is a compiler generator that takes an attributed grammar and generates a scanner and a recursive descent parser. Attributed grammar means that the rules, that are written in an EBNF variant, can be annotated in several ways to change the methods of the generated parser.

The scanner includes support for dealing with things like compiler directives, called pragmas. They can be ignored by the parser and handled by custom code. The scanner can also be suppressed and substituted with one built by hand.

Technically, all the grammars must be LL(1), that is to say, the parser must be able to choose the correct rule only looking one symbol ahead. But Coco/R provides several methods to bypass this limitation, including semantic checks, which are basically custom functions that must return a boolean value. The manual also provides some suggestions for refactoring your code to respect this limitation.

A Coco/R grammar looks like this.

[Imports] 
// ident is the name of the grammar
"COMPILER" ident 
// this includes arbitrary fields and method in the target language (eg. Java)
[GlobalFieldsAndMethods]
// ScannerSpecification 
CHARACTERS
  [..]
  zero          = '0'.
  zeroToThree   = zero + "123" .
  octalDigit    = zero + "1234567" . 
  nonZeroDigit  = "123456789".
  digit         = '0' + nonZeroDigit .
  [..]

TOKENS
  ident         = letter { letter | digit }.
[..]
// ParserSpecification 
PRODUCTIONS
// just a rule is shown
IdentList = 
  ident <out int x>  (. int n = 1; .) 
  {',' ident         (. n++; .) 
  }                  (. Console.WriteLine("n = " + n); .) 
  . 
// end
"END" ident '.'

Coco/R has a good documentation, with several examples grammars. It supports several languages including Java, C#, and C++.

Gardens Point Parser Generator

Gardens Point Parser Generator (GPPG) is a parser generator that produces parsers written in the C# V2 or higher. The input language is YACC-like, and the parsers are LALR(1), with the usual automatic disambiguations. Designed to work with GPLEX.

There are some adaptations to make it work with C# and its tools (e.g. MPF ). However, a particular feature of GPPG is the possibility of also generating an HTML report of the structure of the generated parser. This is meant to simplify understanding and analysis of the parser and grammar.

It is designed to work with its brother GPLEX, however, it has also been used with COCO/R and custom lexers. The structure of the grammar is similar to the one of the brother, but instead of .lex it has the extension  .y .

Grammatica

Grammatica is a C# and Java parser generator (compiler compiler). It reads a grammar file (in an EBNF format) and creates well-commented and readable C# or Java source code for the parser. It supports LL(k) grammars, automatic error recovery, readable error messages, and a clean separation between the grammar and the source code.

The description on the Grammatica website is itself a good representation of Grammatica: simple to use, well-documented, with a good amount of features. You can build a listener by subclassing the generated classes, but not a visitor. There is a good reference, but not many examples.

A typical grammar of Grammatica is divided in three sections: header, tokens, and production. It is also clean, almost as much as an ANTLR one. It is also based on a similar Extended BNF, although the format is slightly different.

2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
%header%


GRAMMARTYPE = "LL"

[..]

%tokens%


ADD                          = "+"
SUB                          = "-"
[..]
NUMBER                       = <<[0-9]+>>
WHITESPACE                   = <<[ \t\n\r]+>> %ignore%

%productions%


Expression = Term [ExpressionTail] ;

ExpressionTail = "+" Expression
               | "-" Expression ;

Term = Factor [TermTail] ;

[..]

Atom = NUMBER
     | IDENTIFIER ;

Hime Parser Generator

Hime Parser Generator is a parser generator for .NET and Java, with a modern implementation of GLR with the RNGLR algorithm. It is a pragmatic tool with everything you need and nothing more. It does not reinvent the wheel, but it does improve it.

The grammar uses a custom language based on BNF with some enhancement. For instance, it supports context-sensitive lexing (useful for soft keywords), template rules to avoid repetition of similar rules, and features to transform the parse tree in an AST. These features are called tree actions: drop and promote. One drops the node from the tree, the other substitutes the node with its children.

Instead of embedding code in a Hime grammar, you can annotate a rule with something called semantic action  in the documentation . In practical terms, you just write the name of a function next to a rule and then you implement the function in your source code.

The grammar is put in a file with the .gram extension. The structure of the file resembles the classic one with the three sections: options, terminals, and rules. But it is much cleaner and looks like a C# class.

grammar MathExp
{
 options
 {
 Axiom = "exp";
 Separator = "SEPARATOR";
 }
 terminals
 {
 WHITE_SPACE-> U+0020 | U+0009 | U+000B | U+000C ;
 SEPARATOR-> WHITE_SPACE+;

 INTEGER-> [1-9] [0-9]* | '0' ;
 REAL-> INTEGER? '.' INTEGER  (('e' | 'E') ('+' | '-')? INTEGER)?
 |  INTEGER ('e' | 'E') ('+' | '-')? INTEGER ;
 NUMBER-> INTEGER | REAL ;
 }
 rules
 {
 exp_atom-> NUMBER^ @OnNumber // @ is a semantic action
 | '('! exp^ ')'! ; // the final ! drop the node from the final tree
 exp_op0-> exp_atom^
 |  exp_op0 '*'^ exp_atom // the ^ drop the node and changes with its children
 |  exp_op0 '/'^ exp_atom ;
 exp_op1-> exp_op0^
 |  exp_op1 '+'^ exp_op0
 |  exp_op1 '-'^ exp_op0 ;
 exp-> exp_op1^ ;
 }
}

The documentation is concise, but complete: there are tutorials and recipes to explain practical usage of the tool. There is an equally good enough grammar repository . It has debug features, like the generation of DOT files.

LLLPG

LLLPG is not a dedicated tool the way ANTLR is. Instead, LLLPG is designed to be embedded inside another programming language

So, LLLPG is a LL(k) parser generator that is not actually a standalone parser generator, but a part of a larger project called Enhanced C#. It is not really usable standalone, because it does not even generate a complete class, as the tool only translates the parts of the input file that it recognizes.

It also bizarrely claims to be better than ANTLR2 (released in 2006), despite not being updated until recently. But, we have mentioned it because, for the very narrow objective of building a custom language on .NET, it is a good tool designed just for that objective.

PEG

After the CFG parsers, is time to look at the PEG parsers available for C#.

IronMeta

The IronMeta parser generator provides a programming language and application for generating pattern matchers on arbitrary streams of objects. It is an implementation of Alessandro Warth's OMeta system in C#.

Despite the potentially perplexing reference to being a programming language, IronMeta is a PEG parser generator that works just like any other.

IronMeta improves upon its base, OMeta, by allowing direct and indirect left recursion. On the other hand, error reporting and documentation are limited.

An IronMeta grammar can contain embedded actions and conditions. A condition is a boolean expression that controls the activation of the following rule. If the condition is true, the rule activates.

// from the documentation
// IronMeta Calculator Example

using System;
using System.Linq;

ironmeta Calc<char, int> : IronMeta.Matcher.CharMatcher<int>
{
    Expression = Additive;

    Additive = Add | Sub | Multiplicative;

    Add = BinaryOp(Additive, '+', Multiplicative) -> { return _IM_Result.Results.Aggregate((total, n) => total + n); };
    Sub = BinaryOp(Additive, '-', Multiplicative) -> { return _IM_Result.Results.Aggregate((total, n) => total - n); };

    Multiplicative = Multiply | Divide;
    Multiplicative = Number(DecimalDigit);

    Multiply = BinaryOp(Multiplicative, "*", Number, DecimalDigit) -> { return _IM_Result.Results.Aggregate((p, n) => p * n); };
    Divide = BinaryOp(Multiplicative, "/", Number, DecimalDigit) -> { return _IM_Result.Results.Aggregate((q, n) => q / n); };

    BinaryOp :first :op :second .?:type = first:a KW(op) second(type):b -> { return new List<int> { a, b }; };

    Number :type = Digits(type):n WS* -> { return n; };

    Digits :type = Digits(type):a type:b -> { return a*10 + b; };
    Digits :type = type;

    DecimalDigit = .:c ?( (char)c >= '0' && (char)c <= '9' ) -> { return (char)c - '0'; };
    KW :str = str WS*;
    WS = ' ' | '\n' | '\r' | '\t';
}

Pegasus

Pegasus is a PEG (Parsing Expression Grammar) parser generator for .NET that integrates with MSBuild and Visual Studio.

Pegasus is a simple parser generator with equally sparse documentation. It supports the formal definition of PEG and does have basic features to simplify the management of indentation and debugging.

A Pegasus grammar is written in a .peg file that is quite clean, but can also includes embedded actions.

@namespace MyProject
@classname ExpressionParser

additive <double> -memoize
    = left:additive "+" right:multiplicative { left + right }
    / left:additive "-" right:multiplicative { left - right }
    / multiplicative

multiplicative <double> -memoize
    = left:multiplicative "*" right:power { left * right }
    / left:multiplicative "/" right:power { left / right }
    / power

power <double>
    = left:primary "^" right:power { Math.Pow(left, right) }
    / primary

primary <double> -memoize
    = decimal
    / "-" primary:primary { -primary }
    / "(" additive:additive ")" { additive }

decimal <double>
    = value:([0-9]+ ("." [0-9]+)?) { double.Parse(value)

That's all for Part 2! Tune back in tomorrow when we'll take a look at parser combinators, the best way to part C# code, and some great tools.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK