cpp-peglib
source link: https://www.tuicool.com/articles/hit/rIfmA37
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
cpp-peglib
C++11 header-only PEG (Parsing Expression Grammars) library.
cpp-peglib tries to provide more expressive parsing experience in a simple way. This library depends on only one header file. So, you can start using it right away just by including peglib.h
in your project.
The PEG syntax is well described on page 2 in the document . cpp-peglib also supports the following additional syntax for now:
-
<
...>
(Token boundary operator) -
~
(Ignore operator) -
\x20
(Hex number char) -
$name<
...>
(Named capture operator) -
$name
(Backreference operator) -
%whitespace
(Automatic whitespace skipping) -
%word
(Word expression) -
$name(
...)
(Capture scope operator) -
$name<
...>
(Named capture operator) -
$name
(Backreference operator) -
MACRO_NAME(
...)
(Parameterized rule or Macro)
This library also supports the linear-time parsing known as the Packrat parsing.
If you need a Go language version, please see go-peg .
How to use
This is a simple calculator sample. It shows how to define grammar, associate samantic actions to the grammar, and handle semantic values.
// (1) Include the header file #include <peglib.h> #include <assert.h> #include <iostream> using namespace peg; using namespace std; int main(void) { // (2) Make a parser auto grammar = R"( # Grammar for Calculator... Additive <- Multitive '+' Additive / Multitive Multitive <- Primary '*' Multitive / Primary Primary <- '(' Additive ')' / Number Number <- < [0-9]+ > %whitespace <- [ \t]* )"; parser parser; parser.log = [](size_t line, size_t col, const string& msg) { cerr << line << ":" << col << ": " << msg << "\n"; }; auto ok = parser.load_grammar(grammar); assert(ok); // (3) Setup actions parser["Additive"] = [](const SemanticValues& sv) { switch (sv.choice()) { case 0: // "Multitive '+' Additive" return sv[0].get<int>() + sv[1].get<int>(); default: // "Multitive" return sv[0].get<int>(); } }; parser["Multitive"] = [](const SemanticValues& sv) { switch (sv.choice()) { case 0: // "Primary '*' Multitive" return sv[0].get<int>() * sv[1].get<int>(); default: // "Primary" return sv[0].get<int>(); } }; parser["Number"] = [](const SemanticValues& sv) { return stoi(sv.token(), nullptr, 10); }; // (4) Parse parser.enable_packrat_parsing(); // Enable packrat parsing. int val; parser.parse(" (1 + 2) * 3 ", val); assert(val == 9); }
There are four semantic actions available:
[](const SemanticValues& sv, any& dt) [](const SemanticValues& sv) [](SemanticValues& sv, any& dt) [](SemanticValues& sv)
SemanticValues
value contains the following information:
- Semantic values
- Matched string information
- Token information if the rule is literal or uses a token boundary operator
- Choice number when the rule is 'prioritized choise'
any& dt
is a 'read-write' context data which can be used for whatever purposes. The initial context data is set in peg::parser::parse
method.
peg::any
is a simpler implementatin of boost::any . It can wrap arbitrary data type.
A semantic action can return a value of arbitrary data type, which will be wrapped by peg::any
. If a user returns nothing in a semantic action, the first semantic value in the const SemanticValues& sv
argument will be returned. (Yacc parser has the same behavior.)
Here shows the SemanticValues
structure:
struct SemanticValues : protected std::vector<any> { // Input text const char* path; const char* ss; // Matched string std::string str() const; // Matched string const char* c_str() const; // Matched string start size_t length() const; // Matched string length // Line number and column at which the matched string is std::pair<size_t, size_t> line_info() const; // Tokens std::vector< std::pair< const char*, // Token start size_t>> // Token length tokens; std::string token(size_t id = 0) const; // Choice number (0 based index) size_t choice() const; // Transform the semantic value vector to another vector template <typename T> vector<T> transform(size_t beg = 0, size_t end = -1) const; }
The following example uses <
... >
operator, which is token boundary operator.
auto syntax = R"( ROOT <- _ TOKEN (',' _ TOKEN)* TOKEN <- < [a-z0-9]+ > _ _ <- [ \t\r\n]* )"; peg pg(syntax); pg["TOKEN"] = [](const SemanticValues& sv) { // 'token' doesn't include trailing whitespaces auto token = sv.token(); }; auto ret = pg.parse(" token1, token2 ");
We can ignore unnecessary semantic values from the list by using ~
operator.
peg::pegparser parser( " ROOT <- _ ITEM (',' _ ITEM _)* " " ITEM <- ([a-z])+ " " ~_ <- [ \t]* " ); parser["ROOT"] = [&](const SemanticValues& sv) { assert(sv.size() == 2); // should be 2 instead of 5. }; auto ret = parser.parse(" item1, item2 ");
The following grammar is same as the above.
peg::parser parser( " ROOT <- ~_ ITEM (',' ~_ ITEM ~_)* " " ITEM <- ([a-z])+ " " _ <- [ \t]* " );
Semantic predicate support is available. We can do it by throwing a peg::parse_error
exception in a semantic action.
peg::parser parser("NUMBER <- [0-9]+"); parser["NUMBER"] = [](const SemanticValues& sv) { auto val = stol(sv.str(), nullptr, 10); if (val != 100) { throw peg::parse_error("value error!!"); } return val; }; long val; auto ret = parser.parse("100", val); assert(ret == true); assert(val == 100); ret = parser.parse("200", val); assert(ret == false);
enter and leave actions are also avalable.
parser["RULE"].enter = [](const char* s, size_t n, any& dt) { std::cout << "enter" << std::endl; }; parser["RULE"] = [](const SemanticValues& sv, any& dt) { std::cout << "action!" << std::endl; }; parser["RULE"].leave = [](const char* s, size_t n, size_t matchlen, any& value, any& dt) { std::cout << "leave" << std::endl; };
Ignoring Whitespaces
As you can see in the first example, we can ignore whitespaces between tokens automatically with %whitespace
rule.
%whitespace
rule can be applied to the following three conditions:
- trailing spaces on tokens
- leading spaces on text
- trailing spaces on literal strings in rules
These are valid tokens:
KEYWORD <- 'keyword' WORD <- < [a-zA-Z0-9] [a-zA-Z0-9-_]* > # token boundary operator is used. IDNET <- < IDENT_START_CHAR IDENT_CHAR* > # token boundary operator is used.
The following grammar accepts one, "two three", four
.
ROOT <- ITEM (',' ITEM)* ITEM <- WORD / PHRASE WORD <- < [a-z]+ > PHRASE <- < '"' (!'"' .)* '"' > %whitespace <- [ \t\r\n]*
Word expression
peg::parser parser(R"( ROOT <- 'hello' 'world' %whitespace <- [ \t\r\n]* %word <- [a-z]+ )"); parser.parse("hello world"); // OK parser.parse("helloworld"); // NG
Capture/Backreference
peg::parser parser(R"( ROOT <- CONTENT CONTENT <- (ELEMENT / TEXT)* ELEMENT <- $(STAG CONTENT ETAG) STAG <- '<' $tag< TAG_NAME > '>' ETAG <- '</' $tag '>' TAG_NAME <- 'b' / 'u' TEXT <- TEXT_DATA TEXT_DATA <- ![<] . )"); parser.parse("This is <b>a <u>test</u> text</b>."); // OK parser.parse("This is <b>a <u>test</b> text</u>."); // NG parser.parse("This is <b>a <u>test text</b>."); // NG
Parameterized Rule or Macro
# Syntax Start ← _ Expr Expr ← Sum Sum ← List(Product, SumOpe) Product ← List(Value, ProOpe) Value ← Number / T('(') Expr T(')') # Token SumOpe ← T('+' / '-') ProOpe ← T('*' / '/') Number ← T([0-9]+) ~_ ← [ \t\r\n]* # Macro List(I, D) ← I (D I)* T(x) ← < x > _
AST generation
cpp-peglib is able to generate an AST (Abstract Syntax Tree) when parsing. enable_ast
method on peg::parser
class enables the feature.
peg::parser parser("..."); parser.enable_ast(); shared_ptr<peg::Ast> ast; if (parser.parse("...", ast)) { cout << peg::ast_to_s(ast); ast = peg::AstOptimizer(true).optimize(ast); cout << peg::ast_to_s(ast); }
peg::AstOptimizer
removes redundant nodes to make a AST simpler. You can make your own AST optimizers to fit your needs.
See actual usages in the AST calculator example and PL/0 language example .
Make a parser with parser combinators
Instead of makeing a parser by parsing PEG syntax text, we can also construct a parser by hand with parser combinatorss . Here is an example:
using namespace peg; using namespace std; vector<string> tags; Definition ROOT, TAG_NAME, _; ROOT <= seq(_, zom(seq(chr('['), TAG_NAME, chr(']'), _))); TAG_NAME <= oom(seq(npd(chr(']')), dot())), [&](const SemanticValues& sv) { tags.push_back(sv.str()); }; _ <= zom(cls(" \t")); auto ret = ROOT.parse(" [tag1] [tag:2] [tag-3] ");
The following are available operators:
Operator Description seq Sequence cho Prioritized Choice zom Zero or More oom One or More opt Optional apd And predicate npd Not predicate lit Literal string cls Character class chr Character dot Any character tok Token boundary ign Ignore semantic value csc Capture scope cap Capture bkr Back reference usr User defined parserAdjust definitions
It's possible to add/override definitions.
auto syntax = R"( ROOT <- _ 'Hello' _ NAME '!' _ )"; Rules additional_rules = { { "NAME", usr([](const char* s, size_t n, SemanticValues& sv, any& dt) -> size_t { static vector<string> names = { "PEG", "BNF" }; for (const auto& name: names) { if (name.size() <= n && !name.compare(0, name.size(), s, name.size())) { return name.size(); // processed length } } return -1; // parse error }) }, { "~_", zom(cls(" \t\r\n")) } }; auto g = parser(syntax, additional_rules); assert(g.parse(" Hello BNF! "));
Unicode support
cpp-peglib accepts UTF8 text. .
matches a Unicode codepoint. Also, it supports \u????
.
peglint - PEG syntax lint utility
Build peglint
> cd lint > mkdir build > cd build > cmake .. > make > ./peglint usage: peglint [--ast] [--optimize_ast_nodes|--opt] [--source text] [--server [PORT]] [--trace] [grammar file path] [source file path]
Lint grammar
> cat a.peg A <- 'hello' ^ 'world' > peglint a.peg a.peg:1:14: syntax error
> cat a.peg A <- B > peglint a.peg a.peg:1:6: 'B' is not defined.
> cat a.peg A <- B / C B <- 'b' C <- A > peglint a.peg a.peg:1:10: 'C' is left recursive. a.peg:3:6: 'A' is left recursive.
Lint source text
> cat a.peg Additive <- Multitive '+' Additive / Multitive Multitive <- Primary '*' Multitive / Primary Primary <- '(' Additive ')' / Number Number <- < [0-9]+ > %whitespace <- [ \t\r\n]* > peglint --source "1 + a * 3" a.peg [commendline]:1:3: syntax error
> cat a.txt 1 + 2 * 3 > peglint --ast a.peg a.txt + Additive + Multitive + Primary - Number (1) + Additive + Multitive + Primary - Number (2) + Multitive + Primary - Number (3)
> peglint --ast --opt --source "1 + 2 * 3" a.peg + Additive - Multitive[Number] (1) + Additive[Multitive] - Primary[Number] (2) - Multitive[Number] (3)
Sample codes
- Calculator
- Calculator (with parser operators)
- Calculator (AST version)
- PL/0 language example
- A tiny PL/0 JIT compiler in less than 700 LOC with LLVM and PEG parser
PEG debug
A debug viewer for Parsing Expression Grammars using cpp-peglib by mqnc . Please see his gihub project page for the detail. You can see a parse result of PL/0 code here .
Tested compilers
- Visual Studio 2017
- Visual Studio 2015
- Visual Studio 2013 with update 5
- Clang++ 5.0.1
- Clang++ 5.0
- Clang++ 4.0
- Clang++ 3.5
- G++ 5.4 on Ubuntu 16.04
IMPORTANT NOTE for Ubuntu: Need -pthread
option when linking. See #23 and #46 .
TODO
- Advanced Unicode support ( Unicode regular expressoin )
License
MIT license (© 2018 Yuji Hirose)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK