Eggex Example: Recognizing Python Integer Literals
source link: http://www.oilshell.org/blog/2019/12/22.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Recognizing Python Integer Literals
Eggex Example: Recognizing Python Integer Literals
Eggex is Oil's regular expression syntax. You may want to skim the headings of the eggex doc if it's unfamiliar.
This post demonstrates egg expressions on a realistic problem:
recognizing integer literals like 42
, 0xFF
, or 0o755
.
It also shows that eggexes are appropriate for two problems that traditionally use different regex syntax:
- Ad-hoc text processing in languages like shell and Python. For example, I interactively refactor source code with regexes, and the /release/$VERSION/ page is essentially a big shell script.
- Lexing programming languages. Under the hood, Oil's lexer uses re2c to turn regexes into efficient C code. But the source is in Python, using its Perl-style regex syntax.
The Oil language and its eggex subset are both still under development. I made a few changes motivated by this example, and I'm happy with the results. Let me know what you think in the comments or follow the links on the wiki!
Three Syntaxes for Regular Languages
Python's Spec
Considered by themselves, integer literals are a regular language. You can write them in decimal, binary, octal, or hexadecimal; and add underscores for readability.
The reference manual describes them like this:
integer ::= decinteger | bininteger | octinteger | hexinteger
decinteger ::= nonzerodigit (["_"] digit)* | "0"+ (["_"] "0")*
bininteger ::= "0" ("b" | "B") (["_"] bindigit)+
octinteger ::= "0" ("o" | "O") (["_"] octdigit)+
hexinteger ::= "0" ("x" | "X") (["_"] hexdigit)+
nonzerodigit ::= "1"..."9"
digit ::= "0"..."9"
bindigit ::= "0" | "1"
octdigit ::= "0"..."7"
hexdigit ::= digit | "a"..."f" | "A"..."F"
Examples of valid integer literals:
7 2147483647 0o177 0b100110111
3 79228162514264337593543950336 0o377 0xdeadbeef
100_000_000_000 0b_1110_0101
I'm unfamiliar with this metalanguage, but I can guess that ...
is a
character range, that []
means "optional", and that *
and +
denote
repetition.
This syntax is meant for humans to read — i.e. it isn't executable.
Source: https://docs.python.org/3/reference/lexical_analysis.html#integer-literals
Eggex
Here's the same language in eggex syntax:
DecDigit = / [0-9] /
BinDigit = / [0-1] /
OctDigit = / [0-7] /
HexDigit = / [0-9 a-f A-F] /
DecInt = / [1-9] ('_'? DecDigit)* | '0'+ ('_'? '0')* /
BinInt = / '0' [b B] ('_'? BinDigit)+ /
OctInt = / '0' [o O] ('_'? OctDigit)+ /
HexInt = / '0' [x X] ('_'? HexDigit)+ /
Integer = / DecInt | BinInt | OctInt | HexInt /
It shows some of the differences between eggex and commmon Perl-style regex syntax:
- Patterns can be composed by naming them, e.g.
BinDigit
. - Literals are quoted, e.g.
'0'
and'_'
. Quotes are optional in character classes for brevity, and to make them similar to Perl/POSIX syntax. - Terms are separated with whitespace.
Unlike the Python spec, this syntax is executable. Compared with POSIX regexes, it's more readable.
(Note that I also re-organized the rules: They're in bottom-up order rather
than top-down, and I inlined the nonzerodigit
rule because it's only used
once.)
POSIX Syntax
An eggex in Oil evaluates to a POSIX extended regex (ERE) by default, so
we simply use echo
to see it.
$ echo $Integer
([1-9](_?[0-9])*|0+(_?0)*|0[bB](_?[0-1])+|0[oO](_?[0-7])+|0[xX](_?[0-9a-fA-F])+)
I wouldn't want to type this by hand! Even if I succeeded, reading it and modifying it later would be a challenge.
Perl's /x
and Python's re.VERBOSE
help to a degree, but they're not
available in shell tools like grep and awk.
Two Use Cases for Eggexes
Text Processing With Unix Tools
The integer literal problem seems naturally closer to lexing, but let's try it with Unix tools first.
Here I grep for integer literals using Oil's seamless integration:
bash$ bin/oil
~/git/oilshell/oil$ egrep $Integer */*.py
...
core/process.py: fd = posix.open(path, fd_mode, 0o666)
...
osh/string_ops.py: return ''.join(chr(b & 0xFF) for b in bytes_)
...
osh/string_ops.py: elif (starting_byte >> 3) == 0b11110:
It works! We found:
- An octal literal
0o666
for file permissions (about the only thing they're good for). - A hex literal
0xFF
for truncating an integer to a single byte. - A binary literal
0b11110
for recognizing with UTF-8. (UTF-8 is best understood in binary notation.)
Lexing Programming Languages
In addition to translating to POSIX ERE syntax, eggexes should also translate to:
- Perl-style syntax, to use with the regex engines in nearly all other languages: Python, Ruby, Java, C++, etc. See issue 488.
- The syntax that the re2c compiler accepts. This would let Oil's own lexer be expressed in eggex.
To be concrete, here's an except from Oil's frontend/lex.py:
VAR_NAME_RE = r'[a-zA-Z_][a-zA-Z0-9_]*'
rules = [
R(r'\$' + VAR_NAME_RE, Id.VSub_DollarName),
...
R(VAR_NAME_RE + '\+?=', Id.Lit_VarLike),
R(VAR_NAME_RE + '\[', Id.Lit_ArrayLhsOpen),
]
As you can see, we use the syntax from Python's re
module and string
concatenation.
We do this because Oil started as a pure Python program, and because the lexer requires "metaprogramming" to remove duplication.
But Eggex could also remove that duplication with pattern composition. It's also improved by "first class" regex syntax, whitespace, and quoted literals.
VarName = / [a-z A-Z _] [a-z A-Z 0-9 _]* /
rules = [
( / '$' VarName /, Id.VSub_DollarName),
...
( / VarName '+'? '=' /, Id.Lit_VarLike),
( / VarName '[' /, Id.Lit_ArrayLhsOpen),
]
See Appendix A for details on the lexer's architecture.
Recent Syntax Changes
I used the integer literal example to test whether Eggex is a good syntax.
I removed punctuation in:
- Pattern composition syntax:
Subpattern
is allowed in addition to@subpattern
. The latter is still supported if you don't want to use capital letters. - Character literals:
[b B]
is allowed in addition to['b' 'B']
. This is consistent with ranges like[a-f]
, and with Perl/POSIX syntax.
I also implemented the capturing and grouping syntax, which contains less punctuation than the Perl-style syntax:
Eggex:
(mygroup+)
< mycapture+ >
< mycapture+ : name >
Perl/Python:
(?:mygroup+)
(mycapture+)
(?P<name>mycapture+)
These changes were motivated by Zulip chats. Thanks to Eric Bergstrome for discussing the grouping and capturing syntax, and to Aur Saraf for discussing the use of punctuation.
The Oil language is still under development, and I've made multiple changes based on user feedback. Feel free to lurk on Zulip and see what we're talking about!
Summary
Like Oil itself, eggexes are a seamless upgrade. You write patterns in a
modern, composable syntax, but tools like egrep
and awk
are passed the
ERE syntax they expect.
The same eggex syntax can be used for both ad-hoc text processing and writing fast and correct lexers.
Try porting regex to eggex and see if it's more maintainable. Start a thread
on #oil-discuss
on Zulip if you have problems.
Also see the Help Wanted section if you want to help make Oil happen! There are many opportunities to parallelize development.
Appendix A: Oil's Lexer Uses Two Stages of Code Generation
Since I didn't finish the #lexing series, now is a good time to explain the architecture of Oil's lexer.
I started publishing the following dense source files on each /release/$VERSION/ page. If anyone wants to implement Oil in another language, this portable and principled lexer will save them many hours of work.
-
frontend/lex.py is a pure Python source file using the
re
module syntax. The dynamic lists are evaluated into (pattern,Id
) rules for 14 lexer modes.Then we use the (undocumented) regex parse tree from Python's
sre_parse.py
to produce -
osh-lex.re2c.h, a file in re2c syntax. This syntax has quoted literals like lex and Eggex.
Then re2c produces
-
osh-lex.h, a large C file with 20K lines of
switch
andgoto
statements, which nicely express a DFA.Then the C compiler produces
-
Machine code. Even though the the source code is large, the machine code isn't, according to the metrics. This is because C compilers perform a switch lowering optimization.
Why go to all this trouble? Because regexes compiled to state machines are more correct than bash's style of groveling through backslashes and braces one at a time. It's also more concise: Oil's source code is many times smaller than bash's.
Appendix B: Other Pattern Syntaxes
Eggex obeys some rough refactoring properties. You might interactively type:
$ egrep [0-9]+ foo.txt
And then extract a pattern so you can test and enhance it:
$ pat = / [0-9]+ /
$ egrep $pat foo.txt
In general the eggex should resemble the regex, but be more readable. The integer literal example is intended to show that.
Eggex also uses postfix repetition operators like +
and *
so it's
consistent with:
- POSIX regex syntax
- Perl regex syntax
- Context Free Grammar syntax like EBNF and Python's pgen2, which
Oil uses. Except that
[optional]
should beoptional?
. - Parsing Expression Grammar syntax, which Python is likely moving to.
- The metalanguage for command line syntax.
grep [OPTION]... PATTERN [FILE]...
could begrep OPTION* PATTERN FILE*
to be consistent.
Note that ...
meant "character range" in Python's spec, but means
"repetition" in command lines. One of Oil's main goals to avoid such
punning across the many languages and metalanguages that
compose a large software system.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK