rxe: literate and composable regular expressions
source link: https://www.tuicool.com/articles/hit/z6rqInf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Introduction
rxe
is a thin wrapper around Python's re
module (see official re docs
). The various rxe
functions are wrappers around corresponding re
patterns. For example, rxe.digit().one_or_more('a').whitespace()
corresponds to \da+\s
. Because rxe
uses parentheses but wants to avoid unnamed groups, the internal (equivalent) representation is actually \d(?:a)+\s
. This pattern can always be retrieved with get_pattern()
.
Motivation
Suppose you want to parse geo coordinates from a string, like (<latitude>,<longitude>)
, where each is a decimal. The raw regular expression would look like \(\d+\.\d\+,\d+\.\d\+)
. This is hard to read and maintain for the next guy, and diffs will be hard to understand and verify.
With rx, you can write:
decimal = (rxe .one_or_more(rxe.digit()) .literal('.') .one_or_more(rxe.digit()) ) coord = (rxe .literal('(') .exactly(1, decimal) .literal(',') .exactly(1, decimal) .literal(')') )
Note how rxe allows the decimal
regex to be re-used in the coord
pattern! Although it's more code, it's much more readable.
Suppose you want to support arbitrary number of whitespace. The diff for this change will be:
coord = (rxe .literal('(') .zero_or_more(rxe.whitespace()) # <--- line added .exactly(1, decimal) .zero_or_more(rxe.whitespace()) # <--- line added .literal(',') .zero_or_more(rxe.whitespace()) # <--- line added .exactly(1, decimal) .zero_or_more(rxe.whitespace()) # <--- line added .literal(')') )
Okay, but we also want to extract the latitude and longitude, not just match on it. Let's extract them, but in a readable way:
coord = (rxe .literal('(') .zero_or_more(rxe.whitespace()) .exactly(1, rxe.named('lat', decimal)) # <--- line changed .zero_or_more(rxe.whitespace()) .literal(',') .zero_or_more(rxe.whitespace()) .exactly(1, rxe.named('lat', decimal)) # <--- line changed .zero_or_more(rxe.whitespace()) .literal(')') ) m = coord2.match('(23.34, 11.0)') print(m.group('lat')) print(m.group('lon'))
One more example, parsing email addresses. The regex is [\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6}
. The equivalent rxe
code:
username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-'])) domain = (rxe .one_or_more(rxe.set([rxe.alphanumeric(), '.', '-'])) .literal('.') .at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')])) ) email = (rxe .exactly(1, username) .literal('@') .exactly(1, domain) )
Install
Use pip
:
pip install git+git://github.com/mtrencseni/rxe
Then:
$ python >>> from rxe import * >>> r = rxe.digit().at_least(1, 'p').at_least(2, 'q') >>> assert(r.match('1ppppqqqqq') is not None)
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK