27

rxe: literate and composable regular expressions

 5 years ago
source link: https://www.tuicool.com/articles/hit/z6rqInf
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction

rxe is a thin wrapper around Python's re module (see official re docs ). The various rxe functions are wrappers around corresponding re patterns. For example, rxe.digit().one_or_more('a').whitespace() corresponds to \da+\s . Because rxe uses parentheses but wants to avoid unnamed groups, the internal (equivalent) representation is actually \d(?:a)+\s . This pattern can always be retrieved with get_pattern() .

Motivation

Suppose you want to parse geo coordinates from a string, like (<latitude>,<longitude>) , where each is a decimal. The raw regular expression would look like \(\d+\.\d\+,\d+\.\d\+) . This is hard to read and maintain for the next guy, and diffs will be hard to understand and verify.

With rx, you can write:

decimal = (rxe
  .one_or_more(rxe.digit())
  .literal('.')
  .one_or_more(rxe.digit())
)
coord = (rxe
  .literal('(')
  .exactly(1, decimal)
  .literal(',')
  .exactly(1, decimal)
  .literal(')')
)

Note how rxe allows the decimal regex to be re-used in the coord pattern! Although it's more code, it's much more readable.

Suppose you want to support arbitrary number of whitespace. The diff for this change will be:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(',')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(')')
)

Okay, but we also want to extract the latitude and longitude, not just match on it. Let's extract them, but in a readable way:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lat', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(',')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lat', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(')')
)

m = coord2.match('(23.34, 11.0)')
print(m.group('lat'))
print(m.group('lon'))

One more example, parsing email addresses. The regex is [\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6} . The equivalent rxe code:

username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
domain = (rxe
    .one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
    .literal('.')
    .at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
)
email = (rxe
    .exactly(1, username)
    .literal('@')
    .exactly(1, domain)
)

Install

Use pip :

pip install git+git://github.com/mtrencseni/rxe

Then:

$ python
>>> from rxe import *
>>> r = rxe.digit().at_least(1, 'p').at_least(2, 'q')
>>> assert(r.match('1ppppqqqqq') is not None)

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK