rxe: literate and composable regular expressions

Introduction

rxe is a thin wrapper around Python's re module (see official re docs ). The various rxe functions are wrappers around corresponding re patterns. For example, rxe.digit().one_or_more('a').whitespace() corresponds to \da+\s . Because rxe uses parentheses but wants to avoid unnamed groups, the internal (equivalent) representation is actually \d(?:a)+\s . This pattern can always be retrieved with get_pattern() .

Motivation

Suppose you want to parse geo coordinates from a string, like (<latitude>,<longitude>) , where each is a decimal. The raw regular expression would look like \(\d+\.\d\+,\d+\.\d\+) . This is hard to read and maintain for the next guy, and diffs will be hard to understand and verify.

With rx, you can write:

decimal = (rxe
  .one_or_more(rxe.digit())
  .literal('.')
  .one_or_more(rxe.digit())
)
coord = (rxe
  .literal('(')
  .exactly(1, decimal)
  .literal(',')
  .exactly(1, decimal)
  .literal(')')
)

Note how rxe allows the decimal regex to be re-used in the coord pattern! Although it's more code, it's much more readable.

Suppose you want to support arbitrary number of whitespace. The diff for this change will be:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(',')
  .zero_or_more(rxe.whitespace()) # <--- line added
  .exactly(1, decimal)
  .zero_or_more(rxe.whitespace()) # <--- line added
  .literal(')')
)

Okay, but we also want to extract the latitude and longitude, not just match on it. Let's extract them, but in a readable way:

coord = (rxe
  .literal('(')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lat', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(',')
  .zero_or_more(rxe.whitespace())
  .exactly(1, rxe.named('lat', decimal)) # <--- line changed
  .zero_or_more(rxe.whitespace())
  .literal(')')
)

m = coord2.match('(23.34, 11.0)')
print(m.group('lat'))
print(m.group('lon'))

One more example, parsing email addresses. The regex is [\w.%+-]+@[\w.-]+\.[a-zA-Z]{2,6} . The equivalent rxe code:

username = rxe.one_or_more(rxe.set([rxe.alphanumeric(), '.', '%', '+', '-']))
domain = (rxe
    .one_or_more(rxe.set([rxe.alphanumeric(), '.', '-']))
    .literal('.')
    .at_least_at_most(2, 6, rxe.set([rxe.range('a', 'z'), rxe.range('A', 'Z')]))
)
email = (rxe
    .exactly(1, username)
    .literal('@')
    .exactly(1, domain)
)

Install

Use pip :

pip install git+git://github.com/mtrencseni/rxe

Then:

$ python
>>> from rxe import *
>>> r = rxe.digit().at_least(1, 'p').at_least(2, 'q')
>>> assert(r.match('1ppppqqqqq') is not None)

Introduction

Motivation

Install

Recommend

Using the R Package Profvis on a Linear Model

任意URL跳转漏洞修复与JDK中getHost()方法之间的坑

TCP拥塞控制算法简介

干货 | 基于tendermint实现Hyperledger Fabric的拜占庭容错排序

Article/activity-launch.md at master · shaomaicheng/Article · GitHub

互联网架构三板斧之并发

聊聊flink的BlobStoreService - code-craft - SegmentFault 思否

《从0到1学习Flink》—— Flink 读取 Kafka 数据批量写入到 MySQL | zhisheng的博客

基于gitlab CI 搭建前端页面预览服务 - 简书

大数据误区

About Joyk