1

Structural pattern matching in Python 3.10

 2 years ago
source link: https://benhoyt.com/writings/python-pattern-matching/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Structural pattern matching in Python 3.10

September 2021

Summary: Python 3.10, which is due out in early October 2021, will include a large new language feature called structural pattern matching. This article is a critical but (hopefully) informative presentation of the feature, with examples based on real-world code.

Go to: What it is | Where it shines | My code | Other projects | Problems | Wrapping up

At a recent local Python meetup, a friend was presenting some of the new features in Python 3.8 and 3.9, and afterwards we got to talking about the pattern matching feature coming in Python 3.10. I went on a mild rant about how I thought Python had lost the plot: first assignment expressions using :=, and now this rather sprawling feature.

My friend interpreted my rant rather generously, and soon said, “it sounds like you want to give a talk about it at our next meetup”. Okay … well, why not!

In the meantime, I thought I’d get to know the feature better by writing up my thoughts and some code examples in article form. As you can gather, I’m rather biased, but I’ll try to present the positives as well as just criticism.

The pattern matching feature has no fewer than three PEPs (Python Enhancement Proposals) to describe it:

The tutorial in particular provides a good overview of the feature, so if you just want to read one of the PEPs, read that one. I’ll also demo the features below.

The cynical side of me noticed that the rationale is by far the longest (clocking in at 8500 words). Methinks thou dost protest too much? To be fair, however, it looks like they just moved the usual “Rejected Ideas” section of the PEP to its own PEP due to the length.

But I think what’s missing in the PEPs is an evaluation of the costs versus the benefits. The costs are significant new language semantics for developers to learn, as well as the cost of implementation (for CPython and other Python implementations). The benefits need to be discussed in light of real-world code: the kind of code people use Python for on a day-to-day basis, not just the fairly contrived examples in the “Motivation” section of the PEP.

Part of what I want to do here is evaluate some real code and see how much (or little) pattern matching improves it. But first, let’s take a brief look at what structural pattern matching in Python looks like.

What it is

It’s tempting to think of pattern matching as a switch statement on steroids. However, as the rationale PEP points out, it’s better thought of as a “generalized concept of iterable unpacking”. Many people have asked for a switch in Python over the years, though I can see why that has never been added. It just doesn’t provide enough value over a bunch of if ... elif statements to pay for itself. The new match ... case feature provides the basics of switch, plus the “structural” matching part – and some.

The basic syntax is shown in the following switch-like example (imagine we’re rolling our own Git CLI):

parser = argparse.ArgumentParser()
parser.add_argument('command', choices=['push', 'pull', 'commit'])
args = parser.parse_args()

match args.command:
    case 'push':
        print('pushing')
    case 'pull':
        print('pulling')
    case _:
        parser.error(f'{args.command!r} not yet implemented')

Python evaluates the match expression, and then tries each case from the top, executing the first one that matches, or the _ default case if no others match.

But here’s where the structural part comes in: the case patterns don’t just have to be literals. The patterns can also:

  • Use variable names that are set if a case matches
  • Match sequences using list or tuple syntax (like Python’s existing iterable unpacking feature)
  • Match mappings using dict syntax
  • Use * to match the rest of a list
  • Use ** to match other keys in a dict
  • Match objects and their attributes using class syntax
  • Include “or” patterns with |
  • Capture sub-patterns with as
  • Include an if “guard” clause

Wow! That’s a lot of features. Let’s see if we can use them all in one go, to see what they look like in a very contrived example (for a more gradual introduction, read the tutorial):

class Car:
    __match_args__ = ('key', 'name')
    def __init__(self, key, name):
        self.key = key
        self.name = name

expr = eval(input('Expr: '))
match expr:
    case (0, x):              # seq of 2 elems with first 0
        print(f'(0, {x})')    # (new variable x set to second elem)
    case ['a', x, 'c']:       # seq of 3 elems: 'a', anything, 'c'
        print(f"'a', {x!r}, 'c'")
    case {'foo': bar}:        # dict with key 'foo' (may have others)
        print(f"{{'foo': {bar}}}")
    case [1, 2, *rest]:       # seq of: 1, 2, ... other elements
        print(f'[1, 2, *{rest}]')
    case {'x': x, **kw}:      # dict with key 'x' (others go to kw)
        print(f"{{'x': {x}, **{kw}}}")
    case Car(key=key, name='Tesla'):  # Car with name 'Tesla' (any key)
        print(f"Car({key!r}, 'TESLA!')")
    case Car(key, name):      # similar to above, but use __match_args__
        print(f"Car({key!r}, {name!r})")
    case 1 | 'one' | 'I':     # int 1 or str 'one' or 'I'
        print('one')
    case ['a'|'b' as ab, c]:  # seq of 2 elems with first 'a' or 'b'
        print(f'{ab!r}, {c!r}')
    case (x, y) if x == y:    # seq of 2 elems with first equal to second
        print(f'({x}, {y}) with x==y')
    case _:
        print('no match')

As you can see, it’s complex but also powerful. The exact details of how matching is done are provided in the spec. Thankfully, much of the above is fairly self-explanatory, though the __match_args__ attribute requires an explanation: if position arguments are used in a class pattern, the items in the class’s __match_args__ tuple provide the names of the attributes. This is a shorthand to avoid specifying the attribute names in the class pattern.

One thing to note is that match and case are not real keywords but “soft keywords”, meaning they only operate as keywords in a match ... case block. This is by design, because people use match as a variable name all the time – I almost always use a variable named match as the result of a regex match.

Where it shines

As I mentioned, I don’t think match pays off if you’re just using it as a glorified switch. So where does it pay off?

In the tutorial PEP there are several examples that make it shine: matching commands and their arguments in a simple text-based game. I’ve merged some of those examples together and copied them below:

command = input("What are you doing next? ")
match command.split():
    case ["quit"]:
        print("Goodbye!")
        quit_game()
    case ["look"]:
        current_room.describe()
    case ["get", obj]:
        character.get(obj, current_room)
    case ["drop", *objects]:
        for obj in objects:
            character.drop(obj, current_room)
    case ["go", direction] if direction in current_room.exits:
        current_room = current_room.neighbor(direction)
    case ["go", _]:
        print("Sorry, you can't go that way")
    case _:
        print(f"Sorry, I couldn't understand {command!r}")

As a comparison, let’s do a quick rewrite of that snippet without pattern matching, old-school style. You’d almost certainly use a bunch of if ... elif blocks. I’m going to set a new variable fields to the pre-split fields, and n to the number of fields, to make the conditions simpler:

command = input("What are you doing next? ")
fields = text.split()
n = len(fields)

if fields == ["quit"]:
    print("Goodbye!")
    quit_game()
elif fields == ["look"]:
    current_room.describe()
elif n == 2 and fields[0] == "get":
    obj = fields[1]
    character.get(obj, current_room)
elif n >= 1 and fields[0] == "drop":
    objects = fields[1:]
    for obj in objects:
        character.drop(obj, current_room)
elif n == 2 and fields[0] == "go":
    direction = fields[1]
    if direction in current_room.exits:
        current_room = current_room.neighbor(direction)
    else:
        print("Sorry, you can't go that way")
else:
    print(f"Sorry, I couldn't understand {command!r}")

Apart from being a bit shorter, I think it’s fair to say that the structural matching version is easier to read, and the variable binding avoids the manual indexing like fields[1]. In pure readability terms, this example seems like a clear win for pattern matching.

The tutorial also provides examples of class-based matching, in what is presumably part of the game’s event loop:

match event.get():
    case Click((x, y), button=Button.LEFT):  # This is a left click
        handle_click_at(x, y)
    case Click():
        pass  # ignore other clicks
    case KeyPress(key_name="Q") | Quit():
        game.quit()
    case KeyPress(key_name="up arrow"):
        game.go_north()
    ...
    case KeyPress():
        pass # Ignore other keystrokes
    case other_event:
        raise ValueError(f"Unrecognized event: {other_event}")

Let’s try rewriting this one using plain if ... elif. Similar to how I handled the “go” command in the rewrite above, I’ll merge cases together where it makes sense (events of the same class):

e = event.get()
if isinstance(e, Click):
    x, y = e.position
    if e.button == Button.LEFT:
        handle_click_at(x, y)
    # ignore other clicks
elif isinstance(e, KeyPress):
    key == e.key_name
    if key == "Q":
        game.quit()
    elif key == "up arrow":
        game.go_north()
    # ignore other keystrokes
elif isinstance(e, Quit):
    game.quit()
else:
    raise ValueError(f"Unrecognized event: {e}")

To me this one seems more borderline. It’s definitely a bit nicer with pattern matching, but not by much. The match has the advantage of the cases all lining up; the if ... elif has the advantage that the event types are grouped more strongly, and we avoid repeating the types.

Despite my skepticism, I’m trying to be fair: these examples do look nice, and even if you haven’t read the pattern matching spec, it’s reasonably clear what they do – with the possible exception of the __match_args__ magic.

There’s also an expression parser and evaluator that Guido van Rossum wrote to show-case the feature. It uses match ... case heavily (11 times in a fairly small file). Here’s one example:

def eval_expr(expr):
    """Evaluate an expression and return the result."""
    match expr:
        case BinaryOp('+', left, right):
            return eval_expr(left) + eval_expr(right)
        case BinaryOp('-', left, right):
            return eval_expr(left) - eval_expr(right)
        case BinaryOp('*', left, right):
            return eval_expr(left) * eval_expr(right)
        case BinaryOp('/', left, right):
            return eval_expr(left) / eval_expr(right)
        case UnaryOp('+', arg):
            return eval_expr(arg)
        case UnaryOp('-', arg):
            return -eval_expr(arg)
        case VarExpr(name):
            raise ValueError(f"Unknown value of: {name}")
        case float() | int():
            return expr
        case _:
            raise ValueError(f"Invalid expression value: {repr(expr)}")

What would that look like with if ... elif? Once again, you’d probably structure it slightly differently, with all the BinaryOp cases together. Note that the nested if blocks don’t actually increase the indentation level, due to the double nesting required for the case clauses in a match:

def eval_expr(expr):
    """Evaluate an expression and return the result."""
    if isinstance(expr, BinaryOp):
        op, left, right = expr.op, expr.left, expr.right
        if op == '+':
            return eval_expr(left) + eval_expr(right)
        elif op == '-':
            return eval_expr(left) - eval_expr(right)
        elif op == '*':
            return eval_expr(left) * eval_expr(right)
        elif op == '/':
            return eval_expr(left) / eval_expr(right)
    elif isinstance(expr, UnaryOp):
        op, arg = expr.op, expr.arg
        if op == '+':
            return eval_expr(arg)
        elif op == '-':
            return -eval_expr(arg)
    elif isinstance(expr, VarExpr):
        raise ValueError(f"Unknown value of: {name}")
    elif isinstance(expr, (float, int)):
        return expr
    raise ValueError(f"Invalid expression value: {repr(expr)}")

It’s two more lines due to the manual attribute unpacking for the BinaryOp and UnaryOp fields. Maybe it’s just me, but I find this one just as easy to read as the match version, and more explicit.

Another place match might be useful is when validating the structure of JSON from an HTTP request (this is my own made-up example):

try:
    obj = json.loads(request.body)
except ValueError:
    raise HTTPBadRequest(f'invalid JSON: {request.body!r}')

match obj:
    case {
        'action': 'sign-in',
        'username': str(username),
        'password': str(password),
        'details': {'email': email, **other_details},
    } if username and password:
        sign_in(username, password, email=email, **other_details)
    case {'action': 'sign-out'}:
        sign_out()
    case _:
        raise HTTPBadRequest(f'invalid JSON structure: {obj}')

This is quite nice. However, one drawback is that it doesn’t give you good validation errors: ideally an API would tell the caller what fields were missing, or what types were incorrect.

Using it in my code

Let’s look at converting some existing code to use the new feature. I’m basically scanning for if ... elif blocks to see if it makes sense to convert them. I’ll start with a few examples of code I’ve written.

The first couple of examples are from pygit, a toy subset of git that’s just enough of a Git client to create a repo, commit, and push itself to GitHub (full source code).

I’ve collapsed the code blocks below by default. Just click the arrow or summary paragraph to expand. Example from find_object(). Some aspects are a bit nicer, but overall I think rewriting it to use match is over-use of the feature.Example from cat_file(), shown two different ways. Seems like a slight win.Example from argument parsing, when switching on the CLI sub-command. Makes sense to use match, but it’s a simple switch, not using the structural features.

Here are a couple more examples from Canonical’s Python Operator Framework, in pebble.py, which I wrote for work.

Example from add_layer(). It handles the various types allowed for the layer parameter. Less visual noise, though also less explicit.Example from exec(), code I’m working on now. No clearer in this case.

Maybe I just don’t write much of the kind of code that would benefit from this feature, but my guess is that there are a lot of people that fall into this camp. However, let’s scan code from a few popular Python projects to see what we can find.

Using it in other projects

Go to: Standard library | Django | Warehouse | Mercurial | Ansible

I’m going to pick examples from three different types of code: library code (from the standard library), framework code (from the Django web framework), and application code (from Warehouse, the server that powers the Python Package Index, from Mercurial, and from Ansible).

In an effort to be fair, I tried to find examples that would really benefit from match and were more than a glorified switch (there were a lot of those, but they’re not using the structural part of pattern matching, so converting them isn’t a huge win). I went looking for elif blocks which look like they test the structure of the data. There might be some good uses of match in code which just uses if without elif, but I think that would be rare.

The standard library

Python’s standard library has about 709,000 lines of code, including tests (measured using scc). The ripgrep search tool (rg --type=py 'elif ' | wc) tells us that 2529 of those lines are elif statements, or 0.4%. I realize this would find “elif ” in comments, but presumably that’s rare.

Example from ast.literal_eval(), in the _convert() helper. Not surprisingly, the first really good use case I found was in AST processing. Definitely a win.

Still, I want to find one outside the ast module. I won’t include it here, but there’s a nice long if ... elif chain in curses/textpad.py in do_command(): it’s mostly a simple switch, but it would benefit from match ... case with a few if guards.

Example from dataclasses, in _asdict_inner(). Reduces visual noise for a nice little improvement.Example from email.utils, in parsedate_tz(). Matching with tuple unpacking makes it a good bit cleaner.

Interestingly, while testing parsedate_tz() I found that this code has a bug that raises an UnboundLocalError on invalid user input: if you pass a time with more than 3 dotted segments like 12.34.56.78, the thh/tmm/tss variables won’t be defined for the following code. Check this out:

$ python3.10 -c 'import email.utils; \
    email.utils.parsedate_tz("Wed, 3 Apr 2002 12.34.56.78+0800")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python3.10/email/_parseaddr.py", line 50, in parsedate_tz
    res = _parsedate_tz(data)
  File "/usr/local/lib/python3.10/email/_parseaddr.py", line 134, in _parsedate_tz
    thh = int(thh)
UnboundLocalError: local variable 'thh' referenced before assignment

All it needs is another else: return None in the dot case. I’ve opened an issue and a pull request that adds a test case for this and fixes the bug.

Django

Django has 327,000 lines of code, including tests. Of these, there are 905 uses of elif, or 0.3%.

Example from Django admin checks, in _check_fieldsets_item(). The structural matching is great here, but doesn’t help produce good error messages.

Warehouse

Warehouse, PyPI’s server code, has 59,000 lines of Python code, including tests. There are 35 uses of elif, or 0.06%. Interestingly, that’s an order of magnitude less than either the standard library or Django, which fits with my conjecture that match won’t pay off as much in “regular” code.

Example from BigQuery syncing, in sync_bigquery_release_files(). This is the only example I found in Warehouse that (at first!) looked like it would benefit from match, but turns out it doesn’t.

It seems Warehouse wasn’t exactly crying out for match. I decided to keep it here anyway, as I think it’s a good counterpoint. Let’s find a couple more examples by skimming through two other large applications: Mercurial and Ansible.

Mercurial

Mercurial, the version control system, has 268,000 lines of Python code, including tests. There are 1941 uses of elif, or 0.7% – the highest ratio yet.

Example from context.py, in ancestor(). Small improvement using tuple unpacking.

There are quite a few cases like this where it might not be a huge win, but it is a small “quality of life” improvement for developers.

Ansible

Ansible is a widely-used configuration management system written in Python. It has 217,000 lines of Python code, including tests. There are 1594 uses of elif, which again is 0.7%. Below are a couple of cases I saw which might benefit from pattern matching.

Example from module_utils/basic.py, in _return_formatted(). Small readability improvement.Example from utils/version.py, in _Alpha.__lt__(), some version-comparison code. Type checking is a little bit nicer with match.

In all of these projects, there are many more cases that could be converted to use match, but I’ve tried to pick out a few different kinds of code where it made sense to at least try.

Some problems with the feature

As I’ve shown, pattern matching does make code clearer in a few cases, but there are a number of concerns I have with this feature. Obviously the ship has already sailed – Python 3.10 is due out in a few days! – but I think it’s valuable to consider the problems for future designs. (Python definitely doesn’t ship every feature people want: the rejected PEPs are interesting to peruse.)

There’s some trivial stuff like how match ... case requires two indentation levels: the PEP authors considered various alternatives, and I believe they chose the right route – that’s only a minor annoyance. But what about larger problems?

Learning curve and surface area. As you can see from the size of the spec PEP, there is a lot to this feature, with about 10 sub-features packed into one. Python has always been an easy-to-learn language, and this feature, while it can look good on the page, has a lot of complexity in its semantics.

Another way to do things. The Zen of Python says, “There should be one – and preferably only one – obvious way to do it.” In reality, Python has always had many different ways to do things. But now there’s one that adds a fair bit of cognitive load to developers: as shown in many of the examples, developers may often need to try both with and without match, and still be left debating which is more “obvious”.

Only useful in rarer domains. As shown above, there are cases where match really shines. But they are few and far between, mostly when handling syntax trees and writing parsers. A lot of code does have if ... elif chains, but these are often either plain switch-on-value, where elif works almost as well, or the conditions they’re testing are a more complex combination of tests that don’t fit into case patterns (unless you use awkward case _ if cond clauses, but that’s strictly worse than elif).

My hunch is that the PEP authors (Brandt Bucher and Guido van Rossum, both Python core developers) regularly write the kind of code that does benefit from pattern matching, but most application developers and script writers will need match far less often. Guido van Rossum in particular has been working on the Mypy type checker for a while, and now he’s working on speeding up CPython – compiler work no doubt involving ASTs.

Syntax works differently. There are at least two parts of this feature where syntax that looks like one thing in “normal Python” acts differently inside a pattern:

  1. Variable names: a variable in a case clause doesn’t return its value like in ordinary code, it binds it as a name. This means case RED doesn’t work as you expect – it will set a new variable named RED, not match your colour constant. To match on constants, they have to have a dot in them – so case Colors.RED works. In writing some of the code above I actually made this mistake: I wrote case ('commit' | 'tree' | 'blob', mode), expecting it to match if the tuple’s second item was equal to mode, but of course it would have set mode to the second item.
  2. Class patterns: these look like function calls, but they’re really isinstance and hasattr tests. It looks nice, but it’s sometimes confusing. It also means you can’t match on the result of an actual function call – that would have to be in an if guard.

The rationale PEP does acknowledge these syntax variations in the “Patterns” section:

Although patterns might superficially look like expressions, it is important to keep in mind that there is a clear distinction. In fact, no pattern is or contains an expression. It is more productive to think of patterns as declarative elements similar to the formal parameters in a function definition.

The __match_args__ magic. In my opinion the __match_args__ feature is too magical, and requires developers to decide which of a class’s attributes should be position-matchable, if any. It’s also strange that the __match_args__ order could be different from the order of the class’s __init__ parameters (though in practice you’d try not to do that). I can see why they’ve included this feature, as it makes the likes of AST node matching really nice, but it’s not very explicit.

Cost for other implementations. CPython is by far the most commonly-used Python interpreter, but there are also others, such as PyPy and MicroPython, that will have to decide whether or not to implement this feature. Other interpreters are always playing catch-up anyway, but a feature of this size at this stage in Python’s history will make it even harder for other implementations to keep up.

Originally I was also concerned that match’s class patterns don’t play well with Python’s use of duck typing, where you just access attributes and call methods on an object, without checking its type first (for example, when using file-like objects). With class patterns, however, you specify the type, and it performs an isinstance check. Duck typing is still possible using object(), but it would be a bit strange.

However, now that I’ve used the feature, I think this is mostly a theoretical concern – the places you’d use class patterns don’t really overlap with the places you’d use duck typing.

This duck typing concern is discussed briefly in the rationale PEP:

Paying tribute to Python’s dynamic nature with ‘duck typing’, however, we also added a more direct way to specify the presence of, or constraints on specific attributes. Instead of Node(x, y) you could also write object(left=x, right=y), effectively eliminating the isinstance() check and thus supporting any object with left and right attributes.

Wrapping up

I do like some aspects of pattern matching, and certain code is definitely cleaner with match ... case than with if ... elif. But does the feature provide enough value to justify the complexity, not to mention the cognitive burden it places on people learning Python or reading Python code?

That said, Python has always been a pragmatic programming language, not a purist’s dream. As Bjarne Stroustrup, the creator of C++, said, “There are only two kinds of languages: the ones people complain about and the ones nobody uses.” I have always liked Python, and I’ve used it successfully for many years. I’ll almost certainly continue to use it for many tasks. It’s not perfect, but if it were, no one would use it.

I’ve also been using Go a lot recently, and there’s definitely something good about how slowly the language changes (by design). Most of the release notes start with “there are no changes to the language” – for example, in Go 1.16 all the changes were in the tooling and standard library. That said, Go will have its own large new feature in a few months with generics coming in Go 1.18.

Overall I’m a bit pessimistic about structural pattern matching in Python. It’s just a big feature to add so late in the game (Python is 30 years old this year). Is the language starting to implode under its own weight?

Or, as my friend predicted, is it one of those features that will be over-used for everything for a couple of years, and then the community will settle down and only use it where it really improves the code? We shall see!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK