Lifecycle of a Python Code - CPython's Execution Model - JOYK Joy of Geek, Geek News, Link all geek

QJbu2im.png!web

Motivation

Learn Python's internals.
Make cool stuff with understanding of AST
Write better and memory efficient code

Pre-requirements

Basic understanding of how interpreters work? (AST/Tokenization...)
Python knowledge
Little bit C.

batuhan@arch-pc  ~/blogs/cpython/exec_model  python --version
Python 3.7.0

CPython's Execution Model

Python is a compiled and interpreted language. The Python compiler generates bytecodes and interpreter executes bytecode. We are going to look this sections;

Parsing & Tokenization
Transformation To AST
CFG
Bytecode
Execution On CPython's Virtual Machine

Parsing & Tokenization

Grammar

The grammar defines semantics of a python with. It is also useful in specifying how a parser should parse the language.

The grammar of python uses ebnf-like syntax. It has its own grammar to source language translator.

You can find the grammar file in cpython/Grammar/Grammar

An example of grammar,

batuhan@arch-pc  ~/cpython/Grammar   master  grep "simple_stmt" Grammar | head -n3 | tail -n1
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE

A simple statement contains small statement then it might take semilcon (for oneliners :) ) and it finishes with a new line. A small statement looks like a list of expressions, imports...

Some statements,

small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt |
             import_stmt | global_stmt | nonlocal_stmt | assert_stmt)
...
del_stmt: 'del' exprlist
pass_stmt: 'pass'
flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
break_stmt: 'break'
continue_stmt: 'continue'

Tokenization ( Lexical Scanning )

Tokenization is the process of getting a stream of text and tokenize it meanful (to interpreter) words with some extra metadata (like where the token started where it ended what is the string value of this token)

Tokens

Tokens is a header file and it contains definition of all tokens. Python currently has 59 types of tokens ( Include/token.h ).

Some tokens;

#define NAME            1
#define NUMBER          2
#define STRING          3
#define NEWLINE         4
#define INDENT          5
#define DEDENT          6

You can see this tokens when you tokenize some python code.

Tokenize with CLI

It is our tests.py

def greetings(name: str):
    print(f"Hello {name}")

greetings("Batuhan")

And when we tokenize it with python -m tokenize , it is the output;

batuhan@arch-pc  ~/blogs/cpython/exec_model  python -m tokenize tests.py
0,0-0,0:            ENCODING       'utf-8'        
...       
2,0-2,4:            INDENT         '    '         
2,4-2,9:            NAME           'print'        
...        
4,0-4,0:            DEDENT         ''             
...        
5,0-5,0:            ENDMARKER      ''

The numbers (ex: 1,4-1,13) represents where token started and where it ended up. The NAME , OP , STRING like things it the real tokens. And def , greetings , ( like thins it their value.

The encoding token is always the first token we get. It tells encoding of source file and if there is a problem about encoding of source file interpreter raises an exception.

Tokenize with `tokenize.tokenize`

If you want to tokenize a file inside your code you can use tokenize module (in stdlib).

>>> with open('tests.py', 'rb') as f:             
...     for token in tokenize.tokenize(f.readline):
...        print(token)
... 
TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
...
TokenInfo(type=1 (NAME), string='greetings', start=(1, 4), end=(1, 13), line='def greetings(name: str):\n')
...
TokenInfo(type=53 (OP), string=':', start=(1, 24), end=(1, 25), line='def greetings(name: str):\n')
...

The type is the token id in the token.h header file. The string is the value of the token. Start and End represents where token started and where it ended.

In other languages the scopes identified with curly braces or begin/end statements but in python we use indentation for scopes and their levels. The INDENT and DEDENT token represents indentations. This tokens are really import for constructing relational parse / abstract syntax trees.

Parser Generation

Parser generation is the name of the process that generates parser from given Grammar. Parser generators are known as PGen.

In cpython repository there are 2 parser generators. One is the original one ( Parser/pgen ) and one is the rewrited one that writed with python ( /Lib/lib2to3/pgen2 ).

"The original was actually the first code I wrote for Python".

Guido

We can understand from the quote that pgen is the one of the most important thing for a programming language.

When you call pgen (in Parser/pgen ) it generates a header file and a c file (the parser itself).

If you look at the generated c code it might not make sense to you (because it is not supposed to do). It just contains alot of static data and structs. I'll try to explain the basic components of it.

DFA

Parser defines one DFA to every non-terminal. Every DFA is defined as an array of states.

static dfa dfas[87] = {
    {256, "single_input", 0, 3, states_0,
"\004\050\340\000\002\000\000\000\012\076\011\007\262\004\020\002\000\300\220\050\037\202\000"},
    ...
    {342, "yield_arg", 0, 3, states_86,
     "\000\040\200\000\002\000\000\000\000\040\010\000\000\000\020\002\000\300\220\050\037\000\000"},
};

For each DFA we also have the start set which shows which tokens the specific non-terminal can start with.

State

Every state is defined as an array of arcs.

static state states_86[3] = {
    {2, arcs_86_0},
    {1, arcs_86_1},
    {1, arcs_86_2},
};

Arc

And every arc has 2 number. The first number is the arc label. It refers to symbol number. The second number is the destination.

static arc arcs_86_0[2] = {
    {77, 1},
    {47, 2},
};
static arc arcs_86_1[1] = {
    {26, 2},
};
static arc arcs_86_2[1] = {
    {0, 2},
};

Labels

It is a special table which converts symbol ids into arc labels.

static label labels[177] = {
    {0, "EMPTY"},
    {256, 0},
    {4, 0},
    ...
    {1, "async"},
    {1, "def"},
    ...
    {1, "yield"},
    {342, 0},
};

The number 1 corresponds all identifiers. So that each one of those get a diffrent arc label even though they are all share the same symbol id.

Simple Example

We have a code that prints 'Hi' if 1 is True.

if 1: print('Hi')

According to parser the code is a single_input .

single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE

Our parse tree starts with a just root node of single input;

And our DFA (single_input) number is 0;

static dfa dfas[87] = {
    {256, "single_input", 0, 3, states_0,
"\004\050\340\000\002\000\000\000\012\076\011\007\262\004\020\002\000\300\220\050\037\202\000"}
    ...
}

And it looks like this;

static arc arcs_0_0[3] = {
    {2, 1},
    {3, 1},
    {4, 2},
};
static arc arcs_0_1[1] = {
    {0, 1},
};
static arc arcs_0_2[1] = {
    {2, 1},
};
static state states_0[3] = {
    {3, arcs_0_0},
    {1, arcs_0_1},
    {1, arcs_0_2},
};

arcs_0_0 made of 3 elements. First one is NEWLINE , second one is simple_stmt and the last one is compound_stmt .

According to start set of simple_stmt a simple_stmt can not starts with if . Because if is not an newline but the compound_stmt might begin with if. So we transition along the last arc ( {4, 2} ). And add the compound_stmt node the parse tree and before we actually transition to number 2 we switch to a new DFA.

Updated parse tree;

The new DFA parses compound statements.

compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt

It loooks like this;

static arc arcs_39_0[9] = {
    {91, 1},
    {92, 1},
    {93, 1},
    {94, 1},
    {95, 1},
    {19, 1},
    {18, 1},
    {17, 1},
    {96, 1},
};
static arc arcs_39_1[1] = {
    {0, 1},
};
static state states_39[2] = {
    {9, arcs_39_0},
    {1, arcs_39_1},
};

Only the first arc can begin with an if statement.

Updated parse tree with if
QfEJzef.png!web

Then we switch our DFA again, the next DFA parses if statements.

if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]

It is a long one. In the inital state we have only one arc with a label 97. That corresponds if token.

static arc arcs_41_0[1] = {
    {97, 1},
};
static arc arcs_41_1[1] = {
    {26, 2},
};
static arc arcs_41_2[1] = {
    {27, 3},
};
static arc arcs_41_3[1] = {
    {28, 4},
};
static arc arcs_41_4[3] = {
    {98, 1},
    {99, 5},
    {0, 4},
};
static arc arcs_41_5[1] = {
    {27, 6},
};
static arc arcs_41_6[1] = {
    {28, 7},
};
...

When we switch to the next state (with staying in same DFA). Next one (arcs_41_1) has only one arc too. But this one for test non-terminal. It might begin with a number (like 1).

In arcs_41_2 there are only 1 arc that consumes Colon token (:).

Then we have a suite. Suite can start with a print statement.

Finally we have a arcs_41_4 . The first arc in suite is the elif statement, second one is represents else and the third one stands for final state.

Basically that is it.

Python Interface For Parser

Python offers you to edit parse tree of python expression in pyhon via a module called parser .

>>> import parser
>>> code = "if 1: print('Hi')"

We can generate a CST (Concrete Syntax Tree) with parser.suite

>>> parser.suite(code).tolist()
[257, [269, [295, [297, [1, 'if'], [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [2, '1']]]]]]]]]]]]]]]], [11, ':'], [304, [270, [271, [272, [274, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [1, 'print']], [326, [7, '('], [334, [335, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [3, "'Hi'"]]]]]]]]]]]]]]]]]], [8, ')']]]]]]]]]]]]]]]]]]], [4, '']]]]]], [4, ''], [0, '']]

It's output is a nested list. Every list in the structure has two elements. First one is the symbol id (0-256 terminals, 256+ non-terminals) and second one is the token string for terminals.

Documentation of parser module says no one should use it. Use AST module instead of it. But i want to just show there is a bridge module for parser.

Abstract Syntax Tree

Basically AST is a structural representation of you source code in a tree format where every node in tree represents a different type of language construct (e.g: an expression, a statement, a variable, a literal).

What is a Tree?

Tree is a data structure that has a root node as a starting point. The root node can branch down to other nodes. Each node (except root node) has a single, unique parent.

What is difference between AST and Parse Tree

uiQvMv2.png!web

In right there is a parse tree of 2x7+3 expression. It is basically a one-2-one mapping of your source code into a tree form. You can see all expression, terms, factors. But it is a very complicated and complex tree for a basic piece of code. So we can remove all syntactical expressions that we need to define the code and simplify it.

The simplified version is called AST. The left one is the AST for same code. we've left behind all these syntactical expressions.

When you transform your Parse Tree to AST you lose some information

Example

Python offers a built-in AST module for working with AST.

You can use 3rd party modules like Astor .

to generate source code back from the AST

We have the same code. Print Hi if 1 is True.

>>> import ast
>>> code = "if 1: print('Hi')"

Lets use ast.parse method to get our AST.

>>> tree = ast.parse(code)
>>> tree
<_ast.Module object at 0x7f37651d76d8>

We can see readable AST with ast.dump method.

>>> ast.dump(tree)
"Module(body=[If(test=Num(n=1), body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Str(s='Hi')], keywords=[]))], orelse=[])])"

It is way better than Parse Tree :D

Control Flow Graphs

CFGs are a directed graph that models the flow of a program using basic blocks that contain the intermediate representation within the blocks.

CFGs are usually one step away from final code output. Code is directly generated from the basic blocks (with jump targets adjusted based on the output order) by doing a post-order depth-first search on the CFG following the edges.

Best Part - The Bytecode

Python Bytecode is a intermediate set of instructions that runs on Python's virtual machine implementation.

When you run a source code python creates a directory called __pycache__ and stores your compiled python code for time saving in rerun situations. That is a python bytecode.

Simple python function in func.py

def say_hello(name: str):
    print("Hello", name)
    print(f"How are you {name}?")

When i call it repl it prints;

>>> from func import say_hello
>>> say_hello("Batuhan")
Hello Batuhan
How are you Batuhan?

It is a function type object

>>> type(say_hello)
<class 'function'>

And it has __code__ attribute. It is a python code object

>>> say_hello.__code__
<code object say_hello at 0x7f13777d9150, file "/home/batuhan/blogs/cpython/exec_model/func.py", line 1>

Code Objects

Code objects are immutable data structures that contains instructions / metadata needed for run the code. Simply it is a representation of a python code.

Also you can get python code objects by compiling ASTs via compile method.

>>> import ast
>>> code = """
... def say_hello(name: str):
...     print("Hello", name)
...     print(f"How are you {name}?")
... say_hello("Batuhan")
... """
>>> tree = ast.parse(code, mode='exec')
>>> code = compile(tree, '<string>', mode='exec')
>>> type(code)
<class 'code'>

Every code object has attributes that holds information (they are prefixed by co_). I'll just mention a few of them.

co_name

First one is name. If it is a function it is going to be name of the function, for a class it is going to be name of the class. In AST example we're using modules so we can't actually tell compiler what we want this to be called so it's just going to be called <module> .

>>> code.co_name
'<module>'
>>> say_hello.__code__.co_name
'say_hello'

co_varnames

It is a tuple that contains all your local variables (including arguments).

>>> say_hello.__code__.co_varnames
('name',)

co_consts

A tuple contains all of the literals or constant values that were referenced in the body of our function.

>>> say_hello.__code__.co_consts
(None, 'Hello', 'How are you ', '?')

The only weired this is None among the values. We didn't put / include a None into our function body. The python puts it into there because if python is executing our function and it finishes executing without reaching any return statement it is going to return None.

The Bytecode (co_code)

This is the bytecode of the our function.

>>> bytecode = say_hello.__code__.co_code
>>> bytecode
b't\x00d\x01|\x00\x83\x02\x01\x00t\x00d\x02|\x00\x9b\x00d\x03\x9d\x03\x83\x01\x01\x00d\x00S\x00'

This is not a string, it is a byte object. A sequence of bytes.

>>> type(bytecode)
<class 'bytes'>

The first byte printed a 't' char. When i look decimal value of that

>>> ord('t')
116

It is 116 same as bytecode[0]

>>> assert bytecode[0] == ord('t')

The 116 value doesn't mean anything to me. Fortunately python offers a standard library module called dis (comes from disassembly). It might help me with opname list. It is a list of all bytecode instructions. Each index is the name of the instruction.

>>> dis.opname[ord('t')]
'LOAD_GLOBAL'

Lets create another function that makes more sense.

>>> def multiple_it(a: int):
...     if a % 2 == 0:
...         return 0
...     return a * 2
... 
>>> multiple_it(6)
0
>>> multiple_it(7)
14
>>> bytecode = multiple_it.__code__.co_code

Lets check what is the first instruction of this bytecode.

>>> dis.opname[bytecode[0]]
'LOAD_FAST

The LOAD_FAST instruction made of 2 elements.

>>> dis.opname[bytecode[0]] + ' ' + dis.opname[bytecode[1]]
'LOAD_FAST <0>'

Loadfast 0 is the instruction for, look up the variable names tuple and push 0 indexed element in tuple into evaluation stack.

>>> multiple_it.__code__.co_varnames[bytecode[1]]
'a'

Lets disassemble our code with dis.dis and transform bytecode to human readable version.

>>> dis.dis(multiple_it)
  2           0 LOAD_FAST                0 (a)
              2 LOAD_CONST               1 (2)
              4 BINARY_MODULO
              6 LOAD_CONST               2 (0)
              8 COMPARE_OP               2 (==)
             10 POP_JUMP_IF_FALSE       16

  3          12 LOAD_CONST               2 (0)
             14 RETURN_VALUE

  4     >>   16 LOAD_FAST                0 (a)
             18 LOAD_CONST               1 (2)
             20 BINARY_MULTIPLY
             22 RETURN_VALUE

The numbers (2,3,4) stands for line numbers. Each line of python code turned into multiple instructions.

Execution Time

CPython is a stack-oriented virtual machine instead of registers. It means that python code is compiled for an imaginary (virtual) computer with a stack architecture.

Each time you make a function call you are pushing a call frame into call stack in python. When return statement happen that call frame gets popped from stack.

Python uses 2 stacks additionally when a function called. One is evaluation stack and the other one is the block stack. The most of call happens in evaluation stack and the block stack tracks how many active blocks right now and other things related with blocks and scopes.

Lifecycle of a Python Code - CPython&#39;s Execution Model