Lifecycle of a Python Code - CPython's Execution Model
source link: https://www.tuicool.com/articles/hit/VZfeUj3
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Motivation
- Learn Python's internals.
- Make cool stuff with understanding of AST
- Write better and memory efficient code
Pre-requirements
- Basic understanding of how interpreters work? (AST/Tokenization...)
- Python knowledge
- Little bit C.
batuhan@arch-pc ~/blogs/cpython/exec_model python --version Python 3.7.0
CPython's Execution Model
Python is a compiled and interpreted language. The Python compiler generates bytecodes and interpreter executes bytecode. We are going to look this sections;
- Parsing & Tokenization
- Transformation To AST
- CFG
- Bytecode
- Execution On CPython's Virtual Machine
Parsing & Tokenization
Grammar
The grammar defines semantics of a python with. It is also useful in specifying how a parser should parse the language.
The grammar of python uses ebnf-like syntax. It has its own grammar to source language translator.
You can find the grammar file in cpython/Grammar/Grammar
An example of grammar,
batuhan@arch-pc ~/cpython/Grammar master grep "simple_stmt" Grammar | head -n3 | tail -n1 simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
A simple statement contains small statement then it might take semilcon (for oneliners :) ) and it finishes with a new line. A small statement looks like a list of expressions, imports...
Some statements,
small_stmt: (expr_stmt | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | nonlocal_stmt | assert_stmt) ... del_stmt: 'del' exprlist pass_stmt: 'pass' flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt break_stmt: 'break' continue_stmt: 'continue'
Tokenization ( Lexical Scanning )
Tokenization is the process of getting a stream of text and tokenize it meanful (to interpreter) words with some extra metadata (like where the token started where it ended what is the string value of this token)
Tokens
Tokens is a header file and it contains definition of all tokens. Python currently has 59 types of tokens ( Include/token.h
).
Some tokens;
#define NAME 1 #define NUMBER 2 #define STRING 3 #define NEWLINE 4 #define INDENT 5 #define DEDENT 6
You can see this tokens when you tokenize some python code.
Tokenize with CLI
It is our tests.py
def greetings(name: str): print(f"Hello {name}") greetings("Batuhan")
And when we tokenize it with python -m tokenize
, it is the output;
batuhan@arch-pc ~/blogs/cpython/exec_model python -m tokenize tests.py 0,0-0,0: ENCODING 'utf-8' ... 2,0-2,4: INDENT ' ' 2,4-2,9: NAME 'print' ... 4,0-4,0: DEDENT '' ... 5,0-5,0: ENDMARKER ''
The numbers (ex: 1,4-1,13) represents where token started and where it ended up. The NAME
, OP
, STRING
like things it the real tokens. And def
, greetings
, (
like thins it their value.
The encoding token is always the first token we get. It tells encoding of source file and if there is a problem about encoding of source file interpreter raises an exception.
Tokenize with tokenize.tokenize
If you want to tokenize a file inside your code you can use tokenize
module (in stdlib).
>>> with open('tests.py', 'rb') as f: ... for token in tokenize.tokenize(f.readline): ... print(token) ... TokenInfo(type=57 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='') ... TokenInfo(type=1 (NAME), string='greetings', start=(1, 4), end=(1, 13), line='def greetings(name: str):\n') ... TokenInfo(type=53 (OP), string=':', start=(1, 24), end=(1, 25), line='def greetings(name: str):\n') ...
The type is the token id in the token.h
header file. The string is the value of the token. Start and End represents where token started and where it ended.
In other languages the scopes identified with curly braces or begin/end statements but in python we use indentation for scopes and their levels. The INDENT
and DEDENT
token represents indentations. This tokens are really import for constructing relational parse / abstract syntax trees.
Parser Generation
Parser generation is the name of the process that generates parser from given Grammar. Parser generators are known as PGen.
In cpython repository there are 2 parser generators. One is the original one ( Parser/pgen
) and one is the rewrited one that writed with python ( /Lib/lib2to3/pgen2
).
"The original was actually the first code I wrote for Python".
- Guido
We can understand from the quote that pgen is the one of the most important thing for a programming language.
When you call pgen (in Parser/pgen
) it generates a header file and a c file (the parser itself).
If you look at the generated c code it might not make sense to you (because it is not supposed to do). It just contains alot of static data and structs. I'll try to explain the basic components of it.
DFA
Parser defines one DFA to every non-terminal. Every DFA is defined as an array of states.
static dfa dfas[87] = { {256, "single_input", 0, 3, states_0, "\004\050\340\000\002\000\000\000\012\076\011\007\262\004\020\002\000\300\220\050\037\202\000"}, ... {342, "yield_arg", 0, 3, states_86, "\000\040\200\000\002\000\000\000\000\040\010\000\000\000\020\002\000\300\220\050\037\000\000"}, };
For each DFA we also have the start set which shows which tokens the specific non-terminal can start with.
State
Every state is defined as an array of arcs.
static state states_86[3] = { {2, arcs_86_0}, {1, arcs_86_1}, {1, arcs_86_2}, };
Arc
And every arc has 2 number. The first number is the arc label. It refers to symbol number. The second number is the destination.
static arc arcs_86_0[2] = { {77, 1}, {47, 2}, }; static arc arcs_86_1[1] = { {26, 2}, }; static arc arcs_86_2[1] = { {0, 2}, };
Labels
It is a special table which converts symbol ids into arc labels.
static label labels[177] = { {0, "EMPTY"}, {256, 0}, {4, 0}, ... {1, "async"}, {1, "def"}, ... {1, "yield"}, {342, 0}, };
The number 1
corresponds all identifiers. So that each one of those get a diffrent arc label even though they are all share the same symbol id.
Simple Example
We have a code that prints 'Hi' if 1 is True.
if 1: print('Hi')
According to parser the code
is a single_input
.
single_input: NEWLINE | simple_stmt | compound_stmt NEWLINE
Our parse tree starts with a just root node of single input;
And our DFA (single_input) number is 0;
static dfa dfas[87] = { {256, "single_input", 0, 3, states_0, "\004\050\340\000\002\000\000\000\012\076\011\007\262\004\020\002\000\300\220\050\037\202\000"} ... }
And it looks like this;
static arc arcs_0_0[3] = { {2, 1}, {3, 1}, {4, 2}, }; static arc arcs_0_1[1] = { {0, 1}, }; static arc arcs_0_2[1] = { {2, 1}, }; static state states_0[3] = { {3, arcs_0_0}, {1, arcs_0_1}, {1, arcs_0_2}, };
arcs_0_0
made of 3 elements. First one is NEWLINE
, second one is simple_stmt
and the last one is compound_stmt
.
According to start set of simple_stmt
a simple_stmt
can not starts with if
. Because if
is not an newline but the compound_stmt
might begin with if. So we transition along the last arc ( {4, 2}
). And add the compound_stmt
node the parse tree and before we actually transition to number 2 we switch to a new DFA.
Updated parse tree;
The new DFA parses compound statements.
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | with_stmt | funcdef | classdef | decorated | async_stmt
It loooks like this;
static arc arcs_39_0[9] = { {91, 1}, {92, 1}, {93, 1}, {94, 1}, {95, 1}, {19, 1}, {18, 1}, {17, 1}, {96, 1}, }; static arc arcs_39_1[1] = { {0, 1}, }; static state states_39[2] = { {9, arcs_39_0}, {1, arcs_39_1}, };
Only the first arc can begin with an if
statement.
Updated parse tree with if
Then we switch our DFA again, the next DFA parses if
statements.
if_stmt: 'if' test ':' suite ('elif' test ':' suite)* ['else' ':' suite]
It is a long one. In the inital state we have only one arc with a label 97. That corresponds if
token.
static arc arcs_41_0[1] = { {97, 1}, }; static arc arcs_41_1[1] = { {26, 2}, }; static arc arcs_41_2[1] = { {27, 3}, }; static arc arcs_41_3[1] = { {28, 4}, }; static arc arcs_41_4[3] = { {98, 1}, {99, 5}, {0, 4}, }; static arc arcs_41_5[1] = { {27, 6}, }; static arc arcs_41_6[1] = { {28, 7}, }; ...
When we switch to the next state (with staying in same DFA). Next one (arcs_41_1) has only one arc too. But this one for test
non-terminal. It might begin with a number (like 1).
In arcs_41_2
there are only 1 arc that consumes Colon token (:).
Then we have a suite. Suite can start with a print statement.
Finally we have a arcs_41_4
. The first arc in suite is the elif
statement, second one is represents else
and the third one stands for final state.
Basically that is it.
Python Interface For Parser
Python offers you to edit parse tree of python expression in pyhon via a module called parser
.
>>> import parser >>> code = "if 1: print('Hi')"
We can generate a CST (Concrete Syntax Tree) with parser.suite
>>> parser.suite(code).tolist() [257, [269, [295, [297, [1, 'if'], [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [2, '1']]]]]]]]]]]]]]]], [11, ':'], [304, [270, [271, [272, [274, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [1, 'print']], [326, [7, '('], [334, [335, [305, [309, [310, [311, [312, [315, [316, [317, [318, [319, [320, [321, [322, [323, [324, [3, "'Hi'"]]]]]]]]]]]]]]]]]], [8, ')']]]]]]]]]]]]]]]]]]], [4, '']]]]]], [4, ''], [0, '']]
It's output is a nested list. Every list in the structure has two elements. First one is the symbol id (0-256 terminals, 256+ non-terminals) and second one is the token string for terminals.
Documentation of parser module says no one should use it. Use AST module instead of it. But i want to just show there is a bridge module for parser.
Abstract Syntax Tree
Basically AST is a structural representation of you source code in a tree format where every node in tree represents a different type of language construct (e.g: an expression, a statement, a variable, a literal).
What is a Tree?
Tree is a data structure that has a root node as a starting point. The root node can branch down to other nodes. Each node (except root node) has a single, unique parent.
What is difference between AST and Parse Tree
In right there is a parse tree of 2x7+3
expression. It is basically a one-2-one mapping of your source code into a tree form. You can see all expression, terms, factors. But it is a very complicated and complex tree for a basic piece of code. So we can remove all syntactical expressions that we need to define the code and simplify it.
The simplified version is called AST. The left one is the AST for same code. we've left behind all these syntactical expressions.
When you transform your Parse Tree to AST you lose some information
Example
Python offers a built-in AST module for working with AST.
You can use 3rd party modules like Astor .
to generate source code back from the AST
We have the same code. Print Hi if 1 is True.
>>> import ast >>> code = "if 1: print('Hi')"
Lets use ast.parse
method to get our AST.
>>> tree = ast.parse(code) >>> tree <_ast.Module object at 0x7f37651d76d8>
We can see readable AST with ast.dump
method.
>>> ast.dump(tree) "Module(body=[If(test=Num(n=1), body=[Expr(value=Call(func=Name(id='print', ctx=Load()), args=[Str(s='Hi')], keywords=[]))], orelse=[])])"
It is way better than Parse Tree :D
Control Flow Graphs
CFGs are a directed graph that models the flow of a program using basic blocks that contain the intermediate representation within the blocks.
CFGs are usually one step away from final code output. Code is directly generated from the basic blocks (with jump targets adjusted based on the output order) by doing a post-order depth-first search on the CFG following the edges.
Best Part - The Bytecode
Python Bytecode is a intermediate set of instructions that runs on Python's virtual machine implementation.
When you run a source code python creates a directory called __pycache__
and stores your compiled python code for time saving in rerun situations. That is a python bytecode.
Simple python function in func.py
def say_hello(name: str): print("Hello", name) print(f"How are you {name}?")
When i call it repl it prints;
>>> from func import say_hello >>> say_hello("Batuhan") Hello Batuhan How are you Batuhan?
It is a function type object
>>> type(say_hello) <class 'function'>
And it has __code__
attribute. It is a python code object
>>> say_hello.__code__ <code object say_hello at 0x7f13777d9150, file "/home/batuhan/blogs/cpython/exec_model/func.py", line 1>
Code Objects
Code objects are immutable data structures that contains instructions / metadata needed for run the code. Simply it is a representation of a python code.
Also you can get python code objects by compiling ASTs via compile
method.
>>> import ast >>> code = """ ... def say_hello(name: str): ... print("Hello", name) ... print(f"How are you {name}?") ... say_hello("Batuhan") ... """ >>> tree = ast.parse(code, mode='exec') >>> code = compile(tree, '<string>', mode='exec') >>> type(code) <class 'code'>
Every code object has attributes that holds information (they are prefixed by co_). I'll just mention a few of them.
co_name
First one is name. If it is a function it is going to be name of the function, for a class it is going to be name of the class. In AST example we're using modules so we can't actually tell compiler what we want this to be called so it's just going to be called <module>
.
>>> code.co_name '<module>' >>> say_hello.__code__.co_name 'say_hello'
co_varnames
It is a tuple that contains all your local variables (including arguments).
>>> say_hello.__code__.co_varnames ('name',)
co_consts
A tuple contains all of the literals or constant values that were referenced in the body of our function.
>>> say_hello.__code__.co_consts (None, 'Hello', 'How are you ', '?')
The only weired this is None among the values. We didn't put / include a None into our function body. The python puts it into there because if python is executing our function and it finishes executing without reaching any return statement it is going to return None.
The Bytecode (co_code)
This is the bytecode of the our function.
>>> bytecode = say_hello.__code__.co_code >>> bytecode b't\x00d\x01|\x00\x83\x02\x01\x00t\x00d\x02|\x00\x9b\x00d\x03\x9d\x03\x83\x01\x01\x00d\x00S\x00'
This is not a string, it is a byte object. A sequence of bytes.
>>> type(bytecode) <class 'bytes'>
The first byte printed a 't' char. When i look decimal value of that
>>> ord('t') 116
It is 116 same as bytecode[0]
>>> assert bytecode[0] == ord('t')
The 116 value doesn't mean anything to me. Fortunately python offers a standard library module called dis (comes from disassembly). It might help me with opname
list. It is a list of all bytecode instructions. Each index is the name of the instruction.
>>> dis.opname[ord('t')] 'LOAD_GLOBAL'
Lets create another function that makes more sense.
>>> def multiple_it(a: int): ... if a % 2 == 0: ... return 0 ... return a * 2 ... >>> multiple_it(6) 0 >>> multiple_it(7) 14 >>> bytecode = multiple_it.__code__.co_code
Lets check what is the first instruction of this bytecode.
>>> dis.opname[bytecode[0]] 'LOAD_FAST
The LOAD_FAST
instruction made of 2 elements.
>>> dis.opname[bytecode[0]] + ' ' + dis.opname[bytecode[1]] 'LOAD_FAST <0>'
Loadfast 0 is the instruction for, look up the variable names tuple and push 0 indexed element in tuple into evaluation stack.
>>> multiple_it.__code__.co_varnames[bytecode[1]] 'a'
Lets disassemble our code with dis.dis
and transform bytecode to human readable version.
>>> dis.dis(multiple_it) 2 0 LOAD_FAST 0 (a) 2 LOAD_CONST 1 (2) 4 BINARY_MODULO 6 LOAD_CONST 2 (0) 8 COMPARE_OP 2 (==) 10 POP_JUMP_IF_FALSE 16 3 12 LOAD_CONST 2 (0) 14 RETURN_VALUE 4 >> 16 LOAD_FAST 0 (a) 18 LOAD_CONST 1 (2) 20 BINARY_MULTIPLY 22 RETURN_VALUE
The numbers (2,3,4) stands for line numbers. Each line of python code turned into multiple instructions.
Execution Time
CPython is a stack-oriented virtual machine instead of registers. It means that python code is compiled for an imaginary (virtual) computer with a stack architecture.
Each time you make a function call you are pushing a call frame into call stack in python. When return statement happen that call frame gets popped from stack.
Python uses 2 stacks additionally when a function called. One is evaluation stack and the other one is the block stack. The most of call happens in evaluation stack and the block stack tracks how many active blocks right now and other things related with blocks and scopes.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK