Outperforming everything with anything Python? Sure, why not?

Outperforming everything with anything

Python? Sure, why not?

This continues Outperforming LAPACK with C metaprogramming , and Vastly outperforming LAPACK with C++ metaprogramming . These two posts describe language tricks to make compilers generate highly-performant code for you. But do you need compilers at all?

The alternative would be writing programs in plain assembly. Although the horrors of it are vastly exaggerated, this approach has two major flaws.

Assembly code is not portable.
Although it became much easier with modern tools, assembly programming still requires a lot of tedious work.

Thankfully, we all live in the XXI century, and both of these problems have already been addressed.

The solution to the first one would be LLVM . Initially, it meant “Low Level Virtual Machine” and that is exactly what we want to ensure portability. Simply put, it takes some code written in very low-level hardware agnostic language and returns some highly optimized native code for the specific hardware platform. With LLVM we have both the power of low-level programming and the automation of hardware-oriented micro-optimizations.

And the solution to the second problem is any “scripting” language you want. Scheme, Python, Perl, even bash or AWK will do. They are all meant to eliminate tedious work. Everything you use daily for automation you can use to generate highly-performant code.

The plan

Let's repeat the same experiment I did with C and C++ in the posts from before. Let's generate a fully inlined fully unrolled solution and embed it into the benchmarking code. The plan then is as follows.

Use Clang to generate LLVM intermediate code for the benchmark that is supposed to measure yet non-existent function named solve_5
Make Python generate a linear solver code in LLVM.
Hijack a benchmark with a Python script replacing solve_5 call with the generated solver.
Use the LLVM static compiler to turn the intermediate code into the machine code.
Use the GNU assembler and the Clang's linker to make the machine code into an executable binary.

That's how it looks in a Makefile:

all: generate_llvm generate_s generate_elf test
.PHONY: generate_llvm
generate_llvm:
  <b>clang</b> benchmark.c -march=native -S -emit-llvm -o benchmark.bc
  <b>python</b> substitute_solver_call.py benchmark.bc
.PHONY: generate_s
generate_s:
  <b>llc</b> -O2 benchmark.bc -o benchmark.s
.PHONY: generate_elf
generate_elf:
  <b>as</b> benchmark.s -o benchmark.o
  <b>clang</b> -o benchmark benchmark.o

The Python part

We need a linear solver in Python just like we had with C and C++. Here it is:

# this generates n-solver in LLVM code with LLVMCode objects.
# No LLVM stuff yet, just completely Pythonic solution
def solve_linear_system(a_array, b_array, x_array, n_value):
  def a(i, j, n):
    if n == n_value:
      return a_array[i * n_value + j]
    return a(i, j, n+1)*a(n, n, n+1) - a(i, n, n+1)*a(n, j, n+1)

  def b(i, n):
    if n == n_value:
      return b_array[i]
    return a(n, n, n+1)*b(i, n+1) - a(i, n, n+1)*b(n, n+1)

  def x(i):
    d = b(i,i+1)
    for j in range(i):
      d -= a(i, j, i+1) * x_array[j]
    return d / a(i, i, i+1)

  for k in range(n_value):
    x_array[k] = x(k)

return x_array

When we run this on numbers, we have numbers. But we want the code. So let's make an object that pretends to be numbers only to spy on the algorithm. The object that writes down every operation the algorithm wants it to perform in ready to assemble LLVM intermediate language.

# this is basically the whole LLVM layer
I = 0
STACK = []

class LLVMCode:
  # the only constructor for now is by double* instruction
  def __init__(self, io, code = ''):
    self.io = io
    self.code = code
  def __getitem__(self, i):
    global I, STACK
    copy_code = "%" + str(I+1)
    copy_code += " = getelementptr inbounds double, double* "
    copy_code += self.io +", i64 " + str(i) + "\n"
    copy_code += "%" + str(I+2)
    copy_code += " = load double, double* %" + str(I+1)
    copy_code += ", align 8\n"
    I += 2
    STACK += [I]
    return LLVMCode(self.io, copy_code)
  def __setitem__(self, i, other_llvcode):
    global I, STACK
    self.code += other_llvcode.code
    self.code += "%" + str(I+1)
    self.code += " = getelementptr inbounds double, double* "
    self.code += self.io +", i64 " + str(i) + "\n"
    self.code += "store double %" + str(I)
    self.code += ", double* %" + str(I+1) + ", align 8\n"
    I += 1
    STACK = STACK[:-1]
    return self
  def general_arithmetics(self, operator, other_llvcode):
    global I, STACK
    self.code += other_llvcode.code;
    self.code += "%" + str(I+1) + " = f" + operator
    self.code += " double %" + str(STACK[-2]) + ", %"
    self.code += str(STACK[-1]) + "\n";
    I += 1
    STACK = STACK[:-2] + [I]
    return self
  def __add__(self, other_llvcode):
    return self.general_arithmetics('add', other_llvcode)
  def __sub__(self, other_llvcode):
    return self.general_arithmetics('sub', other_llvcode)
  def __mul__(self, other_llvcode):
    return self.general_arithmetics('mul', other_llvcode)
  def __div__(self, other_llvcode):
    return self.general_arithmetics('div', other_llvcode)

Now when we run the solver with this kind of objects, we get a fully functional linear solver written in LLVM intermediate language. Then we put it into the benchmark code to see how fast it is.

Instruction in LLVM are numbered and we want to preserve this enumeration, so the function that inserts our code into our benchmark is not trivial. But it's not very complicated either.

# this replaces the function call
# and updates all the instructions' indices
def replace_call(text, line, params):
  global I, STACK
  # '%12 ' -> 12
  I = int(''.join(
    [xi for xi in params[2] if xi.isdigit()]
    ))
  first_instruction_to_replace = I + 1
  STACK = []
  replacement = solve_linear_system(
    LLVMCode(params[0]),
    LLVMCode(params[1]),
    LLVMCode(params[2]), 5).code
  delta_instruction = I - first_instruction_to_replace + 1
  for i in xrange(first_instruction_to_replace, sys.maxint):
    not_found = sum(
      [text.find('%' + str(i) + c) == -1
        for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH]
      )
    if not_found == 4:
      # the last instruction has already been substituted
      break
    new_i = i + delta_instruction
    for c in POSSIBLE_CHARS_NUMBER_FOLLOWS_WITH:
      # substitute instruction number
      text = text.replace('%' + str(i) + c, '%' + str(new_i) + c)
return text.replace(line, replacement)

The whole piece of code that implements the solver, provides the Python-to-LLVM layer, and does the code insertion is only 100 lines long! You can see it in one piece on GitHub .

The benchmark

The benchmark itself is in C. When we run the Makefile, its call for solve_5 gets replaced by the LLVM code generated from Python by Python.

Step 1. Benchmark C source code

#include "stdio.h"

extern void solve_5(double* a, double* b, double* x); // fake

int main() { double sum_x[5] = {0., 0., 0., 0., 0.}; for(int i = 0; i < 1000000; ++i) { double a[5*5] = { 6.80, -6.05, -0.45, 8.32, -9.67, -2.11, -3.30, 2.58, 2.71, -5.14, 5.66, 5.36, -2.70, 4.35, -7.26, 5.97, -4.44, 0.27, -7.17, 6.08, 8.23, 1.08, 9.04, 2.14, -6.87 }; double b[5] = { 4.02, 6.19, -8.22, -7.57, -3.03, }; double x[5];

solve_5(a, b, x); // this will get replaced later

for(int j = 0; j < 5; ++j){ sum_x[j] += x[j]; } } printf("%f, %f, %f, %f, %f\n", sum_x[0], sum_x[1], sum_x[2], sum_x[3], sum_x[4]); return 0; }

Step 2. LLVM assembly language

... 27 lines before... ; <label>:6 ; preds = %3 %7 = bitcast [25 x double]* %a to i8* call void @llvm.memcpy.p0i8.p0i8.i32 (i8* %7, i8* bitcast ([25 x double]* @main.a to i8*), i32 200, i32 8, i1 false) %8 = bitcast [5 x double]* %b to i8* call void @llvm.memset.p0i8.i32(i8* %8, i8 0, i32 40, i32 8, i1 false) %9 = bitcast i8* %8 to [5 x double]* %10 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 0 store double 4.020000e+00, double* %10 %11 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 1 store double 6.190000e+00, double* %11 %12 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 2 store double -8.220000e+00, double* %12 %13 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 3 store double -7.570000e+00, double* %13 %14 = getelementptr [5 x double], [5 x double]* %9, i32 0, i32 4 store double -3.030000e+00, double* %14 %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0 %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0 %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0 call void @solve_5(double* %15, double* %16, double* %17) ; to replace store i32 0, i32* %j, align 4 br label %18 ; to update ; <label>:18 ; preds = %29, %6 %19 = load i32, i32* %j, align 4 %20 = icmp slt i32 %19, 5 br i1 %20, label %21, label %32 ... 58 lines after...

Step 3. LLVM after the call replacement

... 44 lines before ... %15 = getelementptr inbounds [25 x double], [25 x double]* %a, i32 0, i32 0 %16 = getelementptr inbounds [5 x double], [5 x double]* %b, i32 0, i32 0 %17 = getelementptr inbounds [5 x double], [5 x double]* %x, i32 0, i32 0 %18 = getelementptr inbounds double, double* %15, i64 6 ; generated %19 = load double, double* %18, align 8 ; by %20 = getelementptr inbounds double, double* %15, i64 24 ; the %21 = load double, double* %20, align 8 ; Python %22 = fmul double %19, %21 ; script ... 2407 more lines of auto-generated code ... %2422 = getelementptr inbounds double, double* %17, i64 3 %2423 = load double, double* %2422, align 8 %2424 = fmul double %2421, %2423 %2425 = fsub double %2419, %2424 %2426 = getelementptr inbounds double, double* %15, i64 24 %2427 = load double, double* %2426, align 8 %2428 = fdiv double %2425, %2427 %2429 = getelementptr inbounds double, double* %17, i64 4 store double %2428, double* %2429, align 8 store i32 0, i32* %j, align 4 br label %2430 ; instructions updated by the same script ; :18 ; preds = %2441, %6 %2431 = load i32, i32* %j, align 4 %2432 = icmp slt i32 %2431, 5 br i1 %2432, label %2433, label %2444 ... still 58 lines after...

Step 4. Native optimized assembly

... 139 lines of assembly ... vmovsd 352(%esp), %xmm0 # xmm0 = mem[0],zero vmulsd 256(%esp), %xmm2, %xmm6 vmovsd .LCPI0_0, %xmm4 # xmm4 = mem[0],zero vmulsd %xmm4, %xmm0, %xmm7 vsubsd %xmm7, %xmm6, %xmm6 vmulsd 128(%esp), %xmm0, %xmm5 # 8-byte Folded Reload vmovapd %xmm0, %xmm3 vmulsd 344(%esp), %xmm2, %xmm7 vsubsd %xmm5, %xmm7, %xmm1 vmovsd %xmm1, 128(%esp) # 8-byte Spill vmulsd 280(%esp), %xmm2, %xmm7 vmulsd 192(%esp), %xmm4, %xmm5 # 8-byte Folded Reload vsubsd %xmm5, %xmm7, %xmm7 vmovsd %xmm7, 120(%esp) # 8-byte Spill vmovsd 104(%esp), %xmm0 # 8-byte Reload

vmulsd %xmm6, %xmm0, %xmm5 vmulsd %xmm7, %xmm1, %xmm6 vsubsd %xmm6, %xmm5, %xmm5 vmovsd %xmm5, 64(%esp) # 8-byte Spill vmovapd %xmm3, %xmm7 vmulsd 88(%esp), %xmm7, %xmm3 # 8-byte Folded Reload vmulsd 336(%esp), %xmm2, %xmm6 vsubsd %xmm3, %xmm6, %xmm3 vmulsd %xmm0, %xmm3, %xmm3 vmulsd 80(%esp), %xmm1, %xmm6 # 8-byte Folded Reload vsubsd %xmm6, %xmm3, %xmm1 vmovsd %xmm1, 88(%esp) # 8-byte Spill ... only 400 more lines of assembly ...

The most noteworthy is how the ultra-verbose intermediate code generated by the Python script turns into some very compact and very effective code for the hardware. It is highly super-scalarized too. But is it good enough to compete with C and C++ solutions?

Here are some approximate numbers for the three cases: C with tricks, C++ with a trick, and Python with LLVM.

The tricks for C don't quite work for Clang, so GCC version is measured. It runs for roughly 70 ms on average.
The C++ version was built with Clang and it runs for 60 ms.
The Python version, the one described here, runs only for 55 ms.

Of course, this speed-up is not something you would kill for. But it shows that you can write programs in Python that outperform their counterparts written in C or C++. You don't have to learn some special language to create high-performant applications or libraries.

Conclusion

I think that the dichotomy between fast compiling languages and slow scripting languages is a bluff. Native code generation may very well be not a core feature but something like a pluggable option. Like a library. Think about the benefits: you can do your research and rapid prototyping in one language, and then generate highly-performant code... in the same language.

High-performance computing has no reason to remain a privilege of compiling languages. A compiler is just a soft machine for code generation. You can generate code in any language you want. I believe you can teach Matlab to generate ultra-fast LLVM code if you want to.

All the measurements conducted on Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, the code is compiled with clang version 3.8.0-2ubuntu4 and g++ 5.4.0 with -march=native -O2. The source code for the benchmark is available on Github .

The plan

The Python part

The benchmark

Conclusion

Recommend

3D cartoon-like explosion in 180 lines of bare C++

电商有赞：执行996工作制工作家庭平衡不好可以离婚

微信封抖音，不能将互联网壁垒等同于不正当竞争

竞品分析：如何来做智能家居场景自动化？

用了三年时间，我对用户运营有了新看法

扔掉Create React App，打造你自己的React生成工具！

从0到1使用VUE-CLI3开发实战（四）: Axios封装

GitHub - jpillora/chisel: A fast TCP tunnel over HTTP

再补货:DECATHLON 迪卡侬 QUECHUA JK ARP 100 RAIN 男款单层夹克 49.9元_迪卡侬运动超...

还在期待安卓9.0吗？Android 10.0要来了

About Joyk