6

The Complete Idiot's Guide to Refactoring Python Using Multiprocessing Pools

 3 years ago
source link: https://fuzzyblog.io/blog/python/2020/07/31/the-complete-idiot-s-guide-to-refactoring-python-using-multiprocessing-pools.html
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

The Complete Idiot's Guide to Refactoring Python Using Multiprocessing Pools

Jul 31, 2020

cloudy_drive.jpg

While I would happily proclaim that my ur language is Ruby, I spend increasingly large amounts of time these days using Python. And while there are many things that I don't like about Python (the syntax makes my eyes want to weep and then die; thanks Tim Curry / Psych; around 20 seconds in), the strength of the Python ecosystem is outstanding.

Today I'm going to talk about the Python Multiprocessing library which is a standard part of Python and can be used without even needing to install anything. And this isn't going to be a theoretical explanation of processes / threads / parallelism. Instead it is going to be a simple explanation about how my favorite Python guru taught me to love the zen of multiprocessing with a very specific example. But we do need a few basics:

  1. In Python you want to use processes not threads. The reason for this is the infamous GIL issue which Real Python does a great job discussing so I'm just not going to get into it.
  2. Unless your python processes are heavily IO bound (example – calling networked APIs), you generally want to use a pool of processes tied to your CPU / Core count. Happily this is astonishingly trivial as the multiprocessing library gives you multiprocessing.cpu_count() as a core primitive. Please note that I recognize that I have vastly oversimplified this issue and that many people argue for number of cores - 1. As with all complex computing issues, well, ymmv.
  3. Debugging parallel software is always harder than you think it is so I only, ever, do this at the end of project when I know that my code works and where the bottlenecks are (i.e. is it IO bound for example).
  4. Consistency of coding practices makes a huge difference. In the code base I just left, I was able to transition all of it to a multiprocessing architecture trivially because I had invested heavily in consistency.
  5. Your deployment tooling makes a huge difference. If you want to experiment with multiprocessing then you need the ability to change your instance type / number of cores and benchmark to know that you are spending your money wisely.

Before

Before I implemented multiprocessing, I had an architecture across my data pipeline that looked like this:

import foo
import bar 

def main():
    # do the thing
    
    # do more things

After

import multiprocessing
import foo
import bar

def main():
    pool = Pool(multiprocessing.cpu_count())
    res = pool.apply_async(do_main) 
    while True:
        pass

def do_main():
    # do the thing
    
    # do more things

As you can see, do_main is just a rewrite of main() under a different name so it can be called by pool.apply_async(). All I did for this rewrite was:

  1. Import the multiprocessing library.
  2. Create a new def main() as per the above.
  3. Rename my former def main() to def do_main().

Thank You

I would be remiss without thanking my former colleague Grant for his assistance with this refactor. It is tremendously easy to go wrong with multiprocessing and he set me straight a number of times. Thanks Grant!

Caveats

The example above isn't doing anything to capture output from the do_main() method. The reason for this, is that for my use case, I was implementing a data pipeline where my main() method was simply doing work and advancing stuff across the pipeline from SQS bucket to SQS bucket. If you want to capture output from a do_main() routine then that can be done by reading the documentation for apply_async.

References


Posted In: #python #scalability


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK