8

Kristján's Cosmic Percolator

 3 years ago
source link: https://cosmicpercolator.com/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Kristján's Cosmic PercolatorSkip to content

In recent years I’ve been more and more pulled into network administration.  I’ve been involved in small companies with their own infrastructure and have had to learn how to work with VLANs and DHCP and all that jazz.

My current company has offices in two locations and needs to have the internal networks connected via VPN. Having planned the networks so that we have no overlapping subnet ranges, we initially thought we’d use the build in vpn facilities of our internet gateways.  But we decided against this, since we wanted to have high throughput (up to 1Gbps) and gateway hardware isn’t typically designed for this.  So, a tunnel between two linux servers on the lan, then.

OpenVPN

Next, we set up OpenVPN on the servers.  First, we needed to create a vpn solution for people at home to connect to the office.  There are tutorials on how to do this with OpenVPN and it is reasonably simple to do, but OpenVPN is so full of configuration options, often poorly documented, that it is non-trivial to get right.  There is all the messing about with public key cryptography, certificates, client keys and whatnot.  And then sensible crypto configuration options.  We have this running in both offices.  It works for people working from home.

But then we wanted to use the same to set up a VPN tunnel between the offices.  And this turned out to be trickier.  There are frankly not many tutorials showing how to do that.  Keeping the tunnel running reliably was a challenge.  And throughput was disappointing.

I spend some time trying to diagnose why throughput was bad.  In the end, I think I discovered that packets were being dropped on the boundary between the openvpn code running in user space and the the linux tunnel interface.  And dropped packets kill TCP performance since TCP goes into back-off mode.  No amount of increasing network buffers seemed to cure this.

Site to site vpn using ssh

In the end, I set up tunnels over SSH.  This turned out to be relatively stable and moderately easy to configure, with acceptable performance.  We were getting some 150Mbps over our link, which has a round-trip latency of some 80ms.  Having set up port forwarding for the ssh connection on the gateway, setting up such a tunnel is not that hard.  I ended up writing systemd units that ran scripts, similar to this:

#! /usr/bin/bash
LOCTUN=7
REMTUN=0
LOCADDR=10.8.0.7
REMADDR=10.8.0.8
REMVPC=172.30.0.0/16
exec ssh \
-o PermitLocalCommand=yes \
-o LocalCommand="\
sudo ip tuntap add dev tun$LOCTUN mode tun; \
ifconfig tun$LOCTUN $LOCADDR pointopoint $REMADDR && \
ifconfig tun$LOCTUN txqueuelen 10000 && \
ip route add $REMVPC via $REMADDR" \
-o ServerAliveInterval=30 \
-o ServerAliveCountMax=5 \
-n -w $LOCTUN:$REMTUN ec2-user@myhost \
"sudo ip tuntap add dev tun$REMTUN mode tun user ec2-user; \
sudo ifconfig tun$REMTUN $REMADDR pointopoint $LOCADDR && sudo ifconfig tun$REMTUN txqueuelen 10000 && \
sudo ip route add 192.168.10.0/24 via $LOCADDR; \
sudo ip route add 192.168.11.0/24 via $LOCADDR; \
sudo ip route add 192.168.12.0/24 via $LOCADDR; "

What this does is first create a tunnel interface locally and assign endpoint addresses for a peered connection.  Then open a SSH tunnel connection to the remote server, where remote tunnel device is created. Also, appropriate routes have to be set up, both locally and remotely.  And of course, remote and local networks have to have static routes in place to route packets to the gateway hosts.

In addition, you need to create the remote tunnel device and grant ownership to the remote user, if you are not using the root user to log in (not recommended):

sudo ip tuntap add dev tun0 mode tun user ec2-user

This works ok.  systemd will retry the connection if it fails.  The tunnel is much more reliable than openvpn, takes seconds to set up and just kind of works.  But twiddling with tunnel interface numbers is a bother.  Also, it is a tcp tunnel, and performance isn’t that great because tunneling TCPover TCP isn’t optimal.

This is the systemd unit file, saved as /etc/systemd/system/[email protected]:

[Unit]
Description=SSH tunnel for %I
After=network.target
[Service]
ExecStart=/usr/bin/bash /etc/sshtun/%i
# Restart every >2 seconds to avoid StartLimitInterval failure
RestartSec=5
Restart=always
[Install]
WantedBy=multi-user.target

Wireguard

So, recently I was made aware of WireGuard.  A simple VPN encapsulation protocol to be include in the Linux kernel, no less.  I read this article and decided to give it a go.  I had reason to believe it might be better than my SSH solution:

  • It uses UDP packets
  • It is kernel based and uses its own interface type, i.e. no messing around with tun/tap interfaces.
  • There should be no user-space/kernel-space bottleneck
  • There are almost no configuration options.
  • It promises to automatically re-negotiate the tunnel if required.

Setting it up is pretty straigtforward.  but the use cases shown are typically home to office vpns.  For site to site vpns I have chosen to do things a bit differently.

A tunnel endpoint CIDR range

selecting an IP address for the interface

When server A, on lan A, connects to server B on lan B, what IP addresses should one assign to the WireGuard interfaces?  As a concrete example:  Server Alpha has the lan address 192.168.10.100/24 and  Server Beta lives on a different lan, and has the IP address 192.168.20.60/24.

One approach would be to give the WireuGard interface on server Alpha an address from the remote lan, and vice versa.  This is fine for two networks and if you can reserve the addresses out of the DHCP range of each.  But the approach I have use is to reserve a different private IP range for the tunnel endpoint network.  That way, we can set up more complex topology.  I use the network 10.8.1.0/24 as the virtual tunnel network.  Each WireGuard interface on each tunnel server gets one address out of this range.

Setting up

Based on the instructions here, these are the steps needed to configure server Alpha.  We assume that you have created private and public keys on each server and put those in /etc/wireguard/privatekey.  Also, port forwarding on both sides will forward udp packages on the external ports to the servers.  We also assume that ip forwarding has been enabled for each server.  Notice how we have added the tunnel endpoint address to the interface and allow the remmote tunnel address through. You run the following as root:

ip link add dev wg0 type wireguard
ip address add dev wg0 10.8.1.1/24
wg set wg0 listen-port 7777
wg set wg0 private-key /etc/wireguard/privatekey
wg set wg0 peer CbX0FSQ7W2LNMnozcMeTUrru6me+Q0tbbIfNlcBzPzs= allowed-ips 192.168.20.0/24,10.8.1.2/32 endpoint networkB.company.com:8888
ip link set up wg0

Now you have your basic settings.  But there is no routing yet.  But the cool thing is that the utility wg-quick will help you with that.  First, you save the config you have made:

touch /etc/wireguard/wg0.conf
wg-quick save wg0

This will save the config in the above file.  You can edit it, if you like but it should be fine. You can now bring the interface down and up with wg-quick:

wg-quick down wg0
wg-quick up wg0

That was easy.  wg-quick will have created the interface, assigned the correct address, configured the interface and modified the server routing tables.

systemd

The final trick is to make this into a service. That’s remarkably easy, due to a systemd unit being included. This is what you do:

systemctl enable wg-quick@wg0 --now

This enables the unit and runs it, in one swell foop.

The other server

On server Beta, you perform the same dance, except that you assign a different local tunnel address, and remote addresses to the peer section.

Testing

if all goes well, you should be able to ping between servers now.

Adding a third server

This is when things become simple.  Just allocate a third tunnel interface address to the third server.  Add [peer] sections to the wg0.conf files on the existing servers for the third server.  Set up the third server with the other two as peers.  Start up the interface.  It should just work.

Conclusion

Using WireGuard I managed to get the througput up significantly.  Over our inter-office connection with 80ms round trip time, and with an internet connection of 1000Mbps on both ends, we now get some 350Mbps through the tunnel, using regular consumer workstations serving as tunnel endpoints.  And the tunnels just work.

1 Comment

So, I’ve been experimenting with running servers on Raspberry Pi3, using CentOS 7.

CentOS is widely used in the enterprise environment, and it was recently released for the RPi.  See: https://wiki.centos.org/SpecialInterestGroup/AltArch/Arm32/RaspberryPi3

Anyway, it needed a fake hwclock package, to help it cope with running without a hardware clock. Based on stuff that I found on the internet, I came up with a systemd unit to provide that service.

Find it here:  https://github.com/kristjanvalur/soft-hwclock

Note:  I originally named this package “fake-hwclock” but that is a real package for RPi.  The author contacted me and asked me to rename it, to avoid confusion.  So now it is soft-hwclock.

Leave a comment

I sometimes get this question. And instead of starting a rant about microthreads, co-routines, tasklets and channels, I present the essential piece of code from the implementation:

The Code:

/*
    the frame dispatcher will execute frames and manage
    the frame stack until the "previous" frame reappears.
    The "Mario" code if you know that game :-)
 */

PyObject *
slp_frame_dispatch(PyFrameObject *f, PyFrameObject *stopframe, int exc, PyObject *retval)
{
    PyThreadState *ts = PyThreadState_GET();

    ++ts->st.nesting_level;

/*
    frame protocol:
    If a frame returns the Py_UnwindToken object, this
    indicates that a different frame will be run.
    Semantics of an appearing Py_UnwindToken:
    The true return value is in its tempval field.
    We always use the topmost tstate frame and bail
    out when we see the frame that issued the
    originating dispatcher call (which may be a NULL frame).
 */

    while (1) {
        retval = f->f_execute(f, exc, retval);
        if (STACKLESS_UNWINDING(retval))
            STACKLESS_UNPACK(retval);
        /* A soft switch is only complete here */
        Py_CLEAR(ts->st.del_post_switch);
        f = ts->frame;
        if (f == stopframe)
            break;
        exc = 0;
    }
    --ts->st.nesting_level;
    /* see whether we need to trigger a pending interrupt */
    /* note that an interrupt handler guarantees current to exist */
    if (ts->st.interrupt != NULL &&
        ts->st.current->flags.pending_irq)
        slp_check_pending_irq();
    return retval;
}

(This particular piece of code is taken from an experimental branch called stackless-tealet, selected for clarity)

What is it?

It is the frame execution code. A top level loop that executes Python function frames. A “frame” is the code sitting inside a Python function.

Why is it important?

It is important in the way it contrasts to C Python.

Regular C Python uses the C execution stack, mirroring the execution stack of the Python program that it is interpreting. When a Python function foo(), calls a python function bar(), this happens by a recursive invocation of the C function PyEval_EvalFrame(). This means that in order to reach a certain state of execution of a C Python program, the interpreter needs to be in a certain state of recursion.

In Stackless Python, the C stack is decoupled from the Python stack as much as possible. The next frame to be executed is placed in ts->frame and the frame chain is executed in a loop.

This allows two important things:

  1. The state of execution of a Stackless python program can be saved and restored easily. All that is required is the ability to pickle execution frames and other runtime structures (Stackless adds that pickling functionality). The recursion state of a Python program can be restored without having the interpreter enter the same level of C recursion.
  2. Frames can be executed in any order. This allows many tasklets to be created and code that switches between them. Microthreads, if you will. Co-routines, if you prefer that term. But without forcing the use of the generator mechanism that C python has (in fact, generators can be more easily and elegantly implemented using this system).

That’s it!

Stackless Python is stackless, because the C stack has been decoupled from the python stack. Context switches become possible, as well as the dynamic management of execution state.

Okay, there’s more:

  • Stack slicing: A clever way of switching context even when the C stack gets in the way
  • A framework of tasklets andchannels to exploint execution context switching
  • A scheduler to keep everything running

Sadly, development and support for Stackless Python has slowed down in the last few years. It however astonishes me that the core idea of stacklessnesshasn’t been embraced by C Python even yet.

2 Comments

In a recent comment to an elderly post of mine, I was asked about the following code:

def mywrapper(func, args=(), kwargs={}):
    ...

The commenter though that I should have made a special mention about using dict as a default argument, “because it’s such a common gotcha.”

My response is twofold:

  1. This particular case is idiomatic, and widely used for functions that call other functions.
  2. I actually don’t think mutable default arguments are a problem and that they don’t deserve all the stigma they are getting.

I want to expand on point 2 a bit here.

Continue reading →

Leave a comment

Recently I decided to port a little package that I had to Python 3, and ran into the traceback reference cycle problem.  This blog is the result of the detective work I had to do, both to re-familiarize myself with this issue (I haven’t been doing this sort of stuff for a few years) and to uncover the exact behaviour in Python 3.

Background

In Python 2, exceptions are stored internally as three separate objects: The type, the value and the traceback objects. The value is normally an instance of the type by the time your python code runs, so mostly we are dealing with value and traceback only. There are two pitfalls one should be aware of when writing exception handling code.

The traceback cycle problem

Normally, you don’t worry about the traceback object. You write code like:

def foo():
    try:
        return bar()
    except Exception as e:
        print "got this here", e

The trouble starts when you want to do something with the traceback. This could be to log it, or maybe translate the exception to something else:

def foo():
    try:
        return bar()
    except Exception as e:
        type, val, tb = sys.exc_info()
        print "got this here", e, repr(tb)

The problem is that the traceback, stored in tb, holds a reference to the execution frame of foo, which again holds the definition of tb. This is a cyclic referenceand it means that both the traceback, and all the frames it contains, won’t disappear immediately.

This is the “traceback reference cycle problem” and it should be familiar to serious Python developers. It is a problem because a traceback contains a link to all the frames from the point of catching the exception to where it occurred, along with all temporary variables. The cyclic garbage collector will eventually reclaim it (if enabled) but that occurs unpredictably, and at some later point. The ensuing memory wastage may be problematic, the latency involved when gc finally runs, or it may cause problems with unittests that rely on reference counts to detect when objects die. Ideally, things should just go away when not needed anymore.

The same problem occurs whenever the traceback is present in a frame where an exception is raised, or caught. For example, this pattern here will also cause the problem in the called function translate() because tb is present in the frame where it is raised.

def translate(tp, val, tb):
    # translate this into a different exception and re-raise
    raise MyException(str(val)), None, tb

In python 2, the standard solution is to either avoid retrieving the traceback object if possible, e.g. by using

tp, val = sys.exc_info()[:2]

or by explicitly clearing it yourself and thus removing the cycle:

def translate(tp, val, tb):
    # translate this into a different exception and re-raise
    try:
        raise MyException(str(val)), None, tb
    finally:
        del tb

By vigorous use of try-finallythe prudent programmer avoids leaving references to traceback objects on the stack.

The lingering exception problem

A related problem is the lingering exception problem. It occurs when exceptions are caught and handled in a function that then does not exit, for example a driving loop:

def mainloop():
    while True:
        try:
            do_work()
        except Exception as e:
            report_error(e)

As innocent as this code may look, it suffers from a problem: The most recently caught exception stays alive in the system. This includes its traceback, even though it is no longer used in the code. Even clearing the variable won’t help:

report_error(e)
e = None

This is because of the following clause from the Python documentation:

If no expressions are present, raise re-raises the last exception that was active in the current scope.

In Python 2, the exception is kept alive internally, even after the try-except construct has been exited, as long as you don’t return from the function.

The standard solution to this, in Python 2, is to use the exc_clear() function from the sys module:

def mainloop():
    while True:
        try:
            do_work()
        except Exception as e:
            report_error(e)
        sys.exc_clear() # clear the internal traceback

The prudent programmer liberally sprinkles sys.exc_clear() into his mainloop.

Python 3

In Python 3, two things have happened that change things a bit.

  1. The traceback has been rolled into the exception object
  2. sys.exc_clear() has been removed.

Let’s look at the implications in turn.

Exception.__traceback__

While it unquestionably makes sense to bundle the traceback with the exception instance as an attribute, it means that traceback reference cycles can become much more common. No longer is it sufficient to refrain from examining sys.exc_info(). Whenever you store an exception object in a variable local to a frame that is part of its traceback, you get a cycle. This includes both the function where the exception is raised, and where it is caught.

Code like this is suspect:

def catch():    
    try:
        result = bar()
    except Exception as e:
        result = e
    return result

The variable result is part of the frame that is referenced result.__traceback__ and a cycle has been created.

(Note that the variable e is not problematic. In Python 3, this variable is automatically cleared when the except clause is exited.)

similarly:

def reraise(tp, value, tb=None):
    if value is None:
        value = tp()
    if value.__traceback__ is not tb:
        raise value.with_traceback(tb)
    raise value

(The above code is taken from the six module)

Both of these cases can be handled with a well placed try-finally to clear the variables result, value and tb respectively:

def catch():    
    try:
        result = bar()
    except Exception as e:
        result = e
    try:
        return result
    finally:
        del result
def reraise(tp, value, tb=None):
    if value is None:
        value = tp()
    try:
        if value.__traceback__ is not tb:
            raise value.with_traceback(tb)
        raise value
    finally:
        del value, tb
Note that the caller of reraise() also has to clear his locals that he used as an argument, because the same exception is being re-raised and the caller's frame will get added to the exception:
try:
    reraise(*exctuple):
finally:
    del exctuple

The lesson learned from this is the following:

Don’t store exceptions in local objects for longer than necessary. Always clear such variables when leaving the function using try-finally.

sys.exc_clear()

This method has been removed in Python 3. This is because it is no longer needed:

def mainloop():
    while True:
        try:
            do_work()
        except Exception as e:
            report_error(e)
        assert sys.exc_info() == (None, None, None)

As long as sys.exc_info() was empty when the function was called, i.e. it was not called as part of exception handling, then the inner exception state is clear outside the Except clause.

However, if you want to hang on to an exception for some time, and are worried about reference cycles or memory usage, you have two options:

  1. clear the exception’s __traceback__ attribute:
    e.__traceback__ = None
  2. use the new traceback.clear_frames() function
    traceback.clear_frames(e.__traceback__)

clear_frames() was added to remove local variables from tracebacks in order to reduce their memory footprint. As a side effect, it will clear the reference cycles.

Conclusion

Exception reference cycles are a nuisance when developing robust Python applications. Python 3 has added some extra pitfalls. Even though the local variable of the except clause is automatically cleared, the user must himself clear any other variables that might contain exception objects.

6 Comments

Time to write a little bit about this little project of mine.

tl;dr

Multithreading more responsive in a Python 2.7.  30% more requests per second.  Satisfaction guaranteed!

Introduction

After leaving my old job at CCP Games last year, I had the urge to try to collect some of the stuff that we had done for Python 2.7 over there and make it available to the world.  So I started this little fork off 2.7.

The idea was to have a place to add “improvements” to vanilla (as opposed to Stackless) 2.7, so that they could be kept separately and in sync with CPython 2.7.

Thus far, what I’ve been mostly focusing on is modernizing thread support.  (for a full list of changes, see the whatsnew file).

When we were working on DUST 514 for the Playstation I had to make certain improvements to make networking work more efficiently on that platform.  We were interfacing stackless python with the native http api of the PS3, and had to use blocking worker threads.  Marshaling from those threads to tasklets was causing unnecessary latency.

We ended up doing a lot of experiments with condition variables, in the end, providing native C implementations to minimize GIL thrashing and reducing wakeup latency to the minimum.

In PythonPlus I have done this and some other stuff in order to improve threading performance.

The threading related changes cover among other things:

  1. Adding timeout parameters to blocking calls as in the 3.x api.
  2. Adding a native threading.Condition object
  3. Improving the GIL

Adding a native Condition object aims to reduce the thread thrashing that is otherwise associated with condition variables, since a lot lof locking and context switching needs to happen for a thread to wake up with the normal .py version of those constructs.  To do this, however, the internal non-recursive locks need to be implemented using a lock and a condition variable themselves, rather than using native semaphore objects.

Changing the lock types used required the GIL to be visited, since the behaviour of the old GIL was just a random side effect of the choice of internal locks.  This also allowed me to address the old Beazley problem while at it.

The GIL change is minor.  It is simply a separate function, and when a CPU bound thread wishes to yield the GIL to another thread, it calls a new api function, _PyThread_yield_GIL().  Threads that are trying to re-aquire the GIL after unlocking them, are considered to be IO threads and have priority for the GIL when a CPU thread yields it.  But if no such thread is present, then the GIL won’t actually be yielded 99 out of every 100 yields.  This minimizes unnecessary thrashing among CPU threads, while allowing IO threads to quickly get their foot in when required.

Performance

I quickly got this all up and running, but then I had to prove it to be actually better than regular 2.7.  To do this, I set up two test scenarios:

  1. Tools/plus/giltest.py – a test platform to measure performance of concurrent cpu threads as well as the performance of pairs of producer/consumer threads synchronized either with threading.Condition or threading.Lock
  2. Tools/plus/testserver.py – a multithreaded webserver using a pool of thread and socketserver.py, being exercised by ab.

On windows, I found it easy to see improvements.  I got the GIL to behave better and I got the web server to increase throughput.  producer/consumer pairs using Condition variables got a real performance boost and those IO threads got a priority boost over regular CPU bound threads as expected.

However, my virtual linux box was more disappointing.  Tests showed that just replacing the native non-recursive lock which was based on the posix sem_t object with a construct using pthread_mutex_t and pthread_cond_t, slowed down execution.

Fixing linux

I decided that there ought ot be no good reason for a pthread_cond_t to be so much slower than a sem_t, so I decided to write my own condition object using a sem_t.  To make a long story short, it worked.  My emulated condition variable (written using a pthread_mutex_t and a sem_t) is faster than a pthread_condition_t. At least on my dual core virtual box.  Go figure.

The making of this new condition variable is a topic for a blog post on its own.  I doggedly refused to look up other implementations of condition variables based on semaphores, and wanted to come up with a solution on my own that did not violate the more subtle promises that the protocol makes.  Along the way, I was guided by failing unittests of the threading.Barrier class, which relies on the underlying threading.Condition to be true to its promise.  I was actually stuck on this problem for a few months, but after a recent eureka moment I think I succeeded.

The results

So, this has been some months in the making.  I set up the header files so that various aspects of my patch could be switched on or off, and a macro especially for performance testing then sets these in a sensible way.

giltest.py

First, the results of the giltest.py file, with various settings of the macro and on windows and linux:

giltest

Some notes on this are in order.

  1. e is “efficiency”, the cpu throughput of two concurrent cpu threads (incrementing a variable) compared to just one thread.
  2. prod/con is a pair of producer/consumer threads using a threading.Lock primitive, and the column shows the number of transactions in a time-frame (one second)
  3. The green bit shows why a GIL improvement was necessary since IO threads just couldn’t get any priority over a cpu thread.  This column is showing prod/con transactions in the presence of a cpu thread.
  4. In the end, the improvements on linux are modest.  Maybe it is because of my virtual machine.  But the Beazley problem is fixed, and IO responsiveness picks up.  On windows it is more pronounced.
  5. The final column is a pair of producer/consumer threads synchronized using a threading.Condition object.  Notice on windows how performance picks up almost threefold, ending up being some 60% of a pair that’s synchronized with a threading.Lock.

 testserver.py

Now for more real-world like results.  Here the aim was to show that running many requests in parallel was better handled using the new system.  Again, improvements on linux are harder to gauge.  In fact, my initial attempts were so disappointing on linux that I almost scrapped the project.  But when I thought to rewrite the condition variable, things changed.

testserver
  1. Notice how performance picks up with “emulated condvar” on linux (green boxes) (on windows, it is always emulated)
  2. p=1 and p=10 are the number of parallel requests that are made.  “ab” is single threaded, it shoots off n requests and then waits for them all to finish before doing the next batch, so this is perhaps not completely real-world.
  3. On linux, rps (requests per second) go up for the multi-threaded case, both when we add the new GIL (better IO responsiveness) and when we add the native threading.Condition.  Combined, it improves 30%.
  4. On windows, we see the same, except that the biggest improvement is when we modify the locks (orange boxes).
  5. On windows, we achieve better throughput with multithreading.  I.e. multiple requests now work better than single requests, whereas on linux, multiple requests performed worse.

Conclusion

These tests were performed on a dual core laptop, running windows 7.  The linux tests were done in a virtual ubuntu machine on the same laptop, using two cpus.  I’m sure that the virtual nature has its effect on the results, and so, caveat emptor.

Overall, we get 30% improvement in responsiveness when there are multiple threads doing requests using this new threading code in Python Plus.  For real world applications serving web pages, that ought to matter.

On windows, the native implementation of threading.Condition provides a staggering 167% boost in performance of two threads doing rendezvous using a condition variable.

While optimizing the linux case, I uncovered how pthread_cond_t is curiously inefficient.  A “greedy” implementation of a condition variable using the posix sem_t showed dramatic improvement on my virtual machine.  I haven’t replicated this on native linux, but I suspect that the implementors of the pthread library are using explicit scheduling, whereas we rely on the presumably greedy scheduling semantics of the semaphore primitive.  But perhaps a separate blog post on this is in order, after some more research.

Fun stuff.

7 Comments

I made a short presentation to my colleagues the other day about how we use the killing of tasklets as a clean and elegant way to tear down services and workers in a Stackless Python program.

My colleague Rob Galanakis wrote a short blog post on his impressions of it.

Here are the slides, for those interested.
Death to Tasklets

1 Comment

I haven’t written anything on Python here in a good while.  But that doesn’t mean I haven’t been busy wrestling with it.  I’ll need to take a look at my Perforce changelists over the last months and take stock.  In the meantime, I’d like to rant a bit about a most curious peculiarity of Python that I came across  a while back.

Continue reading →

4 Comments

At PyCon 2013 I saw a presentation, with a common function signature:

def call_later(when, function, *args):
...

This got me thinking about some guidelines I wrote recently on our internal tech blog about how to write such proxy functions. The current recommendation I have is for a different signature, for the reason I shall now explain:

Let’s say that you have a function that calls another function for some reason. You start with something like this:

def mywrapper(func, *args, **kwargs):
do_something()
return func(*args, **args)

At some point though, you add another higher level wrapper:

def mybigwrapper(func, *args, **kwargs):
do_something()
return mywraper(func, *args, **args)

This is ok, until someone notices that this is rather slow. The reason is, that arguments are constantly being packed and unpacked. Unnecessarily so, because no one is really looking at them. So a clever software engineer comes up with a solution:

def mywrapper(func, *args, **kwargs):
return mywrapper_without_the_stars(args, args)
def mywrapper_without_the_stars(func, args, kwargs):
do_something()
return func(*args, **args)
def mybigwrapper(func, *args, **kwargs):
do_something()
return mywraper_without_the_stars(func, args, args)

What has happened? Yes, we have created a set of functions that do not take variable arguments, but rather just take the argument tuple and keyword dict. When you nest a number of those, there is no argument packing and unpacking going on and they are all passed through verbatim. We then have a thin layer outside that does the argument packing, for api backwards compatibility.

But there is a lesson here: Perhaps it is not such a good idea to do this style of interface in the first place. Why didn’t we just write:

def mywrapper(func, args=(), kwargs={}):
do_something()
return func(*args, **args)

to begin with? In my opinion, this is actually a much better interface. To illustrate, lets say that we want to wrap a call to myfunc(1,2,3). Compare these two styles:

return mywrapper(myfunc, 1, 2, 3)
return mywrapper(myfunc, (1, 2, 3))

In the former case, we are mixing the callable (myfunc) and its arguments (1, 2, 3) into one big list. This doesn’t really make the distinction that “myfunc” is the callable and “1” is its first argument, but rather they look semantically to be equivalent, as if they were all just a chunk of arguments. In my opinion it is much clearer, when using this sort of proxy functions, to make a distinction between the callable and its arguments.
Therefore, this is currently the recommended way within CCP to write such wrappers. They take the argument tuple (and keyword dict) as a non-variable argument to the function.  Variable argument lists are only used in two cases:

  1. When writing a function where that is appropriate, such as logging functions
  2. When writing wrapper functions that emulat other function’s signature.

But recently, I have been thinking even more about this because passing around “args” and “kwargs” everywhere seems unnecessarily clunky. And we arrive at the thesis of this blog post:

Wrapper functions should be written and used like this:

# wrapper takes an argument-less callable
def mywrapper(func):
do_something()
return func()
# call myfunc with default args
a = mywrapper(myfunc)
# call myfunc with some arguments
a = mywrapper(lambda : myfunc(1, 2, 3))
# call myfunc with something from this context
def call():
return myfunc(foo, bar)
a = mywrapper(call)

In other words: How about using Python’s powerful lambda and closure semantics to add those arguments if and when they are needed, rather than to write layer upon layer of functions that manually carry around argument tuples and keyword dicts?

3 Comments

After a long hiatus, the Cosmic Percolator is back in action.  Now it is time to rant about all things Python, I think.  Let’s start with this here, which came out from work I did last year.

Stackless has had an “atomic” feature for a long time. In this post I am going to explain its purpose and how I reacently extended it to make working with OS threads easier.

Scheduling

In Stackless python, scheduling it cooperative.  This means that a tasklet is normally uninterrupted until it explicitly does something that would cause another one to run, like sending a message over a channel.  This allows one to write logic in stackless without worrying too much about synchronization.

However, there is an important exception to this: It is possible to run stackless tasklets throught the watchdog and this will interrupt a running tasklet if it exceeds a pre-determined number of executed opcodes:

while True:
interrupted = stackless.run(100)
if interrupted:
print interrupted, "has been running quite a bit!"
interrupted.insert()
else:
break # Ok, nothing runnable anymore

This code may cause a tasklet to be interrupted at an arbitrary point (actually during a tick interval, the same point that yields the GIL) and cause a switch to the main tasklet.

Of course, not all code uses this execution mode, but never the less, it has always been considered a good idea to be aware of this.  For this reason, an atomic mode has been supported which would inhibit this involuntary switching in sensitive areas:

oldvalue = stackless.getcurrent().set_atomic(1)
try:
myglobalvariable1 += 1
myglobalvariable2 += 2
finally:
stackless.getcurrent().set_atomic(oldvalue)

The above is then optionally wrapped in a context manager for readability:

@contextlib.contextmanager
def atomic()
oldv = stackless.getcurrent().set_atomic(1)
try:
yield None
finally:
stackless.getcurrent().set_atomic(old)

the atomic state is a property of each tasklet and so even when there is voluntary switching performed while a non-zero atomic state is in effect, it has no effect on other tasklets.  Its only effect is to inhibit involuntary switching of the tasklet on which it is set.

A Concrete Example

To better illustrate its use, lets take a look at the implementation of the Semaphore from stacklesslib (stacklesslib.locks.Semaphore):

class Semaphore(LockMixin):
def __init__(self, value=1):
if value < 0:
raise ValueError
self._value = value
self._chan = stackless.channel()
set_channel_pref(self._chan)
def acquire(self, blocking=True, timeout=None):
with atomic():
# Low contention logic: There is no explicit handoff to a target,
# rather, each tasklet gets its own chance at acquiring the semaphore.
got_it = self._try_acquire()
if got_it or not blocking:
return got_it
wait_until = None
while True:
if timeout is not None:
# Adjust time.  We may have multiple wakeups since we are a
# low-contention lock.
if wait_until is None:
wait_until = elapsed_time() + timeout
else:
timeout = wait_until - elapsed_time()
if timeout < 0:
return False
try:
lock_channel_wait(self._chan, timeout)
except:
self._safe_pump()
raise
if self._try_acquire():
return True
def _try_acquire(self):
if self._value > 0:
self._value -= 1
return True
return False

This code illustrates how the atomic state is incremented (via a context manager) and kept non-zero while we are doing potentially sensitive things, in this case, doing logic based on self._value. Since this is code that is used for implementing a Semaphore, which itself forms the basis of other stacklesslib.locks objects such as CriticalSection and Condition objects, this is the only way we have to ensure atomicity.

Threads

It is worth noting that using the atomic property has largely been confined to such library code as the above. Most stackless programs indeed do not run the watchdog in interruptible mode, or they use the so-called soft-interrupt mode which breaks the scheduler only at the aforementioned voluntary switch points.

However, in the last two years or so, I have been increasingly using Stackless Python in conjunction with OS threads.  All the stackless constructs, such as channels and tasklets work with threads, with the caveat that synchronized rendezvous isn’t possible between tasklets of different threads.  A channel.send() where the recipient is a tasklet from a different thread from the sender will always cause the target to become runnable in that thread, rather than to cause immediate switching.

Using threads has many benefits.  For one, it simplifies certain IO operations.  Handing a job to a tasklet on a different thread won’t block the main thread.  And using the usual tasklet communication channels to talk uniformly to all tasklets, whether they belong to this thread or another, makes the architecture uniform and elegant.

The locking constructs in stacklesslib also all make use of non-immediate scheduling.  While we use the stackless.channel object to wait, we make no assumptions about immediate execution when a target is woken up.  This makes them usable for synchronization between tasklets of different threads.

Or, this is what I thought, until I started getting strange errors and realized that tasklet.atomic wasn’t inhibiting involuntary switching between threads!

The GIL

You see, Python internally can arbitrarily stop executing a particular thread and start running another.  This is called yielding the GIL and it happens at the same part in the evaluation loop as that involuntary breaking of a running tasklet would have been performed.  And stackless’ atomic property din’t affect this behaviour.  If the python evaluation loop detects that another thread is runnable and waiting to execute python code, it may arbitrariliy yield the GIL to that thread and wait to reacquire the GIL again.

When using the above lock to synchronize tasklets from two threads, we would suddenly have a race condition, because the atomic context manager would no longer prevent two tasklets from making simultaneous modifications to self._value, if those tasklets came belonged to different threads.

A Conundrum

So, how to fix this?  An obvious first avenue to explore would be to use one of the threading locks in addition to the atomic flag.  For the sake of argument, let’s illustrate with a much simplified lock:

class SimpleLock(object):
def __init__(self):
self._chan = stackless.channel()
self._chan.preference = 0 # no preference, receiver is made runnable
self._state = 0
def acquire(self):
# oppertunistic lock, without explicit handoff.
with atomic():
while True:
if self._state == 0:
self._state = 1:
return
self._chan.receive()
def release():
with atomic():
self._state == 0
if self._chan.balance():
self._chan.send(None) # Wake up someone who is waiting

While this lock will work nicely with tasklets on the same thread. But when we try to use it for locking between two threads, the atomicity of changing self._state and examining self._chan.balance() won’t be maintained.

We can try to fix this with a proper thread lock:

class SimpleLockedLock(object):
def __init__(self):
self._chan = stackless.channel()
self._chan.preference = 0 # no preference, receiver is made runnable
self._state = 0
self._lock = threading.Lock()
def acquire(self):
# oppertunistic lock, without explicit handoff.
with atomic():
while True:
with self._lock:
if self._state == 0:
self._state = 1:
return
self._chan.receive()
def release():
with atomic():
with self._lock:
self._state == 0
if self._chan.balance():
self._chan.send(None) # Wake up someone who is waiting

This version is more cumbersome, of course, but the problem is, that it doesn’t really fix the issue. There is still a race condition in acquire(), between relesing self._lock and calling self._chan.receive().

Even if we were to modify self.chan.receive() to take a lock and atomically release it before blocking, and reaquire it before returning, that would be a very unsatisfying solution.

thankfully, since we needed to go and modify Stackless Python anyway, there was a much simpler solution.

Fixing Atomic

You see, Python is GIL synchronized.  In the same way that only one tasklet of a particular thread is executing at the same time,  then regular cPython is has the GIL property that only one of the processes thread is runinng python code at a time.  So, at any one time, only one tasklet of one thread is running python code.

So, if atomic can inhibit involuntary switching between tasklets of the same threads, can’t we just extend it to inhibit involuntary switching between threads as well?  Jessörry Bob, it turns out we can.

This is the fix (ceval.c:1166, python 2.7):

/* Do periodic things.  Doing this every time through
the loop would add too much overhead, so we do it
only every Nth instruction.  We also do it if
``pendingcalls_to_do'' is set, i.e. when an asynchronous
event needs attention (e.g. a signal handler or
async I/O handler); see Py_AddPendingCall() and
Py_MakePendingCalls() above. */
#ifdef STACKLESS
/* don't do periodic things when in atomic mode */
if (--_Py_Ticker < 0 && !tstate->st.current->flags.atomic) {
#else
if (--_Py_Ticker < 0) {
#endif

That’s it! Stackless’ atomic flag has been extended to also stop the involuntary yielding of the GIL from happening.  Of course voluntary yielding, such as that which is done when performing blocking system calls, is still possible, much like voluntary switching between tasklets is also possible.  But when the tasklet’s atomic value is non-zero, this guarantees that no unexpected switch to another tasklet, be it on this thread or another, happens.

This fix, dear reader, was sufficient to make sure that all the locking constructs in stacklesslib worked for all tasklets.

So, what about cPython?

It is worth noting that the locks in stacklesslib.locks can be used to replace the locks in threading.locks:  If your program is just a regular threaded python program, then it will run correctly with the locks from stacklesslib.locks replacing the ones in threading.locks.  This includes, Semaphore, Lock, RLock, Condition, Barrier, Event and so on.  and all of them are now written in Python-land using regular Python constructs and made to work by the grace of the extended tasklet.atomic property.

Which brings me to ask the question: Why doesn’t cPython have the thread.atomic property?

I have seen countless questions on the python-dev mailing lists about whether this or that operation is atomic or not.  Regularly one sees implementation changes to for example list and dict operations to add a new requirement that an operation be atomic wrt. thread switches.

Wouldn’t it be nice if the programmer himself could just say: “Ah, I’d like to make sure that my updating this container here will be atomic when seen from the other threads.  Let’s just use the thread.atomic flag for that.”

For cPython, this would be a perfect light-weight atomic primitive.  It would be very useful to synchronize access to small blocks of code like this.  For other implementations of Python, those that are truly GIL free, a thread.atomic property could be implemented with a single system global threading.RLock. Provided that we add the caveat to a thread.atomic that it should be used by all agents accessing that data, we would now have a system for mutual access that wold work very cheaply on cPython and also work (via a global lock) on other implementations.

Let’s add thread.atomic to cPython

The reasons I am enthusiastic about seeing an “atomic” flag as part of cPython are twofold:

  1. It would fill the role of a lightweight synchronization primitive that people are requesting where a true Lock is considered too expensive, and where it makes no sense to have a per-instance lock object.
  2. More importantly, it will allow Stackless functionality to be added to cPython as a pure extension module, and it will allow such inter-thread operations to be added to Greenlet-based programs in the same way as we have solved the problem for Stackless Python.
  3. And thirdly?  Because Debbie Harry says so:

 Update, 23.03.2013:

Emulating an “atomic” flag in an truly multithreaded environment with a lock is not as simple as I first though.  The cool thing about “atomic” is that it still allows the thread to block, e.g. on an IO operation, without affecting other threads.  For an atomic-like lock to work, such a lock would need to be automatically yielded and re-acquired when blocking, bringing us back to a condition-variable-like model.  Since the whole purpose of “atomic” is to be lightweight in a GIL-like environment, forcing it to be backwards compatible with a truly multi-threaded solution is counter-productive.  So, “atomic” as a GIL only feature is the only thing that makes sense, for now.  Unless I manage to dream up an alternative.

2 Comments

So! My previous blog (at blogs.ccpgames.com) disappeared.  It has taken this long for me to get things back up and running.  The previous blog was on a private server run by an external party and it was compromised by one of those sneaky internet maladies that infect sites like these.

I’m in the process of salvaging all the posts from there and get them running here.  This is a somewhat painstaking process.  I was provided with some files in a folder and I have had to re-learn unix (Its been 10 years since I used that regularly), learn about Apache, WordPress, mysql, multi-site installs and all kinds of things.  10 years ago, they didn’t have sudo.  I’m not sure that I like it.

Anyway, I hope to finish this soon and start blogging again about my adventures in Python. PyCon 2013 is coming up and I have, as always, some ideas to put forward and Swiss army knives to grind.

2 Comments

Examing a recent crash case, I stumbled across this code in frameobject.c:

PyFrameObject *
PyFrame_New(PyThreadState *tstate, PyCodeObject *code, PyObject *globals,
PyObject *locals)
...
if (code->co_zombieframe != NULL) {
f = code->co_zombieframe;
code->co_zombieframe = NULL;
_Py_NewReference((PyObject *)f);
assert(f->f_code == code);
}

Intrigued by the name, I examined the header where it is defined, code.h:

...
void *co_zombieframe; /* for optimization only (see frameobject.c) */
...
} PyCodeObject;

It turns out that for every PyCodeObject object that has been executed, a PyFrameObject of a suitable size is cached and kept with the code object. Now, caching is fine and good, but this cache is unbounded. Every code object has the potential to hang on to a frame, which may then never be released.
Further, there is a separate freelist cache for PyFrameObjects already, in case a frame is not found on the code object:

if (free_list == NULL) {
f = PyObject_GC_NewVar(PyFrameObject, &PyFrame_Type,
extras);
if (f == NULL) {
Py_DECREF(builtins);
return NULL;
}
}
else {
assert(numfree > 0);
--numfree;
f = free_list;
free_list = free_list->f_back;
...

Always concious about memory these days, I tried disabling this in version 3.3 and running the pybench test. I was not able to see any conclusive difference in execution speed.

Update:

Disabling the zombieframe on the PS3 shaved off some 50k on startup.  Not the jackpot, but still, small things add up.

——————————————————————————-
PYBENCH 2.1
——————————————————————————-
* using CPython 3.3.0a3+ (default, May 23 2012, 20:02:34) [MSC v.1600 64 bit (AMD64)]
* disabled garbage collection
* system check interval set to maximum: 2147483647
* using timer: time.perf_counter
* timer: resolution=2.9680909446810176e-07, implementation=QueryPerformanceCounter()

——————————————————————————-
Benchmark: nozombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:02:34 (#default)
Unicode: UCS4

——————————————————————————-
Comparing with: zombie
——————————————————————————-

Rounds: 10
Warp: 10
Timer: time.perf_counter

Machine Details:
Platform ID: Windows-7-6.1.7601-SP1
Processor: Intel64 Family 6 Model 26 Stepping 5, GenuineIntel

Python:
Implementation: CPython
Executable: D:pydevhgcpython2pcbuildamd64python.exe
Version: 3.3.0a3+
Compiler: MSC v.1600 64 bit (AMD64)
Bits: 64bit
Build: May 23 2012 20:00:42 (#default)
Unicode: UCS4

Test minimum run-time average run-time
this other diff this other diff
——————————————————————————-
BuiltinFunctionCalls: 51ms 52ms -3.3% 52ms 53ms -2.0%
BuiltinMethodLookup: 33ms 33ms +0.0% 34ms 34ms +0.8%
CompareFloats: 50ms 50ms +0.1% 50ms 50ms +0.4%
CompareFloatsIntegers: 99ms 98ms +0.8% 99ms 99ms +0.6%
CompareIntegers: 77ms 77ms -0.5% 77ms 77ms -0.3%
CompareInternedStrings: 60ms 60ms +0.0% 61ms 61ms -0.1%
CompareLongs: 46ms 45ms +1.5% 46ms 45ms +1.2%
CompareStrings: 61ms 59ms +3.6% 61ms 59ms +3.6%
ComplexPythonFunctionCalls: 60ms 58ms +3.3% 60ms 58ms +3.2%
ConcatStrings: 48ms 47ms +2.4% 48ms 47ms +2.1%
CreateInstances: 58ms 57ms +1.3% 59ms 58ms +1.3%
CreateNewInstances: 43ms 43ms +1.1% 44ms 44ms +1.1%
CreateStringsWithConcat: 79ms 79ms -0.3% 79ms 79ms -0.1%
DictCreation: 71ms 71ms +0.4% 72ms 72ms +1.0%
DictWithFloatKeys: 72ms 70ms +2.1% 72ms 71ms +1.8%
DictWithIntegerKeys: 46ms 46ms +0.7% 46ms 46ms +0.4%
DictWithStringKeys: 41ms 41ms +0.0% 41ms 41ms -0.1%
ForLoops: 35ms 37ms -4.0% 35ms 37ms -4.0%
IfThenElse: 64ms 64ms -0.1% 64ms 64ms -0.4%
ListSlicing: 49ms 50ms -1.0% 53ms 53ms -0.8%
NestedForLoops: 54ms 51ms +6.7% 55ms 51ms +6.7%
NestedListComprehensions: 54ms 54ms -0.7% 54ms 55ms -2.2%
NormalClassAttribute: 94ms 94ms +0.1% 94ms 94ms +0.1%
NormalInstanceAttribute: 54ms 54ms +0.3% 54ms 54ms +0.2%
PythonFunctionCalls: 58ms 57ms +0.8% 58ms 58ms +0.6%
PythonMethodCalls: 65ms 61ms +6.3% 66ms 62ms +5.9%
Recursion: 84ms 85ms -1.0% 85ms 85ms -0.9%
SecondImport: 74ms 76ms -2.5% 74ms 77ms -3.5%
SecondPackageImport: 75ms 78ms -3.8% 76ms 79ms -3.9%
SecondSubmoduleImport: 163ms 169ms -3.4% 164ms 170ms -3.3%
SimpleComplexArithmetic: 43ms 43ms +1.0% 43ms 43ms +1.0%
SimpleDictManipulation: 80ms 78ms +2.2% 81ms 79ms +2.4%
SimpleFloatArithmetic: 42ms 42ms +0.1% 42ms 42ms -0.0%
SimpleIntFloatArithmetic: 52ms 53ms -1.2% 52ms 53ms -1.1%
SimpleIntegerArithmetic: 52ms 52ms -0.7% 52ms 53ms -0.8%
SimpleListComprehensions: 45ms 45ms -0.2% 45ms 45ms +0.3%
SimpleListManipulation: 44ms 46ms -4.0% 44ms 46ms -3.9%
SimpleLongArithmetic: 32ms 32ms -0.9% 32ms 32ms -0.1%
SmallLists: 58ms 57ms +1.2% 58ms 67ms -12.8%
SmallTuples: 64ms 65ms -0.5% 65ms 65ms -0.2%
SpecialClassAttribute: 148ms 149ms -0.8% 149ms 150ms -1.0%
SpecialInstanceAttribute: 54ms 54ms +0.2% 54ms 54ms +0.0%
StringMappings: 120ms 117ms +2.5% 120ms 117ms +2.5%
StringPredicates: 62ms 62ms +0.9% 62ms 62ms +1.0%
StringSlicing: 69ms 68ms +1.6% 69ms 68ms +2.1%
TryExcept: 37ms 37ms +0.0% 37ms 37ms +0.5%
TryFinally: 40ms 37ms +6.7% 40ms 37ms +6.5%
TryRaiseExcept: 19ms 20ms -1.0% 20ms 20ms -0.4%
TupleSlicing: 65ms 65ms +0.5% 66ms 65ms +1.2%
WithFinally: 57ms 56ms +1.9% 57ms 56ms +2.1%
WithRaiseExcept: 53ms 53ms +0.3% 54ms 54ms -0.8%
——————————————————————————-
Totals: 3154ms 3145ms +0.3% 3176ms 3177ms -0.0%

(this=nozombie, other=zombie)

I’m going to remove this weird, unbounded cache from the python interpreter we use on the PS3.

4 Comments

What follows is an account of how I found and fixed an insidious bug in Stackless Python which has been there for years.  It’s one of those war stories.  Perhaps a bit long winded and technical and full of exaggerations as such stories tend to be.

Background

Some weeks ago, because of a problem in the client library we are using, I had to switch the http library we are using on the PS3 from using non-blocking IO to blocking. Previously, we were were issuing all the non-blocking calls, the “select” and the tasklet blocking / scheduling on the main thread. This is similar to how gevent and other such libraries do things. Switching to blocking calls, however, meant doing things on worker threads.

The approach we took was to implement a small pool of pyton workers which could execute arbitrary jobs. A new utility function, stacklesslib.util.call_async() then performed the asynchronous call by dispatching it to a worker thread. The idea of an call_async() is to have a different tasklet execute the callable while the caller blocks on a channel. The return value, or error, is then propagated to the originating tasklet using that channel. Stackless channels can be used to communicate between threads too. And synchronizing threads in stackless is even more conveninent than regular Python because there is stackless.atomic, which not only prevents involuntary scheduling of tasklets, it also prevents automatic yielding of the GIL (cPython folks, take note!)

This worked well, and has been running for some time. The drawback to this approach, of course, is that we now need to keep python threads around, consuming stack space. And Python needs a lot of stack.

The problem

The only problem was, that there appeared to be a bug present. One of our developers complained that sometimes, during long downloads, the http download function would return None, rather than the expected string chunk.

Now, this problem was hard to reproduce. It required a specific setup and geolocation was also an issue. This developer is in California, using servers in London. Hence, there ensued a somewhat prolonged interaction (hindered by badly overlapping time-zones) where I would provide him with modified .py files with instrumentation, and he would provide me with logs. We quickly determined, to my dismay, that apparently, sometimes a string was turning into None, while in transit trough a channel.send() to a channel.receive(). This was most distressing. Particularly because the channel in question was transporting data between threads and this particular functionality of stackless has not been as heavily used as the rest.

Tracking it down

So, I suspected a race condition of some sorts. But a careful review of the channel code and the scheduling code presented no obvious candidates. Also, the somehwat unpopular GIL was being used throughout, which if done correctly ensures that things work as expected.

To cut a long story short, by a lucky coincidence I managed to reproduce a different manifestation of the problem. In some cases, a simple interaction with a local HTTP server would cause this to happen.

When a channel sends data between tasklets, it is temporarily stored on the target tasklet’s “tempval” attribute. When the target wakes up, this is then taken and returned as the result from the “receive()” call. I was able to establish that after sending the data, the target tasklet did indeed hold the correct string value in its “tempval” attribute. I then needed to find out where and why it was disappearing from that place.

By adding instrumentation code to the stackless core, I established that this was happening in the last line of the following snippet:

PyObject *
slp_run_tasklet(void)
{
    PyThreadState *ts = PyThreadState_GET();
    PyObject *retval;

    if ( (ts->st.main == NULL) && initialize_main_and_current()) {
        ts->frame = NULL;
        return NULL;
    }

    TASKLET_CLAIMVAL(ts->st.current, &retval);

By setting a breakpoint, I was able to see that I was in the top level part of the “continue” bit of the “stack spilling” code

Stack spilling is a feature of stackless where the stack slicing mechanism is used to recycle a deep callstack. When it detects that the stack has grown beyond a certain limit, it is stored away, and a hard switch is done to the top again, where it continues its downwards crawl. This can help conserve stack address space, particularly on threads where the stack cannot grow dynamically.

So, something wrong with stack spilling, then.  But even so, this was unexpected. Why was stack spilling happening when data was being transmitted across a channel? Stack spilling normally occurs only when nesting regular .py code and other such things.

By setting a breakpoint at the right place, where the stack spilling code was being invoked, I finally arrived at this callstack:

Type Function
PyObject* slp_eval_frame_newstack(PyFrameObject* f, int exc, PyObject* retval)
PyObject* PyEval_EvalFrameEx_slp(PyFrameObject* f, int throwflag, PyObject* retval)
PyObject* slp_frame_dispatch(PyFrameObject* f, PyFrameObject* stopframe, int exc, PyObject* retval)
PyObject* PyEval_EvalCodeEx(PyCodeObject* co, PyObject* globals, PyObject* locals, PyObject** args, int argcount, PyObject** kws, int kwcount, PyObject** defs, int defcount, PyObject* closure)
PyObject* function_call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_Call(PyObject* func, PyObject* arg, PyObject* kw)
PyObject* PyObject_CallFunctionObjArgs(PyObject* callable)
void PyObject_ClearWeakRefs(PyObject* object)
void tasklet_dealloc(PyTaskletObject* t)
void subtype_dealloc(PyObject* self)
int slp_transfer(PyCStackObject** cstprev, PyCStackObject* cst, PyTaskletObject* prev)
PyObject* slp_schedule_task(PyTaskletObject* prev, PyTaskletObject* next, int stackless, int* did_switch)
PyObject* generic_channel_action(PyChannelObject* self, PyObject* arg, int dir, int stackless)
PyObject* impl_channel_receive(PyChannelObject* self)
PyObject* call_function(PyObject*** pp_stack, int oparg)

Notice the “subtype_dealloc”. This callstack indicates that in the channel receive code, after the hard switch back to the target tasklet, a Py_DECREF was causing side effects, which again caused stack spilling to occur. The place was this, in slp_transfer():

/* release any objects that needed to wait until after the switch. */
Py_CLEAR(ts->st.del_post_switch);

This is code that does cleanup after tasklet switch, such as releasing the last remaining reference of the previous tasklet.

So, the bug was clear then. It was twofold:

  1. A Py_CLEAR() after switching was not careful enough to store the current tasklet’s “tempval” out of harms way of any side-effects a Py_DECREF() might cause, and
  2. Stack slicing itself, when it happened, clobbered the current tasklet’s “tempval”

The bug was subsequently fixed by repairing stack spilling and spiriting “tempval” away during the Py_CLEAR() call.

Post mortem

The inter-thread communication turned out to be a red herring. The problem was caused by an unfortunate juxtaposition of channel communication, tasklet deletion, and stack spilling.
But why had we not seen this before? I think it is largely due to the fact that stack spilling only rarely comes into play on regular platforms. On the PS3, we deliberately set the threshold low to conserve memory space. This is also not the first stack-spilling related bug we have seen on the PS3, but the first one for two years. Hopefully it will be the last.

Since this morning, the fix is in the stackless repository at http://hg.python.org/stackless

Leave a comment

This is another of those memory conservation stories on the PS3.

Our engineers were worried about how much memory was being spent/wasted in dictionaries. Python dicts are these sparse datastructures, optimized for performance and trading off memory usage to achieve speed.

This code shows you the memory used by dicts of various sizes:

for i in range(20): print i, sys.getsizeof(dict((j,j) for j in range(i)))
...
0 148
1 148
2 148
3 148
4 148
5 148
6 532
7 532
8 532
9 532
10 532
11 532
12 532
13 532
14 532
15 532
16 532
17 532
18 532
19 532

It’s rather striking that a 10 element dict on a 32 bit is consuming more than 1/2k of memory. I’m pretty sure BBC Basic can’t have used dicts.

Now, I was interested in tuning the dict implementation for the PS3, sacrificing performance for memory. Looking at the code led me to a file called dictnotes.txt explaining much about dicts. The section on tuning only considers performance. Too sparse a dict, you see, looses performance because of memory cache effects. Otherwise, I’m sure, we would want dicts infinitely sparse.

It turns out that only one parameter is easily tunable, PyDict_MINSIZE. Python 2.7 sets this to 8 for reasons of cache line size, although I find that an odd generalization across a huge number of platforms. in dictobject.h, I came across this comment:

* PyDict_MINSIZE is the minimum size of a dictionary. This many slots are
* allocated directly in the dict object (in the ma_smalltable member).
* It must be a power of 2, and at least 4.

It turns out this is wrong. Python will happily run with it set to 1.

As for the other tunable parameters, I ended up macrofying things that were hard-coded in various places in the code:

/* CCP change: Tunable parameters for dict growth */
#if _PyCCP_TIGHT_DICT
/* Save memory for dust */
#define _PyDICT_MAX_LOAD_NUM 4 /* grow at 80% load */
#define _PyDICT_MAX_LOAD_DENOM 5

#define _PyDICT_GROWTHRATE_NUM 3 /* scale by 1.5 */
#define _PyDICT_GROWTHRATE_DENOM 2
#else
/* max load 2/3, default python: */
#define _PyDICT_MAX_LOAD_NUM 2
#define _PyDICT_MAX_LOAD_DENOM 3

#define _PyDICT_GROWTHRATE_NUM (mp->ma_used > 50000 ? 2 : 4)
#define _PyDICT_GROWTHRATE_DENOM 1
#endif

And then later:

#if 0
    if (!(mp->ma_used > n_used && mp->ma_fill*3 >= (mp->ma_mask+1)*2))
        return 0;
    return dictresize(mp, (mp->ma_used > 50000 ? 2 : 4) * mp->ma_used);
#else
    /* CCP change, tunable growth parameters */
    if (!(mp->ma_used > n_used && mp->ma_fill*_PyDICT_MAX_LOAD_DENOM >= (mp->ma_mask+1)*_PyDICT_MAX_LOAD_NUM))
        return 0;
    return dictresize(mp, _PyDICT_GROWTHRATE_NUM * mp->ma_used / _PyDICT_GROWTHRATE_DENOM);
#endif

By using a smalltable of size 1, setting the max fill rate to 4/5 and growth rate to 1.5, we get this:
[python]
for i in range(20): print i, sys.getsizeof(dict((j,j) for j in range(i)))

0 64
1 88
2 112
3 160
4 160
5 160
6 256
7 256
8 256
9 256
10 256
11 256
12 448
13 448
14 448
15 448
16 448
17 448
18 448
19 448
[/python]

The size of the dicts is still governed by the fact that the tables must have sizes that are powers of two, so we can only delay the onset of growth. But at least we get rid of the super optimistic quadruple growth factor and the large small table.

Perhaps a different, again less optimal, version of the dict wouldn’t have that power of two requirement. There is no inherent need for that when hashing except that it makes for nice bitwise arithmetic.

Update:

The effect these changes had:

I just do a quick test. It saved 2MB in login screen. That’s awesome.

Thanks,

Kevin Zhang

Leave a comment

Polishing our forthcoming console game, our team in Shanghai are relentlessly trying to minimize python memory use.
Today, an engineer complained to me that “cell” objects were being leaked(*).

This rang a bell with me. In 2009, I had posted about this to python-dev.
The response at the time wasn’t very sympathetic. I should be doing stuff differently or simply rely on the cyclic garbage collector and not try to be clever. Yet, as I pointed out, parts of the library are aware of the problem and do help you with these things, such as the xml.dom.minidom.unlink() method.

The data being leaked now appeared to pertain to the json module:

[2861.88] Python: 0: <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x12e14010>>
[2861.88] Python: 1: <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x12e14010>>
[2861.88] Python: 2: <bound method JSONEncoder.default of <json.encoder.JSONEncoder object at 0x12e14010>>

This prompted me to have a look in the json module, and behold, json.encoder contains this pattern:
[python]
def _make_iterencode(…)

def _iterencode(o, _current_indent_level):
if isinstance(o, basestring):
yield _encoder(o)
elif o is None:
yield ‘null’
elif o is True:
yield ‘true’
elif o is False:
yield ‘false’
elif isinstance(o, (int, long)):
yield str(o)
elif isinstance(o, float):
yield _floatstr(o)
elif isinstance(o, (list, tuple)):
for chunk in _iterencode_list(o, _current_indent_level):
yield chunk
elif isinstance(o, dict):
for chunk in _iterencode_dict(o, _current_indent_level):
yield chunk
else:
if markers is not None:
markerid = id(o)
if markerid in markers:
raise ValueError(“Circular reference detected”)
markers[markerid] = o
o = _default(o)
for chunk in _iterencode(o, _current_indent_level):
yield chunk
if markers is not None:
del markers[markerid]

return _iterencode
[/python]

The problem is this: The returned closure has a func_closure() member containing the “cell” objects, one of which points to this function. There is no way to clear the func_closure method after use. And so, iterencoding stuff using the json module causes reference cycles that persist until the next collection, possibly causing python to hang on to all the data that was supposed to be encoded and then thrown away.

Looking for a workaround, I wrote this code, emulating part of what is going on:
[python]
def itertest(o):
def listiter(l):
for i in l:
if isinstance(i, list):
chunks = listiter(i)
for i in chunks:
yield i
else:
yield i
return listiter(o)
[/python]

Testing it, confirmed the problem:

>>> import celltest
>>> l = [1, [2, 3]]
>>> import gc, celltest
>>> gc.collect()
>>> gc.set_debug(gc.DEBUG_LEAK)
>>> l = [1, [2, 3]]
>>> i = celltest.itertest(l)
>>> list(i)
[1, 2, 3]
>>> gc.collect()
gc: collectable <cell 01E96B50>
gc: collectable <function 01E97330>
gc: collectable <tuple 01E96910>
gc: collectable <cell 01E96B30>
gc: collectable <tuple 01E96950>
gc: collectable <function 01E973F0>
3

To fix this, it is necessary to clear the “cell” objects once there is no more need for them. It is not possible to do this from the outside, so how about from the inside? Changing the code to:
[python]
def itertest2(o):
def listiter(l):
for i in l:
if isinstance(i, list):
chunks = listiter(i)
for i in chunks:
yield i
else:
yield i

chunks = listiter(o)
for i in chunks:
yield i
chunks = listiter = None
[/python]
Does the trick. the function becomes a generator, yields the stuff, then cleans up:

>>> o = celltest.itertest2(l)
>>> list(o)
[1, 2, 3]
>>> gc.collect()
0

It is an unfortunate situation. The workaround requires work to be done inside the function. It would be cool if it were possible to clear the function’s closure by calling, e.g. func.close(). As it is, people have to be aware of these hidden cycles and code carfully around them.

(*) Leaking in this case means not being released immediately by reference counting but lingering. We don’t want to rely on the gc module’s quirkiness in a video game.

Update:

In my toy code, I got the semantics slightly wrong.  Actually, it is more like this:
[python]
def make_iter():
def listiter(l):
for i in l:
if isinstance(i, list):
chunks = listiter(i)
for i in chunks:
yield i
else:
yield i
return listiter

def get_iterator(data):
it = make_iter()
return it(data)
[/python]

This complicates things. Nowhere is, during iteration, any code running in the scope of make_iter that we can use to clear those locals after iteration. Everything is running in nested functions and since I am using Python 2.7 (which doesn’t have the “nonlocal” keyword) there seems to be no way to clear the outer locals from the inner functions once iteration is done.

I guess that means that I’ll have to modify this code to use class objects instead.

Also, while on the topic, I think Raymond Hettinger’s class-like objects are subject to this problem if they have any sort of mutual or recursive relationship among their “members”.

8 Comments

I just had this problem which would have been elegantly solved with the ability to manually clear weak references pointing to an object. I am (for technical reasons) recycling an object, so instead of killing it and re-creating it, I re-initialize it. But that leaves old weak references in place. How nice wouldn’t it be to be able to call “myobject.clear_weakrefs()”?

2 Comments

For the past montsh this blog has been completely overwhelmed with comment spam. I’ve finally had a solution installed (Akismet) and cleaned up the thousands of pending comments. I’ll be able to contribute stuff again!

Leave a comment

I thought I’d mention a cool little patch we did to Python some years back.

We work with database tables a lot.  Game configuration data is essentially rows in a vast database.  And those rows contain a lot of floats.  At some point I recognized that common float values were not being reused.  In particular, id(0.0) != id(0.0).  I was a bit surprized by this, since I figured, some floats must be more common than others.  Certainly, 0.0 is a bit special.

I mentioned this on python-dev some years back but with somewhat underwhelming results.  A summary of the discussion can be found here.

Anyway, I thought I’d mention this to people doing a lot of floating point.  We saved a huge amount of memory on our servers just caching integral floating point values between -10 and +10, including both the negative and positive 0.0.  These values are very frequent, for example as multipliers in tables, and so on.

Here’s some of the code:

PyObject *
PyFloat_FromDouble(double fval)
{
    register PyFloatObject *op;
    int ival;
    if (free_list == NULL) {
        if ((free_list = fill_free_list()) == NULL)
            return NULL;
        /* CCP addition, cache common values */
        if (!f_reuse[0]) {
            int i;
            for(i = 0; i<21; i++)
                f_reuse[i] = PyFloat_FromDouble((double)(i-10));
        }
    }
    /* CCP addition, check for recycling */
    ival = (int)fval;
    if ((double)ival == fval && ival>=-10 && ival <= 10) {
#ifdef MS_WINDOWS
        /* ignore the negative zero */
        if (ival || _fpclass(fval) != _FPCLASS_NZ) {
#else
        /* can't differentiate between positive and negative zeroes, ignore both */
        if (ival) {
#endif
            ival+=10;
            if (f_reuse[ival]) {
                Py_INCREF(f_reuse[ival]);
                return f_reuse[ival];
            }
        }
    }

    /* Inline PyObject_New */
    op = free_list;
    free_list = (PyFloatObject *)Py_TYPE(op);
    PyObject_INIT(op, &PyFloat_Type);
    op->ob_fval = fval;
    return (PyObject *) op;
}

(Please excuse the lame syntax highlighter with its & and < thingies 🙂

3 Comments

When doing IO, it is sometimes useful for a worker thread to notify Python that something has happened. Previously we have just had the Python main thread “Poll” some external variable for that, but recently we have been experimenting with having the main thread just grab the GIL and perform python work itself.

This should be straightforward. Python has an api called PyGILState_Ensure() that can be called on any thread. If that thread doesn’t already have a Python thread state, it will create a temporary one. Such a thread is sometimes called an external thread.

On a server loaded to some 40% with IO, this is what happened when I turned on this feature:

process cpu

The dark gray area is main thread CPU, (initially at around 40%) and the rest is other threads.  Turning on the “ThreadWakeup” feature adds some 20% extra cpu work to the process.

When the main thread is not working, it is idle doing a MsgWaitForMultipleObjects() Windows system call (with the GIL unclaimed).  So the worker thread should have no problem acquiring the GIL.  Further, there is only ever one woker thread doing a PyGILState_Ensure()/PyGILState_Release() at the same time, and this is ensured using locking on the worker thread side.

Further tests seem to confirm that if the worker thread already owns a Python thread state, and uses that to aquire the GIL (using a PyEval_RestoreThread() call) this overhead goes away.

This was surprising to me, but it seems to indicate that it is very expensive to “acquire a thread state on demand” to claim the GIL.  This is very unfortunate, because it means that one cannot easily use arbitrary system threads to call into Python without significant overhead.  These might be threads from the Windows thread pool for example, threads that we have no control over and therefore cannot assign thread state to.

I will try to investigate this furter, to see where the overhead is coming from.  It could be the extra TLS calls made, or simply the cost of malloc()/free() involved.  Depending on the results, there are a few options:

  1. Keep a single thread state on the side for (the single) external thread that can claim the GIL at a time, ready and initialized.
  2. Allow an external thread to ‘borrow’ another thread state and not use its own.
  3. Streamline the stuff already present.

Update, oct. 6th 2011:
Enabling dynamic GIL with tread state caching did notthing to solve this issue.
I think the problem is likely to be that spin locking is in effect for the GIL. I’ll see what happens if I explicitly define the GIL to not use spin locking.

3 Comments

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK